Kaggle NLP Absolute Beginners#
first tutorial in the Kaggle Natural Language Processing Guide
NLP For Classification#
One of the more useful applications of NLP. Can be used for a bunch of stuff like organizing documents by topic or Sentiment Analysis (finding out if people are saying positive or negative stuff about your product)
U.S. Patent Phrase to Phrase Matching Competition#
compare two words or short phrases
original competition:
score them
0
-1
based on whether they’re similar or not0
= totally different meaning,1
= identical meaning,0.5
= somewhat similar meaning
classification version (what we’ll do here)
classify the pairs of words or phrases into
Different
,Similar
, orIdentical
categories
from pathlib import Path
import pandas as pd
Get the Dataset#
we’ll be getting the dataset from Kaggle.
One problem - when you go to download a data set from a Kaggle competition, you need to agree to the competition rules, including a rule to not make the data available to people who haven’t agreed to the competition rules. So I can’t just add it to my publicly-available repo.
I could just download it from the webpage manually and put it in the right place, but since I can’t add it to tracked files, I’d need to re-do that manually for any notebooks that I’d done that for previously anytime I cloned the repo down.
Instead, install the Kaggle API to download the dataset here so I can import it into this notebook, but don’t track it in Git.
If you haven’t already, go to the Competition page, go to the
Data
tab, andAccept
the rules of the competition to be allowed to download the dataset.If not already installed, install the API (usually with
pip install kaggle
, but since I’m usingUV
as a dependency manager, I useduv add kaggle
. Runninguv sync
in this repo should install with all the other dependencies)On the Kaggle website, make or login to your account, Click the Profile picture ->
Settings
->API
->Create new Token
to downloadkaggle.json
to computer.Move that file to
~/.kaggle/kaggle.json
(~
is the home directory)note: I use Sphinx with
myst_nb
to turn these notebooks into documentation, andmyst_nb
runs the notebooks to check if they still work. Since I can’t commit thekaggle.json
file to the repo without making my privatekaggle api key
publicly available, specify the API key with environment variables instead:KAGGLE_USERNAME
andKAGGLE_KEY
. Get those values out of thekaggle.json
and add them to GitHub Secrets for the Github Actions Pipeline to use
run the cell below to download and unzip the dataset if it doesn’t already exist.
initially this gave me a
"Forbidden URL" error
but later it worked. Possibly I hadn’t accepted the rules for the competition yet.
# download and unzip the dataset to this folder if not already downloaded
data_dir = Path("us-patent-phrase-to-phrase-matching")
if not data_dir.exists():
import kaggle
import zipfile
kaggle.api.competition_download_cli(str(data_dir)) # download the dataset from Kaggle as zip file
zip_path = data_dir.with_suffix(".zip") # path to the downloaded zip file
zipfile.ZipFile(zip_path).extractall(data_dir) # unzip the file
zip_path.unlink() # delete the zip file after unzipping
Downloading us-patent-phrase-to-phrase-matching.zip to /home/runner/work/aiml-notes/aiml-notes/docs/source/portfolio/kaggle
0%| | 0.00/682k [00:00<?, ?B/s]
100%|██████████| 682k/682k [00:00<00:00, 480MB/s]
Examine the DataSet#
# import and check the dataset. Looks like it's already scoring similarity of word/phrase pairs.
train_df = pd.read_csv(data_dir / "train.csv")
train_df
id | anchor | target | context | score | |
---|---|---|---|---|---|
0 | 37d61fd2272659b1 | abatement | abatement of pollution | A47 | 0.50 |
1 | 7b9652b17b68b7a4 | abatement | act of abating | A47 | 0.75 |
2 | 36d72442aefd8232 | abatement | active catalyst | A47 | 0.25 |
3 | 5296b0c19e1ce60e | abatement | eliminating process | A47 | 0.50 |
4 | 54c1e3b9184cb5b6 | abatement | forest region | A47 | 0.00 |
... | ... | ... | ... | ... | ... |
36468 | 8e1386cbefd7f245 | wood article | wooden article | B44 | 1.00 |
36469 | 42d9e032d1cd3242 | wood article | wooden box | B44 | 0.50 |
36470 | 208654ccb9e14fa3 | wood article | wooden handle | B44 | 0.50 |
36471 | 756ec035e694722b | wood article | wooden material | B44 | 0.75 |
36472 | 8d135da0b55b8c88 | wood article | wooden substrate | B44 | 0.50 |
36473 rows × 5 columns
# get descriptive statistics on the object (string) columns
train_df.describe(include="object")
id | anchor | target | context | |
---|---|---|---|---|
count | 36473 | 36473 | 36473 | 36473 |
unique | 36473 | 733 | 29340 | 106 |
top | 37d61fd2272659b1 | component composite coating | composition | H01 |
freq | 1 | 152 | 24 | 2186 |