Kaggle NLP Absolute Beginners#

NLP For Classification#

One of the more useful applications of NLP. Can be used for a bunch of stuff like organizing documents by topic or Sentiment Analysis (finding out if people are saying positive or negative stuff about your product)

U.S. Patent Phrase to Phrase Matching Competition#

  • compare two words or short phrases

    • original competition:

      • score them 0-1 based on whether they’re similar or not

      • 0 = totally different meaning, 1 = identical meaning, 0.5 = somewhat similar meaning

    • classification version (what we’ll do here)

      • classify the pairs of words or phrases into Different, Similar, or Identical categories

from pathlib import Path
import pandas as pd

Get the Dataset#

  • we’ll be getting the dataset from Kaggle.

    • One problem - when you go to download a data set from a Kaggle competition, you need to agree to the competition rules, including a rule to not make the data available to people who haven’t agreed to the competition rules. So I can’t just add it to my publicly-available repo.

    • I could just download it from the webpage manually and put it in the right place, but since I can’t add it to tracked files, I’d need to re-do that manually for any notebooks that I’d done that for previously anytime I cloned the repo down.

  • Instead, install the Kaggle API to download the dataset here so I can import it into this notebook, but don’t track it in Git.

    • If you haven’t already, go to the Competition page, go to the Data tab, and Accept the rules of the competition to be allowed to download the dataset.

    • If not already installed, install the API (usually with pip install kaggle, but since I’m using UV as a dependency manager, I used uv add kaggle. Running uv sync in this repo should install with all the other dependencies)

    • On the Kaggle website, make or login to your account, Click the Profile picture -> Settings -> API -> Create new Token to download kaggle.json to computer.

      • Move that file to ~/.kaggle/kaggle.json (~ is the home directory)

      • note: I use Sphinx with myst_nb to turn these notebooks into documentation, and myst_nb runs the notebooks to check if they still work. Since I can’t commit the kaggle.json file to the repo without making my private kaggle api key publicly available, specify the API key with environment variables instead: KAGGLE_USERNAME and KAGGLE_KEY. Get those values out of the kaggle.json and add them to GitHub Secrets for the Github Actions Pipeline to use

    • run the cell below to download and unzip the dataset if it doesn’t already exist.

    • initially this gave me a "Forbidden URL" error but later it worked. Possibly I hadn’t accepted the rules for the competition yet.

# download and unzip the dataset to this folder if not already downloaded
data_dir = Path("us-patent-phrase-to-phrase-matching")
if not data_dir.exists():
    import kaggle
    import zipfile
    kaggle.api.competition_download_cli(str(data_dir))  # download the dataset from Kaggle as zip file
    zip_path = data_dir.with_suffix(".zip")  # path to the downloaded zip file
    zipfile.ZipFile(zip_path).extractall(data_dir)  # unzip the file
    zip_path.unlink()  # delete the zip file after unzipping
Downloading us-patent-phrase-to-phrase-matching.zip to /home/runner/work/aiml-notes/aiml-notes/docs/source/portfolio/kaggle
  0%|          | 0.00/682k [00:00<?, ?B/s]
100%|██████████| 682k/682k [00:00<00:00, 480MB/s]


Examine the DataSet#

# import and check the dataset. Looks like it's already scoring similarity of word/phrase pairs.
train_df = pd.read_csv(data_dir / "train.csv")
train_df
id anchor target context score
0 37d61fd2272659b1 abatement abatement of pollution A47 0.50
1 7b9652b17b68b7a4 abatement act of abating A47 0.75
2 36d72442aefd8232 abatement active catalyst A47 0.25
3 5296b0c19e1ce60e abatement eliminating process A47 0.50
4 54c1e3b9184cb5b6 abatement forest region A47 0.00
... ... ... ... ... ...
36468 8e1386cbefd7f245 wood article wooden article B44 1.00
36469 42d9e032d1cd3242 wood article wooden box B44 0.50
36470 208654ccb9e14fa3 wood article wooden handle B44 0.50
36471 756ec035e694722b wood article wooden material B44 0.75
36472 8d135da0b55b8c88 wood article wooden substrate B44 0.50

36473 rows × 5 columns

# get descriptive statistics on the object (string) columns
train_df.describe(include="object")  
id anchor target context
count 36473 36473 36473 36473
unique 36473 733 29340 106
top 37d61fd2272659b1 component composite coating composition H01
freq 1 152 24 2186