Kaggle NLP Absolute Beginners#

Getting Started with NLP for Absolute Beginners
first tutorial in the Kaggle Natural Language Processing Guide

NLP For Classification#

One of the more useful applications of NLP. Can be used for a bunch of stuff like organizing documents by topic or Sentiment Analysis (finding out if people are saying positive or negative stuff about your product)

U.S. Patent Phrase to Phrase Matching Competition #

compare two words or short phrases
- original competition:
  - score them 0-1 based on whether they’re similar or not
  - 0 = totally different meaning, 1 = identical meaning, 0.5 = somewhat similar meaning
- classification version (what we’ll do here)
  - classify the pairs of words or phrases into Different, Similar, or Identical categories

from pathlib import Path
import pandas as pd

Get the Dataset#

we’ll be getting the dataset from Kaggle.
- One problem - when you go to download a data set from a Kaggle competition, you need to agree to the competition rules, including a rule to not make the data available to people who haven’t agreed to the competition rules. So I can’t just add it to my publicly-available repo.
- I could just download it from the webpage manually and put it in the right place, but since I can’t add it to tracked files, I’d need to re-do that manually for any notebooks that I’d done that for previously anytime I cloned the repo down.
Instead, install the Kaggle API to download the dataset here so I can import it into this notebook, but don’t track it in Git.
- If you haven’t already, go to the Competition page, go to the Data tab, and Accept the rules of the competition to be allowed to download the dataset.
- If not already installed, install the API (usually with pip install kaggle, but since I’m using UV as a dependency manager, I used uv add kaggle. Running uv sync in this repo should install with all the other dependencies)
- On the Kaggle website, make or login to your account, Click the Profile picture -> Settings -> API -> Create new Token to download kaggle.json to computer.
  - Move that file to ~/.kaggle/kaggle.json (~ is the home directory)
  - note: I use Sphinx with myst_nb to turn these notebooks into documentation, and myst_nb runs the notebooks to check if they still work. Since I can’t commit the kaggle.json file to the repo without making my private kaggle api key publicly available, specify the API key with environment variables instead: KAGGLE_USERNAME and KAGGLE_KEY. Get those values out of the kaggle.json and add them to GitHub Secrets for the Github Actions Pipeline to use
- run the cell below to download and unzip the dataset if it doesn’t already exist.
- initially this gave me a "Forbidden URL" error but later it worked. Possibly I hadn’t accepted the rules for the competition yet.

# download and unzip the dataset to this folder if not already downloaded
data_dir = Path("us-patent-phrase-to-phrase-matching")
if not data_dir.exists():
    import kaggle
    import zipfile
    kaggle.api.competition_download_cli(str(data_dir))  # download the dataset from Kaggle as zip file
    zip_path = data_dir.with_suffix(".zip")  # path to the downloaded zip file
    zipfile.ZipFile(zip_path).extractall(data_dir)  # unzip the file
    zip_path.unlink()  # delete the zip file after unzipping

Downloading us-patent-phrase-to-phrase-matching.zip to /home/runner/work/aiml-notes/aiml-notes/docs/source/portfolio/kaggle

  0%|          | 0.00/682k [00:00<?, ?B/s]

100%|██████████| 682k/682k [00:00<00:00, 480MB/s]

Examine the DataSet#

# import and check the dataset. Looks like it's already scoring similarity of word/phrase pairs.
train_df = pd.read_csv(data_dir / "train.csv")
train_df

	id	anchor	target	context	score
0	37d61fd2272659b1	abatement	abatement of pollution	A47	0.50
1	7b9652b17b68b7a4	abatement	act of abating	A47	0.75
2	36d72442aefd8232	abatement	active catalyst	A47	0.25
3	5296b0c19e1ce60e	abatement	eliminating process	A47	0.50
4	54c1e3b9184cb5b6	abatement	forest region	A47	0.00
...	...	...	...	...	...
36468	8e1386cbefd7f245	wood article	wooden article	B44	1.00
36469	42d9e032d1cd3242	wood article	wooden box	B44	0.50
36470	208654ccb9e14fa3	wood article	wooden handle	B44	0.50
36471	756ec035e694722b	wood article	wooden material	B44	0.75
36472	8d135da0b55b8c88	wood article	wooden substrate	B44	0.50

36473 rows × 5 columns

# get descriptive statistics on the object (string) columns
train_df.describe(include="object")  

	id	anchor	target	context
count	36473	36473	36473	36473
unique	36473	733	29340	106
top	37d61fd2272659b1	component composite coating	composition	H01
freq	1	152	24	2186

Kaggle NLP Absolute Beginners

Contents

Kaggle NLP Absolute Beginners#

NLP For Classification#

U.S. Patent Phrase to Phrase Matching Competition#

Get the Dataset#

Examine the DataSet#

U.S. Patent Phrase to Phrase Matching Competition #