How Term Frequency-Inverse Document Frequency (TF-IDF) can be used for document classification?
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is commonly used in document classification tasks to represent the significance of words in a document relative to the entire dataset. Here's how TF-IDF can be used for document classification:
Compute TF-IDF Matrix:
For each document in the dataset, calculate the TF-IDF score for each term in the document. TF-IDF is calculated as the product of Term Frequency (TF) and Inverse Document Frequency (IDF).
Term Frequency (TF) measures how frequently a term appears in a document.
Inverse Document Frequency (IDF) measures the importance of a term across the entire dataset.
TF-IDF is then given by:
Vectorization:
- Represent each document as a vector in the TF-IDF space. Each element of the vector corresponds to the TF-IDF score of a term in the document.
Classification Algorithm:
- Train a machine learning classifier (e.g., Support Vector Machine, Naive Bayes, etc.) using the TF-IDF vectors as features.
Predictions:
- Given a new document, calculate its TF-IDF vector using the same TF-IDF matrix generated during training.
- Use the trained classifier to predict the document's class based on its TF-IDF representation.
Example:
Suppose you have a dataset of movie reviews with two classes: positive and negative. You want to classify new reviews into one of these classes.
Compute TF-IDF Matrix:
- Calculate TF-IDF scores for each term in each document based on the entire dataset.
Vectorization:
- Represent each movie review as a vector in the TF-IDF space.
Classification Algorithm:
- Train a classifier using the TF-IDF vectors and corresponding class labels.
Predictions:
- Given a new movie review, calculate its TF-IDF vector using the same TF-IDF matrix.
- Use the trained classifier to predict whether the review is positive or negative.
TF-IDF helps in capturing the importance of words in a document while considering their relevance across the entire dataset. It often results in more informative and discriminative features for document classification compared to simple Bag of Words representations.
Example
Documents:
- "The quick brown fox jumped over the lazy dog."
- "The lazy dog barked at the fox."
Step 1: Tokenization:
- Tokenize each document into individual words.
Document 1:
["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]
Document 2:
["The", "lazy", "dog", "barked", "at", "the", "fox"]Step 2: Compute TF (Term Frequency):
- Calculate the frequency of each term in each document.
Document 1:
- TF("The") = 2/9
- TF("quick") = 1/9
- ... and so on for each term.
Document 2:
- TF("The") = 2/7
- TF("lazy") = 1/7
- ... and so on for each term.
Step 3: Compute IDF (Inverse Document Frequency):
- Calculate the inverse document frequency for each term in the entire corpus (both documents).
- Total number of documents (N) = 2 IDF("The") = log(N / Number of documents containing "The") = log(2 / 2) = 0 IDF("quick") = log(2 / 1) = 0.69 IDF("brown") = log(2 / 1) = 0.69 IDF("fox") = log(2 / 2) = 0 IDF("jumped") = log(2 / 1) = 0.69 IDF("over") = log(2 / 1) = 0.69 IDF("lazy") = log(2 / 2) = 0 IDF("dog") = log(2 / 2) = 0 IDF("barked") = log(2 / 1) = 0.69 IDF("at") = log(2 / 1) = 0.69
Step 4: Compute TF-IDF:
- Multiply the TF and IDF values for each term in each document.
Document 1:
- TF-IDF("The") = (2/9) * 0 = 0
- TF-IDF("quick") = (1/9) * 0.69 = 0.0767
- ... and so on for each term.
Document 2:
- TF-IDF("The") = (2/7) * 0 = 0
- TF-IDF("lazy") = (1/7) * 0 = 0
- ... and so on for each term.
These TF-IDF values provide a numerical representation of the importance of each term in each document relative to the entire dataset. In a real-world scenario, you would typically use software libraries like scikit-learn to automate this process for larger datasets.