How Term Frequency-Inverse Document Frequency (TF-IDF) can be used for document classification?

How Term Frequency-Inverse Document Frequency (TF-IDF) can be used for document classification?


Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is commonly used in document classification tasks to represent the significance of words in a document relative to the entire dataset. Here's how TF-IDF can be used for document classification:

  1. Compute TF-IDF Matrix:

    • For each document in the dataset, calculate the TF-IDF score for each term in the document. TF-IDF is calculated as the product of Term Frequency (TF) and Inverse Document Frequency (IDF).

    • Term Frequency (TF) measures how frequently a term appears in a document. TF(,)=Number of times term  appears in document Total number of terms in document 

    • Inverse Document Frequency (IDF) measures the importance of a term across the entire dataset. IDF(,)=log(Total number of documents in the dataset Number of documents containing term )

    • TF-IDF is then given by: TF-IDF(,,)=TF(,)×IDF(,)

  2. Vectorization:

    • Represent each document as a vector in the TF-IDF space. Each element of the vector corresponds to the TF-IDF score of a term in the document.
  3. Classification Algorithm:

    • Train a machine learning classifier (e.g., Support Vector Machine, Naive Bayes, etc.) using the TF-IDF vectors as features.
  4. Predictions:

    • Given a new document, calculate its TF-IDF vector using the same TF-IDF matrix generated during training.
    • Use the trained classifier to predict the document's class based on its TF-IDF representation.


Suppose you have a dataset of movie reviews with two classes: positive and negative. You want to classify new reviews into one of these classes.

  1. Compute TF-IDF Matrix:

    • Calculate TF-IDF scores for each term in each document based on the entire dataset.
  2. Vectorization:

    • Represent each movie review as a vector in the TF-IDF space.
  3. Classification Algorithm:

    • Train a classifier using the TF-IDF vectors and corresponding class labels.
  4. Predictions:

    • Given a new movie review, calculate its TF-IDF vector using the same TF-IDF matrix.
    • Use the trained classifier to predict whether the review is positive or negative.

TF-IDF helps in capturing the importance of words in a document while considering their relevance across the entire dataset. It often results in more informative and discriminative features for document classification compared to simple Bag of Words representations.



  1. "The quick brown fox jumped over the lazy dog."
  2. "The lazy dog barked at the fox."

Step 1: Tokenization:

  • Tokenize each document into individual words.

Document 1:

["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"]

Document 2:

["The", "lazy", "dog", "barked", "at", "the", "fox"]

Step 2: Compute TF (Term Frequency):

  • Calculate the frequency of each term in each document.

Document 1:

  • TF("The") = 2/9
  • TF("quick") = 1/9
  • ... and so on for each term.

Document 2:

  • TF("The") = 2/7
  • TF("lazy") = 1/7
  • ... and so on for each term.

Step 3: Compute IDF (Inverse Document Frequency):

  • Calculate the inverse document frequency for each term in the entire corpus (both documents).
  • Total number of documents (N) = 2 IDF("The") = log(N / Number of documents containing "The") = log(2 / 2) = 0 IDF("quick") = log(2 / 1) = 0.69 IDF("brown") = log(2 / 1) = 0.69 IDF("fox") = log(2 / 2) = 0 IDF("jumped") = log(2 / 1) = 0.69 IDF("over") = log(2 / 1) = 0.69 IDF("lazy") = log(2 / 2) = 0 IDF("dog") = log(2 / 2) = 0 IDF("barked") = log(2 / 1) = 0.69 IDF("at") = log(2 / 1) = 0.69

Step 4: Compute TF-IDF:

  • Multiply the TF and IDF values for each term in each document.

Document 1:

  • TF-IDF("The") = (2/9) * 0 = 0
  • TF-IDF("quick") = (1/9) * 0.69 = 0.0767
  • ... and so on for each term.

Document 2:

  • TF-IDF("The") = (2/7) * 0 = 0
  • TF-IDF("lazy") = (1/7) * 0 = 0
  • ... and so on for each term.

These TF-IDF values provide a numerical representation of the importance of each term in each document relative to the entire dataset. In a real-world scenario, you would typically use software libraries like scikit-learn to automate this process for larger datasets.

Post a Comment

* Please Don't Spam Here. All the Comments are Reviewed by Admin.