Steps involved in BoW:

1. Construction of a d-dimensional dictionary

  • Here, we create an array of all the unique words in the document corpus.
  • Let there be ‘d’ unique words.
  • Every unique word is a dimension.

Note 1:_ A text, which could be a word or a sentence, is known as a document in NLP._

Note 2:_ A collection of such documents is known as a document corpus._

1.1 Example:

Let there be two documents in the document corpus as given below:

  1. This car drives good and is expensive.
  2. This car is not expensive and drives good.

We create a dictionary(or an array) of all the unique words in the document corpus as:

[This, car, drives, good, and, is, expensive, not]

2. Creating vector for each document

  • For every document, we create a d-dimensional vector.
  • Every dimension of a vector corresponds to a unique word.
  • The value of every dimension is equivalent to the number of occurrences of the unique word, in the given document, corresponding to that dimension.

Note 3:_ Generally the BoW creates sparse vectors. In a sparse vector, most of the dimensions have 0 value._

2.1 Example:

Let vectors v1 and v2 correspond to document 1 and document 2 respectively. Then these vectors are represented as:

v1 = [1 1 1 1 1 1 1 0]

v2 = [1 1 1 1 1 1 1 1]

#naturallanguageprocessing #machine-learning #data-science

Bag Of Words(BoW) -Steps involved in BoW
1.25 GEEK