The need to develop robust document/text similarity measure solutions is an essential step for building applications such as Recommendation Systems, Search Engines, Information Retrieval Systems including other ML/AI applications such as News Aggregators or Automated Recruitment systems used to match CVs to job specification and so on. In general, text similarity is the measure of how words/tokens, tweets, phrases, sentences, paragraphs and entire documents are lexically and semantically close to each other. Texts/words are lexically similar if they have similar character sequence or structure and, are semantically similar if they have the same meaning, describe similar concepts and they are used in the same context.  

This tutorial will demonstrate a number of strategies for feature extraction i.e., transforming documents to numeric feature vectors. This transformation step is a prerequisite for computing the similarity between documents. Typically, each strategy will involve 4 steps, namely: 1) the use of standard natural language pre-processing techniques to prepare/clean the documents, 2) the transformation of the document text into numeric vectors/embeddings, 3) calculation of document similarity using metrics such as Cosine, Euclidean and Jaccard and, 4) validation of the findings



#building  #documentation  #solution 

Building Document Similarity Solutions
1.00 GEEK