Chinese recipes search engine and recommendation function
The object of this project is to build a search engine for recipes searching and recommendation. Sometimes existing recipe websites always return recipes that are fancy or endorsed by many users because of the beautiful look of the dish. The instructions for those recipes may not clear or may be too fancy to conduct. And the suggestion system always recommends users recipes that are most popular but not meet the users’ taste. It is meaningful to build a search and recommendation engine to retrieve and suggest the recipes that users are truly interested in.
The recipe data for this project was scraped from a Chinese recipe website called xiachufang.com. The following code could be used to scrape recipes from xiachufang.com website. The package I used is called beautifulsoup which is a popular python library for web scraping and scrawling.
Each line of record in this recipe dataset is a recipe. There are about 900 records I scraped here. Those instructions are written in Chinese. I used the jieba package to process sentences to words. Chinese word segmentation is another interesting part in this project. Jieba searching engine mode separating function will take the input sentences or sentence and return a list of words. I processed this list and concatenated these words together and separated them by an empty space. I fed the reprocessed document to metapy, which makes it easy to get all terms and build the inverted index of the recipes dataset.
Recommendation feature method
A document can be represented by a vector . Keywords of a short document (such as news) are not many, so we can extract the keywords of each news, and use the TFIDF value of these keywords to constitute the vector representation of the document, which can greatly reduce the amount of similarity calculation, while maintaining a good recommendation effect. The jiaba package could return terms and corresponding TFIDF value. Top 20 terms of each document are collected to represent the document. Then an M*N matrix is constructed, where M represents the number of all documents and N is the number of all selected unique terms in M documents. This matrix is a sparse matrix. By using the pairwise distance method in sklearn and pandas package, cosine similarity can be calculated. Below is the code to calculated cosine similarities. The code was modified from here.
Results and discussion for recommendation function
Below are the scatter plots of cosine similarity for document 1 and document 2. As we can see that most documents has a cosine similarity below 0.15. Since we just pick top 5 documents, the relevance between the original document and recommended documents are predictable.
Information retrieval models for search engine
In this project, I used probabilistic retrieval models. Computing the probability that if we know that a relevant document is retrieved is D P(R|Q,D) is hard. Using P(R=1|D,Q) /P(R= 0|D,Q) as ranking strategy and computing P(D,Q∣R=r) after applying Bayes’ rule is easier.
Using document generation model, P(Q,D|R) = P(D|Q,R) P(Q|R). I used BM25 as the metric to rank documents. Given a query Q, containing keywords q1, q2…, the BM25 score of a document D is:
c(t,D) is qi’s term frequency in the document D, |D| is the length of the document D in words, and avdl is the average document length in the text collection from which documents are drawn. k1, k3 and b are free parameters. By tuning k1, k3 and b, BM25 function could be adjusted properly. Here I used NDCG metric for evaluation. Normalized discounted cumulative gain at rank 10 is calculated for each query. Below is the BM25 results with b = 0.5, k1 = 1.2 and k3 = 500 as the baseline.
The parameter k1 controls term frequency scaling. For some key ingredients, they usually has large frequencies in the recipe. In this case, using raw term frequency is reason-able. If one ingredients just occur one time, then it may be just the side dish. In this scenario, a binary model is more suitable. So I keep k1 to be at 1.2 to get a relatively stable performance and balance the key ingredient searching and side dish searching. Besides k1, b is also an important parameter in BM25 function because it controls document length normalization. As stated in introduction session, I would like to pick out doable recipes that has reasonable workload and should gave enough details at the same time.In order to achieve these goals, document length normalization becomes the key factor. I run experiment test on the influence of b. And the results shows that BM25 gave the best performance when b is set to be 0.6. Below are the results of changing value of parameter b.
Using query generation, P(Q,D|R) = P(Q|D,R) P(D|R). And language models provide us with a way to quantify the likelihood of sequence of words. For language model, smoothing is a very important step. There are several smoothing methods such as additive smoothing, absolute discounting, Jelinek-Mercer smoothing and Dirichlet prior smoothing. All these methods are trying to discount the probability of words seen in a document and re-allocate the extra counts so that unseen words will have a non-zero count. In this project, I used Dirichlet prior smoothing as shown below because it has a good performance for search engine. Below are the results for language model with Dirichlet prior smoothing with different smoothing parameters. The model has the best aerage performance at the tuning parameter is equals to 150.
Metapy provides BM25, Dirichlet prior smoothing language model and some other models. Check out here for more information. The code I used to pick out top 10 query results is shown below. By changing different rankers, one can choose to use different models.
Results and discussion for information retrieval models
The BM25 and language model with Dirichlet smoothing gave good performance for recipes retrieval. No matter models based on document or query, Chinese word segmentation is the key part. Without properly separating Chinese words from sentences, further study will have no sense.
The parameter k1 controls term frequency scaling. For some key ingredients, they usually has large frequencies in the recipe. In this case, using raw term frequency is reasonable. If one ingredients just occur one time, then it may be just the side dish. In this scenario, a binary model is more suitable. So I keep k1 to be at 1.2 to get a relatively stable performance and balance the key ingredient searching and side dish searching. Besides k1, b is also an important parameter in BM25 function because it controls document length normalization. As stated in introduction session, I would like to pick out doable recipes that has reasonable workload and should gave enough details at the same time. In order to achieve these goals, document length normalization becomes the key factor. I run experiment test on the influence of b. And the results shows that BM25 gave the best performance when b is set to be 0.6.
Language model with Dirichlet smoothing gave an overall better performance than BM25 function. Because it deals with terms that is not in documents.
As I stated above, I used the jieba package for Chinese word segmentation. There are several packages such as jieba, SnowNLP, THULAC, and NLPIR that can also do the similar job. The existing word segmentation methods can be divided into three categories: string matching based word segmentation method based on understanding and statistics based word segmentation method. The jieba package used the third method because this method does not need very large quantity language knowledge and information and could do a great job on recognizing ambiguity. The jieba library also provides several mode for word segmentation. One could try different methods for this part And find the best one for recipe search engine.
There are mainly three types of recommendation engine. The first kind is a general recommendation engine. The administrator of a certain website will recommend the same things which is calculated to be a popular item among all users. The second kind of engine is based on the data source, such as demographic characteristics, tag of the items. The third kind of engine is based on the user-item two dimensions matrix that describe the user preference. Here in the project, I used a simple mechanism to calculate the suggested items. Because the dataset only contains the recipes but no users data there. In a complicated system, there might be needed to find similar users and similar items and utilize both of information. But here we just want to calculate similarity between documents. There are several ways to calculate the similarity such as Euclidean distance, Pearson’s correlation coefficient, Cosine similarity and Tanimoto coefficient .