Dialog Engine for Product Information
The implementation of this project for has been divided into the following modules:
- Crawling and scraping the product information from Flipkart.
- Processing the scraped data.
- Saving the data in MongoDB (NoSql database).
- Preprocessing the query.
- Querying the database and extracting the relevant results.
You can find the live application hosted here.
Crawling, Scraping & Processing
The tools used were Scrapy and BeautifulSoup for crawling the data from Flipkart's website. The categories that were scraped are mobiles, televisions, laptops, air conditioners, refrigerators and cameras. The amount of data that was extracted was around 3000 products from the above mentioned categories.
BeautifulSoup is a python library used for extracting data from the HTML or XML pages.
MongoDB is a NoSql database used for storing big data with a lot of flexibility.
We maintain different collections for different categories of products. Ex: Mobiles and TV's of electronics are stored in different collections, which will be advantageous while querying, once the category is known we can search in the corresponding collection.
The primary key of each document in MongoDB is the model name.
We have handled two type of property based queries:
- Template Based Query
- Natural Language Based Query
- Comparison Based Query
- We maintain three lists namely - product name list, brand name list and property list.
- We extract the brand of the product from the given query by iterating through the brand name list using edit distance algorithm which also helps in handling spelling errors or typos.
- Elements of product_name list are tuples of size 3 - brand, model name and category.
- After the extraction of the brand name, we consider only those products from that particular brand for further processing.
- For determining the exact model name and the property name, we use a similar approach but add an additional similarity measure along with the edit distance as mentioned before.
- The second similarity measure is calculated by dividing the maximum length of the two strings by the number of character matches between the strings.
- We take the harmonic mean of the edit distance score and the above metric to get a final similarity measure.
- We take the Top 10 results for products and the best one for property.
- For every product name from the top 10, we query the database and obtain respective results and display to the user.
#IIIT-H #IRE #Major_Project #Information_Retrieval_and_Extraction_Course #Dialog_Engine #Flipkart #NLP #StopWordDetection #Tokeniser #Tokenisation #Keywords_Identification #Crawling #Scraping #Python #MongoDB #BeatifulSoup #Scrapy