Data extraction is a subtask in the broader field of Information Extraction. It is an intermediate step in Information extraction. Data Extraction can be about relational information extraction. Data extraction is a sub-discipline of information extraction that encompasses other tasks like named entity recognition.For Information extraction, we have to care about the document individually and our attention is to look at what is there in the document.
The intention is always to assimilate the document quickly and finding the important features in the document. In data extraction, there can be a case that we want to find the similarity between the two documents, i.e- we can find a relation between two documents. Hence, in this way the data extraction is different from information extraction.Following are few of the applications of Information extraction that can help organizations:Web Analytics: Information can be extracted about the visitors, their demographics, and their preferences.Ad Placement on Web Pages: If an E-commerce website wants to place advertisements of a product next to the text for expressing a positive opinion about it.
We do information extraction task for extracting the products and the type of opinion expressed.Comparison Shopping: we do information extraction helpful for creating a comparison of shopping websites which will automatically crawl merchant websites to find products and their prices.News Tracking: We automatically track the specific event types from news sources, tracking disease outbreaks.
Customer care: Many unstructured data are getting collected at a customer-oriented enterprise. We do structure information extraction for identification of product names and attributes from customer emails, liking of customer emails to a specific transaction in the sales database.Data cleaning: Large organizations like banks, universities and telephone companies store millions of addresses.
an original form of these addresses has little explicit structure. We do the extraction and put all these addresses in a standard canonical format where all the different fields are identified.