Web scraping, screen scraping and web data extraction are all terms for the software techniques used to extract information from websites. You can scrape a blog post in which a company or product is mentioned, or lists of classified ads or even product catalogues. The scraping technique isn’t for extracting whole pages, but rather for specifically extracting selected information.
Scraping can be done from one website or from thousands.
Scraping can also be done manually, by cut and paste. Sometimes that’s all that is needed. It can also be done automatically using various kinds of software usually referred to as “bots”. The bot could be developed by a programmer or set up using various tools.
While some people refer to web scraping and web harvesting as the same thing, I see web scraping as the extraction process, and web harvesting as the aggregation, transformation and monitoring of the information from a multitude of bots. The transformation process is also referred to as normalization of the data. This process might include filtering for errors or undesired information or actual “translating” of the information from one format to another using mapping or synonym mechanisms.
If you are about to embark on an aggregation project make sure you understand whether you only need to scrape information or if you also need to normalize it. Understanding the level complexity of the needed normalization is crucial.