Scraping & Big Data


web-scraping (also called web harvesting or web data extraction) is a technique that allows the extraction of information from websites with the use of programs. Often it includes transforming unstructured data from web pages into databases for content analysis or reuse.

Web programs that simulate human navigation are created and launched. By visiting web pages, these software collect the necessary data and transcribe it on files (or databases). This data is used for offline analyzes.

In market research, web scraping is often used to collect contact information from particular targets. For example, for some surveys on companies, contact information is first collected on the yellow pages and then used in CATI or CAWI surveys.

Data collection on Social Networks, Twitter, Blogs ...

Big Data are information generated by users involuntarily (browsing) or voluntarily (writing in blogs), on Social Networks. Then, there are also the information generated for some administrative operations such as credit card payments.

Public Opinion Quarterly (POQ), the quarterly magazine of AAPOR, dedicated a monographic issue on the past, present and future of research. Cooper has called this type of data “organic data”. This denomination hardly anyone uses it so we will call them Big Data for clarity.

Big Data

Big Data does not replace statistical research, opinion polls or market research for one simple reason: they often collect little information. The most commonly shared information is like this: likes of a certain brand, gender, age, location, time of post. The advantage is that big data can include tens of thousands of cases (e.g. Facebook likes), while a survey typically does not exceed 2.000 respondents. On the other hand, a questionnaire is normally composed of dozens of questions on which it is possible to analyze any relationships (e.g. which product do you consume, for what reasons, etc.)

Another aspect to consider is the two Big Data biases:

  • coverage: what is the coverage of that particular Social Network with respect to the population?
  • measurement: how many members of that Social Network are happy to share (and let know) their opinion on a particular product?

This service consists of collecting and systematizing information that is voluntarily or involuntarily left on the web by "surfers". For example: like of a brand on facebook, opinion on a politician left on twitter or blogs, navigators' paths from one site to another.

This type of information is constantly growing and is an opportunity for the techno-researchers of Demetra to provide a new service to their customers.

Have any questions or want to ask us for a quote?