Shades of Grey: The Good, The Bad, and The Ugly Side of Web Scraping
Webscraping, data mining, cookieing… these are just some of the many tactics used by businesses to collect information from other websites for their financial gain. So if a company uses software to “webscrape” information from a third-party website, is it legal? Is it ethical? Google offers webscraping tools, while the courts consistently rule against its users. Either way, it certainly raises some serious security concerns.
There is no doubt Webscraping stimulates the discussion on technology and ethics. Like any technology, it brings many advantages to the table. But what happens when someone abuses that technology for malicious purposes?
The tactic of webscraping resides squarely in the grey zone of legal and ethical business practices. In this blog, I will try to bring some clarity to these murky waters by defining:
- What is web-scraping
- How is WebScraping used by perpatrators
- What do you need to know in order to be protected
- How did a major international corporate overcome a sophisticated attack
What is WebScraping
Webscraping is a method of data mining from web sites that uses software to extract all the information available (usually in HTTP format) from the targeted site by simulating human behavior. Each year, more and more businesses adopt webscraping tools as part of their business intelligence and advertising initiatives. In fact, webscrapers are actually accountable for 22% of the world’s internet traffic.
Webscraping allows indexing and structuring of extensive data amounts, leading to a possibility to run various statistical, behavioral and qualitative analyses. This is exactly where it becomes ambiguous. Such large data assets can become a vehicle for either good or malicious purposes. On one hand, webscraping facilitates information transparency and availability. On the other, it poses risks of information abuse. There have been many legal battles to define the line between legitimate and nefarious use of this technology. Today, there is still a gap in enforcement.
Webscraping abuse and cyber-attacks:
One of the common uses for webscraping is to hunt for online bargains. Webscraping tools allow an individual to keep following any online shop price and produce numerous requests once a price drop is identified. These requests can be carried out by either a human being or a bot. A bot is far more efficient as it can generate multiple requests per minute (real or fake). The result may be emptying the online store inventory or simply a significant reduction in its margin. Grey marketers can later sell these goods at a higher price.
Think about it… Have you ever tried to purchase concert tickets the second they went onsale online, but all the good seats were gone already? Later you can find all those good seats available on ticket broker websites for 5 -8 times the cost? That is a result of webscraping.
Alternatively – a business’ competitor can harvest your company data using such software programs. They can gain access not only to public knowledge but can even consume gated content from your website. Webscraping allows the competition to receive real time updates on pricing and promotions, product information and other plans. Such software, with Dynamic IP service, is available on the dark market place for dimes.
Business risks may include:
- Revenue loss
- Temporary or permanent suspension of visitors and customers
- Reputation loss
- Poor SLA and user experience
- Increase in web infrastructure cost
[You might also like: Why Online Retailers Should Be on High Alert for Cyber Attacks]
Protecting my organization:
Publish a clear Terms & Conditions (T&C). Decide if you allow scrapers to run on your site or not. Remember that any visible content can be scraped. Identify the sensitive data on your site to block any attempt of abuse. Webscraping has some unique characteristics, though it constantly improves in simulating human behavior. For example, a data-focused scraping attack will normally target specific web pages where information can be extracted. Since in most cases perpetrators will use spoofed or dynamic IP, the challenge will be to recognize the footprint of the device launching the attacks.
Here are few recommended steps to make a scraper sweat:
- Identify repetitive interest in specific web pages – limit requests and downloads
- Ban IP addresses or devices
- Return fake data
- Sensitive information should be gated and accessed only with authorization
- Consider converting data to images or flash
Webscraping and Airline Tickets
For a major US-based airline, this type of cyber-attack occurred with alarming frequency. Bots were programmed to “scrape” certain flights, routes and classes of tickets. With the bots acting as faux buyers—continuously creating but never completing reservations on those tickets—the airline was unable to sell the seats to real customers. In essence, the airline’s inventory was held hostage, and a growing number of flights were taking off with empty seats that could have been sold. (Source: Radware’s 2015-2016 Global Application & Network Security Report)
How do you know if you are at risk?
Does your business run a high volume of online transactions? If so, you are under risk to become a target. Remember – these bots:
- Exhaust application resources
- Illegitimately scrape sensitive information
- Seek vulnerabilities by abusing application logic.
To protect applications from advanced bots or even collective human threats, website operators need more advanced user and client identification that can detect and block illegitimate users.
At the end of the day, technology isn’t ‘good’ or ‘bad’. It is the human user who determines what it will do. Be aware of the risks, tune your security controls. Keep learning as threats evolve.