In October 2015, the alternative-data company Eagle Alpha published its research on GoPro, the popular action camera company, with much of its findings having been determined by using web-scraping techniques. “The data from US electronics websites pointed to potential weakness in GoPro revenue for the third quarter of that year,” Eagle Alpha noted. “The crawled data was showing weak demand for GoPro’s products, and a negative mix shift to lower-end products that was likely to impact average selling prices. The report also highlighted weakness in the ranking of bestselling cameras, including the Session product which had recently been released.”
Despite 68 percent of total recommendations at the time suggesting that the company was a “buy”, Eagle Alpha insisted, correctly, that it would underperform and miss its targets for the quarter. Further reports by Eagle Alpha only reinforced its position that demand for GoPro was weakening and that the average selling price remained under pressure. As Nicholas Woodman, chief executive officer of GoPro, acknowledged at the time, “While we experienced strong year-over-year growth, this quarter marks the first time as a public traded company that we delivered results below the expectations that we outlined in our guidance.” Ultimately, web-scraped data proved crucial in identifying this underperformance well ahead of traditional methodology.
Web scraping refers to the process of harvesting data from public websites and is typically carried out using high-powered software (or bots) that can identify what might be deemed valuable to the end user, such as a hedge fund. According to web-scraping specialist Scrapinghub, the typical process of web scraping works in two distinct steps. Firstly, a web crawler, or “spider”, leads the process by using artificial intelligence (AI) to browse the internet and index relevant website content. This content is then passed on to the web scraper, which is “a specialized tool designed to accurately and quickly extract data from a web page”. Eagle Alpha is among the biggest proponents of web scraping. Founded in 2012, the company is among the world’s leading companies in the alternative-data space. According to the firm’s director of data insights, Ronan Crosson, who spoke to Forbes in December 2019, the firm has compiled a taxonomy of 24 different types of alternative data, with the most commonly deployed alternative datasets being “web scraped data, credit card data and consumer sentiment data”. And today, companies such as Eagle Alpha and others are scraping data from a variety of online sources.
Social-media platforms, for instance, are being heavily monitored to identify signs of changes in the sentiments being expressed, especially now that companies are increasingly announcing their latest developments on the likes of Facebook and Twitter. Indeed, Twitter even partnered with Bloomberg in 2017 to enable the popular financial-news company to deliver a real-time feed of curated Twitter data to enterprise clients, who can then integrate the data into their trading algorithms. “Our customers tell us that Twitter data is a vital part of their information-driven trading strategies, helping them uncover early trends and changes in sentiment,” Tony McManus, Bloomberg’s Enterprise Data’s chief information officer, said at the time. “Our Twitter EDF (Event-Driven Feeds product) will help quantitative traders to capitalize on Twitter’s influence on the markets through constantly evolving curation methodologies. These include proprietary NLP [natural language processing] modelling, coupled with Bloomberg’s reputation for data quality and the expertise of a world-class news organization.”
Today, web scraping is arguably proving most useful to the hedge-fund industry, particularly as it has ramped up its spending on procuring alternative datasets in recent years to gain deeper insights into existing and prospective investments. Indeed, a growing number of fund managers are now paying dedicated web-scraping companies to deliver such data with the hope of making more informed investing decisions. According to research from Greenwich Associates in 2018, 50 percent of institutional investors planned to increase their usage of alternative data over the following year, and of the specific types of alternative data, web-scraped data proved to be the most popular.
According to Greenwich itself, the data that is scraped from websites can take various forms, including “product pricing, search trends, insights from expert networks, and web traffic data”.
But the practice of web scraping is not without its own set of risks. For example, some data collected as a result of web scraping may be considered material nonpublic information (MNPI), according to New York law firm Proskauer. “If that data were collected in a manner considered deceptive, then trading on that information might implicate the anti-fraud provisions of the securities laws,” Proskauer stated in its 2017 “Annual Review and Outlook”. And should web scrapers attempt to circumvent certain security protocols or disguise their identities, they could be considered “deceptive devices”, which would put them in violation of the Securities Exchange Act (SEA).
When apprised of certain unwanted data-collection activities, moreover, Proskauer pointed out that website owners may well implement measures such as IP (internet protocol) address blocks and “cease and desist” orders to expressly revoke the web scraper’s access to their websites. Disputes may also arise when site owners seek to block scrapers from engaging in a handful of additional activities, Proskauer observed, including:
- extracting content from the relevant site for the third party’s commercial or competitive use,
- copying content protected by copyright,
- extracting content in contravention of the site’s terms of service or technical measures,
- disrupting the site’s operations through scraping activities—for example, causing a website outage through crawling excessively, spamming users or incurring extra IT (information technology) costs for the owner.
But is there a sufficient regulatory framework in place to protect consumers from having their personal information scraped without authorisation? “Despite the relative maturity of e-commerce, the legality of automated data collection is still unsettled,” the Hedge Fund Law Report (HFLR) stated in 2018. And while there have been many cases that have examined scraping disputes under various state and federal statutes, the HFLR pointed out that the law is not uniform and that past decisions have been “fact-specific in nature”. For example, the popular classified-advertisements website Craigslist sued online used-car-listing service Instamotor for scraping its website to create its own listings of services and sending unsolicited emails to Craigslist users for promotional purposes. In the end, Instamotor was ordered to pay a hefty $31 million to Craigslist for breaching its terms of use.
In some cases, however, the courts have ruled in favour of the web scrapers, perhaps most famously in the 2017 case of workforce-analytics firm hiQ Labs vs. LinkedIn. The case saw the popular careers and professions networking service invoke the US federal Computer Fraud and Abuse Act (CFAA) to issue hiQ with a cease-and-desist letter, demanding that it stop scraping data from LinkedIn’s server. The CFAA itself is a statute that imposes liability on anyone who “intentionally accesses a computer without authorization, or exceeds authorized access, and thereby obtains…information from any protected computer”. However, the court forbade LinkedIn from denying hiQ access on the basis that authorisation to access publicly available profiles is not required.
“Ultimately the bottom line is scraping data is big business, and it’s only going to get bigger as hedge funds begin to establish it as an industry-standard tool,” Daniel Ni, the founder of web-scraping tool Scraper API, observed in December 2019. “As time goes by, more and more hedge funds are realising that web scraping is not only hugely important, but it’s probably a mandatory practice at this point if they want to keep pace with their more savvy competition. Having access to the online activity of consumers is an absolutely indispensable tool for someone whose job it is to predict where those consumers are going to move their money.” As the competition for web-scraped data continues to heat up among hedge funds, therefore, there will be only more of such legal issues to confront. But given the massive potential for web-scraped data—and alternative data as a whole—to produce useful insights, it would seem that we are only at the beginning of realising just how much value this process can generate.