The database was available for anyone to access without a password.
Recently on October 16, 2019, a team of two dark web researchers named Bob Diachenko and Vinny Troia discovered a database containing a massive trove of personal records of more than 1.2 billion people.
While they were looking for exposures through BinaryEdge and Shodan, they stumbled upon the server which had an IP address that could be traced to Google Cloud Services. In total, the database was home to over 4 terabytes of data sitting in plain sight for public access.
Found on an exposed Elasticsearch server; the good news is that these records did not host login credentials, social security numbers or payment card details. A look at the details shared by researchers indicates that the data was scraped from social media platforms including Twitter, Facebook, LinkedIn and GitHub, a Git repository hosting service.
Additionally, it contains approximately 50 million phone numbers and 622 million email addresses, both unique without any duplication.
When it comes to the structure of the data found, it appears that four different data sets have been combined with three of them labeled to be originating from a San Francisco based data broker called People Data Labs and one from OxyData.
However, PDL has denied that they own the server with Sean Thorne, the Co-Founder stating that,
“The owner of this server likely used one of our enrichment products, along with a number of other data-enrichment or licensing services”.
On the other hand, OxyData which boasts of having 4 TB of user data which includes 380 million profiles also denied the ownership of the server. Most of the data found with them is of LinkedIn which includes recruiter information.
Nonetheless, despite the denials by both companies, a comparison of the exposed data with their databases shows us that they are identical confirming to an extent that it did at least originate from them. The researchers elaborate specifically for the PDL in their blog post stating,
The data discovered on the open Elasticsearch server was almost a complete match to the data being returned by the People Data Labs API. The only difference being the data returned by the PDL also contained education histories. There was no education information in any of the data downloaded from the server. Everything else was exactly the same, including accounts with multiple email addresses and multiple phone numbers.
Upon reporting this to the FBI, the data was taken offline in a few hours. Although likely, it is not clear if the FBI did this since they have not commented on these developments yet.
In a conversation with Randy Koch, CEO, ARM Insight, told HackRead that, “This mass data exposure incident was very damaging to the data enrichment companies allegedly associated with ownership of the data, but it was even more damaging to the billions of people who had their PII exposed to the world.”
“This incident could’ve been prevented very easily if the data enrichment companies converted their user collected data into synthetic data. Synthetic data eliminates reputational, privacy, compliance and breach headline risks. It mimics real data while removing the identifiable characteristics of users. When properly synthesized, it cannot be reverse-engineered by hackers, yet it retains all the statistical value of the original data set – so it can still be used for analytics, marketing, customer segmentation, AI algorithms and more,” Randy explained.
“Organizations that utilize synthetic data can render data misuse, accidental exposure or abuse moot. Financial services and healthcare companies are already starting to use this technology to keep end-user data private – and frankly, all organizations, especially those that collect or store mass amounts of sensitive data, should follow their lead. Otherwise, these types of mass data exposure incidents will keep occurring,” warned Randy.
The takeaway from this entire episode in the grand scheme of exposures is two-fold. Firstly, even though the exposed data may already be public information since it is collected from social media profiles readily available, its exploitation becomes much easier when it is collected in one place. This can potentially make the job of hackers easier who can use such information to build towards their eventual targets.
Secondly, due to the complexity brought in by the different actors involved in this, it will be hard to hold anyone party accountable. For example, even though the server IP address originated from Google Cloud, Google may not reveal the account owner’s information to law enforcement agencies in its bid to maintain user privacy.
Hence, the best we as consumers can do is to limit the data we make available on any public platform so it cannot be used tomorrow by any attacker.
Not for the first time
It is worth mentioning that Elasticsearch servers have a history of being exposed to the public and putting personal data of unsuspecting users and businesses at risk. Earlier this year, personal information of more than 20 million Russian citizens was exposed on the Elasticsearch server.
In May this year again, personal and payment card data with CVV codes of millions of Canadians was exposed after the Elasticsearch database owned by Freedom Mobile was leaked online.
In December last year, another database containing personal information of 82 million Americans was exposed online. There are several other data leak related incidents involving Elasticsearch servers which can be read here.