Photo: Adobe Stock

Taming the Wild West of Big Data

5 min readFeb 26, 2021


In the era of Big Data — where public and private data is abundant, and keeps growing — everyday residents might not realize how they generate volumes of data throughout their days in ways that are both innocent and serious; visibly apparent and covert. Their data reveals both mundane and intimate details about their habits, movements, and lifestyles.

Every time a person uses an app to order a meal, log their 10,000 steps, look up driving directions, buy a coffee, or report a public works issue, they’re generating data. As consumers, we appreciate the conveniences this connected world provides. We accept the tradeoff that we’re surrendering a bit of personal data in exchange for a simple convenience. However, consumers accept this tradeoff with the belief that these data holders and new users of this data will act in a principled way. Some may, some may not. Therein lies the rub.

The relatively new practice of processing all of this data goes beyond long-standing disciplines like statistics and survey analysis. Technologists now use Machine Learning and Artificial Intelligence to process these large volumes of information to try and make them actionable and insightful.

Indeed, the combination of huge amounts of data plus powerful computing leads to an ethical fork in the road. Cue the literary references: We’ve opened Pandora’s Box. We’ve created a Frankenstein of data. With great power comes great responsibility.

But in real life practice, we cannot shrug off our ethics to a metaphor.

Privacy is a real issue with real risks. This is an issue worthy of thoughtful, nuanced debate to balance the public’s interest against such risks. The world of passive data has largely operated in an unregulated, Wild West fashion for too long. A common misperception among users of data is that privacy and quality insights are somehow at odds with each other.

At Replica, we believe that in order to gather useful insights — such as timely and informed decisions on how to keep residents and essential workers moving and safe during COVID — we shouldn’t need to sacrifice individual privacy.

Absent regulation or standards, the scales of data insights vs. privacy will continue to operate in a gray area where the holders and consumers of this data are left to self-police. Without clear rules of engagement, the next era of data usage requires ethical leadership and the courage to do the right thing.

At Replica, as technologists, data scientists and former public officials, this self-policing is why we cannot — and will not — compromise on our ethical leadership when it comes to privacy protection.

We use Replica’s technology to build models from different data sources independently so that we abstract out potentially identifying details of any individual before combining these models into our aggregate outputs. We never attempt to re-identify individuals from our source data, and forbid our users from doing so as well in our contracts.

We also have a goal at Replica to level the playing field between the public and private sectors. We’ve seen how the public sector frequently finds itself at the negotiation table with the large, data-rich companies that are behind popular consumer services. These companies have so much data that they own more knowledge about what’s happening in the city than the City itself. And, because that app data is first-party, self-collected data, the companies can claim to protect their own user’s privacy, while still leveraging that data to shape public policy and public opinion. We believe this private vs. public information asymmetry is fundamentally unfair.

This paradox has put public agencies at a significant disadvantage. They are unequipped, from a data and tooling perspective, to negotiate or regulate in effective, equitable ways — potentially undermining democratic norms and the public institutions that uphold them.

Replica has developed a way to offer the same powerful tools and data to public agencies; the difference is that we do it in a way that doesn’t compromise or put the public agency at risk in regards to privacy. In simplified terms, we build “synthetic data,” computer-generated data that contains properties of the original data without disclosing the actual original, raw data itself.

Given the relative newness of this issue and the highly technical and nuanced elements of privacy, many public agencies act as if the need for more data should supersede the need for protecting privacy. The reality is, Replica has shown you don’t need to make this trade-off.

With that, here is the set of privacy principles we hold ourselves to, in all of the work that we do. We encourage public agencies to raise the bar and hold the companies from which they source information to the same standard:

  • Always use de-identified data, and apply additional internal de-identification measures
  • Use synthetic populations so that behavior is matched in aggregate, but never copy the specific behavior of a real person from original, ingested data
  • Build models from different data sources independently so that potentially identifying details of any individual are abstracted out before combining these models into our aggregate outputs
  • Never join data sources on keys containing sensitive data
  • Never attempt to re-identify individuals from our source data, and forbid users from doing so as well in all contracts

We are in the process of creating an open-source template for privacy-protecting vendor agreements that will codify these principles. We hope this template will serve as a guide to all those who have the same commitment to privacy as we do. If others are interested in this process, we encourage you to send an email to

We call on other trailblazers in our field, those who are mining piles of data for nuggets of insight, to join us as we raise the bar for the ethical use of data and define a high standard from the outset. When tempted to compromise for the sake of a juicy insight or even a laudable public policy goal, absent clear regulations or standards, we must stand firm to our privacy principles, and remember that we can still uncover useful information that is ethically-sourced and privacy-protected.