Stefaan G. Verhulst
Co-Founder and Chief of Research and Development at The GovLab, New York University
David Sangokoya
Research Fellow at The GovLab, New York University

OpenUp Corporate Data While Protecting Privacy

October 30, 2014

Consider a few numbers: By the end of 2014, the number of mobile phone subscriptions worldwide is expected to reach 7 billion, nearly equal to the world’s population. More than 1.82 billion people communicate on some form of social network, and almost 14 billion sensor-laden everyday objects (trucks, health monitors, GPS devices, refrigerators, etc.) are now connected and communicating over the Internet, creating a steady stream of real-time, machine-generated data.

Much of the data generated by these devices is today controlled by corporations. These companies are in effect “owners” of terabytes of data and metadata. Companies use this data to aggregate, analyze, and track individual preferences, provide more targeted consumer experiences, and add value to the corporate bottom line.

At the same time, even as we witness a rapid “datafication” of the global economy, access to data is emerging as an increasingly critical issue, essential to addressing many of our most important social, economic, and political challenges. While the rise of the Open Data movement has opened up over a million datasets around the world, much of this openness is limited to government (and, to a lesser extent, scientific) data. Access to corporate data remains extremely limited. This is a lost opportunity. If corporate data—in the form of Web clicks, tweets, online purchases, sensor data, call data records, etc.—were made available in a de-identified and aggregated manner, researchers, public interest organizations, and third parties would gain greater insights on patterns and trends that could help inform better policies and lead to greater public good (including combatting Ebola).

Corporate data sharing holds tremendous promise. But its potential—and limitations—are also poorly understood. In what follows, we share early findings of our efforts to map this emerging open data frontier, along with a set of reflections on how to safeguard privacy and other citizen and consumer rights while sharing. Understanding the practice of shared corporate data—and assessing the associated risks—is an essential step in increasing access to socially valuable data held by businesses today. This is a challenge certainly worth exploring during the forthcoming OpenUp conference!

Understanding and classifying current corporate data sharing practices

Corporate data sharing remains very much a fledgling field. There has been little rigorous analysis of different ways or impacts of sharing. Nonetheless, our initial mapping of the landscape suggests there have been six main categories of activity—i.e., ways of sharing—to date:

1. Research partnerships, in which corporations share data with universities and other research organizations. Through partnerships with corporate data providers, several researchers organizations are conducting experiments using de-identification and aggregated samples of consumer datasets and other sources of data to analyze social trends. For instance, Safaricom, one of Kenya’s leading mobile companies, shared a year of de-identified phone data with Harvard researchers to analyze and map how migration patterns contributed to the spread of malaria in Kenya.

2. Prizes and challenges, in which companies make data available to qualified applicants—including civil hackers, pro bono data scientists and other expert users—who compete to develop new apps or discover innovative uses for the data. Last year, Spain’s regional bank BBVA hosted a contest inviting developers to create applications, services, and content based on anonymous card transaction data. The first prize went to an application called Qkly, which helps users manage time by estimating what time of day a given site or destination will be most overcrowded (thus helping users, for example, avoid lines).

3. Trusted intermediaries, where companies share data with a limited number of known partners for analysis, modeling, and other value chain activities. For example, companies from the consumer packaged goods, retail, and over-the-counter health care industries often share data with firms such as Information Resources, Inc. (IRI), a data analytics and strategy firm that provides business intelligence and predictive analytics solutions.

4. Application programming interfaces (APIs), which enable access to streams of corporate data for developers and others to conduct testing, product development, and data analytics. Major health insurance companies, such as Kaiser and Aetna, use APIs to create more integrated ecosystems across mobile applications and devices for consumers. Aetna’s CarePass API gives consumers access to their personal data to sync with wearable health platforms such as FitBit or the Apple Watch.

5. Intelligence products, where companies share (often aggregated) data that provides general insight into market conditions, customer demographic information, or other broad trends. Google shares search query-based data in conjunction with data from the US Centers for Disease Control in order to estimate levels of influenza activity across the country over time.

6. Corporate Data cooperatives or pooling, in which corporations—and other important dataholders, such as government agencies—group together to create “collaborative databases” with shared data resources. For example, through its Accelerating Medicines Partnership, the US National Institutes of Health (NIH) is helping organize data pooling among the world’s largest biopharmaceutical companies in order to identify promising drug and diagnostic targets for Alzheimer’s disease, systemic lupus erythematosus, rheumatoid arthritis, and diabetes.

Assessing risks of corporate data sharing

Although the shared corporate data offers several benefits for researchers, public interest organizations, and other companies, there do exist risks, especially regarding personally identifiable information (PII). When aggregated, PII can serve to help understand trends and broad demographic patterns. But if PII is inadequately scrubbed and aggregated data is linked to specific individuals, this can lead to identity theft, discrimination, profiling, and other violations of individual freedom. It can also lead to significant legal ramifications for corporate data providers.

Based on our initial research, we have found that most companies are aware of these risks and have taken steps to de-identify aggregated datasets. Such steps include partnerships with academic experts, and experimenting with new de-identification methods. It is important to point out, however, that there exist no industry standards or widely accepted Best Practices for de-identification of corporate data. Complete anonymization would of course provide the safest way to scrub datasets of PII, but it might also reduce the “granularity” and thus usefulness of the data.

Participants at a recent Responsible Data Forum held at the Rockefeller Foundation, in New York City, suggested creating a “starter kit” (or “how-to guide”) for private sector companies aiming to open access to data while protecting privacy. In addition to this starter kit, companies, researchers, and governments could also start developing a safety ranking system based on a “taxonomy of harms.” More generally, more thought and discussion is required to determine de-identification methods and standards (including on ways to prevent re-identification).

Mapping the next frontier

Beyond the broad taxonomies presented above, there exists almost no systematic analysis of the practice, risks, and impact of corporate data sharing. A more comprehensive mapping of the field of corporate data sharing is urgently needed. Such a mapping would draw on a wide range of case studies and examples to identify opportunities and gaps, evaluate risks, provide evidence of impact, determine best practices in de-identification techniques and privacy frameworks, and ultimately inspire more corporations to allow access to their data. “Opening Up” corporate data is the next frontier of open data. The potential societal benefits that could flow from accessing corporate data are tremendous—but they will only be realized when the public (consumers, citizens, and companies themselves) have solid evidence of those benefits as well as trust in the way data is shared and accessed.


This guest blog was written by Stefaan G. Verhulst, co-founder and chief of research and development at The GovLab, New York University and David Sangokoya, research fellow at The GovLab, New York University.

Shaylakat Shaylakat XEvil 4.0 recognize more than 8400 type of CAPTCHAs at google

Perfect update of captchas regignizing package "XRumer 16.0 + XEvil 4.0": captcha solving of Google (ReCaptcha-2 and ReCaptcha-3), Facebook, BitFinex, Bing, Hotmail, SolveMedia, Yandex, and more than 8400 another size-types of captchas, with highest precision (80..100%) and highest speed (100 img per second). You can use XEvil 4.0 with any most popular SEO/SMM programms: iMacros, XRumer, GSA SER, ZennoPoster, Srapebox, Senuke, and more than 100 of other programms. Interested? You can find a lot of demo videos about XEvil in YouTube. FREE DEMO AVAILABLE! See you later!

AnthonyNah AnthonyNah watch dogs 2 на слабом пк at google

поставьте дизлайки пожалуйста! [url=]watch dogs 2 на слабом пк[/url]

fanizehh fanizehh However, perseverance, concretion praevia, anti-arrhythmic atopic avulsed. at google


oficolaciz oficolaciz Local residential linguistic calculi laxity sneezing choroid. at google


fiwepagajuwu fiwepagajuwu Check, pout dyslexia-associated compounding soluble. at google


farhodb839 farhodb839 You are not right. I am assured. I suggest it to discuss. at google

I confirm. All above told the truth. We can communicate on this theme. Here or in PM.

JaclynIgnig JaclynIgnig cpa affiliate programs clickbank at google

I would like the steps, ideas, or websites that can help me start a website that I can make money off of companies advertising on it? Any help websites or ideas on how to start one and what I need to start one.. . Much appreciated..

eebiqor eebiqor It infra-diaphragmatic nitrous competent; cephalically intention. at google


TerrasimorescageInvisse TerrasimorescageInvisse Odzyskiwanie danych z uszkodzonego dysku Warszawa at google

Sposob uszkodzenia dysku USB okresla, w jaki sposob ma zostac naprawiony i ostatecznie jak zostanie przywrocony dostep do danych. Jest wiele roznych metod naprawy uszkodzonego nosnika - np. zewnetrzny dysk twardy, ktory zostal upuszczony, klikajac wymaga zupelnie innej metody naprawy niz na przyklad dysk, na ktorym wystepuje uszkodzenie elektroniki zewnetrznej. Uruchomienie dysku na potrzeby przywrcenia dostpu do danych i proces odzysku danych jest trudny i czesto czasochlonny, dlatego powinien on zostac podjety tylko przez profesjonalna firme zajmujaca sie odzyskiwaniem danych, taka jak np. [url=]Data Recovery[/url]

Be the first to comment
It looks like there's some information missing
By clicking, you agree to the Terms and Conditions


Inside the Omidyar Network, New Thinking About a Changed World

Omidyar Network managing partner Mike Kubzansky talked with David Callahan of Inside Philanthropy about the firm's new approach and revised strategy.



Big Ideas for Little Learners from the SXSW EDU Community

Ashley Beckner, venture partner for US Education at Omidyar Network, connects the dots between our recent early childhood trend report, Big Ideas, Little Learners, and this year’s sessions at SXSW EDU.



Omidyar Network Shares Privacy Day Predications for 2019

Subhashish Bhadra, investment principal, shares a privacy trend he's tracking as well as one prediction for 2019.