Let us have data for breakfast together

Shannon Ang, Assistant Professor of Sociology at Nanyang Technological University, is hungry for more data — here he suggests how we can better harness it for Singapore’s collective good.

How likely is it that children of poor parents in Singapore are also poor when they grow up? Is it true that companies who hire primarily foreigners at first gradually increase their proportion of local employees over time? What are the family circumstances of older adults who do low-wage work? Whose livelihoods have been most affected by the COVID-19 pandemic?

The issues raised by these questions have one thing in common: instead of letting our biases and anecdotes drive public discourse, they can be much better addressed with data. This article asks: how can we harness data to better understand our society?

Data is tremendously valuable. Tech giants such as Facebook and Google, by mastering the art of harvesting personal information, have become two of the world’s most profitable companies. The Singapore government has happily hopped onto the bandwagon: in the near future, it aims to have 20,000 public officers trained in data analytics, with public agencies embarking on at least 10 cross-agency “high-impact data analytics” projects a year.

*SM Tharman on data for breakfast. Screenshot of Facebook post.*

Deputy Prime Minister Heng Swee Keat tells us that he wants to co-create policies with us. However, without access to the data held by public agencies, it is easy for ordinary citizens to be brushed aside and told that in vague terms that we “don’t understand” the situation or that a particular solution is “not practical”. This information asymmetry is not ideal for Singapore, especially if we wish to “harness our diversity as a strength”.

As a social scientist with a keen interest in understanding Singapore society better with quantitative data, I propose a few key principles for us to forge ahead. These suggestions concern the type of data I work with (i.e., survey data, administrative data), but likely extend to other forms, such as archival data. Senior Minister Tharman’s “data for breakfast” analogy is rather useful here: how can we cook up great tasting but healthy breakfasts (of data) that benefit all Singaporeans?

*KPIs for the government’s digital plan. Screenshot from CSC website.*

It’s better when shared

“Data is not like oil, where if I consume it, I deprive you of consuming it. In fact, data has an increasing rate of return – where the more I use it together with you, the more value we create out of it.”
– Minister Chan Chun Sing

The government understands the important role of data in governance. It expects public officers to harness data to “measure the effectiveness of policies and interventions”, as well as to “spend more time analysing and designing solutions to key challenges”. These are laudable goals—policy-making should indeed be based on empirical evidence, not the whims and fancies of office holders.

Conversely, however, policy proposals coming from outside the government become toothless without data to back them up. Thus the idea of co-creating policies will remain a distant pipe dream, unless both public officers and the citizenry can rely on the same facts and wield the same data to craft their proposals.

Minister Chan is right about the value of sharing data. The government can and should lead the way in this, as others like Lim Sun Sun, Kishore Mahbubani and Pritam Singh have previously highlighted. True, many will interpret and analyse shared data in different (and sometimes contradictory) ways. But some interpretations will be more plausible and some analyses more robust than others; having data as common ground can provide a rich tapestry of ideas for open and vigorous debate.

Yet sharing data is not just about generating new insights—it is also about being sure of what we have found. We cannot simply assume that data will always be error-free or be presented in an unbiased manner, even those we get from seemingly authoritative bodies such as government agencies, news reports or academic studies. The “replication crisis” in psychology reminds us that reproducibility is an important part of empirical work. Policies and interventions must not be based on findings that result from someone intentionally “cooking the books” or unintentional coding mistakes. Data sharing enables us to check on each other, resulting in fewer opportunities to turn “evidence-based policy-making” into “policy-based evidence-making”. The difference is subtle, but key. The former is an effort to arrive at policy positions inductively, allowing data to drive the process. The latter, in contrast, manipulates data to suit one’s desired argument (sometimes called “p-hacking” in academic circles). It is near impossible, however, to tell the difference between the two from just their results, unless independent parties have access to the same data.

*An example of “policy-based evidence-making”. Screenshot from Twitter.*

“Sharing is caring”—until it isn’t

The potential value of data also means that we should spare no effort in data protection. How can we safely share data without compromising privacy? This was the topic of my recent exchange with Smart Nation and Digital Government Office in the Straits Times Forum. There are two key principles we must recognise:

1. Having data is having power

There are always power dynamics involved when sharing data. Its possession is often a function of positions of power. Minister Chan humbly appeals to multinational companies by saying that “we are all richer if we are able to share the data”, but patronises opposition leader Pritam Singh when the latter asks for data on the distribution of jobs in Singapore in Parliament, famously asking him: “What is the point behind the question?” Why is there such a substantial difference in tone? Notice that Minister Chan plays the role of data requestor in the first case, and data provider in the second.

Releasing data is thus not a neutral act. Kishore Mahbubani sees sharing data as an issue of trust, because sharing de-identified government data with the public is akin to sharing power (which perhaps explains why there is so much reticence in doing so). However, releasing data that identifies individuals for the purposes of hacking, doxxing or shaming is malicious and constitutes an abuse of power.

These power dynamics may play out in more nuanced ways. For instance, some were uncomfortable with how the Ministry for Social and Family Development recently publicly released the details of a low-wage elderly cleaner in a bid “to preserve the public’s trust in public agencies”. This may have been perfectly legal, but the huge power differential between the Ministry and the vulnerable cleaner leads one to ask: was it really necessary? Surely there must be ways to restore public trust without compromising the individual’s privacy so crudely? These important questions highlight the power dynamics involved. Notably, Natalie Pang suggests that some information was excessive and could have been withheld, which brings us to the next point.

2. Sharing data is not all-or-nothing

*“Only a Sith deals in absolutes.” – Obi-Wan Kenobi*

Sharing data is not an all-or-nothing endeavor—that is, it is not a choice between sharing all the data (with or without identifiers) in its unadulterated form, or sharing nothing at all. Data sharing efforts should be broad (i.e., across multiple sectors) and provide sufficient granularity to be useful to researchers, while still protecting the privacy of individuals (i.e., it should prevent re-identification).

The conversation around data sharing often ends up in an unnecessary impasse about whether or not to implement a (UK-style) Freedom of Information Act. This unfortunately tends to reproduce the all-or-nothing misunderstanding. Passing a calibrated form of the Freedom of Information Act in Parliament may be desirable at some point, but government agencies can start simply by sharing survey data. We can protect personal privacy by removing identifiers and making certain measures more coarse (e.g., providing age in 5-10 year bands instead of single years). Data can then be hosted on a secure platform and made available to users who register and present ethics clearance, as is done by the Inter-university Consortium for Political and Social Research at the University of Michigan. Many surveys by government agencies (e.g., the Retirement and Health Study, the National Youth Survey, the Marriage and Parenthood Survey) are highly valuable for social science research, and making these data available can encourage students learning data analytics to apply their skills towards understanding Singapore society.

Good ingredients make all the difference

We need high quality data for robust analysis. Government agents often point to sites like SingStat Table Builder or data.gov.sg to argue that the state is in fact sharing a lot of data. While these highly aggregated data have their uses, they are not sufficient for robust policy analyses. SM Tharman wisely tells us to be wary of “single-factor correlations”, but data on sites like data.gov.sg do not permit multivariable analyses, which we need in order to understand the underlying reasons for an observed trend and reduce the likelihood of spurious associations. Such analyses are possible only with case-specific data on individuals or other units of interest such as neighbourhoods or schools (e.g., de-identified survey data or administrative data). Providing access to such data (sometimes referred to as “anonymised microdata”) should be prioritised over releasing endless amounts of highly aggregated data.

*One poll puts Tharman as the most popular choice for PM, while another says Singaporeans are “not ready for a non-Chinese” candidate. An apparent contradiction? (Photo: Wikimedia)*

We must also recognise that data quality is heavily dependent on how the data is generated. The very act of data collection occurs in a social context. The way in which we elicit data (e.g., in terms of tone, question phrasing, question order, response options) shapes the kind of responses we get. This is why even though one poll shows Senior Minister Tharman (of Indian ethnicity) as the most popular choice to assume the role of Prime Minister by far, results from another poll can be interpreted to say Singapore is “not ready for a non-Chinese Prime Minister”. The main reason for the apparent contradiction is simple: respondents were asked to rank real people (i.e., with names) in the former case, but were only asked about their abstract racial preferences in the latter (which probably meant resorting to their own racial stereotypes to inform their choice). We may disagree about which finding is more relevant, but the point is, how and what you ask matters.

Data holders and people who use data to make policy arguments should therefore be transparent about the way in which data was collected, and acknowledge the limitations inherent in their methodology. We must demand more stringent reporting of data and statistics, starting from media organisations and public institutions. Methodological details are often opaque to news readers, typically clarified on a separate platform only after a public outcry. It is a tragedy when mainstream newspapers such as The Straits Times report margins of error calculated using formulas for a probability sample, when the sample in question is in fact a non-probability sample. Such carelessness in quantifying uncertainty makes us overconfident of our knowledge, and should always be avoided.

Waste leaves us all worse off

Finally, my impression is that data wastage is a big problem in Singapore. (Ironically, there is no real data on this). Because data is not widely shared, social science researchers often collect data already held by government agencies or other researchers. Since collecting data is expensive, overlaps in such efforts are immensely wasteful in terms of time and money (most of which likely comes from taxpayers).

I make three specific suggestions. First, grant-awarding bodies such as the Social Science Research Council (SSRC) should more intently scrutinise applications to collect data. Proposals to spend lots of money to collect data in some “interdisciplinary”, “big data”, “[insert other buzzword]” project may sound alluring, but such data may already exist with some government agency or another researcher. Instead of dispensing large sums to re-collect mostly the same data, the SSRC can play the role of an intermediary, advocating and facilitating greater data access for researchers.

Second, avoid romanticising “novel” studies, and learn to build upon existing efforts instead. Researchers should be encouraged (and rewarded) to build on existing and ongoing studies rather than endlessly start up new data collection efforts of their own. There are way too many cross-sectional and short-running (i.e., consisting of two or three waves) social science studies—researchers would benefit much more from having a large, high-quality, long-running study that is made accessible to all. They could then propose to add on topical modules to collect innovative data, rather than start completely new studies. An important step towards this goal would be to first make publicly available an exhaustive record of studies conducted by public agencies or social science researchers on publicly funded grants, so that other researchers can be made aware of the studies that already exist.

Third, more grants should be made available to conduct secondary data analyses on social science datasets. Units like the SSRC or the Government Data Office should facilitate these projects by instituting a formal and transparent process through which researchers can apply for and access data, and providing the appropriate resources to do so (e.g., providing the use of a secure data enclave). Past data is heavily underutilised if only primary data collection is funded—researchers are incentivised to collect new data without making full use of data that has already been collected. This, again, is wasteful.

*It’s probably not just food we are wasting.*

Consider ageing. To name just a few studies, there are the Retirement and Health Study, the Singapore Longitudinal Ageing Studies, the Panel of Health and Ageing in Singapore Elderly, the Well-being of the Singapore Elderly Study, and more recently the Lifelong Education for Aging Productively in Singapore study. These likely overlap a great deal with one another. They will all probably collect the usual information on older adults, such as sociodemographic factors, physical and mental health, social engagement, healthcare use and so on. Imagine if all the money that had gone (is going) into these projects were combined for one large project! It would allow us to follow cohorts over a longer time frame, and to collect a wider spectrum of information from the same people, leading to more robust analyses of individual and societal change over time. As a point of reference, the Health and Retirement Study (HRS) in the United States has been running for about 26 years (easily twice the length of any of the Singapore studies mentioned above), and contains information across many dimensions of health and social life not easily found in studies of ageing in Singapore (e.g., data on genomics, personality, debt, work/family/religious life history, entrepreneurship, time use, etc.). Researchers regularly propose new questions for future waves of data collection, building on the HRS’ rich trove of data. Importantly, the data is publicly available, so over 5,000 doctoral dissertations and research papers have been written from secondary analyses of HRS data. Singapore is likely nowhere near to using its data as effectively.

Encouraging long-running data collection efforts can yield real benefits. For example, researchers from Singapore Management University were able to use data from the Singapore Life Panel to explicate the impact of the COVID-19 pandemic on older adults. This is a pleasant surprise—it is usually difficult to examine the impact of unexpected events, because the process of getting funds and ethical approval for social science research is often painfully long. Because the study was already ongoing, data on the pandemic’s effects could be obtained in a timely manner. However, the Singapore Life Panel has only been running for about five years. This means it is unable to provide robust answers for other important questions such as: how does the impact of COVID-19 on older adults’ mental health compare with that of the 2008 Global Financial Crisis? Have individuals worst hit by the Global Financial Crisis become more resilient this time around? A long-running study like the HRS is much better equipped to answer such questions. Short-term research efforts, as are common in Singapore, can only give us a small part of the picture.

If policymakers and grant-awarding bodies are serious about improving social science research, they must recognise that data wastage hurts us all. It is not only a waste of money, but a barrier preventing us from a better understanding of Singapore society.

Conclusion

I have now outlined a number of ways that Singapore can better use its (social science) data in service of societal good. Taking these steps may be an uphill task; it is not in the nature of those who have power (i.e., data) to share it, and they have no incentive to. Many have called for greater data sharing in the past decade, but we have made little progress. Nonetheless, the recent call to co-create policy is probably best met with an equally clear response—no data, no talk. Civil society can advocate for greater access to de-identified data from researchers and government agencies; researchers can commit to greater collaboration; grant-awarding agencies can incentivise more effective use of existing data. Hopefully, in time to come, we will all have data for breakfast together.

For media: Are you interested in republishing this article? Please see our guidelines here.

Academia | SG