Race-based data: Friend or foe?

Academic Views / Monday, July 12th, 2021

Shannon Ang, Assistant Professor of Sociology at Nanyang Technological University, considers whether and when it is useful to disaggregate data by race.

A recent spate of racist incidents has led to more discussion of inter-ethnic relations and race-based policies, from arguments about concepts like Chinese privilege and Critical Race Theory to policy debates about the Ethnic Integration Policy (EIP) and the Group Representation Constituency (GRC) system. Many politicians and academics have given views. Yet discussion on policies needs to be backed up with data, especially with contentious issues like these. With more data, we can answer questions like: what proportion of inter-ethnic ties form within HDB neighborhoods, versus at workplaces? Could forced integration foster more animosity between people of different races? This article addresses one very specific but important question: should we release data broken down by race, or not?

First, we must recognise a tension between two seemingly contradictory goals. On the one hand, we do not want race-based data to be used as a reason for stereotyping or stigma. For instance, economists point to “statistical discrimination”, where employers rely on beliefs about group statistics to evaluate individuals and rationalise discriminatory decisions. Some therefore worry that framing data by race may accentuate differences, with others suggesting that race-based data should be replaced with data categorised by socioeconomic class. On the other hand, we want to uplift groups that may not be doing as well as others. As Minister in the Prime Minister’s Office Indranee Rajah points out, overly-aggregated data that ignores race differences may mask the fact that some ethnic communities are not doing as well compared to the others. A look at findings from Census 2020 highlights why this can be crucial. Over the past decade, overall increases of those living in 1-2 Room HDB flats seem modest, from 4.6% in 2010, to 6.5% in 2020. Among Malays, however, the increase is much larger—almost doubling, from 8.7% in 2010 to 16.0% in 2020. Without race-based data, we may miss this important change, because data from the majority race tends to overwhelm those of other ethnic groups.

This leads to a conundrum: how do we release data in a way that does not simply hand-wave racial differences away, but at the same time ensure it does not lead to racial discrimination? I propose three principles.

Recognise that race is a social construct

First, we must acknowledge that race is socially constructed. Contrary to popular belief, to say that something is “socially constructed” does not necessarily mean that it is “not real”. For example, sociologists say that illness is socially constructed. This does not mean that, say, the abnormal mass of tissue in someone’s body is not real. What it means is that society shapes most of the resultant implications, such as the stigma associated with having different forms of cancer, and expectations around what a cancer patient should do (e.g. get chemotherapy). For instance, people tend to blame lung cancer patients for their condition, even though not all lung cancer patients have a history of smoking. This means that lung cancer patients may be less likely to seek treatment or advocate for themselves.

The racial categories used by the British colonial government to classify people in Singapore changed over the years. (Table excerpted from article)

Unlike cancer, there is little biological basis for racial classification. Much of it is a result of our social imagination—from the definition of racial categories, to what we think the “culture” of each race consists of (e.g. its values and practices). Racial categories in Singapore were put in place by the British colonial government, who used individuals’ characteristics such as their place of birth and/or linguistic group to classify them. Even then, the categories used to classify people changed over the years (see above picture). This fluidity tells us that the Chinese, Malay, Indian, and Others (CMIO) model is not self-evident and immutable. It is, to an extent, arbitrary. And because it is arbitrary, some people do not fit neatly into those categories, especially as society changes over time. For me, a striking observation on the recent racist incident involving Tan Boon Lee was that each member of the mixed-race couple was already of mixed-race parentage—this was a “second generation” mixed-race couple, but Tan had still put them into the old CMIO boxes.

None of these observations makes race any less “real” than something like cancer. Because systems and institutions throughout history have relied on racial classifications, race comes with many ramifications. In America, this means African-Americans are seven times more likely to be wrongfully convicted of murder compared to Whites. In Singapore, the enactment of race-based policies like the EIP may mean that minorities have to sell their HDB flat at lower prices. For decades, the Singapore state has relied upon the CMIO model in its administration of citizens, implementing race-based policies with real implications for members of each racial group. Race is real, because the shared history and circumstances of racial groups living under the consequences of these socially-imposed categories are real.

What does this mean for our understanding of race-based data? As long as race-based policies and practices differentiate the experiences of Singaporeans as they go through life, variations across racial groups are likely to reflect more than just socioeconomic differences. We should not assume that socioeconomic data can completely replace race-based data, because socioeconomic status does not fully encompass what it means to live under race-based policies in Singapore.

Yet because race is shaped by historical circumstance, we must constantly question the categories we use, especially as society experiences large demographic changes. How (if at all) do new immigrants fit under the CMIO model? Should we include non-residents in the EIP? Are we forcing mixed-race persons to choose one category instead of recognising they may not fit any (or even both)? If the lived experiences and actual identities of Singapore residents no longer easily fit into the categories of Chinese, Malay, Indian, or Other, then race-based data should start to move away from these rigid categories.

Know when race-based data is useful, and when it is not

Second, we must identify circumstances when race-based data is useful, and when it is not. I will suggest just one way to discern and decide on this matter.

Students of statistics learn a simple maxim called the law of total variance. This states that overall variation in outcomes is the sum of within-group variability and between-group variability. For instance, in examining the heights of men and women, there are two sources of variation: the variation in height between men and women (i.e. whether men are taller than women on average), and the variation within each group (i.e. some men are taller than other men, and likewise among women).

Average height differences between men and women help to illustrate the law of total variance. (Photo: pxfuel)

Students analyse these variances to establish meaningful differences between groups. To illustrate, we might say that men tend to be taller than women. This does not mean that all men are taller than all women, which is untrue. Rather, the variability in height between men and women is substantively greater than the variability in height within each group.

So, a simple rule of thumb might be that race-based data is useful when between-race variability is larger than within-race variability. One implication of this view is that as each racial group becomes more diverse, current race-based data based on rigid CMIO categories becomes less useful. For instance, if new Chinese immigrants have very different cultural backgrounds and social circumstances from the existing Chinese population, an influx of these immigrants introduces variability within the “Chinese” ethnic group. Census data shows that in 2020, 22.7% Chinese residents in Singapore were born outside of Singapore, versus 17.6% in 2000 (non-Singapore-born residents among “Indian” and “Other” ethnic groups have also increased drastically since the CMIO model was conceived). We then have to ask questions like: are there differences in income between Singaporean Chinese and new Chinese immigrants (or Permanent Residents)? Are those differences larger than those between Chinese and Indians (or another ethnic group)? If there are large variations in outcomes within a single ethnic group, comparing outcomes with other ethnic groups becomes much less meaningful. Another approach would be to ask if differences between races are shrinking. For example, if people across all racial groups are becoming more educated, will attitudes towards the race of our Prime Minister begin to converge between ethnic groups? Or, as English language proficiency increases across all races, will Singlish become similarly accepted across all races? If racial groups are becoming more like one another, then it becomes less useful to release race-based data.

Whether race-based data is useful therefore depends on the outcome we are looking at (e.g. health, income, housing, education), and whether the CMIO categories adequately capture variation in each of these areas. Unfortunately, we are seldom given data on variability within race groups. This brings me to the next point.

Contextualise race-based data with more data

My third and final principle is that the answer lies in releasing more data, not less. The solution to the ills of race-based data (e.g. statistical discrimination) is not to stop releasing it, but to release more data that can contextualise observed racial differences. With only race-based data, people tend to default to unhelpful cultural explanations, relying on false stereotypes such as “Malays are lazy”. But more data can help us to determine the sources of inequality that explain racial differences. To illustrate, a middle-class graduate Malay may have more in common with a middle-class graduate Chinese than with a working-class Malay. However, to look at this we need data along the simultaneous dimensions of socioeconomic status and race—data that are seldom available. More of the right data can help us get at the key question: what factors explain racial differences?

I will highlight just one simple method we can use to answer questions like these. Demographers use a nifty tool created by sociologist and demographer Evelyn Kitagawa to understand how a third factor may explain overall differences in outcomes between two groups. Consider Census 2020 data (Table 20 below) which tells us that among those who live in HDB flats, 29.9% of Chinese live in 5-room and executive HDB flats, while only 23.2% of Malays live in such flats (a difference of approximately 6.7%). One might wonder: perhaps this difference exists because Chinese households are larger than Malay households, and therefore need bigger flats? Kitagawa’s decomposition method lets us test this hunch.   

This method reveals (see box) that Malays actually have larger households compared to Chinese, but nevertheless live in smaller flats. Put another way, if the Chinese had similar household sizes as Malays, we would anticipate a 24% greater difference (growing from the current 6.7%, to about 8.1%) between the proportions of Chinese and Malays living in 5-room and executive HDB flats. So the answer is no, Chinese do not live in larger flats because they have larger household sizes.

We can also make another observation: the association between household size and HDB flat size is not the same for Malays and Chinese. Malays with large households are less likely to live in larger HDB flats, compared to Chinese with large households. But why? With more data, researchers can go on to investigate why this is so. Perhaps the difference in this association is because Malay households have, on average, lower income than Chinese households.

I have hopefully demonstrated here how this simple tool can potentially further the conversation around race. Tools like the Kitagawa decomposition can help us answer questions like: how much of the racial difference in health can be explained by the way household incomes are distributed across race? How much of the racial difference in gross monthly income can be attributed to differences in educational qualifications across race? How are educational outcomes across race shaped by factors such as family structure, or social capital?

Yet publicly available data are often not detailed enough to contextualise differences between racial groups. To come up with the example above, I scoured the Department of Statistics website and numerous Institute of Policy Studies publications—still, it was just about the only data I could use for this exercise.

Addressing the issue of people resorting to false cultural stereotypes requires more information, not less. We cannot abandon race-based data altogether, because pretending racial differences don’t exist may be even more harmful for minority communities. More data, not less, is needed to counter misinformation. With more data, we can work to contextualise apparent racial differences—examining factors that explain these differences and perhaps working to close racial gaps.


I started by saying I would try to answer the question “Should we release data by race, or not?” But this tends to paint the situation as an all-or-nothing scenario, and is typically unproductive for conversations around race. A better question to ask would be: how and under what circumstances should we release race-based data? I have attempted to provide three principles to prepare us for this task. First, we must recognise that racial categories are not self-evident. This reminds us that CMIO categories are not natural or immutable—rather, they are contingent on historical circumstances and tend to serve specific interests. Current and longstanding effects of race-based policies drive some (if not most) of the observable differences between racial groups. Second, we must discern when race-based data is useful, and when it is not. Part of this means keeping an eye on within-race and between-race variability, ensuring that we emphasise only meaningful differences. With increasing diversity within race groups, race-based data by CMIO categories may become much less useful. Third, race-based data should be released in greater detail, so that users can contextualise and understand why differences exist across racial groups. For instance, differences in educational attainment may be better explained by differences in parental socioeconomic status rather than “laziness” or “grit”, but data to explore such relationships are seldom available. These three principles represent my hope that race-based data can become friend, and not foe.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.