# Variability in the interpretation of probability phrases used in Dutch news articles — a risk for miscommunication

### Abstract:

Verbal probability phrases are often used in science communication to express estimated risks in words instead of numbers. In this study we look at how laypeople and statisticians interpret Dutch probability phrases that are regularly used in news articles. We found that there is a large variability in interpretations, even if the phrases are given in a neutral context. Also, statisticians do not agree on the interpretation of the phrases. We conclude that science communicators should be careful in using verbal probability expressions.

Keywords:

24 October 2019

24 March 2020

6 April 2020

### 1 Introduction

Probabilities and risks play a crucial role in science communication. Doctors inform their patients about the probability of a successful treatment and the risks of side-effects. Climate researchers want to convey the probability of different climate change scenarios. Science journalists report on estimated probabilities and risks in many different fields. And every day people make decisions based on these probabilities and risks. Due to this dependence of the decision maker on the information provider, it is important that the message is understood as intended in order to minimize the risk of miscommunication.

Many estimated probabilities are communicated verbally, with terms such as very likely instead of exact percentages. In that case it is important that the interpretation of the verbal probability phrase is the same for both sender and receiver. For example, in health communication there are guidelines that state that side-effects that occur in 1–10% of patients should be referred to as common. But when the side-effect of constipation was described as common to patients, they estimated on average that 34.2% of people would experience constipation [Knapp, Raynor and Berry, 2004]. Such overestimations of risks can decrease medicine adherence or can lead to nocebo effects where people will actually experience more side-effects.

Many organisations use probability scales as in Table 1 as a guideline on how probability phrases should be interpreted such that their risk communication is standardized. But how well do these translations match with how people actually interpret these phrases? How do people translate these verbal probability phrases back into numbers?

Table 1: Approximate probability scale recommended for harmonized use in the European Food Safety Authority (EFSA) to express uncertainty about questions or quantities of interest [European Food Safety Authority et al., 2019].

In early studies on the interpretation of probability phrases respondents were asked to give their interpretation of a probability expression as a single value or range on a scale of 0–1 or 0–100% or were asked to rank them. The phrases were either presented out-of-context or in sentences describing a particular situation. Many of these studies were summarized in the literature reviews by Druzdzel [1989] and Visschers et al. [2009] and the meta-analysis by Theil [2002].

The overall conclusion from these studies was that, although individuals seem to be internally consistent in their ranking of probability phrases [Budescu and Wallsten, 1985] and their perception of them over time [Bryant and Norman, 1980], the interpretation of these phrases varies greatly among individuals. This interpretation variability is especially large for phrases expressing a probability in the range from 20% to 80%. For words that express extreme probabilities, such as always, certain, never, and impossible, consensus was highest. This variability of interpretations is represented by the varying widths of the subjective probability ranges in probability scales as in Table 1. These wide ranges complicate communication, because it is impossible to express a very specific probability.

Several studies also showed that the numerical interpretations of some probability phrases overlap or are very similar. For example, Reagan, Mosteller and Youtz [1989] concluded that likely is synonymous with probable, and low chance with unlikely and improbable. Synonymous words have overlapping probability ranges which would complicate a probability scale. The codification presented in Table 1 seems to avoid this complication by limiting the vocabulary to phrases with non-overlapping ranges.

Furthermore, translation issues for verbal probability expressions are important for all international organizations that publish their documents in more than one language. For example, a question that may arise within the European Food Safety Authority is whether their probability scale (Table 1) translates directly to other European languages, or whether the subjective probability ranges in the second column should be adjusted, and consequently, the expressions in the documentation text.

Most research on the numerical interpretation of probability phrases was conducted in English. There have been some replication studies in other languages, among which the Dutch language. Most of the Dutch studies are over twenty years old. For instance, Eekhof, Mol and Pielage [1992] focused on the interpretation of 30 Dutch phrases. However, all phrases in this study expressed frequencies (like often, always, and rarely) instead of probabilities (like certain, likely, and low chance). In a later study by Timmermans [1994] some probability phrases were included, usually in combination with an adverb like quite or rather. Unfortunately, the article is written in English and does not provide the Dutch expressions used in the study, hence it is unclear exactly which Dutch expressions and adverbs were investigated. In a study by Pander Maat and Klaassen [1996], focus was on the interpretation of uncertainty in information leaflets that come with medicine. Although their main interest was not in the numerical values associated with verbal probability phrases, they did investigate this for three phrases. Renooij and Witteman [1999] did several experiments to develop a probability scale containing both words and numbers. Their focus was on ranking seven probability phrases and developing their corresponding numerical scale. Given that the first study included many phrases but only frequencies, and the other three studies included only a few probability phrases, usually in combination with adverbs, many Dutch probability expressions still needed to be studied.

In addition to replication studies in other languages, several studies have been done to compare the interpretation variability of English probability phrases with the interpretations of their translations to other languages. Three studies, comparing English with French [Davidson and Chrisman, 1994], German [Doupnik and Richter, 2003], and Chinese [Harris et al., 2013], showed that on average the numerical interpretations of the English phrases differ from the interpretation of their counterparts in the three other languages. Additionally, in French and Chinese, the standard deviations of the numerical values related to the probability phrases were much larger than those of the original English wording. These results show that the meaning of probability expressions can get lost in translation from one language to another.

In our study we focus on the interpretation of Dutch verbal probability phrases given in neutral contexts. In the next section we give an overview of theories and results from (science) communication literature that determined the set-up of our study.

### 2 Background

#### 2.1 The communication mode preference paradox

Until recently, it was generally believed that information providers, the senders of a message, prefer to express probabilities verbally, namely by using verbal probability expressions as unlikely, usually and maybe, while decision makers favour numeric expressions like percentages. Druzdzel [1989] reasoned that senders prefer verbal expressions because these convey some amount of uncertainty. Including this uncertainty in the expression is favoured by senders, because probability estimates are usually based on empirical data and therefore not sufficiently precise to be translated into exact numerical statements. Hence, if a numerical value is given, its suggested precision may be misleading. On the other hand, decision makers prefer this precision of numerical expressions, since numeric values are easier to compare and to draw conclusions from. Erev and Cohen [1990] referred to this difference in preference as the communication mode preference paradox.

In more recent studies, researchers have challenged this theory, but the results are not conclusive. For example, Juanchich and Sirota [2019] concluded that people favour verbal phrases in general, but in some contexts or for specific purposes numerical expressions are preferred.

#### 2.2 Asymmetry

A complication in the interpretation of probability phrases is asymmetry. For example, based on the discovery of the synonymous pair low chance with unlikely and improbable, Reagan, Mosteller and Youtz [1989] also expected high chance to be synonymous with likely and probable. However, their data indicated that actually very likely and very probable are its synonyms. This unbalanced result shows that there is some asymmetry in the interpretation of probability phrases.

This phenomenon of asymmetry in the interpretation of mirrored probability phrases is studied and confirmed by many researchers. In most studies, this imbalance is investigated on a group level by comparing the group means or medians of two complementary phrases. For instance, Lichtenstein and Newman [1967] concluded that the interpretations of likely and unlikely are asymmetric, since their means sum to (72% + 18% =) 90% and their medians sum to (75% + 16% =) 91% instead of 100%. This asymmetry was confirmed by both Reagan, Mosteller and Youtz [1989] (medians sum to 90%) and Stheeman et al. [1993] (medians sum to 80%). Furthermore, Lichtenstein and Newman [1967] focused on the influence of adverbs (such as very, quite and fairly) and found that, for instance, the means of the numeric probabilities given to quite likely and quite unlikely sum to (79% + 11% =) 90% instead of 100%.

Previous studies have also shown that some terms actually are (almost) symmetrical. For example, very likely and very unlikely (mean interpretations sum to 96% [Lichtenstein and Newman, 1967] ), and almost always and almost never (median interpretations sum to 98% [Stheeman et al., 1993] ).

Some mirrored terms have a clear linguistic explanation for their asymmetry. For example, Mosteller and Youtz [1990] studied the terms possible and impossible and found that the interpretation of impossible is stable (around 3% for all participants of the study), while possible has distinct meanings for different people. Namely, some respondents used the literal interpretation of possible and indicated that it could indicate any percentage between 0% and 100%, and others associated it with rare events that only scarcely occur (as in barely possible). Hence, the different interpretations of possible causes the strong asymmetry with its mirrored expression impossible. The asymmetry in the interpretation of certain and uncertain can be explained in a similar way.

The asymmetry in the interpretation of verbal probability expressions complicates the development of probability tables. Namely, if a probability scale is symmetric, it is easier to use. For example, the symmetry of the probability scale in Table 1 simplifies the use of the table. However, since research showed that people do not necessarily interpret mirrored phrases in a symmetrical way, symmetric tables do not necessarily represent the actual interpretation of its terms.

All these research results show that the interpretations of verbal probability expressions vary too much to translate them into a (symmetrical) probability scale of which the numerical probability ranges would be supported by everyone. Therefore, many researchers who initially intended to make a translation table, concluded that such a codification is practically impossible [Lichtenstein and Newman, 1967; Mosteller and Youtz, 1990; Weber and Hilton, 1990; Timmermans and Mileman, 1993], or realized that their currently used table was actually not conveying the intended probabilities [Pander Maat and Klaassen, 1996]. Yet, still many organizations are using tables like this.

#### 2.3 Context dependence

The interpretation of a probability phrase is influenced enormously by its context. For instance, compare your numerical interpretation of the word likely in the next two statements:

• It is likely that it will rain in Manchester, England, next June;
• It is likely that it will rain in Barcelona, Spain, next June.

Probably, your numerical interpretation of likely in the first statement is higher than in the second. Wallsten, Fillenbaum and Cox [1986] used this example and, based on their research, predicted a difference in the numerical interpretation of these statements. Namely, in their study, they showed that an individual’s expected base-rate of a context scenario influences this person’s interpretation of the probability phrase. In this example the base-rate for the first scenario is higher (in spring rain is more probable in England than in Spain) and this influences the interpretation of the word likely.

This hypothesis on the base-rate effect was confirmed by Weber and Hilton [1990], who, additionally, provided evidence that other variables may be affecting the interpretation as well. According to their findings, the perceived severity or consequentiality of an event and its emotional valence will also influence the judged probability.

Since it was shown that context may influence the interpretation of probability phrases, many researchers decided to investigate them out-of-context. However, it was argued by Druzdzel [1989] that, if no specific context is provided, participants may invent their own context. Due to these self-created contexts, participants’ responses will portray the interpretation of the probability phrases in many completely different contexts instead of out-of-context. These different scenarios may cause extra variability in the data which makes it more difficult to draw conclusions from the results.

#### 2.4 Differences between sub-populations

In most studies, data on the interpretation of probability phrases was gathered within specific sub-populations. Participants were, for instance, physicians [Bryant and Norman, 1980], science writers [Mosteller and Youtz, 1990], radiologists [Stheeman et al., 1993], biological scientists [MacLeod and Pietravalle, 2017], or patients [Pander Maat and Klaassen, 1996]. Although all these studies showed variability in the perception of probability phrases within these sub-populations, one might wonder whether there are any differences between these groups as well. For example, Theil [2002] argued that there may be a difference between professionals who regularly make and communicate probability estimations, and persons who are inexperienced in this respect. However, his meta-analysis did not provide evidence for this hypothesis.

In studies on the use of jargon in science communication, it has been shown that there is a significant difference in the interpretation of medical terms between doctors and patients [Boyle, 1970] and of hydrological vocabulary between experts and laypeople [Venhuizen et al., 2019]. Experts may be unaware of this difference [Castro et al., 2007] and, hence, their use of jargon may cause a miscommunication of information.

Given these results on the different interpretations of jargon, there is reason to believe that there may be differences between the numerical interpretations of probability expressions of experts and laypeople as well, as Theil [2002] suggested. If this hypothesis is correct, experts may be misunderstood if they express probabilities verbally.

#### 2.5 Gaps in the literature

Summarizing, we see that despite ongoing interest in and usage of verbal probability expressions, there are large gaps in the literature. Furthermore, in most studies on this topic, the sample sizes were quite small. For instance, the number of participants in the Dutch studies lay between 78 [Timmermans, 1994] and 101 [Eekhof, Mol and Pielage, 1992]. The English studies have comparable sample sizes, for example, in the nine studies mentioned by Theil [2002] the median number of participants is 52 and the mean is 170.

Therefore, we set up a large-scale study for the interpretation of Dutch verbal probability expressions, presented in a neutral context which are based on ordinary events. By choosing a neutral context, we try to eliminate any prior beliefs about the context. In this way we can investigate whether there is also a large variability in interpretation if it is not influenced by these prior beliefs.

Additionally, we check for synonymous phrases and asymmetry since these two characteristics are well studied in English but have not yet been analyzed in Dutch studies. Furthermore, we compare the results of statisticians with those of laypeople to check whether experts use different interpretations.

### 3 Methods

We used a survey design where probability phrases were presented in a neutral sentence to participants, and they could give their interpretation as a point estimate on a 0–100% scale. The survey was distributed online via Twitter and mailing lists of Dutch statistical societies in order to reach a large number of people.

#### 3.1 Choice of phrases

There are many Dutch probability and frequency phrases that can be studied. To make a selection for our study, we first listed the phrases used in the English studies and translated them to Dutch. For translation Google Translate [Google, 2018] and the leading Dutch dictionary Van Dale [Van Dale Uitgevers, 2018] were used. If more than one translation was appropriate, both were added to the list. Then we added the expressions from previous Dutch studies [Eekhof, Mol and Pielage, 1992; Renooij and Witteman, 1999; Pander Maat and Klaassen, 1996]. This resulted in a list of 131 phrases.

This list was too long to use in one survey, so a selection had to be made. Since the most frequently used phrases are also the most relevant, we selected the verbal probability expressions that were used at least 100 times in all online available articles of the popular Dutch news website nu.nl. To prevent too much overlap with the research by Eekhof, Mol and Pielage [1992], only the ten most commonly used frequency phrases were selected. Furthermore, some combinations of adverbs with a probability phrase were removed from the list to prevent too much overlap with the study by Timmermans [1994], and to prevent repetitions of very similar phrases. Additionally, the word undecided was removed, since it was mostly used in sport results where it has a different meaning.

This method of phrase selection resulted in a list of 29 frequency and probability expressions. These phrases, and their English translations, are given in Table 2 in appendix A. In this article, we will use the English translations. Please keep in mind that all given numerical interpretations for these phrases are actually for their Dutch counterparts.

#### 3.2 Context

As described before, the interpretation of a probability expression may be influenced by a person’s prior expectations of the phrase’s context. To avoid these base-rate effects, our aim was to formulate sentences that are neutral in the sense that everyone can imagine the situation but has little prior expectations about it. Some examples of the statements, formulated with the probability phrase likely, are

• It is likely that this plan succeeds.
• It is likely that this hotel is fully booked.
• It is likely that the team wins a match.

We tried to minimize the base-rate effect by not specifying a specific plan, hotel, or team. We developed twelve sentences like these. The complete list of these contexts is given in Table 3 in appendix B. In each sentence the verbal probability expression was printed in bold to direct more attention to it.

#### 3.3 Numeric interpretations

For each probability expression in the survey, participants gave the point estimate of their numerical interpretation in percentages (0–100%) by using a slider. After the statement, each survey item was formulated as a question. For example, the questions related to the three statements above were formulated as follows:

• What is the probability (expressed in percentages) that this plan succeeds?
• What is the probability (expressed in percentages) that this hotel is fully booked?
• What is the probability (expressed in percentages) that the team wins a match?

All probability phrases were presented individually and in a random order, and participants were required to answer each question before continuing to the next. In this way, missing data was prevented.

#### 3.4 Randomization

To prevent a systematic influence of the context on the interpretation of the probability phrase, 12 different versions of the survey were created. In every version, the probability phrase was formulated in a different context and contexts were repeated two or three times in each survey version (since 29 is not divisible by 12). All survey versions were evenly and randomly distributed among the participants by the survey software Qualtrics [2005].

#### 3.5 Personal characteristics

After giving their interpretation of the 29 phrases, participants were asked for some personal information. This included whether they are a statistician, their highest completed education level, age, and gender. Statisticians were self-reported, and this was questioned as Are you a statistician or do you perform statistical analyses on a weekly or monthly basis?. Education was categorized in six common categories of degrees in the Netherlands. Age was categorized in intervals of 20 years. These wide intervals were chosen to protect the anonymity of the participants and because the exact ages were not of particular interest for this research. However, age was included to check whether both young and older people participated. As with age, gender is not of particular interest for this study, but it was included to check whether participants are almost equally distributed among the genders.

All these characteristics were asked as multiple-choice questions and participants could select one of the given categories. Participants were allowed to refrain from providing their age and gender.

#### 3.6 Pilot

A pilot study showed that the length of the survey was reasonable (approximately ten minutes) and that the explanation was clear. We noticed that some participants had the tendency to base their interpretation of a phrase on their interpretations of previous phrases. This confirmed that randomization of the phrases is necessary. Additionally, it supported our decision to present one phrase at the time and to not allow participants to change their answers to previous questions. If we had permitted this, participants may have ranked their answers instead of giving the interpretations individually, which may have influenced the results. Based on the pilot study, we decided to make the original question Are you a statistician or do you perform statistical analyses on a regular basis? more specific by changing on a regular basis into on a weekly or monthly basis.

#### 3.7 Survey distribution

We obtained permission to distribute this survey from the ethical committee of the Faculty of Behavioural and Social Sciences of the University of Groningen (17451-O). Since we wanted to compare the interpretations of Dutch-speaking statisticians with those of non-statisticians, the survey was distributed among both groups. Statisticians were invited to participate via the mailing list of the Netherlands Society for Statistics and Operations Research (VVSOR) and the Interuniversity Graduate School of Psychometrics and Sociometrics (IOPS). To reach non-statisticians, the survey invitation was distributed via the personal Twitter [Twitter Inc., 2018] accounts of the three authors (one of the authors is a public figure and has over 60.000 followers, many of which are not in the academic community). Their followers were asked to participate and to share the survey in their network.

### 4 Results

#### 4.1 Participants’ characteristics

The survey was open for participation for almost four months, namely between July 18${}^{\mathrm{th}}$, 2018 and November 8${}^{\mathrm{th}}$, 2018. During this time, 1004 persons started the survey, of which 115 did not finish it. These incomplete observations were removed from the data. Another 8 participants were excluded from the analysis, because their native language was not Dutch. As a result, the data contains the responses of 881 participants.

The participants are evenly distributed among the genders (430 male vs. 440 female). There were many more non-statisticians than (self-reported) statisticians (655 vs. 226). Their distribution among the age groups and education levels is displayed in Figure 1. The first bar plot indicates that most participants were equally distributed among the two middle age groups (20–39 years and 40–60 years). The second bar plot shows that many of the participants were highly educated. Most statisticians have an academic education (94%) and also among the non-statisticians, the proportion of academically educated persons is large (58%). Furthermore, there are more males than females among the statisticians (59% male) and more females among the non-statisticians (55% female).

#### 4.2 Interpretation of probability phrases

The distributions of the interpreted percentages of each probability phrase are displayed by the density plots in Figure 2 and the mean values and 5% and 95% percentiles are listed on the right side of the plots. The 5% and 95% percentiles indicate the range of interpretations of 90% of the participants.

There seems to be some consensus about the interpretation of extreme words like always, certain, and impossible. Namely, the intervals between their 5% to 95% percentiles have a width of about 20 percentage points. Surprisingly, the 95% percentile of the extreme phrase never is at 32%, which seems high for this expression.

There is even less consensus for phrases that do not represent an extreme probability. Namely, their numerical interpretations have percentile ranges with widths up to 50 percentage points. For example, 90% of the respondents interpreted the verbal probability expressions sometimes, probable, and almost always between, respectively, 11–55%, 41–86%, and 70–96%.

Other things to notice are the small peaks in the density plots which indicate that participants often express probabilities as multiples of ten which results the “heaping” of data at these round numbers. Also, there was no phrase in our survey that represents 50%. The candidates liable to happen, chance, uncertain, maybe, and possible, for which 50% is the most frequently chosen interpretation, all have a large tail to the left and percentile ranges of 42–50 percentage points.

#### 4.3 Asymmetry

For the usability of verbal probability expressions, (a)symmetry in the interpretation of mirrored verbal probability expressions are of interest. The imbalance in their interpretation is often investigated by reviewing whether the group means or group medians of the interpretations of two complementary words sum to 100%. The group means from our data are listed in Figure 2, and show that, as in English, asymmetry is present for the Dutch translations of likely and unlikely. Namely, the mean interpretation of likely in our data is 75% and the mean for unlikely is 16%, and hence these sum to 91%. Symmetry is found for phrases as very likely and very unlikely (sum to 95%), almost always and almost never (sum to 100%), and often and not often (sum to 97%).

The results from previous studies and those listed above are based on the results at a group level (group means). We also looked at the results on an individual level by plotting the density of the sums of complementary phrases, see Figure 3. These plots show that there are some mirrored pairs which interpretation sums up to about 100% for most participants, for example (almost) always and (almost) never, and very likely and very unlikely. Other complementary phrases were interpreted asymmetrically by many participants and usually sum up to slightly less than 75% to 100%, for example likely and unlikely, and often and not often.

As explained in the introduction, in some cases asymmetry has a linguistic cause. Our results on the interpretation of possible and impossible confirm the findings of Mosteller and Youtz [1990]. Namely, Figure 2 shows that impossible has a stable interpretation that is close to 0%, while possible has a broad interpretation from 20% to 70% which peaks around 50%. The asymmetry is also confirmed by the distribution of their sums in Figure 3.

A similar pattern is found for certain and uncertain; there is a consensus on the interpretation of certain (around 100%) while the perception of uncertain varies a lot and is comparable to maybe’s interpretation, namely some value between 20% to 50% (see Figure 2). As a result, the percentages of certain and uncertain always sum to more than 100% and together peak at 150% (see Figure 3).

#### 4.4 Context

One of our concerns was that the context of the sentences influences the perception of the probability phrases. To avoid the base-rate effect, we tried to formulate the context sentences as neutrally as possible.

To check whether we succeeded in our intention, we investigated the variability of the interpretation of phrases among different contexts. Figure 4 shows the mean percentages given by the participants to each probability phrase, grouped by context, together with the intervals between the 25% and 75% percentiles. Hence, these intervals indicate the numerical interpretation of a probability phrase of half of the participants and give an indication of the uncertainty around the mean values.

This plot shows that, in general, the means of phrases are very similar for each context, with a maximum of 20 percentage points difference between contexts. Most of this variability appears for words that represent 30% to 80%. This is confirmed by the intervals indicating the uncertainty around these means. The fact that the widths of most intervals are comparable and that the intervals of each context overlaps with the intervals of several other contexts suggests that none of the context sentences is systematically interpreted differently (higher/lower or more/less extreme) from the others. Even the intervals of two phrases that presented a negative outcome (they will go on strike and this hotel is fully booked) overlap considerably with the intervals of the other context sentences.