Abstract
Background
Although a large body of literature is available that describes the effects of smoking, asthma and COPD on lung function, most studies are restricted to a small age range and to one factor. As a consequence, available results are incomplete and often difficult to compare, also due to the ways the effects are expressed. Furthermore, current approaches consider one type of measurement only or several types separately.
Methods
We propose a probabilistic model that expresses the effects as number of years added to chronological age or, in other words, that estimates the biological age of the lungs. Using biological age as a measure of the effects has the advantage of facilitating the understanding of their severity and comparison of results. In our model, chronological age and other factors affecting the health status of the lungs generate biological age, which in turn generates lung function measurements. This structure enables the use of multiple types of measurement to obtain a more precise estimate of the effects and parameter sharing for characterization over large age ranges and of cooccurrence of factors with little data. We treat the parameters that model smoking habits and lung diseases as random variables to obtain uncertainty in the estimated effects.
Results
We use the model to investigate the effects of smoking, asthma and COPD on the TwinsUK Registry. Our results suggest that the combination of smoking with lung disease(s) has higher effect than smoking or lung disease(s) alone, and that, in smokers, cooccurrence of asthma and COPD is more detrimental than asthma or COPD alone.
Conclusions
The proposed model or other models based on a similar approach could be of help in improving the understanding of factors affecting lung function by enabling characterizations over large age ranges and of cooccurrence of factors with little data and the use of multiple types of measurement. The software implementing the model can be downloaded at the first author’s webpage.
Keywords:
Lung function; Biological ageing; Probabilistic model; Generative model; Posterior distributions; Smoking; Asthma; COPD; FEV_{1}; FVCIntroduction
Smoking, asthma and Chronic Obstructive Pulmonary Disease (COPD) are the primary risk factors for lung function impairment in adults. Their average effects on the lungs are commonly estimated by measuring reduction in spirometric values with respect to a population of healthy individuals [17]. Due to the difficulty of collecting large sample size data spanning the entire adulthood, most studies are restricted to a small age range and to one factor. As a consequence, overall ages and combined effects are reported only in a few studies or are still missing and results from multiple studies are often difficult to compare, also due to the ways the effects are expressed. Furthermore, current approaches consider one type of measurement only, or several types separately (mostly Forced Expiratory Volume in 1 second (FEV_{1}) or Forced Vital Capacity (FVC)) – a combined analysis of several types of measurement could potentially provide a more precise quantification of the effects.
In this paper we address these issues by taking the viewpoint that reduced pulmonary function corresponds to premature ageing of the lungs: we propose a model that expresses average FEV_{1} and FVC reduction in individuals that smoke and/or have asthma and/or COPD in terms of number of years that are added to the lungs, or, in other words, we propose a model that estimates biological ageing of the lungs.
Biological age has been studied mainly at the whole body level (see [810] for recent references). At the respiratory system level, it was first introduced in [11] as a potentially more powerful type of information than spirometric values in motivating smokers to quit. Since then, several studies have investigated this hypothesis [12,13], using as biological age of a smoker the chronological age of a nonsmoker of same height, gender and average FEV_{1} obtained from predictive equations. This approach was designed to estimate the specific effect of smoking on a single individual rather than the average effect on an entire population, which is the interest of this paper.
We propose a generative probabilistic approach that explicitly represents biological age using an unobserved random variable – an adjustment of chronological age induced by factors that have an impact on the health status of the respiratory system such as smoking habits, lung diseases, environmental and genetic factors, etc. Our generative approach enables us to integrate multiple aspects of the problem into a single consistent framework, which allows the use of multiple types of measurement as well as sharing of information and therefore estimation with little data. The probabilistic approach enables us to deal with uncertainty and noise in the data. Furthermore, it allows us to treat the parameters that model smoking habits and lung diseases as random variables and therefore to obtain uncertainty in the estimated effects of such factors on the lungs.
We evaluate our model on a subset of the TwinsUK Registry [14]. The dataset contains FEV_{1} and FVC measurements of several individuals along with information about smoking habits, asthma, COPD, and height. By examining the posterior distributions of the parameters that model the combinations of smoking, asthma and COPD, and the posterior distributions representing the biological age associated to each combination, we are able to make general and agespecific quantitative statements about the effects of these factors.
Methods
The TwinsUK Registry is a cohort of about 12000 twins aged 16 to 100 years from all over the United Kingdom used to study heritability and genetics of agerelated diseases. It includes clinical, physiological, behavioural and lifestyle data collected since 1992 either at visits to the Department of Twin Research at King’s College London or via selfadministered questionnaires. For historical reasons, it encompasses predominantly females in the age range 45–65 years.
For the study, we considered female individuals with spirometry data collected between 1992 and 2010 and with recorded height. Males were excluded as their number was too small to enable reliable estimation of model parameters.
The study was approved by the St. Thomas’ Hospital Research Ethics Committee, and all twins provided signed informed consent, in accordance with the Helsinki Declaration.
FEV_{1}FVC measurements
Spirometry tests (model 2150; Vitalograph; Buckingham, England) were performed during visits (up to five for each individual) to the department. During each test, three FEV_{1}FVC measurements were recorded and the one corresponding to maximum FEV_{1} was selected. The measurements were included in the study if in normal range, identified as between 0.5 and 7.0 litres based on [15,16]. More information can be found in [17].
Smoking status
We considered the subset of individuals that responded consistently in different smokingrelated questionnaires between 1992 and 2010 (maximum of 13 questionnaires and 52 types of question). For such individuals, only those FEV_{1}FVC measurements for which one of the following two conditions held were included in the study:
• The individual reported to have never smoked either cigarettes, cigars or pipes in a questionnaire completed in the same (or a subsequent) year in which the measurement was recorded.
• The individual reported to be a smoker in a questionnaire completed in the same year in which the measurement was recorded.
As the same condition was satisfied for all retained measurements from the same individual, an overallmeasurement nonsmoker or smoker status could be assigned to each individual.
Asthma and COPD status
We considered the subset of individuals that responded consistently in different asthmarelated questionnaires between 1992 and 2010 (maximum of 8 questionnaires and 4 types of question). Such individuals were classified as nonasthmatic if they reported to have never suffered from asthma and as asthmatic otherwise. Diagnosis by a doctor was not always explicitly required. A similar procedure was used to determine COPD status.
All possible combinations of smoking, asthma and COPD status give rise to 8 groups (see Table 1 where H stands for healthy with respect to smoking, asthma and COPD). Only individuals of known combined status, and therefore group, were included in the study. In order to eliminate potential bias in estimating the effects of smoking, asthma and COPD due to correlation between twins and multiple visits, with the exception of Group H, we disregarded at random one twin for twins belonging to the same group and retained only the most recent FEV_{1}FVC measurement for individuals with multiple visits. Group H, which contains a considerable number of datapoints and should therefore not be heavily affected by this correlation, was excluded as accurate estimation of parameters b (see (3)) requires a large amount of data.
Table 1. FEV_{1}FVC grouping
These filtering steps are summarized in Table 2. The final dataset encompassed 4403 FEV_{1}FVC measurements taken from individuals in the age range 18.3–82.8 years (the age of an individual, calculated from birth date and date of measurement, is expressed in decimals of year by considering 365.25 days per year). The total number of measurements and the age range of each group are indicated in Table 2. The histogram representing the number of FEV_{1}FVC measurements available at different ages is given in Figure 1. The number of measurements available for Group H at age ranges 18–44, 45–64 and 65–83 is respectively 871, 2221 and 650.
Table 2. Data filtering
Figure 1. FEV_{1}FVC measurements. Histogram of the number of FEV_{1}FVC measurements available at each age.
Our classification does not take into account the degree of severity of asthma, COPD and smoking. Therefore, the estimated effects have to be interpreted as corresponding to the most likely degree. We are also limited by our definition of asthma and COPD, which potentially includes individuals with a selfreported diagnosis. Finally, whilst the definition of nonsmoker and smoker is based on the year in which the FEV_{1}FVC measurement was taken, this is not the case for asthma and COPD, as we do not have precise timing information about these diseases. We nevertheless expect little error due to this as each individual answered the questionnaires multiple times.
Definition of biological age
Before describing the proposed model in details, we define biological age and highlight key points that guided us in the construction of the model.
Figure 2 illustrates the concept of biological ageing for smokers (Group S) relative to the reference population of healthy individuals (Group H) based on FEV_{1}. As we can see from the measurements (Figure 2(a)), smokers have on average lower FEV_{1} than healthy individuals. This becomes clearer when looking at the measurement means (Figure 2(b)), which are averages computed over an 11year sliding window to enforce smoothness over ages. For example, smokers’ mean at age 60 (computed from age interval 55–65) is equal to that of healthy individuals at age 68. It is therefore reasonable to define smokers’ biological age at chronological age 60 as approximately 68 years. That is, biological age is defined to be the chronological age of the healthy population corresponding to the same lung function mean. This is the population level analogue of the individual level definition introduced in [1113].
Figure 2. FEV_{1} of healthy individuals and smokers. (a): FEV_{1} measurements (in litres) of healthy individuals (Group H, blue) and smokers (Group S, red). (b): Mean change of FEV_{1} for healthy individuals and smokers over ages. To enforce smoothness, each mean value is calculated over an age interval of 11 years (the Xaxis labels indicate the ages at the middle of the intervals). (c): Estimated average decline of FEV_{1} for healthy individuals and smokers of average height (1.62 metres) using the model defined by (1). (d): Biological ageing of smokers relative to healthy individuals inferred from (c) using as definition of biological age the chronological age of healthy individuals with equal lung function mean.
A straightforward approach to estimating biological ageing would be to compute differences in average FEV_{1} decline between healthy individuals and smokers by fitting two separate lung function models (such a separate approach was used for example in [1,18]), and subsequently deduce biological ageing from these differences. We can use, for example, the model in [15] first proposed in [19], which is considered an accurate predictor of lung function in adults. In this model, the relationship between the log of the nth lung function measurement, l^{n}, chronological age, a^{n}, and height, h^{n}, is given by the following equation:
where b = {b_{1},b_{2},b_{3},b_{4}} is a set of unknown model parameters (modelling the log of the measurement, rather than the measurement, makes the model linear in b and therefore simplifies its estimation). By computing two separate sets b, one for healthy individuals and one for smokers, we can obtain the average FEV_{1} decline for the two populations, as shown in Figure 2(c) for individuals of average height (1.62 metres). From such estimates we can deduce smokers’ biological ageing, as shown in Figure 2(d).
This simple approach has several limitations. It cannot produce reliable estimates of b for the groups of small size (all groups other than Groups H and S). A single model of all groups in which some parameters are shared among them would alleviate this problem. Linear regression models that include factors such as smoking and lung disease as covariates, e.g. [20], have this property but are limited to additive combinations of effects.
Furthermore, it is not clear how to consider multiple types of measurement, such as FEV_{1} and FVC, to obtain a more precise estimate of biological age. If two separate models for FEV_{1} and FVC are fitted, the inferred biological ages need to be combined into a single estimate. Simply taking the average (as investigated in [11]) is not optimal as for example, for young ages for which differences between healthy individuals and smokers are absent in FVC (see Figure 3), only FEV_{1} should be considered. An approach that estimates biological age from simultaneous modelling of FEV_{1} and FVC would overcome this difficulty.
Figure 3. FVC of healthy individuals and smokers. (a): FVC measurements (in litres) of healthy individuals (Group H, blue) and smokers (Group S, red). (b): Mean change of FVC for healthy individuals and smokers over ages. To enforce smoothness, each mean value is calculated over an age interval of 11 years (the Xaxis labels indicate the ages at the middle of the intervals).
Finally, a probabilistic approach would better deal with noise in the data and would allow to obtain uncertainty in the estimated biological ages, which is particularly important when little amount of data is available.
A probabilistic model of biological age
Our approach to taking into account the observations above is to define a probabilistic model with an explicit unobserved random variable representing biological age. This variable is an adjustment of chronological age due to smoking habits, lung diseases, environmental and genetic factors, etc., namely all factors that have an impact on the health status of the respiratory system. Biological age combined with other factors that do not affect the health status of the respiratory system but heavily influence lung function measurements, namely height and measurement noise, generate FEV_{1} and FVC.
More specifically, our probabilistic model is defined by the following equations:
In these equations, l^{n} is a twodimensional column vector containing the log of the nth FEV_{1}FVC measurement (n indexes the measurement rather than the individual, as in Group H each individual can have more than one measurement), a^{n} is the chronological age of the corresponding individual, h^{n} is the height, is the biological age, c^{n} is a discrete variable representing the group to which measurement n belongs (c^{n} ∈ {1,…,8} corresponding to {Group H, Group A, Group C, Group AC, Group S, Group SA, Group SC, Group SAC}), and , b_{i} (i = 1,…,4) and Σ_{l} are unknown deterministic parameters.
Biological age is generated as a groupdependent linear transformation of chronological age a^{n}, , with the addition of a Gaussian term ε^{n}. The term ε^{n} represents the modification to chronological age that is specific to the nth measurement and not captured at the group level, and therefore also includes all unmeasured factors such as environmental and genetic factors.
Logmeasurement l^{n} is obtained as a nonlinear transformation of biological age and height h^{n} (of the same form as (1)), to which a Gaussian noise term η^{n} is added. The term η^{n} is drawn from a twodimensional Gaussian with nondiagonal covariance matrix Σ_{l}, which accounts for the high correlation between FEV_{1} and FVC. The parameters b_{i} (i = 1,…,4) are twodimensional column vectors that model agerelated decline of FEV_{1} and FVC. They are estimated from healthy individuals only to ensure that they describe lung function decline in the absence of smoking, asthma and COPD. These parameters are common to all groups, which is crucial in enabling the inclusion of groups with a small number of available datapoints.
The generative process induced by the model is depicted in Figure 4, where empty nodes indicate unknown quantities, whilst filled nodes indicate known quantities.
Figure 4. Probabilistic model of biological age. Generative process induced by our model. The two plate sections indicate that the enclosed structures are repeated for all 8 groups and N measurements. Combined smoking, asthma and COPD status c^{n} (through parameters and the addition of a noise term ε^{n} representing influence of smoking habits, lung disease, and unmeasured factors such as environmental and genetic factors that are specific to the nth measurement) transforms chronological age a^{n} into biological age . Biological age and height h^{n} generate (through parameters b and the addition of a noise term η^{n} representing measurement noise) lung function measurement l^{n}.
The linear transformation of chronological age contains both a slope and an intercept . The slope determines the rate at which biological age changes with chronological age. Only positive values of are to be expected as they indicate that biological age increases with chronological age: indicates an increase rate of one year per year, whilst indicates an increase rate higher (lower) than one year per year. For example, Figure 2(d) implies u_{5} > 1. The intercept determines the value of biological age at birth.
Parameters b_{i} (i = 1,…,4), and Σ_{l} are treated as deterministic quantities and their values are learned as detailed in the Appendix. Parameters u_{j} and v_{j} (j = 1,…,8) are treated as independent Gaussian random variables. This enables us to obtain uncertainty in the estimated effects of smoking, asthma and COPD. The large variance makes the prior uninformative, which ensures that the posterior variance, and therefore uncertainty in the estimated effects, fully depends on the data.
In a probabilistic formulation, we can write the model as
where the symbol ^{T} indicates the transpose operator and I is the identity matrix. To simplify the notation, in the rest of the paper we omit conditioning on all quantities that are not treated as random, namely μ, Σ, a^{n}, c^{n}, , h^{n}, b, Σ_{l}, and therefore denote the three basic Gaussian density functions defining the model as , and .
Inference
In order to make deductions about the effects of smoking, asthma and COPD, we need to infer the posterior distributions of the group parameters given all N measurements, p(u_{j},v_{j}l^{1},…,l^{N}) (j = 1,…,8), and the posterior distributions describing the biological age of each group at chronological age a, p(u_{j}a + v_{j}l^{1},…,l^{N}). An analysis of p(u_{j},v_{j}l^{1},…,l^{N}) enables us to make general (summarized over all ages) statements about the groups: lack of or small overlap of some of these distributions indicates fundamentally different biological ageing of the corresponding groups. An analysis of p(u_{j}a + v_{j}l^{1},…,l^{N}) enables us to make statements which are specific to age a.
As explained above, we treat u_{j} and v_{j} as a priori independent random variables with Gaussian distributions. The joint posterior distribution factorizes as
where {l^{n}c^{n} = j} denotes the subset of measurements belonging to group j. The factors p(u_{j},v_{j}{l^{n}c^{n} = j}) have unknown analytical form, as the transformation from the biological age to the measurements (3) is nonlinear. We estimated them numerically and found that they are all indistinguishable from Gaussian density functions. As a consequence, we also found that p(u_{j}a + v_{j}{l^{n}c^{n} = j}) are Gaussian. A detailed explanation of how to estimate these posterior distributions is given in the Appendix.
Results
In the next two sections we analyse the posterior distributions p(u_{j},v_{j}{l^{n}c^{n} = j}) and p(u_{j}a + v_{j}{l^{n}c^{n} = j}) obtained when fitting the proposed model to our dataset.
Analysis of posterior distributions p(u_{j},v_{j}{l^{n}c^{n} = j})
Figure 5(a) shows the contour plots of p(u_{j},v_{j}{l^{n}c^{n} = j}): each ellipse is centred at the mean and encloses 95% of the distribution.
Figure 5. Posterior distributionsp(u_{j},v_{j}{l^{n}c^{n} = j}). (a): Contour plots of the posterior distributions p(u_{j},v_{j}{l^{n}c^{n} = j}). For each group, we show an ellipse centred at the mean and enclosing 95% of the distribution. (b): Linear transformation of chronological age a, u_{j}a+v_{j}, for 100 pairs (u_{j},v_{j}) sampled from p(u_{j},v_{j}{l^{n}c^{n} = j}) for Groups H (continuous blue) and SC (dashedgreen), showing that uncertainty is higher at young and old ages and lower at middle ages.
We can notice that the posterior distributions have different spread, depending on the combined effect of number and dispersion of measurements. For Group H (continuousblue ellipse), the high number of available measurements makes the distribution highly peaked around u_{1} = 1, v_{1} = 0, despite the high dispersion at each age (see Figure 2(a) and Figure 3(a)). This highlights an important point about how to interpret the posterior distributions: they provide us with a measure of uncertainty on the estimated average biological ageing. Thus, even if dispersion at each age is high, the model can still be certain about the average biological age.
The major axes of the ellipses all have very similar directions, expressing the fact that increasing the slope u_{j} requires decreasing the intercept v_{j} and viceversa. This means that samples from the posterior distributions give linear transformations of chronological ages intersecting at middle ages, as shown in Figure 5(b) for Groups H and SC. In other words, uncertainty about biological age is higher at young and old ages than at middle ages, which is what we would expect from the distribution of measurements shown in Figure 1.
With the exception of Group C (continuousgreen ellipse) for which there is small overlap, unhealthy groups do not overlap with Group H indicating that biological ageing differs from chronological ageing.
If we consider Group A (continuousred ellipse) versus Group SA (dashedred ellipse), Group C versus Group SC (dashedgreen ellipse), and Group AC (continuouscyan ellipse) versus Group SAC (dashedcyan ellipse), we can see that the ellipses do not overlap (considerably) and that the centre of the smoking ellipse is closer to the upperright corner than the centre of the nonsmoking ellipse, which means that smoking in addition to having lung disease(s) induces significant increase in ageing with respect to having lung disease(s) alone. The fact that Group S (dashedblue ellipse) does not overlap with Groups SA, SC and SAC and is closer to the lowerleft corner signifies that this increase in ageing is not due to smoking alone but is a truly combined effect. We can therefore conclude that the combination of smoking with lung disease(s) has more severe effect on ageing than lung disease(s) alone. Lack of overlap despite the very small number of available measurements, which causes considerable spread of some of these distributions, makes us confident about this conclusion.
Comparison of Groups A and C with Group AC and comparison of Groups SA and SC with Group SAC reveal the effect of cooccurrence of asthma and COPD versus either disease. Unlike the nonsmoking case for which the large overlap does not enable us to draw conclusions, in the smoking case the posterior distributions indicate substantial increase in ageing in the cooccurrence of the diseases.
Analysis of posterior distributions p(u_{j}a + v_{j}{l^{n}c^{n} = j})
Figure 6(a) shows the standard deviations of p(u_{j}a + v_{j}{l^{n}c^{n} = j}). As discussed above, the standard deviations, and therefore uncertainties about the estimated effects, are lower at middle ages for which more measurements are available. Figure 6(bf) show the posterior distributions p(u_{j}a + v_{j}{l^{n}c^{n} = j}) at ages 20, 45, 55, 65 and 80 years: the length between two starts equals 2 × 1.96 times the standard deviation. Figure 7 illustrates the behaviour of the posterior distributions every 5 years: each rectangle is centred at the mean and its length equals 2 ×1.96 times the standard deviation.
Figure 6. Posterior distributions p(u_{j}a + v_{j}{l^{n}c^{n} = j}). (a): Standard deviations of the posterior distributions p(u_{j}a + v_{j}{l^{n}c^{n} = j}). (bf): Posterior distributions p(u_{j}a+v_{j}{l^{n}c^{n} = j}) for ages a = 20, 45, 55, 65 and 80 years. The length between two starts equals 2 × 1.96 times the standard deviation. The legend in (b) is valid for all plots.
Figure 7. Posterior distributions p(u_{j}a + v_{j}{l^{n}c^{n} = j}) over all ages. Posterior distributions p(u_{j}a + v_{j}{l^{n}c^{n} = j}) for age a in the range 20–80 years at 5year stepsize. Each rectangle is centred at the mean and its length equals 2 ×1.96 times the standard deviation.
From these figures we can see that, at the extreme ages of 20 and 80 years for which the standard deviations are higher, some of the general conclusions made in the previous section are no longer valid. More specifically, at age 20 there is considerable overlap between Groups A and SA, between Groups C and SC, and between Groups AC and SAC. Therefore, it is not possible to deduce from the posterior distributions that the combination of smoking with lung disease(s) has more severe effect on ageing than smoking or lung disease(s) alone at this early age. Similarly, we cannot make conclusions about cooccurrence of asthma and COPD versus either disease. At age 80, Groups AC and SAC are significantly different, as are Groups S and SAC, so that we can conclude that the combination of smoking with asthmaCOPD (with asthmaCOPD we indicate cooccurrence of asthma and COPD) has more severe effect on ageing. However, this is not the case for asthma and COPD alone. Furthermore, we cannot conclude that the combined effect of asthma and COPD is higher than the single effects. By looking at the other ages, we can see that the full set of statements made in the previous section is valid for the age range 50 – 60.
Notice that the difference between Groups H and S is already significant at age 30. This shows that at young ages the model is considering FEV_{1} measurements only to determine smokers’ biological age, as desired (see discussion of Figure 3 above).
This agespecific analysis has enabled us to determine at which ages the general statements about differences in groups made in the previous section are valid. However, it also reveals an important difference between younger and older ages, namely that, with the exception of Groups A and C, means distances of unhealthy groups from Group H are substantially higher at older ages. Thus the effects of most combinations of factors seem to increase with age.
In Table 3 we give the estimated number of years that are added to chronological age (means ±1× standard deviations) for the age range 45–64. From the table we can make a final interesting observation: at age 50 the effect of combined smoking with asthmaCOPD seems more severe than additive. Indeed, when considering 1.96 times the standard deviation, the sum of the maximum numbers of years added to chronological age in Groups S and AC is 23.8, whist the minimum number of years added in Group SAC is 23.6.
Table 3. Estimated number of years added to chronological age
Discussion
To date, biological age of the lungs has been used at the individual level to investigate its effectiveness in motivating smokers to quit. In this paper, we have used biological age of the lungs at the population level to analyse the average effects of smoking, asthma and COPD on the health status of the respiratory system. As for the individual level case, knowing how much older, on average, the lungs of individuals that smoke and/or have lung disease(s) look relative to the healthy population enables a more immediate understanding of the impact of these factors on the health status of the lungs. However, with this work we have shown that modelling lung function through biological age has additional benefits.
Such a modelling enables to properly combine multiple types of measurement to obtain a more precise estimate of the health status of the respiratory system. We have seen that our approach correctly deals with the case in which lung function differences are not evident in one type of measurement.
Such a modelling also enables parameter sharing for characterization over large age ranges and of cooccurrences of factors with little data. We obtained results that are in agreement with the literature (see the next section) using a small amount of data. Furthermore, we could compare cases that have not been previously analysed, as nonsmokers with asthma and COPD versus smokers with asthma and COPD.
By treating the parameters that model smoking and lung diseases as random variables, we could obtain uncertainty in the estimated effects of such factors on the lungs.
Finally, such a modelling enables more immediate interpretation and comparison of results within and among different studies than approaches expressing effects in spirometric values. Whilst we did not show that in this paper, the following examples can clarify this point. Suppose that Studies A and B find that FEV_{1} mean value at age 60 in the healthy population is respectively 2.75 and 2.5 litres, and that both studies find that FEV_{1} mean value at age 60 in the smoking population is 2.25 litres. One has to consider the mean values of the healthy populations to understand that Study A estimates that smoking has a stronger effect than Study B. On the other hand, this would be immediately evident if biological age was used, since the estimated number of years added to chronological age in smokers in Study A would be higher than in Study B. As another example, consider investigating whether the effect of smoking on pulmonary function in females and males is different (published results on this subject are controversial [2127]). Whilst our analysis was restricted to females, males can be easily included in the model e.g. by having separate sets of parameters b, u and v so that only noise covariances are shared between genders. Similarly to the previous example, if spirometric values are compared as in current studies, the values of healthy males and females need to be considered to understand whether the impact of smoking is gender specific, whilst this is not the case if biological age is used, as biological age is a measure that is relative to the healthy population.
One limitation of the proposed model is that it does not account for longitudinal and twin structure, so that we had to exclude many datapoints from the analysis. We are currently investigating an extension that incorporates both types of structure by adding Gaussian terms which are shared across ages and twins.
The choice of modelling biological age as a linear transformation of chronological age, as defined in (2), was motivated by simplicity and supported by Figure 2(d). This figure indicates that smokers’ biological age is well described as a linear transformation and makes it reasonable to expect that linear or piecewise linear transformations should be valid transformations for the other groups too. As the size of our dataset was too small to enable reliable estimation of piecewise linear transformations, we restricted ourselves to linear ones. However, piecewise linear transformations would be worthy of investigation in studies in which more datapoints are available.
The form of nonlinearity in (3) enabled us to describe lung function decline in adulthood quite accurately whilst keeping the model relatively simple. However, it would be worthy to also consider the more flexible case in which the form is estimated, particularly when considering other types of measurement in addition/replacement to FEV_{1} and FVC. Some work in this direction, specifically addressing complex lung function growth in young individuals, has been done in [18] and in [16], which proposed the model of [28]. We are currently investigating modelling lung function decline with Gaussian radial basis functions.
Treating b as deterministic rather than random enabled us to use simple numerical integration for inference, avoiding the need to develop more complex approximation schemes. It is reasonable to assume that a posterior on b would be highly peaked (as is the posterior of u_{1},v_{1}, p(u_{1},v_{1}{l^{n}c^{n} = 1}), computed from the same individuals) and therefore that this choice had minor impact on the estimated uncertainties.
Finally, we would like to notice that, whilst the proposed model can also provide single individuals with biological age, such a usage of the model would require a careful analysis on how to set the measurement noise covariance Σ_{l}, as the maximum likelihood approach used in this paper could be suboptimal.
Conclusions
We have introduced a probabilistic model based on the concept of biological age to analyse the effects of smoking, asthma and COPD on female lung function. Our approach enabled us to make statements over large age ranges and about cooccurrence of factors with little data.
We have found that cooccurrence of smoking with asthma or COPD or combined asthma and COPD has more severe effect on ageing than smoking, asthma, COPD or combined asthma and COPD alone. This is in agreement with the findings in [29], that suggest that the rate of decline of lung function is faster in smokers with emphysema than in exsmokers with emphysema. This is also in line with the results in [4,20,30], which show that smoking has a strong additional ageing effect on individuals with asthma. To the best of our knowledge, results on cooccurrence of smoking with combined asthma and COPD have not been previously reported.
We have also found that cooccurrence of asthma and COPD has a more detrimental effect on the lungs than asthma or COPD alone. This is in line with recent studies that indicate a reduced quality of life in individuals with both asthma and COPD with respect to individuals that have only either disease [3133].
By analysing differences among ages, we could conclude that, with the exception of asthma and COPD alone, the effects of the combinations of factors increase with age and therefore are more severe at older ages. This is in agreement with other studies, for example [4], in which it is shown that the effects of smoking and combined smoking with asthma increase with age, whilst the effect of asthma is constant.
At age 50 for which the standard deviations are lower, our model estimated that the average number of years ±1× the standard deviations added to chronological age by the factors are approximately as follows. Asthma: 6.6 ± 1.4; COPD: 5.7 ± 4.0; asthmaCOPD: 8.8 ± 3.5; smoking: 6.6 ± 0.7; smokingasthma: 16.8 ± 3.5; smokingCOPD: 17.2 ± 2.0; smokingasthmaCOPD: 29.5 ± 3.0.
The software implementing the model can be downloaded at the first author’s webpage.
Appendix
Below we describe how to estimate the model parameters b, and Σ_{l} and the posterior distributions p(u_{j},v_{j}{l^{n}c^{n} = j}) and p(u_{j}a+v_{j}{l^{n}c^{n} = j}). In order to avoid underflow/overflow problems, computations were performed in logscale.
Parameter learning
As explained above, the parameter set b was estimated from the healthy group (Group H) only to make sure that it describes lung function decline in the absence of smoking, asthma and COPD. We learned the two subsets of b corresponding to FEV_{1} and FVC separately using ordinary least squares. We then fixed b and estimated parameters and Σ_{l} using an Expectation Maximization (EM) approach [34]. More specifically, the EM approach consisted of iterating the following two steps until convergence:
• EStep: Perform inference on to compute the quantities required to perform the MStep.
• MStep: Find the values of and Σ_{l} that maximize the expectation of the complete data loglikelihood
where 〈·〉_{p(·)} denotes averaging with respect to p(·) and is computed using the values of and Σ_{l} estimated in the previous iteration.
The part of the expectation of the complete data loglikelihood that depends on and Σ_{l} is given by
We excluded the parameter set b from the EM approach as we found that otherwise the nonlinearity in FEV_{1} and FVC decline with age of healthy individuals would be transferred to the biological age (through high noise variance ) so that b would not represent normal lung function decline.
MStep: Updates for
Setting to zero the derivative of (4) with respect to
where the required moments are estimated as explained below.
MStep: Updates for Σ_{l}
Setting to zero the derivative of (4) with respect to
where , we obtain the optimal Σ_{l}
EStep: Inference on
The marginal likelihood can be estimated as
where the required integrations are computed numerically.
Then the posterior distribution can be estimated as
From this distribution, the moments required for the parameter updates, namely , , , , , , , and 〈u_{j}v_{j}〉, are computed by numerical integration.
Approximation
The EM approach for learning and Σ_{l} described above is time consuming. A comparison of this approach with an approximation in which u_{j} and v_{j} are considered as deterministic did not show any difference in the learned values of and Σ_{l}. We therefore used this approximation for the presented results.
In this alterative approach, the updates for and Σ_{l} in the MStep are similar to the ones above in which the optimal values of u_{j} and v_{j} are used and becomes , computed as . The optimal values of u_{j} and v_{j} are learned by setting to zero
that is, by solving the following linear system:
where N_{j} indicates the number of measurements belonging to Group j.
Computing the effects of smoking, asthma and COPD
The posteriors distributions p(u_{j},v_{j}{l^{n}c^{n} = j}) can be computed from (5) by numerical integration over . The posteriors distributions p(u_{j}a + v_{j}{l^{n}c^{n} = j}) can be computed from p(u_{j},v_{j}{l^{n}c^{n} = j}) using the formula of linear transformation of random variables and numerical integration. However, as we found numerically that p(u_{j},v_{j}{l^{n}c^{n} = j}) are Gaussian, p(u_{j}a + v_{j}{l^{n}c^{n} = j}) can be computed more simply using the formula of linear transformation of Gaussian random variables. A transformation of p(u_{j},v_{j}{l^{n}c^{n} = j}) was performed to correct the small deviation of the mean of p(u_{1},v_{1}{l^{n}c^{n} = 1}) from (1,0).
Abbreviations
COPD: Chronic obstructive pulmonary disease; FEV1: Forced expiratory volume in one second; FVC: Forced vital capacity; EM: Expectation maximization
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SC conceived and implemented the model, performed the experiments, the data filtering, and wrote the manuscript. JW contributed to the discussion and interpretation of the model and of the experiments and revised and gave suggestions about the structure of the manuscript. AV contributed to the discussion and interpretation of the model and of the experiments and revised the manuscript. HT performed data cleaning and smoking, asthma and COPD status assignment. TS contributed to data experimental design and collection. All authors read and approved the final manuscript.
Acknowledgements
Silvia Chiappa would like to thank David Barber for insightful suggestions and discussions on the content of this paper and for proposing and implementing a sampling approach to the Gaussian radial basis functions model. She is also very grateful to Andrew Brown, Zhihao Ding, David Knowles and Nevena Lazic for many useful discussions and for revising the manuscript. This work has been funded by Microsoft Research Connections, Microsoft Research Cambridge and by the EU FP7 grant EuroBATS (No. 259749). The TwinsUK study was funded by the Wellcome Trust; European Community’s Seventh Framework Programme (FP7/20072013). The study also receives support from the National Institute for Health Research (NIHR) Clinical Research Facility at Guy’s & St Thomas’ NHS Foundation Trust and NIHR Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. Silvia Chiappa was funded by Microsoft Research Connections and Microsoft Research Cambridge. Ana Viñuela and Hannah Tipney were funded by EuroBATS (No. 259749). Tim Spector is an NIHR Senior Investigator and is holder of an ERC Advanced Principal Investigator award.
References

Sherrill DL, Lebowitz MD, Knudson RJ, Burrows B: Smoking and symptom effects on the curves of lung function growth and decline.
Am Rev Respir Dis 1991, 144:1722. PubMed Abstract  Publisher Full Text

Kerstjens HA: Decline of FEV1 by, age and smoking status: facts, figures, and fallacies.
Thorax 1997, 52(9):820827. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Willemse BWM, Postma DS, Timens W, ten Hacken NHT: The impact of smoking cessation on respiratory symptoms, lung function, airway hyperresponsiveness and inflammation.
Eur Respir J 2004, 23(3):464476. PubMed Abstract  Publisher Full Text

James AL, Palmer LJ, Kicic E, Maxwell PS, Lagan SE, Ryan GF, Musk AW: Decline in lung function in the Busselton Health Study: the effects of asthma and cigarette smoking.
Am J Respir Crit Care Med 2005, 171(2):109114. PubMed Abstract  Publisher Full Text

Sears MR: Lung function decline in asthma.
Eur Respir J 2007, 30(3):411413. PubMed Abstract  Publisher Full Text

Lee PN, Fry JS: Systematic review of the evidence relating FEV1 decline to giving up smoking.
BMC Med 2010, 8:84112. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Drummond MB, Hansel NN, Connett JE, Scanlon PD, Tashkin DP, Wise RA: Spirometric predictors of lung function decline and mortality in early chronic obstructive pulmonary disease.
Am J Respir Crit Care Med 2012, 185(12):13011306. PubMed Abstract  Publisher Full Text

MacDonald SWS, Dixon RA, Cohen AL, Hazlitt JE: Biological age and 12year cognitive change in older adults: findings from the Victoria Longitudinal Study.
Gerontology 2004, 50(2):6481. PubMed Abstract  Publisher Full Text

Klemera P, Doubal S: A new approach to the concept and computation of biological age.
Mech Ageing Dev 2006, 127(3):240248. PubMed Abstract  Publisher Full Text

Knowles DA, Part sL, Glass D, Winn JM: Inferring a measure of physiological age from multiple ageing related phenotypes.
2011.
[NIPS Workshop From Statistical Genetics to Predictive Models in Personalized Medicine]

Morris JF, Temple W: Lung age estimation for motivating smoking cessation.
Prev Med 1985, 14:655662. PubMed Abstract  Publisher Full Text

Parkes G, Greenhalgh T, Griffin M, Dent R: Effect of smoking on FEV1 decline in a crosssectional and longitudinal study of a large cohort of Japanese males.
BMJ 2008, 336:598600. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bize R, Burnand B, Mueller Y, RègeWalther M, Camain JY, Cornuz J: Biomedical risk assessment as an aid for smoking cessation.
Cochrane Database Syst Rev 2012., 12
CD004705

Moayyeri A, Hammond CJ, Valdes AM, Spector TD: Cohort profile: TwinsUK and healthy ageing twin study.
Int J Epidemiol 2013, 42:7685. PubMed Abstract  Publisher Full Text

Falaschetti E, Laiho J, Primatesta P, Purdon S: Prediction equations for normal and low lung function from the Health Survey for England.
Eur Respir J 2004, 23(3):456463. PubMed Abstract  Publisher Full Text

Stanojevic S, Wade A, Stocks J, Hankinson J, Coates AL, Pan H, Rosenthal M, Corey M, Lebecque P, Cole TJ: Reference ranges for spirometry across all ages: a new approach.
Am J Respir Crit Care Med 2008, 177(3):253260. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhai G, Valdes AM, Cherkas L, Clement G, Strachan D, Spector TD: The interaction of genes and smoking on forced expiratory volume: a classic twin study.
Chest 2007, 132(6):17721777. PubMed Abstract  Publisher Full Text

Wypij D: Spline and smoothing approaches to fitting flexible models for the analysis of pulmonary function data.
Am J Respir Crit Care Med 1996, 154:S223S228. PubMed Abstract  Publisher Full Text

Brändli O, Schindler C, Künzli N, Keller R, Perruchoud AP: Lung function in healthy never smoking adults: reference values and lower limits of normal of a Swiss population.
Thorax 1996, 51(3):277283. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ulrik CS, Lange P: Decline of lung function in adults with bronchial asthma.
Am J Respir Crit Care Med 1994, 150(3):629634. PubMed Abstract  Publisher Full Text

Xu X, Weiss ST, Rijcken B, Schouten JP: Smoking, changes in smoking habits, and rate of decline in FEV1: new insight into gender differences.
Eur Respir J 1994, 7(6):10561061. PubMed Abstract  Publisher Full Text

Xu X, Li B, Wang L: Gender difference in smoking effects on adult pulmonary function.
Eur Respir J 1994, 7(3):477483. PubMed Abstract  Publisher Full Text

Peat JK, Woolcock AJ, Cullen K: Decline of lung function and development of chronic airflow limitation: a longitudinal study of nonsmokers and smokers in Busselton, Western Australia.
Thorax 1990, 45:3237. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Anthonisen NR, Connett JE, Murray RP: Smoking and lung function of lung health study participants after 11 years.
Am J Respir Crit Care Med 2002, 166(5):675679. PubMed Abstract  Publisher Full Text

Chen Y, Horne SL, Dosman JA: Increased susceptibility to lung dysfunction in female smokers.
Am Rev Respir Dis 1991, 143(6):12241230. PubMed Abstract  Publisher Full Text

Langhammer A, Johnsen R, Gulsvik A, Holmen TL, Bjermerz L: Sex differences in lung vulnerability to tobacco smoking.
Eur Respir J 2003, 21(6):10171023. PubMed Abstract  Publisher Full Text

Kohansal R, MartinezCamblor P, Buist AS, Mannino DM, Soriano JB, Agustí A: The natural history of chronic airflow obstruction revisited: an analysis of the Framingham offspring cohort.
Am J Respir Crit Care Med 2009, 180:310. PubMed Abstract  Publisher Full Text

Cole TJ, Green PJ: Smoothing reference centile curves: the LMS method and penalized likelihood.
Stat Med 1992, 11(10):13051319. PubMed Abstract  Publisher Full Text

Hughes JA, Hutchison DC, Bellamy D, Dowd DE, Ryan KC, HughJones P: Annual decline of lung function in pulmonary emphysema: influence of radiological distribution.
Thorax 1982, 37:3237. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lange P, Parner J, Vestbo J, Schnohr P, Jensen G: A 15year followup study of ventilatory function in adults with asthma.
New England J Med 1998, 339(17):11941200. Publisher Full Text

Gibson PJ, Simpson JL: The overlap syndrome of asthma and COPD: what are its features and how important is it?
Thorax 2009, 64:728735. PubMed Abstract  Publisher Full Text

Kauppi P, Kupiainen H, Lindqvist A, Tammilehto L, Kilpeläinen M, Kinnula VL, Haahtela T, Laitinen T: Overlap syndrome of asthma and COPD predicts low quality of life.
J Asthma 2011, 48(3):279285. PubMed Abstract  Publisher Full Text

Hardin M, Silverman EK, Barr RG, Hansel NN, Schroeder JD, Make BJ, Crapo JD, Hersh CP: The clinical features of the overlap between COPD and asthma.
Respir Res 2011, 12:127134. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

McLachlan GJ, Krishnan T: The EM Algorithm and Extensions. Hoboken: John Wiley & Sons; 2008.