Missing Step Count Data? Step Away From the Expectation – Maximization Algorithm

In studies that compare physical activity between groups of individuals, it is common for physical activity to be quanti ﬁ ed by step count, which is measured by accelerometers or other wearable devices. Missing step count data often arise in these settings and can lead to bias or imprecision in the estimated effect if handled inappropriately. Replacing each missing value in accelerometer data with a single value using the Expectation – Maximization (EM) algorithm has been advocated in the literature, but it can lead to underestimation of variances and could seriously compromise study conclusions. We compare the performance in terms of bias and variance of two missing data methods, the EM algorithm and Multiple Imputation (MI), through a simulation study where data are generated from a parametric model to re ﬂ ect characteristics of a trial on physical activity. We also conduct a reanalysis of the 2019 MOVE-IT trial. The EM algorithm leads to an underestimate of the variance of effects of interest, in both the simulation study and the reanalysis of the MOVE-IT trial. MI should be the preferred approach to handling missing data in accelerometer, which provides valid point and variance estimates.

Wearable devices, such as pedometers and accelerometers, are becoming a popular tool in clinical and epidemiological studies for measuring participants' physical activity (Bravata et al., 2007). For example, accelerometers have evaluated the impact of interventions aiming to increase exercise in a number of clinical trials (Harris et al., 2015(Harris et al., , 2017(Harris et al., , 2018Ismail et al., 2019;Murray et al., 2006). These devices measure acceleration in three dimensions in very fine intervals of time, called epochs, which are then aggregated to obtain step counts on an hourly, daily, or weekly level. Compared to self-report approaches, measurements from these devices do not suffer from recall and desirability bias and there is reduced participant burden (Ae . However, missing step count data is a common issue in this setting. Participants may not wear the device as per protocol, and there may be entire days or parts of days where no step counts are recorded. There may also be technical issues such as the battery running out, or water damage to the device, and leading to loss of information. If the analysis does not account for the missing data in an appropriate way, the resulting estimates may be biased or imprecise. Accelerometer data raise a number of broader missing data issues (Tackney et al., 2021), but we focus here on comparing the Expectation-Maximization (EM) algorithm and Multiple Imputation (MI) as methods for handling missing data. Analysis of data with missing values requires assumptions about the way in which data become missing-the missingness mechanism. These mechanisms were categorized into three broad classes by Rubin (1976), which we describe in the accelerometer context: • The missing completely at random (MCAR) assumption states that the probability that a step count is missing does not depend on the observed or unobserved data; for example, if a number of accelerometers become faulty by chance and stop recording data, the missingness mechanism is MCAR. • The missing at random (MAR) assumption states that the probability that a step count is missing depends on the observed data, but not on the unobserved data; for example, if younger people are more likely to forget to wear the accelerometer, but their activity levels on days where they forget the device is similar to the activity levels of younger people on days where they wear the device, the missingness mechanism for step counts is MAR given age group. • The missing not at random assumption state that the probability that a step count is missing depends on the unobserved data; this would occur, for example, if people decide not to wear the accelerometer on days where they are less active.
Here, we consider settings where daily step counts are collected, some of which are missing. The primary analysis model has step count as the outcome, and aims to compare step counts between groups. Typically, baseline step counts are accounted for in the model. We assume that the missing data mechanism is MAR. We note that in practice it is not possible to verify that the MAR assumption is met using the observed data; however, it is a natural assumption to conduct the primary analysis under. Sensitivity analysis is recommended to assess robustness of the analysis to violations of the MAR assumption; this is beyond the scope of this article. Our focus is on the statistical properties of the EM algorithm and MI for handling the missing data, in particular, bias and precision of the estimates.
There are various ways of dealing with missing data. First, maximum likelihood methods can handle missing outcome data for linear regression or mixed models (Snijders & Bosker, 2011), which give unbiased effect estimates and valid estimates of variances under the MAR assumption. However, maximum likelihood cannot readily handle missing values in both the outcome and covariates (Carpenter & Smuk, 2021), which is likely to occur in the accelerometer setting as baseline step counts are often incorporated as a covariate in the primary analysis model. This would lead to exclusion of participants with missing covariates, which leads to loss of information and potentially a reduction in statistical power. Thus, in the accelerometer setting, there are two common approaches to handling missing data: single imputation using the EM algorithm, and MI (Ae Borghese et al., 2019;Xu et al., 2018). The literature on the design and analysis of clinical trials caution against the use of single imputation, as it can lead to underestimation of standard errors (SEs) (Dziura et al., 2013;Jakobsen et al., 2017). This has also been demonstrated in simulation studies using observational data (Avtar et al., 2019). In accelerometer studies, however, there has been some misunderstanding in the recommended approach to handling missing data. Using a simulation study, Catellier et al. (2005) compared the EM algorithm and MI in handling intermittent missing data such as missing intervals within days, or missing days within a week. They found that the estimates of mean step counts are similar in terms of bias and precision. Though they acknowledge that the EM algorithm can lead to underestimation of the variance estimates in general, the results from their simulation showing similar performances between the EM algorithm and MI have been used to justify the EM approach to imputation in other accelerometer studies. In this study, we aim to illustrate the EM and MI approaches to handling missing data in the accelerometer setting and demonstrate their statistical properties. We carefully elucidate their performances in terms of the bias, variance, and confidence intervals (CIs) of the treatment effect in a simulation study of a simple trial set up. We then conduct a reanalysis of the 2019 MOVE-IT trial to compare the two approaches to imputation in a more complex setting, and discuss the implications.

EM Algorithm and MI
The EM algorithm is an approach to finding maximum likelihood estimates in the presence of missing data under the MAR assumption (Schafer, 1997). In the context of accelerometer outcomes, the algorithm can provide point predictions for average daily step counts, conditional on participant characteristics such as sex, age, and treatment arm. The missing daily step counts can then be imputed (replaced) by these point predictions from the EM algorithm. This results in a "complete" data set which can then be used for the primary analysis. In this analysis, all values in the "complete" data set are treated equally, regardless of whether the step count was actually observed or imputed using the EM algorithm. This may not be appropriate, because the prediction of the missing values is more uncertain than the observed values-but this information is not used by the primary analysis model, which gives predictions for missing values the same status as observed values.
Multiple imputation is an alternative approach to handling missing data under the MAR assumption, which considers the uncertainty due to the missing values. Given an imputation model, which in the accelerometer setting can be a joint model for average daily step counts, conditional on characteristics such as sex, age, and treatment arm, MI creates M imputed data sets by replacing each missing value by M different plausible values generated from the imputation model. In each of the M imputed data sets, the imputed value is different, reflecting the uncertainty around the missing value. The imputed data sets are analyzed separately and the results of the M analyses are combined in a pooling step. The point estimates from the M data sets are averaged to get the pooled effect estimate, and the pooled estimate of the SEs incorporate the variability within and between the M imputations (Rubin, 1976). Thus, MI gives missing observed values a different status to observed values, and the uncertainty around the missing values is taken into account.
The two approaches are illustrated in Figure 1. Technical details of each procedure are provided in the Appendix.

Software
Both the EM algorithm and MI can be implemented in a wide range of statistical software by readily available packages and options.
• SPSS: Both single imputation using the EM algorithm and MI can be conducted (IBM Corp., 2020a, 2020b). • R: The package norm carries out the EM algorithm (Novo & Schafer, 2013). The package JOMO implements MI (Quartagno & Carpenter, 2020), and can be run with the interface mitml ( Grund et al., 2019), which provides tools for visualizing and analyzing multiple imputed data sets. A tutorial for JOMO and mitml is provided by Quartagno et al. (2019). MI can also be implemented in R using mice (van Buuren & Groothuis-Oudshoorn, 2011), which has an associated online vignette by Vink and van Buuren (n.d.) • Stata: The command mi impute mvn can be used to conduct both single imputation with the EM algorithm as well as MI. Furthermore, MI can be performed using the command mi impute chained (Statacorp, 2021). • SAS software: The procedures PROC MI and PROC MIA-NALYSE implement single imputation using the EM algorithm and MI (SAS Institute Inc., 2021). A tutorial is provided by Yuan (2011).

Simulation
We compare the performance of the EM algorithm and MI for handling missing data by simulating a simple randomized trial setting. We focus on the bias, variance, and CIs of the estimates of the treatment effect obtained under the two methods. In this simulation, we assume that participants provide an accelerometer step count at baseline. They are then randomized to either the treatment or control arm, and then provide a step count after 1 year.
The step counts at baseline are fully observed, but there are step counts at Year 1 which are MCAR. While this setup is simplistic, it will provide insight into the statistical properties of the two methods in more general MAR mechanisms. We generate step count data for this simulation through a parametric model, with parameter settings chosen to reflect characteristics of a trial on physical activity. We denote by y i,0 the step count for the ith patient at baseline. Assuming that there is just one observation per person, which is centered around a mean of 7,000 steps, we generate y i,0 as: where ϵ i is normally distributed with mean 0 and standard deviation 1,700. We note that, while daily step counts are large enough to be treated as continuous data, they are typically right-skewed, so a log-transformation may be necessary for linear regression. We assume for simplicity in this simulation that step counts are normally distributed. We assume that patients are randomized to one of two arms (treatment and control). The variable arm i is an indicator variable for whether patient i received the treatment. We denote by y i,1 the step count for patient i postintervention, one year after baseline. The step counts at Year 1 are drawn from a normal distribution, conditional on arm and step count at baseline: where ν i is normally distributed with mean 0 and standard deviation 2,000, and ϵ i and ν i have a correlation of 0.6. The true treatment effect is 300, which is an effect that could realistically be observed in a trial for physical activity. We wish to test the null hypothesis that there is no effect of treatment, against the alternative hypothesis that there is an effect of treatment, with a Type I error of 5%. We generate n = 500 patients in each simulation. In this simple setting, we assume that there are no missing step counts at baseline, but there are missing step counts at Year 1. We explore scenarios with the proportion of missing data at Year 1 ranging from 0.1 to 0.9. Missing values are imputed using the EM algorithm and MI. The EM algorithm is implemented using the R package norm (Novo & Schafer, 2013). MI is conducted via joint modeling using the R packages JOMO (Quartagno & Carpenter, 2020) and mitml . We use 30 imputations. A small number of imputations, typically greater than five, is sufficient for most applications (Carpenter & Kenward, 2013), but a larger number of imputations is needed for stable estimates of the standard error when the proportion of missing data is large (von Hippel, 2018). For each scenario, we simulate 2,000 data sets to ensure that we estimate the empirical SEs with a Monte Carlo SE of less than 2%. Full details of the implementation are provided in the Appendix. We then analyze the imputed data sets using a linear regression model with step count at Year 1 as the outcome, and step count at baseline and treatment as covariates. From this linear regression model, we estimate the treatment effect and its variance.
We evaluate the two methods by considering the mean, variance, and 95% CI of the treatment effect. The mean of the treatment effect across the 2,000 simulations has expected value of 300 if the treatment effect estimate is unbiased. Furthermore, we expect the theoretical variance of the treatment effect to be similar to the empirical variance (the sample variance of the treatment effect across simulations). If the theoretical variance is underestimated by an approach, the corresponding CIs will be too narrow; conversely, if the theoretical variance is overestimated, the CIs will be too wide. Thus, we assess the performance of each approach by considering the following measures across the 2,000 replications: (a) Mean of the estimated treatment effect, which has an expected value of 300.
(b) Means of the theoretical variance and the empirical variance, which we expect to have similar values.
(c) Coverage, the proportion of 95% CIs which contain the true treatment effect (300), which we expect to be 0.95. The proportion of CIs that are smaller than the true effect, and the proportion of CIs that are larger than the true effect, should be 0.025.
In Figure 2, we see in the top panel that the estimates of the treatment effect are centered around the true value of 300 for both the MI and EM approaches; this is expected as the missing data mechanism is MCAR. We also observe that the variability of the estimates of the means increases as the proportion of missing data increases; more missing data lead to more uncertainty in the estimated treatment effect. In the middle panel, we observe that the means of the theoretical variances are very different for the two methods; while the means of the variances for MI increase as the proportion of missing data increases, the means of the variances for the EM algorithm remain constant. When we compare this to the plot of the means of the empirical variances in the bottom panel, we observe that, for MI, the theoretical variances are a reasonable estimate of the empirical variances, but for the EM algorithm, the theoretical variances are underestimating the empirical variances. The underestimate of the variances by the EM algorithm becomes increasingly large as the proportion of missing data increases.
In Figure 3, we see in the top panel that the proportion of CIs that contain the true treatment effect decreases as the proportion of missing data increases for the EM algorithm, while for MI, it appears to remain fairly constant. We also observe that the proportion of CIs that are smaller than the true value of the treatment effect (middle panel) and the proportion that are larger than it (bottom panel) increase as the proportion of missing data increases for the EM algorithm, but stays constant for MI.
Overall, the simulation demonstrates that the EM algorithm underestimates the variance of the treatment effect, where the extent of underestimation increases as the proportion of missing data becomes larger. This leads to CIs that include the true treatment effect less than 95% of the time and provide a false sense of precision around the treatment effect estimate. This implies that Type I error is inflated.
The result is illustrative of the implications in more complicated settings. For example, the same variance underestimation will occur if the missingness mechanism of the Year 1 step counts is MAR. The variance underestimation will also occur if additional variables, such as baseline step count, are MAR. If the missing mechanism is MNAR, both approaches would lead to bias in addition to variance underestimation for the EM algorithm. Furthermore, if one is interested in modeling summaries of step counts, such as weekly averages, where there are intermittent days with missing data, the underestimation of the variance using the EM algorithm is also of concern.
Next, we illustrate the application of the EM algorithm and MI to the analysis of the MOVE-IT trial. Using real data, we explore a more complex setting where there are three treatment groups, and three time periods at which step counts are measured for each individual. We assume a MAR missingness mechanism and the primary analysis has weekly averaged step counts as the outcome.

Application to the MOVE-IT Trial
We compare the EM and MI approaches to imputation in the analysis of the 2019 MOVE-IT trial (Bayley et al., 2015;Ismail et al., 2019Ismail et al., , 2020. The MOVE-IT trial investigated the effects of motivational interviewing and motivational group therapy in reducing weight and increasing physical activity for patients who are at high risk of cardiovascular disease (QRISK2 of 20% or higher; National Institute for Health and Care Excellence, 2015). The trial randomized patients between three arms: individual motivational interviewing (Arm 1), motivational group therapy (Arm 2), or usual care (Arm 3). Motivational interviewing and motivational group therapy consisted of 10 sessions over the course of a year. The participants recorded their daily physical activity with an ActiGraph GT3X accelerometer (ActiGraph) for a period of seven consecutive days on three occasions: baseline, Year 1, and Year 2. The trial provided insufficient evidence to recommend either intervention for reducing weight or increasing physical activity.
The outcome of interest is the average step count across a 7-day period (Ismail et al., 2019). Our analysis model is a mixed model where we have the average step daily count as the outcome. The covariates are year (Year 1 or Year 2), arm, arm-year interaction, baseline average step count, the interaction between baseline average step count and year, gender, and age, and we have an unstructured covariance matrix. We wish to estimate the difference in average step count between the individual therapy and usual care, and the difference in average step count between the We wish to impute missing days where participants provide at least one observed day during the trial; this means that there is some information from the participant from which information on the missing days can be recovered. Out of 1,742 patients who were randomized to a treatment, 25 did not provide any data on any of the three measurement periods, so they will be excluded from this analysis. Some participants wore the device for longer than 7 days in a measurement period. Data from Days 1 to 7 are used for the analysis, unless participants provided insufficient data on the first day, in which case data from Days 2 to 8 are used instead. If participants wore the device for less than 540 minutes in a day, this observation is considered missing (Ismail et al., 2020). Table 1 shows the percentage of the 1,717 participants who have missing data on each day at each year.
We impute the daily step count under the assumption that each of the 21 step counts (for the 7 days at baseline, Year 1, and Year 2) are jointly normally distributed, and dependent on gender, age, and treatment arm, and further assuming that the data are MAR. We impute the missing values separately within each arm with the EM algorithm and also using MI (M = 30 imputations). Both methods are implemented using the R package norm. Figure 4 displays the 95% CIs for the difference in average step count for each intervention compared with usual care for Year 1 and Year 2. While the point estimates for the differences between individual therapy and usual care are larger than that between group therapy and usual care within each year, and the point estimates for Year 1 are larger than those for Year 2, neither intervention is effective at the 5% significance level. Importantly, the CIs provided by the EM algorithm are smaller than those obtained by MI, consistent with the results of the simulation study. Detailed results are provided in the Appendix, which illustrate that the SEs of all effects are lower when using the EM algorithm compared with MI. These differences between the two methods are nontrivial; in our study, we found that the length of the 95% CIs for the differences in average step count are between 11.7% and 13.7% lower when the EM algorithm is used instead of MI. Such differences could potentially lead to different conclusions in other studies.

Discussion
While the theoretical advantages of using MI over single imputation are well known, how this plays out in practice is less clear, especially when the relatively sophisticated EM algorithm is used for single imputation. Therefore, despite the fact that guidance on handling missing data for clinical trials (e.g., by Dziura et al., 2013, andJakobsen et al., 2017) caution against the use of single imputation, it is important to critically compare the two methods in a practically relevant context derived from a real clinical trial with accelerometer outcomes. In this paper, we therefore evaluated two approaches to handling missing data in accelerometer outcomes; single imputation of missing values using the EM algorithm (advocated by Catellier et al., 2005), and MI (Carpenter & Kenward, 2013;Rubin, 1976). Specifically, we compared the two approaches in a simulation study of a simple trial setting where the outcome is a daily step count and the data are MCAR. The results showed that the EM algorithm leads to a practically important underestimation of the variance of the treatment effect, and also reduced coverage probability; the extent of these two issues increases with increased proportion of missingness in the data.
We also compared the two approaches in the analysis of the MOVE-IT trial. In this more complex setting, the outcome is the average of seven consecutive days of step counts. Our analysis assumes that the data are MAR. Again, we found that the SEs of all effects are lower when using the EM algorithm compared with using MI; in consequence, using the EM algorithm can lead to an increase in Type I error. Similar results were found in an observational study of accelerometer outcomes (Avtar et al., 2019).
In applications, valid imputation of missing accelerometer outcome data requires careful consideration of a number of further issues. First, defining missingness for accelerometer outcomes is a complex task with no consensus . Second, analysis by MI typically benefits from the inclusion of carefully selected auxiliary variables which must be good predictors of the missing accelerometer values. If they also predict the chance of those values being observed, they may correct for any bias (Carpenter & Kenward, 2013, p. 64). Inclusion of auxiliary variables can improve plausibility of the MAR assumption. Third, analyses typically assume that the data are MAR; sensitivity analyses to explore the impact of deviation from this assumption on the results should be conducted (Carpenter & Smuk, 2021;Cro et al., 2020). For a practically grounded discussion of these issues, we refer readers to a framework for handling missing accelerometer data (Tackney et al., 2021).
In summary, our results, together with theoretical considerations, show that it's time to step away from the EM algorithm for missing step count data. y i,0,1 = α 01 þ γ 01 female i þ δ 01 age i þ κ 01 arm i þ e 01,i , . . . y i,0,7 = α 07 þ γ 07 female i þ δ 07 age i þ κ 07 arm i þ e 07,i , y i,1,1 = α 11 þ γ 11 female i þ δ 11 age i þ κ 11 arm i þ e 11,i , . . . y i,1,7 = α 17 þ γ 17 female i þ δ 17 age i þ κ 17 arm i þ e 17,i , y i,2,1 = α 21 þ γ 21 female i þ δ 21 age i þ κ 21 arm i þ e 21,i , . . . Detailed Results of MOVE-IT Analysis Table A1 provides point estimates and standard errors for the coefficients in the primary analysis model (Equation 1) using MI versus using the EM algorithm. We observe that the standard errors are lower when the EM algorithm is used, for all effects. Table A2 provides estimates of the variances of the residuals under both approaches; these estimates are similar under the two approaches, as expected.  Note. MI = multiple imputation; EM = expectation-maximization.