Unsupervised clustering is a machine learning method frequently applied to sequential or longitudinal data to provide clinical or biomedical insights. For example, clustering has been applied to physical activity data recorded by accelerometers (Dobbins & Rawassizadeh, 2018; Jones et al., 2021; Lim et al., 2019), where some studies may be interested in examining the relationship between physical activity and cardiovascular-related health outcomes. A key component to optimizing the performance of clustering algorithms is estimating the optimal number of clusters (Fu & Perry, 2020; Jones et al., 2021; Mirkin, 2011; Volkovich et al., 2011), denoted here as *W*_{k}, which decreases as the number of clusters increases (Tibshirani & Walther, 2005), making it less straightforward to identify the optimal number. For example, for the Elbow method, there is no quantitative metric that identifies this “elbow,” or the point where there is a marked difference in *W*_{k}. In contrast, we were interested in treating the problem of estimating *k* as a classification problem. We therefore opted to focus on the *prediction strength* metric developed by Tibshirani and Walther to determine the optimal *prediction strength* treats identifying *prediction strength* metric finds the number of clusters that minimizes the prediction error of proposed clusters. Relying on cross-validation methods, this process involves splitting the data set into training and testing sets in the absence of true cluster labels. The algorithm iterates through a range of *k* clusters and evaluates how well the training data cluster centers “predict” the comembership pairs found from the separately *k*-clustered testing data. A *prediction strength* metric of 1 implies perfect prediction with *k*, whereas a *prediction strength* metric of 0 implies poor prediction. The optimal number of clusters, *k* whose *prediction strength* metric exceeds a given threshold. Tibshirani and Walther (2005) used simulation studies to show that a threshold of 0.8–0.9 is appropriate for well-separated clusters.

The application of *prediction strength* to find optimal number of clusters for sequential and longitudinal data has not been described previously. Accelerometers are a common type of time series data, which can be clustered to discover patterns of physical activity. This may be of interest, for example, to investigate whether specific patterns of activity affect health outcomes. For instance, it may be that those who exercise intensely in the evening versus in the morning have more favorable cardiovascular health. Our goal is to characterize the performance of *prediction strength* when identifying patterns of activity from accelerometer data and to generalize guidelines for its optimal application in similar types of data. For this purpose, we designed an extensive simulation study to examine the performance of *prediction strength* with different model parameters and noise scenarios. We also evaluated a preprocessing method of sorting the data by magnitude of intensity within windows throughout the time series prior to application of *prediction strength* as others have explored in practice (Xiao et al., 2015). Our goals were threefold. First, we characterized how the recommended threshold cut-point performs in terms of finding the true number of clusters; second, we determined the potential benefit of sorting the raw accelerometer data by magnitude of intensity for finding the true number of clusters; last, we explored a graphical-based approach—Local Maxima—to determine the optimal number of clusters based on the *prediction strength* metric.

## Methods

We designed a simulation study to evaluate the effect of key input parameters in the *prediction* *strength* algorithm when using *k*-means clustering methods to cluster days of activity using accelerometer data on the monitor-independent movement summary (MIMS) scale.

Accelerometers provide objective measurements of physical activity in free-living environments. Many clinical studies, such as those that examine the relationship between physical activity and health outcomes, therefore incorporate data from accelerometers that record participants’ movement over multiple days (Füzéki et al., 2017; Migueles et al., 2017). Today’s accelerometers generate data that are high dimensional, with movement expressed as accelerations measured relative to Earth’s gravitational field in three orthogonal dimensions at as much as or greater than 100 Hz (Migueles et al., 2017).

Our simulation study relied on the resampling of existing accelerometer data from the Stanford GOALS study, a clinical study to evaluate a behavioral intervention to reduce obesity in high-risk children. As part of the study, children were asked to wear accelerometers for 1 week at baseline and again at 1-, 2-, and 3-year postrandomization (Robinson et al., 2021). Simulated data were created by stitching together blocks of 1-s epoch accelerometer data sampled from a pool of observed accelerometer data to create sequences of daily (24 hr) accelerometer activity. Each daily sequence is sampled so that it generates an a priori defined pattern of physical activity (e.g., active only in the morning, active only in the early evening, sedentary all day, etc.). Days generated with the same daily activity pattern are considered members of the same cluster. The following list gives a broad outline of the steps in our approach to evaluate our questions. We expand in greater detail in the subsections below.

- 1.Preprocessing and cleaning the Stanford GOALS accelerometer data
- 2.Formatting accelerometer data to create a pool of data for sampling
- 3.Defining true clusters of activity
- 4.Sampling from pool to generate data for days that correspond to clusters
- 5.Reformatting the sampled data in preparation of analysis

### Step 1: Preprocessing and Cleaning Source Data

First, we applied the mims_unit_from_files function from the MIMSunit package to convert raw accelerometer data from the Stanford GOALS study to 1-s epoch MIMS-unit data. Then, for simplicity, we removed accelerometer data from Daylight Savings Time conversion days (John et al., 2019).

### Step 2: Formatting Source Data and Creating a Sample Pool

We then pooled MIMS-unit data and applied a data reduction algorithm. This algorithm was defined by the following three parameters: (a) feature window width, *w*_{f} (e.g., 900 s); (b) feature signal, *s*_{f} (e.g., vector magnitude, *y*-axis, etc.); and (c) feature aggregation function *A*_{f}, with which to aggregate (e.g., mean, minimum, *SD*, etc.).

In our case, a feature window width of 900 s or 15 min was selected largely because we are interested in grouping similar patterns of physical activity throughout the entire day (e.g., work out in the morning followed by sedentary behavior vs. sedentary all day except for nighttime burst of activity), whereas patterns that describe finer deviations may not be deemed clinically meaningfully different (e.g., a workout at 8:00 a.m. vs. 8:04 a.m.). With all necessary parameters defined (*w*_{f} = 900 s, *s*_{f} = vector magnitude, *A*_{f} = mean), we divided our 1-s epoch MIMS-unit data, * X*, into

*m*distinct feature windows,

**X**_{1},

**X**_{2}, ...,

**X***, where each window was composed of*

_{m}*w*

_{f}contiguous epochs of data. For each feature window

*f*, we calculated the aggregated value

*y*

_{k}=

*A*

_{f}(

**X**_{k},

*w*

_{f},

*s*

_{f}). We then defined the vector

*= (*

**Y***y*

_{1},

*y*

_{2}, ...,

*y*

*), and the following functions.*

_{m}Let *θ*(*y*_{k}) = **X**_{k}, which allows any *y*_{k} to be mapped back to *X*_{k}.

Let *P*_{y}(*y*_{k}) = *p*_{k}, which allows any *y*_{k} to be mapped to its percentile *p*_{k} in the empirical cumulative distribution of * Y*.

### Step 3: Defining True Clusters

While the general principles behind our approach allow for an infinite number of true clusters, we set the number of clusters for our simulation to six (Table 1). Furthermore, we defined clusters with the following structure. Clusters were composed of a single, 24-hr period, and we chose the number of features, *n*_{f}, used to define a cluster as a function of window width, *w*_{f} (in seconds). Given 86,400 s in a day, then *w*_{f} is 900 s, *n*_{f} = 96 feature).

Cluster Activity Periods Used in Each of the Scenarios

Cluster name | Active period(s) (v = 0.7) | Inactive period(s) (v = 0.5 for CA scenarios and v = 0.1 for Sporadic Activity scenarios) |
---|---|---|

Always Active | 0:00–23:59 | None |

Always Sedentary | None | 0:00–23:59 |

Early Morning Active | 0:00–5:59 | 6:00–23:59 |

Morning Active | 6:00–11:59 | 0:00–5:59, 12:00–23:59 |

Afternoon Active | 12:00–17:59 | 0:00–11:59, 18:00–23:59 |

Evening Active | 18:00–23:59 | 0:00–17:59 |

Then, for each feature window *t* in a cluster *j*, we chose a percentile to define the median activity level for that feature window, *ν*_{jt}, and a percentile width, *p*_{jt} to define a sampling range for that feature window. Cluster *j* is then defined in terms of feature window-specific percentile sampling ranges as:

### Step 4: Sampling From Pool to Create One Day

A day belonging to Cluster *j* is simulated via sampling by moving iteratively through all feature windows, *t*, and sampling with equal probability a simulated value denoted as * Y* containing all

*y*such that

*n*

_{f}ordered, randomly sampled elements of

*.*

**Y**For example, if Cluster *j* is defined so that when the first feature window, *t* = 1, the median target percentile, *v*_{j1} = 0.6, and the noise parameter, *p*_{j1} = 0.2, then *y* that fall between the 40th and 80th percentiles of * Y*, and

**Y**_{j1}(see Table 1 for a description of the clusters considered and below for more details under “Design of Simulation Scenarios” section).

Each *w*_{f}-length 1-s epoch MIMS-unit data from where it was aggregated via *j* can then be constructed by concatenating

### Step 5: Reformatting the Sampled Data in Preparation of Analysis

Our data generation approach produces data in the second-epoch MIMS-unit format, but our clusters are defined in terms of aggregate feature windows. To prepare the simulated data for use in the evaluation of our analytic strategies, we repeated the data aggregation procedure used to prepare the observed MIMS-unit data for the sampling pool using the same feature window width, signal, and aggregation function. Once aggregated feature windows were created, the data were transformed into wide format, with a row comprising the data for a subject’s day.

### Design of Simulation Scenarios

To explore the performance of the *prediction strength* algorithm, we simulated accelerometer data under 28 different scenarios. For each scenario, we generated 100 data sets, each with 50 subjects, where each subject had 5 days of data. Second-epoch MIMS-unit GOALS accelerometer data were used to create our sampling pool. We used *w*_{f} = 900 s (15 min), *s*_{f} = vector magnitude, *A*_{f} = mean, and *n*_{f} = 96 feature windows.

Within each of our scenarios, we had the same six clusters of physical activity patterns for a day (see Table 1; e.g., Always Active, Always Sedentary, Morning Active, etc.). The clusters differed by whether physical activity occurs (Always Active vs. Always Sedentary), and the time interval across which physical activity could occur (e.g., morning or afternoon or throughout the entire day; Table 1). Using these cluster definitions, we varied the following across scenarios: (a) within-cluster noisiness (i.e., how similar the clusters were to one another) and (b) the extent to which the physical activity occurred during active periods (Constant vs. Sporadic).

In Constant Activity (CA) scenarios, simulated data have high activity continuously in all active periods and constant low activity in all inactive periods. In contrast, in the Sporadic Activity (SA) scenarios, simulated data show activity for two randomly selected 15-min windows in active periods and constant low activity in inactive periods for that specific scenario. For example, for subjects in Cluster 3 (Early Morning Active) for CA, observed physical activity from 0:00 to 5:59 will be constant and high, and constant and low from 6:00 to 23:59. For SA, a subject in Cluster 3 (Early Morning Active) would have two 15-min periods of high activity randomly situated between 0:00 and 5:59; outside of these two 15-min periods, this subject would have constant low activity. The magnitude of active and inactive periods also differed by CA and SA scenarios. In CA, the median target percentiles (*v*) were 0.7 and 0.5 for active and inactive feature windows, respectively. In SA, the median target percentiles (*v*) were 0.7 and 0.1 for active and inactive feature windows, respectively.

We controlled cluster noisiness via the percentile width, *p*_{jt}. As *p*_{jt} increases, the range of physical activity available for sampling increases and thus the variance in physical activity observed increases. This affects the within-cluster variability of each cluster. We varied this parameter across the following 14 values shown in Table 2: 0.16, 0.165, 0.17, 0.175, 0.18, 0.185, 0.19, 0.195, 0.20, 0.22, 0.24, 0.26, 0.28, and 0.3. Figure 2a and 2b show average intensities of a simulated data set with *p*_{jt} at 0.16 and 0.3 under the CA scenarios.

Description of 28 Data Generation Scenarios (*n* = 96 Feature Windows per day; Median Target Percentiles [*v*] for Active Periods is 0.7; and Median Target Percentiles [*v*] for Inactive Periods Varies by Whether Data are Generated Assuming CA or SA)

Noise level | Number of features windows per day | Active periods | Inactive periods | |
---|---|---|---|---|

Label | p | N | v | v |

CA—Strategy Set A | ||||

Scenario 1_CA | .16 | 96 | 0.7 | 0.5 |

Scenario 2_CA | .165 | 96 | 0.7 | 0.5 |

Scenario 3_CA | .17 | 96 | 0.7 | 0.5 |

Scenario 4_CA | .175 | 96 | 0.7 | 0.5 |

Scenario 5_CA | .18 | 96 | 0.7 | 0.5 |

Scenario 6_CA | .185 | 96 | 0.7 | 0.5 |

Scenario 7_CA | .19 | 96 | 0.7 | 0.5 |

Scenario 8_CA | .195 | 96 | 0.7 | 0.5 |

Scenario 9_CA | .20 | 96 | 0.7 | 0.5 |

Scenario 10_CA | .22 | 96 | 0.7 | 0.5 |

Scenario 11_CA | .24 | 96 | 0.7 | 0.5 |

Scenario 12_CA | .26 | 96 | 0.7 | 0.5 |

Scenario 13_CA | .28 | 96 | 0.7 | 0.5 |

Scenario 14_CA | .30 | 96 | 0.7 | 0.5 |

SA—Strategy Set B | ||||

Scenario 1_SA | .16 | 96 | 0.7 | 0.1 |

Scenario 2_SA | .165 | 96 | 0.7 | 0.1 |

Scenario 3_SA | .17 | 96 | 0.7 | 0.1 |

Scenario 4_SA | .175 | 96 | 0.7 | 0.1 |

Scenario 5_SA | .18 | 96 | 0.7 | 0.1 |

Scenario 6_SA | .185 | 96 | 0.7 | 0.1 |

Scenario 7_SA | .19 | 96 | 0.7 | 0.1 |

Scenario 8_SA | .195 | 96 | 0.7 | 0.1 |

Scenario 9_SA | .20 | 96 | 0.7 | 0.1 |

Scenario 10_SA | .22 | 96 | 0.7 | 0.1 |

Scenario 11_SA | .24 | 96 | 0.7 | 0.1 |

Scenario 12_SA | .26 | 96 | 0.7 | 0.1 |

Scenario 13_SA | .28 | 96 | 0.7 | 0.1 |

Scenario 14_SA | .30 | 96 | 0.7 | 0.1 |

*Note*. CA = Constant Activity; SA = Sporadic Activity.

—(a) Average intensities of one simulated accelerometer data set with a noise level of 0.16. (b) Average intensities of one simulated accelerometer data set with a noise level of 0.3.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—(a) Average intensities of one simulated accelerometer data set with a noise level of 0.16. (b) Average intensities of one simulated accelerometer data set with a noise level of 0.3.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—(a) Average intensities of one simulated accelerometer data set with a noise level of 0.16. (b) Average intensities of one simulated accelerometer data set with a noise level of 0.3.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

### Clustering Strategies: Sorting and Cut-Points

We wanted to understand whether sorting the data by its magnitude (highest to lowest) within prespecified time frames prior to clustering would facilitate better results. Such techniques have been utilized in understanding patterns in physical behavior previously (e.g., Xiao et al., 2015). For example, if an individual is active from 8:00 to 8:30 a.m. and sedentary otherwise, and another individual is active from 8:30 to 9:00 a.m. and sedentary otherwise, then we may consider these days as part of the same cluster (morning activity only). Clustering algorithms, however, can be sensitive to slight shifts in the data. To avoid discovery of separate clusters that may meaningfully belong to the same cluster from a behavioral or clinical standpoint while still maintaining key temporal features like morning versus afternoon versus evening behavior, we can sort the data by its magnitude within time frames prior to clustering. In the example above, the morning period for both scenarios would look much the same; there would be 30 minutes of highest intensity displayed first followed by the lower intensity values displayed for the length of time they were observed. More generally, within a prespecified time frame, the data would be sorted by highest to lowest level of magnitude. Further the time frames themselves would be presented by their order of time so that a time series of sorts remains. For example, if frames are defined as morning, afternoon, and evening, the data would be ordered as such: morning followed by afternoon followed by evening, where the data within the morning frame represent the highest to lowest level of magnitude measured for the length of time they were observed. We considered three different sorting strategies prior to clustering the data: (a) unsorted data (original, observed time series fully retained); (b) sorting the data in 4-hr windows (4 hr); and (c) sorting the data in 6-hr windows (6 hr).

For a given clustering algorithm, the *prediction strength* approach allows users to specify the cut-point value used to select the optimal number of clusters using the *prediction strength* metric. We varied the cut-point value from 0.3 to 0.95 in increments of 0.05. Note that in the description of the *prediction strength* metric, Tibshirani and Walther (2005) recommend that the user rely on cut-points between 0.8 ad 0.9 for cluster prediction.

For each of the 28 scenarios described in Table 2, we applied *k*-means clustering using the *prediction strength* algorithm for 14 different cut-point values.

For convenience and interpretability, the performance of the *prediction strength* algorithm was evaluated and visualized within the following strategies.

- 1.Strategy Set A1: Unsorted data with CA
- 2.Strategy Set A2: Sorted data with CA using 4 hr window
- 3.Strategy Set A3: Sorted data with CA using 6-hr window
- 4.Strategy Set B1: Unsorted data with SA
- 5.Strategy Set B2: Sorted data with SA using 4-hr window
- 6.Strategy Set B3: Sorted data with SA using 6-hr window

### Estimating Clusters Using *Prediction Strength*

Recall that *prediction strength* is an approach that estimates the number of clusters as the largest value at which the cut-point is obtained. Recall further that the cut-point is defined by the user. Tibshirani and Walther (2005) recommend a cut-point between 0.8 and 0.9 (Tibshirani & Walther, 2005). We refer to the approach of prespecifying a cut-point as the Threshold Approach and when using a cut-point between 0.8 and 0.9, we call this the Recommended Threshold Approach. We developed another approach we refer to as the Local Maxima Approach that involves identifying the smallest value at which *prediction strength* attains a local maximum. More specifically, as the *prediction strength* algorithm iterates through different values of *k*, we identify the first value of *k* that corresponds to a local maximum of the *prediction strength* metric as the estimated number of clusters. To evaluate whether a given point is a local maximum, we define windows around that point comprised of *m* neighboring points on either side. A maximum exists if the neighboring *m* points flanking the value are smaller. For this study, we used *m* = 2 to find the local maxima. If no local maxima existed, we set *k* = 1. The first local maximum is used if multiple local maxima were detected. As a caveat, setting *m* = 2 prohibits the consideration of *k* = 2, because the *prediction strength* metric always equals 1 (the maximum value) at *k* = 1. Therefore, we set *prediction strength* at *k* = 2 is greater than the *prediction strength* at *k* = 3.

Both competing methods—the Recommended Threshold Approach and the Local Maxima Approach—were evaluated and compared using the metrics described below.

### Evaluation Metric

We defined prediction error to be the squared distance between the true and estimated number of clusters. We calculated the prediction error for the two competing methods as well as every value of *k*, where *k* is smaller or equal to 15. Specifically, for each simulated scenario and possible cut-point, the squared distance was calculated as *prediction strength* algorithm at a given cut-point. For every data generation scenario and clustering strategy, the most optimal cut-point threshold was determined by finding the cut-point threshold that resulted in the lowest squared distance. The same squared distance error was calculated for the

Code to apply the *prediction strength* algorithm, along with simulated data sets, is available here: https://purl.stanford.edu/cg185zq8485. All coding and analyses were conducted using R 3.5.1 with the following packages: MIMSunit, digest, purrr, dplyr, reshape2, lubridate, stringr, fpc, cluster, doSNOW, ggplot2, and zip.

No institutional review board or consent process was needed for this project.

## Results

Figure 3a shows the number of clusters (averaged over 100 data sets within a scenario) predicted for a given cut-point as a function of the noise of the scenario. We found considerable variance in the optimal *k* determined by the *prediction strength* algorithm as the noise in the data increased. Choice of cut-point threshold therefore had a large impact on the performance of the *prediction strength* algorithm. For example, when the cut-point was 0.3 and the data were unsorted and generated under CA, where periods with activity have consistent levels of activity (Strategy Set A1), the average optimal *k* was greater than eight for less noisy scenarios and decreased to around five for noisier scenarios. When the cut-point was larger (0.85), we observed a range of *k* from 6 for the less noisy scenarios to 1 for the noisier scenarios. For scenarios that were less noisy (noise level < 0.18), a cut-point of 0.8 (Recommended Threshold Approach) performed well for Strategy Set A1 (Figure 3a). However, for noisier data sets under Strategy Set A1, the number of clusters was largely underestimated by the Recommended Threshold Approach. When the data were noisier, a lower cut-point (0.5–0.65 vs. 0.8 or greater) had superior performance. In contrast, in Strategy Set B1, where the data are unsorted and generated under SA where periods with activity have sporadic levels of activity (Figure 3a; Strategy Set B1), lower cut-points (0.3–0.5) overestimated the number of clusters for all noise levels. For cut-points between 0.55 and 0.65, the predicted number of clusters is close to the truth for scenarios with less noise. For the noisy unsorted scenario, the Recommend Threshold Approach performed well. However, the Recommended Threshold Approach largely resulted in an underestimate of the number of clusters in Strategy Set B1.

—(a) The distribution of number of clusters averaged within each scenario. The black horizontal line indicates the true number of clusters (*k* = 6). (b) Squared distance to the true cluster number (*k* = 6) by cut-point used in the *prediction strength* algorithm across scenarios generated with various noise levels. Recommended cut-point threshold highlighted in shaded area. (c) The histogram of the cut-points that achieve minimum squared distance to the true cluster. If multiple cut-points achieved the lowest squared distance, then the mean is depicted. Recommended cut-point threshold highlighted in shaded area.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—(a) The distribution of number of clusters averaged within each scenario. The black horizontal line indicates the true number of clusters (*k* = 6). (b) Squared distance to the true cluster number (*k* = 6) by cut-point used in the *prediction strength* algorithm across scenarios generated with various noise levels. Recommended cut-point threshold highlighted in shaded area. (c) The histogram of the cut-points that achieve minimum squared distance to the true cluster. If multiple cut-points achieved the lowest squared distance, then the mean is depicted. Recommended cut-point threshold highlighted in shaded area.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—(a) The distribution of number of clusters averaged within each scenario. The black horizontal line indicates the true number of clusters (*k* = 6). (b) Squared distance to the true cluster number (*k* = 6) by cut-point used in the *prediction strength* algorithm across scenarios generated with various noise levels. Recommended cut-point threshold highlighted in shaded area. (c) The histogram of the cut-points that achieve minimum squared distance to the true cluster. If multiple cut-points achieved the lowest squared distance, then the mean is depicted. Recommended cut-point threshold highlighted in shaded area.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

Across strategy sets where different sorting approaches are taken, the distribution of predicted *k* also varied at a given noise level (Figure 3a). For example, under Strategy Set A, we observed that sorting the data prior to clustering yielded more consistent and accurate predictions of *k* (Figure 3a; Strategy Set A2, A3), that were less dependent upon the cut-point used and less varied as noise level increased. Note that if the data were sorted, the Recommended Threshold Approach performed well, although cut-points between 0.6 and 0.8 were ideal. Sorting alleviated issues with overestimation in the lower cut-points for Strategy Set B as well (Figure 3a; Strategy Set B2, B3), but a similar pattern of underestimation that occurred with the unsorted data (Strategy Set B1) was observed for the higher cut-points of 0.75 and greater (Figure 3a; Strategy Set B2, B3). Thus, the Recommend Threshold Approach did not perform well in these cases.

Figure 3b demonstrates the squared differences between the predicted number of clusters and true number of clusters. Under CA (Strategy Sets A1–A3), we see that sorting strategies yielded less variation in *k* across various noise levels in Strategy Sets A2 and A3. In contrast, Strategy Set A1 exhibits considerably larger variation in findings. Although more variation is observed in SA Strategies, this phenomenon is also observed across sorting strategies even under SA. This again shows that sorting helps “stabilize” the *prediction strength* algorithm when processing accelerometer data.

Figure 3c is a histogram of the cut-points that achieves the lowest squared distance between the predicted *k* and the true *k*. From the figure, we observed that the Recommend Threshold Approach (cut-points of 0.8–0.9) performed well in only some of the Strategy Sets. Specifically, in Strategy Sets A2 and A3 (CA, Sorted), cut-points that ranged from 0.65 to 0.85 performed well. In the other Strategy Sets, however, a lower cut-point performed better. For instance, in the SA scenarios with sorted data, the optimal cut-point was determined to lie around 0.4 (Figure 3c; Strategy Set B3).

In Figure 4, we show the average *prediction strength* by *k* and noise level within each strategy set. This graph also depicts where the Local Maxima occur (on average) and motivates the competing Local Maxima Approach. The Local Maxima Approach performed consistently well (on average) within each strategy set with the exception of Strategy Set B1 where the data were unsorted and under the SA scenario where no maxima were found for *k* > 1 (Figure 4; Strategy Set B1).

—The average *prediction strength* metric by number of clusters (*k*) across scenarios generated with different noise levels. The number of true clusters is indicated by the vertical reference line (k = 6), and the lower bound of the recommended cut-point is indicated by the horizontal reference line (0.8). The optimal *k* found by the local maxima method are highlighted in the black points. For strategy set B1, no optimal *k *was found by the local maxima method.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—The average *prediction strength* metric by number of clusters (*k*) across scenarios generated with different noise levels. The number of true clusters is indicated by the vertical reference line (k = 6), and the lower bound of the recommended cut-point is indicated by the horizontal reference line (0.8). The optimal *k* found by the local maxima method are highlighted in the black points. For strategy set B1, no optimal *k *was found by the local maxima method.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—The average *prediction strength* metric by number of clusters (*k*) across scenarios generated with different noise levels. The number of true clusters is indicated by the vertical reference line (k = 6), and the lower bound of the recommended cut-point is indicated by the horizontal reference line (0.8). The optimal *k* found by the local maxima method are highlighted in the black points. For strategy set B1, no optimal *k *was found by the local maxima method.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

Figure 5 shows the comparison between the Recommended Threshold and Local Maxima Approaches. The Recommend Threshold Approach and Local Maxima Approach performed comparably for more noisy scenarios under CA when the data were not sorted (Strategy Set A1). For scenarios with noise levels of 0.24 or less, however, the Local Maxima Approach was superior. Sorting largely yielded comparable performance between the two approaches under CA. Under SA, when the data were not sorted, the Local Maxima Approach performed poorly—underestimating the number of clusters—across all noise levels and worse than the Recommended Threshold Approach, which underestimated the number of clusters with less noisy scenarios and overestimated in noisier scenarios. Sorting under SA yielded much improved performance for the Local Maxima Approach, which outperformed the Recommended Threshold Approach.

—Comparison of the squared distances of the predicted optimal *k* to the true *k* between those obtained via the Recommended Threshold Method and those via the Local Maximum Method. The shaded area indicates where the Local Maximum Method performed better than the Recommended Threshold Method.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—Comparison of the squared distances of the predicted optimal *k* to the true *k* between those obtained via the Recommended Threshold Method and those via the Local Maximum Method. The shaded area indicates where the Local Maximum Method performed better than the Recommended Threshold Method.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

—Comparison of the squared distances of the predicted optimal *k* to the true *k* between those obtained via the Recommended Threshold Method and those via the Local Maximum Method. The shaded area indicates where the Local Maximum Method performed better than the Recommended Threshold Method.

Citation: Journal for the Measurement of Physical Behaviour 6, 2; 10.1123/jmpb.2022-0049

## Discussion

This is the first study to evaluate the application of the *prediction strength* algorithm to cluster activity patterns using accelerometer data. Identifying different types of activity patterns (for example intense exercise in the morning vs. the evening vs. never) may be relevant to gain insight into how timing of activity may affect clinical outcomes like cardiovascular health. When the data were not sorted, performance of the *prediction strength* algorithm was sensitive to the choice of user-defined cut-point. The Recommended Threshold Approach worked well when the clusters were less noisy and under the CA scenarios. However, for noisier clusters, the Recommended Threshold Approach largely underestimated the number of clusters, with the optimal cut-point being lower than that used in the Recommended Threshold Approach. Sorting alleviated many of the issues surrounding sensitivity to the choice of cut-point that arose due to noise in the CA Scenarios. This was the case whether the sorting matched with the data generation (Strategy Sets A3) or not (Strategy Sets A2). While we are not the first to consider sorting prior to clustering for accelerometer data, as Xiao et al. demonstrated this in their work on functional clustering (Xiao et al., 2015), our results provide further evidence for the utility of sorting accelerometer data prior to clustering.

Under the SA scenarios—evaluated under Strategy Set B—when the data were unsorted, the *prediction strength* algorithm generally did not perform well under either approach considered; lower cut-points overestimated the number of clusters, and the Recommended Threshold Approach largely underestimated the number of clusters. However, while we deem Strategy Set B to be comprised of scenarios with six meaningfully different types of activity patterns, it makes intuitive sense that more than six clusters may be discovered using *k*-means clustering. For example, we defined the “Morning Active” Cluster as having two half-hour bouts of activity that occurred sometime between the hours of 6 a.m. to noon. This would, therefore, include 2 days with one having two high-intensity activity periods in close succession (e.g., at 8 a.m. and then at 8:45 a.m.) and one with high-intensity activities farther apart (at 8 a.m. and then at 11 a.m.). While we may consider these 2 days as generated by the same cluster (sporadic morning activities only), a clustering algorithm may place these in separate clusters (early morning activity vs. early and late morning activity).

As in the CA scenarios, sorting alleviated issues with performance observed with the unsorted data for the SA scenarios. In the sorted strategies (Strategy Sets B2 and B3), we observed superior performance of lower cut-points relative to higher cut-points, and generally the algorithm performed more favorably for less noisy scenarios. The number of clusters was largely underestimated for cut-points of 0.75 or greater (which includes those under the Recommend Threshold Approach).

The comparison between the two competing approaches was nuanced but yields the following recommendations. When the data are sorted, we would recommend the Local Maxima Approach. In our case, sorting the data by periods throughout the day makes clinical sense; it retains pertinent information about timing of activity within each of the periods of the day. Depending on the research question, however, sorting may remove critically important temporal features that are desirable. When the data are not sorted, the Local Maxima Approach outperformed the Recommended Threshold Approach for data generated under CA. However, the Local Maxima Approach performed poorly under SA when the data were not sorted. Given these findings, we recommend the Local Maxima Approach coupled with sensitivity analyses that include sorting the data when doing so is meaningful. Furthermore, graphical views of the *prediction strength* metric by values of *k* may provide additional insight. For example, in the first panel depicted in Figure 4, for the scenario with noise level of 0.24, we observe a nonmonotonic function with a modest peak at *k* = 6 that does not meet the cut-point of 0.8–0.9 (i.e., *k* = 6 would not have been chosen if using the Recommended Threshold Approach). While the *prediction strength* value is low (close to 0.5), graphical inspection may indicate that clusters defined with *k* = 6 should be considered in a downstream analysis of the data.

There were several strengths to our study. Our simulations relied on accelerometer data that were observed, providing scenarios that were representative of real activity levels. We generated a wide variation of realistic scenarios by varying noise levels (i.e., where clusters may be easy to distinguish vs. more nuanced) as well as timing and duration of activity. There were also limitations. We only considered *k*-means clustering. Our findings could differ with application of other clustering approaches such as k-medoids. We also only considered the squared distance metric to evaluate performance. However, it may be that other metrics like Normalized Mutual Information provide a different viewpoint. We initially explored Normalized Mutual Information as a secondary metric but the found results were concordant with findings for the squared distance metric. Given that we were interested in evaluating methods for recapturing specific clusters, we believed squared distance was the most appropriate metric. Additionally, while we identified sorting as a robust strategy for improving clustering performance, we found that selecting the appropriate window length for sorting can impact the final clustering performance. Thus, additional studies on exploring the optimal sorting window size is warranted.

The *prediction strength* metric is an incredibly intuitive and useful tool for estimating the number of clusters. We observed sensitivity to the user-specified input of cut-point. For accelerometer data, we recommend that the Local Maxima Approach be utilized together with graphical evaluation of the *prediction strength* metric function over values of *k*. Furthermore, we strongly urge consideration of sorting of the data prior to clustering if the sorting retains meaning for the research question at hand. Sensitivity analyses that vary findings by sorting strategies and cut-point may provide additional insight. We provide code to ease adoption of application of these methods.

## Acknowledgment

This manuscript is supported by the following National Institutes of Health funding source: R01LM013355.

## References

Dobbins, C., & Rawassizadeh, R. (2018). Towards clustering of mobile and smartwatch accelerometer data for physical activity recognition.

*Informatics,**5*(2), Article 29. https://doi.org/10.3390/informatics5020029Fu, W., & Perry, P.O. (2020). Estimating the number of clusters using cross-validation.

*Journal of Computational and Graphical Statistics,**29*(1), 162–173. https://doi.org/10.1080/10618600.2019.1647846Füzéki, E., Engeroff, T., & Banzer, W. (2017). Health benefits of light-intensity physical activity: A systematic review of accelerometer data of the National Health and Nutrition Examination Survey (NHANES).

*Sports Medicine,**47*(9), 1769–1793. https://doi.org/10.1007/s40279-017-0724-0John, D., Tang, Q., Albinali, F., & Intille, S. (2019). An open-source monitor-independent movement summary for accelerometer data processing.

*Journal for the Measurement of Physical Behaviour,**2*(4), 268–281. https://doi.org/10.1123/jmpb.2018-0068Jones, P.J., Catt, M., Davies, M.J., Edwardson, C.L., Mirkes, E.M., Khunti, K., Yates, T., & Rowlands, A.V. (2021). Feature selection for unsupervised machine learning of accelerometer data physical activity clusters—A systematic review.

*Gait & Posture,**90,*120–128. https://doi.org/10.1016/j.gaitpost.2021.08.007Larose, D.T., & Larose, C.D. (2015).

(2nd ed.). Wiley.*Data mining and predictive analytics*Lim, Y., Oh, H.S., & Cheung, Y.K. (2019). Functional clustering of accelerometer data via transformed input variables.

*Journal of the Royal Statistical Society: Series C,**68*(3), 495–520. https://doi.org/10.1111/rssc.12310Migueles, J.H., Cadenas-Sanchez, C., Ekelund, U., Delisle Nyström, C., Mora-Gonzalez, J., Löf, M., Labayen, I., Ruiz, J.R., & Ortega, F.B. (2017). Accelerometer data collection and processing criteria to assess physical activity and other outcomes: A systematic review and practical considerations.

*Sports Medicine,**47*(9), 1821–1845. https://doi.org/10.1007/s40279-017-0716-0Mirkin, B. (2011). Choosing the number of clusters.

*Wiley Interdisciplinary Reviews. Data Mining and Knowledge Discovery,**1*(3), 252–260. https://doi.org/10.1002/widm.15Robinson, T.N., Matheson, D., Wilson, D.M., Weintraub, D.L., Banda, J.A., McClain, A., Sanders, L.M., Haskell, W.L., Haydel, K.F., Kapphahn, K.I., Pratt, C., Truesdale, K.P., Stevens, J., & Desai, M. (2021). A community-based, multi-level, multi-setting, multi-component intervention to reduce weight gain among low socioeconomic status Latinx children with overweight or obesity: The Stanford GOALS randomised controlled trial.

*Lancet Diabetes and Endocrinology,**9*(6), 336–349. https://doi.org/10.1016/s2213-8587(21)00084-xRousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.

*Journal of Computational and Applied Mathematics,**20,*53–65. https://doi.org/10.1016/0377-0427(87)90125-7Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength.

*Journal of Computational and Graphical Statistics,**14*(3), 511–528. https://www.jstor.org.turing.library.northwestern.edu/stable/27594130Volkovich, Z., Barzily, Z., Weber, G.W., Toledano-Kitai, D., & Avros, R. (2011). Resampling approach for cluster model selection.

*Machine Learning,**85*(1–2), 209–248. https://doi.org/10.1007/s10994-011-5236-9Xiao, L., Huang, L., Schrack, J.A., Ferrucci, L., Zipunnikov, V., & Crainiceanu, C.M. (2015). Quantifying the lifetime circadian rhythm of physical activity: A covariate-dependent functional approach.

*Biostatistics,**16*(2), 352–367. https://doi.org/10.1093/biostatistics/kxu045