With the advancement of technology, the shift from traditional paper-based assessments to computer-based testing (CBT) has accelerated, enabling the efficient collection of diverse process data. In the field of learning analytics, such data are increa...
With the advancement of technology, the shift from traditional paper-based assessments to computer-based testing (CBT) has accelerated, enabling the efficient collection of diverse process data. In the field of learning analytics, such data are increasingly used to predict students' academic achievement and design intervention strategies to prevent dropouts—especially in online learning environments, where their importance is even more pronounced. To interpret process data reliably, it is essential to recognize that students' response patterns may vary depending on item characteristics such as cognitive level, item type, discrimination, and difficulty. Accordingly, this study categorizes process data based on these characteristics and constructs prediction models for mathematics achievement. Linear regression and random forest techniques are applied to compare model performance and identify key predictors, with the aim of improving both predictive accuracy and interpretability. This approach holds significance in its effort to capture the interaction between item characteristics and process data more precisely—distinguishing it from previous studies.
This study utilized data from Korean students who participated in the mathematics domain of the PISA 2022, focusing on CBT items. The dependent variables were the ten plausible values (PVs), and for each PV, prediction performance metrics (RMSE, MAE, MSE, R2) were calculated. The final model evaluation was based on the average of these ten values. Independent variables included 27 background variables identified through a systematic literature review by Wang et al.(2023), as well as process data reconstructed by item characteristics (cognitive level, item type, discrimination, and difficulty). The process data were generated by computing average response times and the average number of actions for each item category. Variables with multicollinearity issues were excluded from the analysis. A total of six models were constructed based on different combinations of item characteristics and analysis methods, allowing for comparative analysis of model performance and key predictors.
Missing data were addressed using a combination of listwise deletion and K-nearest neighbors(KNN) imputation. After determining the optimal k values, 39 variables with low missing rates were imputed with k=10, while a variable with a high missing rate (ST293Q01JA) was imputed with k=5. For continuous variables with high skewness or kurtosis, log transformation was applied, and all continuous independent variables were standardized for linear regression only.
Using the refined dataset, mathematics achievement prediction models were constructed in Python utilizing linear regression and random forest algorithms. The dataset was randomly split into training (80%) and test (20%) sets. Linear regression was implemented with default settings, while hyperparameter tuning for the random forest model involved adjusting n_estimators (100, 300, 500, 1,000) and max_features ('sqrt', 'log2') using GridSearchCV and five-fold cross-validation. Model performance was evaluated based on RMSE, MAE, MSE, and R2.
Finally, the consistency between the two methods was assessed using top-10 and top-20 variable overlap rates and Spearman’s rank correlation coefficients across the six models. Key predictors were further examined by ranking standardized coefficients in linear regression and feature importances in random forest. Repeatedly influential variables and method-specific predictors were analyzed for each item characteristic-based model, offering insights into both shared and distinct factors influencing mathematics achievement.
The main findings derived using the above methodology are summarized as follows. First, a comparison between linear regression and random forest in predicting mathematics achievement revealed that random forest consistently outperformed linear regression. In all six models, random forest exhibited lower error metrics (RMSE, MAE, MSE) and higher R2, suggesting that non-linear models such as random forest are particularly advantageous when handling complex, non-linear interactions among predictor variables—such as those involving item characteristics and process data.
Second, when comparing the predictive performance of models based on item characteristics, models 2 through 6 consistently outperformed model 1. Model 1, which included only student background variables, showed the lowest performance across all metrics. In contrast, models 2 through 5, which incorporated process data classified by cognitive level, item type, discrimination, and difficulty, each demonstrated a consistent improvement in predictive power. Although model 6—an integrated model including process data classified by all item characteristics—contained the largest number of predictors, its performance showed only a marginal improvement compared to models 2–5. Chi-square and Cramér’s V analyses revealed significant correlations among process data classified by cognitive level, item difficulty, and item discrimination, indicating that overlapping variables may have weakened the independent contribution of each factor, thereby limiting the model’s overall predictive performance.
Third, the top 20 predictors identified in each item-characteristic-based model using linear regression showed that process data held strong predictive power in models 2 through 6. With few exceptions, all process variables ranked within the top 20 across models. In particular, the average response time on high-cognitive-level items, average response time on low-discrimination items, and average number of actions on high-difficulty items emerged as strong predictors across all models. At the individual level, mathematics self-efficacy consistently ranked first and even surpassed process data in predictive strength. At the family level, socioeconomic status maintained a high ranking in all models, while at the school level, average school-level ESCS was consistently ranked between second and fifth place, indicating a stable and significant influence.
Fourth, in the random forest models, all process data variables included in models 2 through 6 ranked within the top 20 predictors, reaffirming their strong predictive power. In particular, the average response time for high-difficulty items, low-discrimination items, and high-cognitive-level items were among the most influential predictors across all models. At the individual level, basic/applied mathematics self-efficacy consistently ranked first or second, often showing predictive power equal to or greater than that of the process data. Although mathematics self-efficacy related to reasoning and 21st-century competencies ranked slightly lower, it was still identified as an influential predictor. At the family level, both ESCS and home possessions showed moderate but stable predictive power. At the school level, average school-level ESCS remained among the top three to eight predictors across all models, highlighting its strong and consistent impact.
Fifth, a quantitative comparison of variable consistency between the linear regression and random forest models revealed a moderate to high degree of agreement. On average, 6.5 to 8 variables overlapped within the top 10 predictors across models, and 14.2 to 15.7 variables overlapped within the top 20. The Spearman’s rank correlation coefficient for overall variable importance rankings ranged from 0.54 to 0.58 for models 2 through 6, indicating a generally strong level of similarity between the two methods.
Sixth, an analysis of the top 10 predictors across models 1 to 6 showed that process data variables representing response behavior consistently appeared among the most important predictors in both linear regression and random forest models. Specifically, variables such as the average response time on high-cognitive-level items, the average number of actions on high-difficulty items, the average response time or number of actions on low-discrimination and the average response time of low-cognitive items were consistently ranked within the top 10. These findings suggest that students’ response behaviors vary according to item characteristics such as cognitive level, difficulty, and discrimination, and these behaviors are closely associated with achievement outcomes.
At the individual level, mathematics self-efficacy ranked first or second in all models, emerging as a powerful single predictor. At the family level, socioeconomic status was a common predictor in models 1 to 5. While other family-level indicators such as home possessions were considered, ESCS was the most consistently important predictor of students’ socioeconomic background. At the school level, average school-level ESCS was consistently identified as a top predictor across all models.
Seventh, an analysis of variables identified exclusively by either linear regression or random forest revealed method-specific patterns. While process data variables showed no strikingly different selection patterns between the methods, differences were notable in background variables. At the individual level, linear regression repeatedly identified math anxiety and tardiness as key predictors in models 2, 3, and 4. In contrast, random forest consistently selected reasoning and 21st-century-related self-efficacy across all models. At the family level, the highest level of parental education emerged as a key variable in linear regression, whereas home possessions was prioritized in random forest. At the school level, linear regression emphasized quantitative opportunity indicators such as weekly math instruction time, while random forest emphasized qualitative engagement indicators, such as participation frequency in class discussions.
In summary, this study empirically confirmed that process data derived from item characteristics function as stable and consistent predictors of mathematical achievement, regardless of the analysis method employed. Models 1 through 6 demonstrated that process data based on various item attributes—such as cognitive level, item type, discrimination, and difficulty—significantly contributed to prediction accuracy. Notably, even though model 6 integrated multiple item characteristics, its performance did not substantially improve, suggesting that process data based on a single characteristic can still yield highly effective predictions.
Mathematics self-efficacy also emerged as the most influential predictor across all models, underscoring the strong connection between learners' perceived competence and their actual performance. This finding highlights the need for instructional strategies that not only deliver content but also support students' cognitive beliefs and emotional engagement.
Furthermore, both household- and school-level predictors that consistently ranked in the top 10 were related to socioeconomic status, indicating that economic factors operate structurally across individual, family, and institutional contexts in shaping academic achievement. This supports existing research showing that economic disparities persist as a significant driver of educational inequality.
Differences in predictor selection across method-models also reflect the structural characteristics of each analysis method, suggesting limitations in relying on a single modeling approach. For more robust and nuanced predictions, future research should adopt multiple analytical techniques and conduct integrated interpretations that consider both common and method-specific predictors. In particular, the superior performance of random forest in capturing complex, nonlinear interactions implies that advanced machine learning techniques may be more suitable for analyzing process data.
Based on these findings, several directions for future research are proposed. First, a broader range of modeling techniques—including XGBoost, SVM, and other modern algorithms—should be compared. Second, while this study focused on total response time and the number of action, future work should incorporate other types of PISA process data (F, V, VS variables) and sequential/time-series information. Third, because item characteristics can be perceived differently depending on the learner's ability, motivation, and strategy, prediction models should reflect learner-centered item characteristics’ classifications. Finally, as this study included only a limited set of background variables, future studies should expand to incorporate more emotional and psychological factors, especially within the family context.