Artykuł w czasopiśmie
Brak miniatury
Licencja

ClosedAccessDostęp zamknięty

Improving Group Lasso for High-Dimensional Categorical Data

Autor
Sołtys, Agnieszka
Rejchel, Wojciech
Pokarowski, Piotr
Nowakowski, Szymon
Data publikacji
2023
Abstrakt (EN)

Sparse modeling or model selection with categorical data is challenging even for a moderate number of variables, because roughly one parameter is needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection of continuous or categorical variables, but all estimates related to a selected factor usually differ. Therefore, a fitted model may not be sparse, which makes the model interpretation difficult. To obtain a sparse solution of the Group Lasso, we propose the following two-step procedure: first, we reduce data dimensionality using the Group Lasso; then, to choose the final model, we use an information criterion on a small family of models prepared by clustering levels of individual factors. In the consequence, our procedure reduces dimensionality of the Group Lasso and strongly improves interpretability of the final model. What is important, this reduction results only in the small increase of the prediction error. In the paper we investigate selection correctness of the algorithm in a sparse high-dimensional scenario. We also test our method on synthetic as well as the real data sets and show that it outperforms the state of the art algorithms with respect to the prediction accuracy, model dimension and execution time. Our procedure is contained in the R package DMRnet and available in the CRAN repository.

Dyscyplina PBN
informatyka
Strony od-do
455-470
Licencja otwartego dostępu
Dostęp zamknięty