Unexpected outcome in text analysis using R with gutenbergr and cv.glmnet - r
hope that all of you are fine,
I'm trying to replicate a text analysis exercise ir R with the gutembergr library, and trying to run a machine learning model with the glmnet library.
The problem is that the code actually runs, but it returns the wrong outcome.
I leave the full code, which is not very extensive and is fully replicable.
First part of the code, runs fine
##### Carga de librerias sajtas #####
library(tidyverse)
library(tidytext)
library(udpipe)
library(gutenbergr)
library(rsample)
library(glmnet)
library(yardstick)
##### 1. Cargar los libros del gutenberg ####
twist_tale <- gutenberg_metadata %>%
filter(
title %in% c("A Tale of Two Cities", "Oliver Twist"),
has_text,
language == "en") %>%
pull(gutenberg_id) %>%
gutenberg_download(meta_fields = "title")
##### 2. Quitarle los blancos #####
twist_tale <- twist_tale %>%
filter(text != "")
View(twist_tale)
##### 3. Crear la varialbe lógica ######
twist_tale <- twist_tale %>%
mutate(
es_two_cities = case_when(
title == "A Tale of Two Cities" ~ 1L,
title == "Oliver Twist" ~ 0L
),
line_id = row_number()
)%>%
view()
##### 3.1. Guardar los resultados y contar las filas ####
twist_tale %>%
count(title)
##### 4. Preparar el dataset y modelar #####
dl <- udpipe_download_model(language = "english")
english_model <- udpipe_load_model(dl$file_model)
text <- twist_tale %>%
select(doc_id = line_id, text)
twist_tale_preprocesado <- udpipe(text, english_model, parallel.cores = 4L)
##### 4b. crear un algoritmo de entrenamiento #####
set.seed(1234L)
twist_tale_split <- initial_split(twist_tale)
twist_tale_training <- training(twist_tale_split)
twist_tale_testing <- testing(twist_tale_split)
##### 4a. Ponerlo en minúsculas y quitarle la mugre ####
sparse_train_data <- twist_tale_preprocesado %>%
mutate(lemma = str_to_lower(lemma)) %>%
anti_join(stop_words, by = c("lemma" = "word")) %>%
filter(upos %in% c("PUNCT", "SYM", "X", "NUM")) %>%
mutate(doc_id = as.integer(doc_id)) %>%
anti_join(twist_tale_testing, by = c("doc_id" = "line_id")) %>%
count(doc_id, lemma) %>%
cast_sparse(doc_id, lemma, n)
Until here, everything is fine. The problem arises running the cv.glmnet() model, since it doesn't return any error or warning message, but the outcome is a row vector, and I believe that it must be a matrix or tibble.
# The code that presents the problem
##### 5a. Excluir los resultados irrelevantes y guardarlos ####
y <- twist_tale_training %>%
filter(line_id %in% rownames(sparse_train_data)) %>%
pull(es_two_cities)
##### 6a. Calcular la regresión logística regularizada ####
model <- cv.glmnet(sparse_train_data, y, family = "binomial",
keep = T, trace.it=1) # THE PROBLEM IS ACTUALLY HERE
coeficientes <- model$glmnet.fit %>%
tidy() %>%
filter(lambda == model$lambda.1se)
coeficientes %>%
group_by(estimate > 0) %>%
slice_max(estimate, n = 5) %>%
ungroup()
coeficientes %>%
group_by(estimate > 0) %>%
slice_max(estimate, n = 5) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = fct_reorder(term, estimate), y = estimate, fill = estimate > 0)) +
coord_flip()
I Apologize for the extent of the question but was for showing the full context.
Haven't run your code but probably you need to make sure the x matrix and y response vector are in the same order. Something alongside the following will do that (replacing step 5a. and the beginning of your step 6)
library(udpipe)
y <- setNames(twist_tale_training$es_two_cities, twist_tale_training$line_id)
traindata <- dtm_align(x = sparse_train_data, y = y)
model <- cv.glmnet(x = traindata$x, y = traindata$y, family = "binomial", keep = T, trace.it = 1)
Related
I need assistance with a nested for-loop function which outputs multiple tables
I'm using the diamonds data set as hypothetical data. I'm trying to compare the performance of two separate models in predicting a binary variable via misclassification error. The binary variable is decided upon arbitrary thresholds, which would be set at the beginning of the code (prior to making the models and both models are run with different thresholds each time). I want to do a for loop so that it sets different thresholds each time the loop runs. My goal is to obtain tables for the misclassification error for each threshold. Please bear with my inelegant codes. Any help is so greatly appreciated. library(tidyverse) library(rpart) library(rpart.plot) data = diamonds threshold = c(0.20, 0.70, 2.00) for (i in 1:length(threshold)) { data_filtered = data %>% select(-c(cut, color, clarity)) %>% mutate(carat_dummy = ifelse(carat > threshold, 1, 0)) %>% drop_na() set.seed(14643) data_filtered = data_filtered %>% mutate(train_test = sample(c("train", "test"), n(), replace = TRUE, prob = c(0.75, 0.25))) data_train = data_filtered %>% filter(train_test == "train") data_test = data_filtered %>% filter(train_test == "test") #regression reg = lm(carat_dummy ~ depth + table + price, data = data_train) #decision Trees data_train$carat_dummy = factor(data_train$carat_dummy) trees = rpart(carat_dummy ~ depth + table + price, data = data_train) tree_predict = predict(trees, newdata = data_test, type = "class") #Linear regression error predict = predict(reg, data_test, type="response") prediction_reg = data_test %>% mutate(prediction = ifelse(predict >= 0.5,1,0)) %>% select(carat_dummy, prediction) error_reg = prediction_reg %>% summarise(misclassification = mean(carat_dummy != prediction)) %>% add_column("Method" = "Linear Regression") %>% select(Method, everything()) #Decision tree error prediction_tree = data_test %>% mutate(prediction = as.numeric(as.character(tree_predict))) %>% select(carat_dummy, prediction) error_tree = prediction_tree %>% summarise(misclassification = mean(carat_dummy != prediction)) %>% add_column("Method" = "Decision Trees") %>% select(Method, everything()) errors_t = rbind(error_reg, error_tree) }
Multivariate time series - is there notation to select all the variables, or do they all have to be written out?
I'm working to build a multivariate time series to make predictions about labor in the United States. The fpp3 package is excellent, but I don't see a notation to model all the variables. For example, in linear regression, it's possible to do this: library(tidyverse) mtcars.lm <- lm(mpg ~ ., data = mtcars) summary(mtcars.lm) to model mpg on all the remaining variables, without having to write all the variables out explicity. Is there something similar in time series using the fpp3 package? For example, this returns an error: library(tidyverse) library(fpp3) library(clock) # Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0000000001 All_Employees <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/All_Employees.csv', col_select = c(Label, Value), show_col_types = FALSE) All_Employees <- All_Employees %>% rename(Month = Label, Total_Employees = Value) All_Employees <- All_Employees %>% mutate(Month = yearmonth(Month)) %>% as_tsibble(index = Month) %>% mutate(Total_Employees_Diff = difference(Total_Employees)) index = All_Employees$Month All_Employees <- All_Employees %>% filter((Month >= start_month), (Month <= end_month)) # Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0500000003 Average_Hourly_Earnings <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/Average_Hourly_Earnings.csv', col_select = c(Label, Value), show_col_types = FALSE) Average_Hourly_Earnings <- Average_Hourly_Earnings %>% rename(Month = Label, Avg_Hourly_Earnings = Value) Average_Hourly_Earnings <- Average_Hourly_Earnings %>% mutate(Month = yearmonth(Month)) %>% as_tsibble(index = Month) %>% mutate(Avg_Hourly_Earnings_Diff = difference(Avg_Hourly_Earnings)) Average_Hourly_Earnings <- Average_Hourly_Earnings %>% filter((Month >= start_month), (Month <= end_month)) Monthly_labor_data_small <- tsibble( Month = All_Employees$Month, index = Month, 'Total_Employees' = All_Employees$Total_Employees, 'Avg_Earnings' = Average_Hourly_Earnings$Avg_Hourly_Earnings ) start_month_small = yearmonth("2020 Mar") end_month_small = yearmonth("2022 Jan") Monthly_labor_data_small <- Monthly_labor_data_small %>% filter((Month >= start_month_small), (Month <= end_month_small)) Monthly_labor_data_small %>% model( linear = TSLM(Total_Employees ~ .,)) The error is: Error in TSLM(Total_Employees ~ ., ) : unused argument (alist()) But this runs fine if I list everything out: fit <- Monthly_labor_data_small %>% model( linear = TSLM(Total_Employees ~ Avg_Earnings + season() + trend())) report(fit) The full tsibble will have a large number of columns, is there a short way to list all of them, similar to what can be done in linear regression?
You should be able to do something like resp <- "Total_Employees" form <- reformulate(response = resp, c(setdiff(names(Monthly_labor_data_small), resp), "season()", "trend()")) And then use form in your model. I haven't tried your examples -- if there are other variables (like a time index) that should not be explicitly included in the model then the second argument to setdiff() should be c(resp, "excluded_var2", "excluded_var3")
How to deal with external regressors in time series recipes?
In time series forecasting external regressors can make a big difference. Currently I want to track the effects of external regressors, using the modeltime framework. However, I could not find any helpful information on this topic so far. I only found out, that you can add regressor variables with a "+" to your recipe. After adding the variables Transactions (number of transactions per day and Store) and Open_Closed (1 = Store is closed, and 0 = Store is open) to my recipe, I found out, that there was no effect on the prediction. How can I achieve this? some reprex data: suppressPackageStartupMessages(library(modeltime)) suppressPackageStartupMessages(library(tidymodels)) suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(timetk)) #### DATA data <- data.frame (Store = c(rep("1",365),rep("2",365)), Sales = c(seq( 1, 44, length.out = 365)), Date = c(dates <- ymd("2013-01-01")+ days(0:364)), Transactions = c(seq( 50, 100, length.out = 365)), Open_Closed = sample(rep(0:1,each=365)) ) h = 42 # split set.seed(234) splits <- time_series_split(data, assess = "42 days", cumulative = TRUE) # recipe recipe_spec <- recipe(Sales ~ Date + Transactions + Open_Closed, data) %>% step_timeseries_signature(Date) %>% step_rm(matches("(iso$)|(xts$)|(day)|(hour)|(min)|(sec)|(am.pm)")) %>% step_dummy(all_nominal()) recipe_spec %>% prep() %>% juice() #### MODELS # elnet model_spec_glmnet <- linear_reg(penalty = 1) %>% set_engine("glmnet") wflw_fit_glmnet <- workflow() %>% add_model(model_spec_glmnet) %>% add_recipe(recipe_spec %>% step_rm(Date)) %>% fit(training(splits)) # xgboost model_spec_xgboost <- boost_tree("regression", learn_rate = 0.35) %>% set_engine("xgboost") set.seed(123) wflw_fit_xgboost <- workflow() %>% add_model(model_spec_xgboost) %>% add_recipe(recipe_spec %>% step_rm(Date)) %>% fit(training(splits)) # sub tbl submodels_tbl <- modeltime_table( wflw_fit_glmnet, wflw_fit_xgboost ) submodels_tbl %>% modeltime_accuracy(testing(splits)) %>% table_modeltime_accuracy(.interactive = FALSE)
How to handle forecast data (melt and "unmelt") generated by modeltime prediction - lost variables
below I created some fake forecast data using the tidyverse modeltime packages. I have got monthly data from 2016 and want to produce a test fc for 2020. As you can see, the data I load comes in wide format. For usage in modeltime I transform it to long data. After the modeling phase, I want to create a dataframe for the 2020 prediction values. For this purpose I need to somehow "unmelt" the data. In this process I am unfortunately losing a lot of variables. From 240 variables that I want to forecast I get only 49 in the end result. Maybe I am blind, or I do not know how to configure the modeltime functions correctly. I would really much appreciate some help. Thanks in advance! suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(tidymodels)) suppressPackageStartupMessages(library(modeltime)) ## create some senseless data to produce forecasts on... dates <- ymd("2016-01-01")+ months(0:59) fake_values <- c(661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239, 661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239, 661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239, 661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239) replicate <- rep(1,60) %*% t.default(fake_values) replicate <- as.data.frame(replicate) df <- bind_cols(replicate, dates) %>% rename(c(dates = ...241)) ## melt it down data <- reshape2::melt(df, id.var='dates') ## make some senseless forecast on senseless data... split_obj <- initial_time_split(data, prop = 0.8) model_fit_prophet <- prophet_reg() %>% set_engine(engine = "prophet") %>% fit(value ~ dates, data = training(split_obj)) ## model table models_tbl_prophet <- modeltime_table(model_fit_prophet) ## calibration calibration_tbl_prophet <- models_tbl_prophet %>% modeltime_calibrate(new_data = testing(split_obj)) ## forecast fc_prophet <- calibration_tbl_prophet %>% modeltime_forecast( new_data = testing(split_obj), actual_data = data, keep_data = TRUE ) ## "unmelt" that bastard again fc_prophet <- fc_prophet %>% filter(str_detect(.key, "prediction")) fc_prophet <- fc_prophet[,c(4,9,10)] fc_prophet <- dplyr::filter(fc_prophet, .index >= "2020-01-01", .index <= "2020-12-01") #fc_prophet <- fc_prophet %>% subset(fc_prophet, as.character(.index) >"2020-01-01" & as.character(.index)< "2020-12-01" ) fc_wide_prophet <- fc_prophet %>% pivot_wider(names_from = variable, values_from = value)
Here is my full solution. I also have provided background on what I'm doing here: https://github.com/business-science/modeltime/issues/133 suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(lubridate)) suppressPackageStartupMessages(library(tidymodels)) suppressPackageStartupMessages(library(modeltime)) library(timetk) ## create some senseless data to produce forecasts on... dates <- ymd("2016-01-01")+ months(0:59) fake_values <- c(661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239, 661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239, 661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239, 661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366, 510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733, 862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239) replicate <- rep(1,60) %*% t.default(fake_values) replicate <- as.data.frame(replicate) df <- bind_cols(replicate, dates) %>% rename(c(dates = ...241)) ## melt it down data <- reshape2::melt(df, id.var='dates') data %>% as_tibble() -> data data %>% filter(as.numeric(variable) %in% 1:9) %>% group_by(variable) %>% plot_time_series(dates, value, .facet_ncol = 3, .smooth = F) ## make some senseless forecast on senseless data... split_obj <- initial_time_split(data, prop = 0.8) split_obj %>% tk_time_series_cv_plan() %>% plot_time_series_cv_plan(dates, value) split_obj_2 <- time_series_split(data, assess = "1 year", cumulative = TRUE) split_obj_2 %>% tk_time_series_cv_plan() %>% plot_time_series_cv_plan(dates, value) model_fit_prophet <- prophet_reg() %>% set_engine(engine = "prophet") %>% fit(value ~ dates, data = training(split_obj)) ## model table models_tbl_prophet <- modeltime_table(model_fit_prophet) ## calibration calibration_tbl_prophet <- models_tbl_prophet %>% modeltime_calibrate(new_data = testing(split_obj_2)) ## forecast fc_prophet <- calibration_tbl_prophet %>% modeltime_forecast( new_data = testing(split_obj_2), actual_data = data, keep_data = TRUE ) fc_prophet %>% filter(as.numeric(variable) %in% 1:9) %>% group_by(variable) %>% plot_modeltime_forecast(.facet_ncol = 3) ## "unmelt" that bastard again # fc_prophet <- fc_prophet %>% filter(str_detect(.key, "prediction")) # fc_prophet <- fc_prophet[,c(4,9,10)] # fc_prophet <- dplyr::filter(fc_prophet, .index >= "2020-01-01", .index <= "2020-12-01") # #fc_prophet <- fc_prophet %>% subset(fc_prophet, as.character(.index) >"2020-01-01" & as.character(.index)< "2020-12-01" ) # # fc_wide_prophet <- fc_prophet %>% # pivot_wider(names_from = variable, values_from = value) # Make a future forecast refit_tbl_prophet <- calibration_tbl_prophet %>% modeltime_refit(data = data) future_fc_prophet <- refit_tbl_prophet %>% modeltime_forecast( new_data = data %>% group_by(variable) %>% future_frame(.length_out = "1 year"), actual_data = data, keep_data = TRUE ) future_fc_prophet %>% filter(as.numeric(variable) %in% 1:9) %>% group_by(variable) %>% plot_modeltime_forecast(.facet_ncol = 3) # Reformat as wide future_wide_tbl <- future_fc_prophet %>% filter(.key == "prediction") %>% select(.model_id, .model_desc, dates, variable, .value) %>% pivot_wider( id_cols = c(.model_id, .model_desc, dates), names_from = variable, values_from = .value ) future_wide_tbl[names(df)]
R Explaining Random Forest Variable Selection Sample Code
I have the sample code of random forest variable selection. We want to choose the combination of variables with most importance and build the random forest model with the lowest OOB. Can anyone explain the for loop part in the function for me? clinical_variables <- c("Age","location", "smoke", "perianal_disease","upper_tract", "LnASCA IgA","LnASCA IgG", "LnANCA", "LnCbir", "LnOMPC", "CRP", "Albumin", "African American Race") variable_selected_progress_biomarkers <- vector("list", 50) error_rate_min_progress_biomarkers <- rep(NA, 50) for (j in 1:50){ risk_progress_biomarker_variables <- risk_full %>% select(names(risk), clinical_variables) %>% select(-c("STRICTURE", "TIM2STRICTURE", "PENETRATING", "TIM2PENETRATING","BDNF","LASTFOLLOWUPDAYSPROGRESS", "PROGRESSED")) %>% names risk_progress_biomarker_variables_total <- vector("list",104) names(risk_progress_biomarker_variables_total) <- 104:1 error_rate_tail_progress_biomarker <- rep(NA, 104) for (i in 1:104){ set.seed(4182019) risk_progress_biomarker_variables_total[[i]] <- risk_progress_biomarker_variables rf_risk_progress_biomarker <- rfsrc( Surv(LASTFOLLOWUPDAYSPROGRESS, PROGRESSED) ~ ., data = risk_full %>% select(risk_progress_biomarker_variables, LASTFOLLOWUPDAYSPROGRESS, PROGRESSED)%>% mutate_if(is.factor, as.numeric), ntree=1000, importance = TRUE ) error_rate_tail_progress_biomarker[i] <- tail(rf_risk_progress_biomarker$err.rate,n =1) rf_risk_progress_biomarker_importance <- rf_risk_progress_biomarker$importance %>% as.data.frame() %>% rownames_to_column() %>% as.tibble() %>% dplyr::rename(VIMP = ".") %>% arrange(desc(VIMP)) risk_progress_biomarker_variables <- rf_risk_progress_biomarker_importance %>% head((dim(rf_risk_progress_biomarker_importance)[1]-1)) %>% # top_n((dim(rf_risk_progress_biomarker_importance)[1]-1)) %>% pull(rowname) print(i) } tibble_error_rate_tail_progress_biomarker <- tibble(n = 104:1, error_rate = error_rate_tail_progress_biomarker) suppressMessages(n_min_progress_biomarker <- tibble_error_rate_tail_progress_biomarker %>% top_n(-1) %>% pull(n)) suppressMessages(error_rate_min_progress_biomarker <- tibble_error_rate_tail_progress_biomarker %>% top_n(-1) %>% pull(error_rate)) variable_selected_progress_biomarkers[[j]] <- str_replace_all(risk_progress_biomarker_variables_total[[105-n_min_progress_biomarker]], "_", "") error_rate_min_progress_biomarkers[j] <- error_rate_min_progress_biomarker print(paste("Finish", j)) }