Unexpected outcome in text analysis using R with gutenbergr and cv.glmnet

Unexpected outcome in text analysis using R with gutenbergr and cv.glmnet - r

hope that all of you are fine,
I'm trying to replicate a text analysis exercise ir R with the gutembergr library, and trying to run a machine learning model with the glmnet library.
The problem is that the code actually runs, but it returns the wrong outcome.
I leave the full code, which is not very extensive and is fully replicable.
First part of the code, runs fine
##### Carga de librerias sajtas #####
library(tidyverse)
library(tidytext)
library(udpipe)
library(gutenbergr)
library(rsample)
library(glmnet)
library(yardstick)
##### 1. Cargar los libros del gutenberg ####
twist_tale <- gutenberg_metadata %>%
filter(
title %in% c("A Tale of Two Cities", "Oliver Twist"),
has_text,
language == "en") %>%
pull(gutenberg_id) %>%
gutenberg_download(meta_fields = "title")
##### 2. Quitarle los blancos #####
twist_tale <- twist_tale %>%
filter(text != "")
View(twist_tale)
##### 3. Crear la varialbe lógica ######
twist_tale <- twist_tale %>%
mutate(
es_two_cities = case_when(
title == "A Tale of Two Cities" ~ 1L,
title == "Oliver Twist" ~ 0L
),
line_id = row_number()
)%>%
view()
##### 3.1. Guardar los resultados y contar las filas ####
twist_tale %>%
count(title)
##### 4. Preparar el dataset y modelar #####
dl <- udpipe_download_model(language = "english")
english_model <- udpipe_load_model(dl$file_model)
text <- twist_tale %>%
select(doc_id = line_id, text)
twist_tale_preprocesado <- udpipe(text, english_model, parallel.cores = 4L)
##### 4b. crear un algoritmo de entrenamiento #####
set.seed(1234L)
twist_tale_split <- initial_split(twist_tale)
twist_tale_training <- training(twist_tale_split)
twist_tale_testing <- testing(twist_tale_split)
##### 4a. Ponerlo en minúsculas y quitarle la mugre ####
sparse_train_data <- twist_tale_preprocesado %>%
mutate(lemma = str_to_lower(lemma)) %>%
anti_join(stop_words, by = c("lemma" = "word")) %>%
filter(upos %in% c("PUNCT", "SYM", "X", "NUM")) %>%
mutate(doc_id = as.integer(doc_id)) %>%
anti_join(twist_tale_testing, by = c("doc_id" = "line_id")) %>%
count(doc_id, lemma) %>%
cast_sparse(doc_id, lemma, n)
Until here, everything is fine. The problem arises running the cv.glmnet() model, since it doesn't return any error or warning message, but the outcome is a row vector, and I believe that it must be a matrix or tibble.
# The code that presents the problem
##### 5a. Excluir los resultados irrelevantes y guardarlos ####
y <- twist_tale_training %>%
filter(line_id %in% rownames(sparse_train_data)) %>%
pull(es_two_cities)
##### 6a. Calcular la regresión logística regularizada ####
model <- cv.glmnet(sparse_train_data, y, family = "binomial",
keep = T, trace.it=1) # THE PROBLEM IS ACTUALLY HERE
coeficientes <- model$glmnet.fit %>%
tidy() %>%
filter(lambda == model$lambda.1se)
coeficientes %>%
group_by(estimate > 0) %>%
slice_max(estimate, n = 5) %>%
ungroup()
coeficientes %>%
group_by(estimate > 0) %>%
slice_max(estimate, n = 5) %>%
ungroup() %>%
ggplot() +
geom_col(aes(x = fct_reorder(term, estimate), y = estimate, fill = estimate > 0)) +
coord_flip()
I Apologize for the extent of the question but was for showing the full context.

Haven't run your code but probably you need to make sure the x matrix and y response vector are in the same order. Something alongside the following will do that (replacing step 5a. and the beginning of your step 6)
library(udpipe)
y <- setNames(twist_tale_training$es_two_cities, twist_tale_training$line_id)
traindata <- dtm_align(x = sparse_train_data, y = y)
model <- cv.glmnet(x = traindata$x, y = traindata$y, family = "binomial", keep = T, trace.it = 1)

Related

I need assistance with a nested for-loop function which outputs multiple tables

I'm using the diamonds data set as hypothetical data.
I'm trying to compare the performance of two separate models in predicting a binary variable via misclassification error. The binary variable is decided upon arbitrary thresholds, which would be set at the beginning of the code (prior to making the models and both models are run with different thresholds each time). I want to do a for loop so that it sets different thresholds each time the loop runs. My goal is to obtain tables for the misclassification error for each threshold. Please bear with my inelegant codes. Any help is so greatly appreciated.
library(tidyverse)
library(rpart)
library(rpart.plot)
data = diamonds
threshold = c(0.20, 0.70, 2.00)
for (i in 1:length(threshold)) {
data_filtered = data %>%
select(-c(cut, color, clarity)) %>%
mutate(carat_dummy = ifelse(carat > threshold, 1, 0)) %>%
drop_na()
set.seed(14643)
data_filtered = data_filtered %>%
mutate(train_test = sample(c("train", "test"), n(), replace = TRUE, prob = c(0.75, 0.25)))
data_train = data_filtered %>%
filter(train_test == "train")
data_test = data_filtered %>%
filter(train_test == "test")
#regression
reg = lm(carat_dummy ~ depth + table + price, data = data_train)
#decision Trees
data_train$carat_dummy = factor(data_train$carat_dummy)
trees = rpart(carat_dummy ~ depth + table + price, data = data_train)
tree_predict = predict(trees, newdata = data_test, type = "class")
#Linear regression error
predict = predict(reg, data_test, type="response")
prediction_reg = data_test %>%
mutate(prediction = ifelse(predict >= 0.5,1,0)) %>%
select(carat_dummy, prediction)
error_reg = prediction_reg %>%
summarise(misclassification = mean(carat_dummy != prediction)) %>%
add_column("Method" = "Linear Regression") %>%
select(Method, everything())
#Decision tree error
prediction_tree = data_test %>%
mutate(prediction = as.numeric(as.character(tree_predict))) %>%
select(carat_dummy, prediction)
error_tree = prediction_tree %>%
summarise(misclassification = mean(carat_dummy != prediction)) %>%
add_column("Method" = "Decision Trees") %>%
select(Method, everything())
errors_t = rbind(error_reg, error_tree)
}

Multivariate time series - is there notation to select all the variables, or do they all have to be written out?

I'm working to build a multivariate time series to make predictions about labor in the United States. The fpp3 package is excellent, but I don't see a notation to model all the variables.
For example, in linear regression, it's possible to do this:
library(tidyverse)
mtcars.lm <- lm(mpg ~ ., data = mtcars)
summary(mtcars.lm)
to model mpg on all the remaining variables, without having to write all the variables out explicity. Is there something similar in time series using the fpp3 package?
For example, this returns an error:
library(tidyverse)
library(fpp3)
library(clock)
# Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0000000001
All_Employees <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/All_Employees.csv', col_select = c(Label, Value), show_col_types = FALSE)
All_Employees <- All_Employees %>%
rename(Month = Label, Total_Employees = Value)
All_Employees <- All_Employees %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month) %>%
mutate(Total_Employees_Diff = difference(Total_Employees))
index = All_Employees$Month
All_Employees <- All_Employees %>%
filter((Month >= start_month), (Month <= end_month))
# Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0500000003
Average_Hourly_Earnings <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/Average_Hourly_Earnings.csv', col_select = c(Label, Value), show_col_types = FALSE)
Average_Hourly_Earnings <- Average_Hourly_Earnings %>%
rename(Month = Label, Avg_Hourly_Earnings = Value)
Average_Hourly_Earnings <- Average_Hourly_Earnings %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month) %>%
mutate(Avg_Hourly_Earnings_Diff = difference(Avg_Hourly_Earnings))
Average_Hourly_Earnings <- Average_Hourly_Earnings %>%
filter((Month >= start_month), (Month <= end_month))
Monthly_labor_data_small <-
tsibble(
Month = All_Employees$Month,
index = Month,
'Total_Employees' = All_Employees$Total_Employees,
'Avg_Earnings' = Average_Hourly_Earnings$Avg_Hourly_Earnings
)
start_month_small = yearmonth("2020 Mar")
end_month_small = yearmonth("2022 Jan")
Monthly_labor_data_small <- Monthly_labor_data_small %>%
filter((Month >= start_month_small), (Month <= end_month_small))
Monthly_labor_data_small %>%
model(
linear = TSLM(Total_Employees ~ .,))
The error is: Error in TSLM(Total_Employees ~ ., ) : unused argument (alist())
But this runs fine if I list everything out:
fit <- Monthly_labor_data_small %>%
model(
linear = TSLM(Total_Employees ~ Avg_Earnings + season() + trend()))
report(fit)
The full tsibble will have a large number of columns, is there a short way to list all of them, similar to what can be done in linear regression?

You should be able to do something like
resp <- "Total_Employees"
form <- reformulate(response = resp,
c(setdiff(names(Monthly_labor_data_small), resp),
"season()", "trend()"))
And then use form in your model. I haven't tried your examples -- if there are other variables (like a time index) that should not be explicitly included in the model then the second argument to setdiff() should be c(resp, "excluded_var2", "excluded_var3")

How to deal with external regressors in time series recipes?

In time series forecasting external regressors can make a big difference. Currently I want to track the effects of external regressors, using the modeltime framework.
However, I could not find any helpful information on this topic so far. I only found out, that you can add regressor variables with a "+" to your recipe.
After adding the variables Transactions (number of transactions per day and Store) and Open_Closed (1 = Store is closed, and 0 = Store is open) to my recipe, I found out, that there was no effect on the prediction. How can I achieve this?
some reprex data:
suppressPackageStartupMessages(library(modeltime))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(timetk))
#### DATA
data <- data.frame (Store = c(rep("1",365),rep("2",365)),
Sales = c(seq( 1, 44, length.out = 365)),
Date = c(dates <- ymd("2013-01-01")+ days(0:364)),
Transactions = c(seq( 50, 100, length.out = 365)),
Open_Closed = sample(rep(0:1,each=365))
)
h = 42
# split
set.seed(234)
splits <- time_series_split(data, assess = "42 days", cumulative = TRUE)
# recipe
recipe_spec <- recipe(Sales ~ Date + Transactions + Open_Closed, data) %>%
step_timeseries_signature(Date) %>%
step_rm(matches("(iso$)|(xts$)|(day)|(hour)|(min)|(sec)|(am.pm)")) %>%
step_dummy(all_nominal())
recipe_spec %>% prep() %>% juice()
#### MODELS
# elnet
model_spec_glmnet <- linear_reg(penalty = 1) %>%
set_engine("glmnet")
wflw_fit_glmnet <- workflow() %>%
add_model(model_spec_glmnet) %>%
add_recipe(recipe_spec %>% step_rm(Date)) %>%
fit(training(splits))
# xgboost
model_spec_xgboost <- boost_tree("regression", learn_rate = 0.35) %>%
set_engine("xgboost")
set.seed(123)
wflw_fit_xgboost <- workflow() %>%
add_model(model_spec_xgboost) %>%
add_recipe(recipe_spec %>% step_rm(Date)) %>%
fit(training(splits))
# sub tbl
submodels_tbl <- modeltime_table(
wflw_fit_glmnet,
wflw_fit_xgboost
)
submodels_tbl %>%
modeltime_accuracy(testing(splits)) %>%
table_modeltime_accuracy(.interactive = FALSE)

How to handle forecast data (melt and "unmelt") generated by modeltime prediction - lost variables

below I created some fake forecast data using the tidyverse modeltime packages. I have got monthly data from 2016 and want to produce a test fc for 2020. As you can see, the data I load comes in wide format. For usage in modeltime I transform it to long data. After the modeling phase, I want to create a dataframe for the 2020 prediction values. For this purpose I need to somehow "unmelt" the data. In this process I am unfortunately losing a lot of variables. From 240 variables that I want to forecast I get only 49 in the end result. Maybe I am blind, or I do not know how to configure the modeltime functions correctly. I would really much appreciate some help. Thanks in advance!
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(modeltime))
## create some senseless data to produce forecasts on...
dates <- ymd("2016-01-01")+ months(0:59)
fake_values <-
c(661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239,
661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239,
661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239,
661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239)
replicate <- rep(1,60) %*% t.default(fake_values)
replicate <- as.data.frame(replicate)
df <- bind_cols(replicate, dates) %>%
rename(c(dates = ...241))
## melt it down
data <- reshape2::melt(df, id.var='dates')
## make some senseless forecast on senseless data...
split_obj <- initial_time_split(data, prop = 0.8)
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(value ~ dates, data = training(split_obj))
## model table
models_tbl_prophet <- modeltime_table(model_fit_prophet)
## calibration
calibration_tbl_prophet <- models_tbl_prophet %>%
modeltime_calibrate(new_data = testing(split_obj))
## forecast
fc_prophet <- calibration_tbl_prophet %>%
modeltime_forecast(
new_data = testing(split_obj),
actual_data = data,
keep_data = TRUE
)
## "unmelt" that bastard again
fc_prophet <- fc_prophet %>% filter(str_detect(.key, "prediction"))
fc_prophet <- fc_prophet[,c(4,9,10)]
fc_prophet <- dplyr::filter(fc_prophet, .index >= "2020-01-01", .index <= "2020-12-01")
#fc_prophet <- fc_prophet %>% subset(fc_prophet, as.character(.index) >"2020-01-01" & as.character(.index)< "2020-12-01" )
fc_wide_prophet <- fc_prophet %>%
pivot_wider(names_from = variable, values_from = value)

Here is my full solution. I also have provided background on what I'm doing here: https://github.com/business-science/modeltime/issues/133
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(modeltime))
library(timetk)
## create some senseless data to produce forecasts on...
dates <- ymd("2016-01-01")+ months(0:59)
fake_values <-
c(661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239,
661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239,
661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239,
661,678,1094,1987,3310,2105,1452,983,1107,805,675,684,436,514,668,206,19,23,365,456,1174,1760,735,366,
510,580,939,1127,2397,1514,1370,832,765,661,497,328,566,631,983,1876,2784,2928,2543,1508,1175,8,1733,
862,779,1112,1446,2407,3917,2681,2397,1246,1125,1223,1234,1239)
replicate <- rep(1,60) %*% t.default(fake_values)
replicate <- as.data.frame(replicate)
df <- bind_cols(replicate, dates) %>%
rename(c(dates = ...241))
## melt it down
data <- reshape2::melt(df, id.var='dates')
data %>% as_tibble() -> data
data %>%
filter(as.numeric(variable) %in% 1:9) %>%
group_by(variable) %>%
plot_time_series(dates, value, .facet_ncol = 3, .smooth = F)
## make some senseless forecast on senseless data...
split_obj <- initial_time_split(data, prop = 0.8)
split_obj %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(dates, value)
split_obj_2 <- time_series_split(data, assess = "1 year", cumulative = TRUE)
split_obj_2 %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(dates, value)
model_fit_prophet <- prophet_reg() %>%
set_engine(engine = "prophet") %>%
fit(value ~ dates, data = training(split_obj))
## model table
models_tbl_prophet <- modeltime_table(model_fit_prophet)
## calibration
calibration_tbl_prophet <- models_tbl_prophet %>%
modeltime_calibrate(new_data = testing(split_obj_2))
## forecast
fc_prophet <- calibration_tbl_prophet %>%
modeltime_forecast(
new_data = testing(split_obj_2),
actual_data = data,
keep_data = TRUE
)
fc_prophet %>%
filter(as.numeric(variable) %in% 1:9) %>%
group_by(variable) %>%
plot_modeltime_forecast(.facet_ncol = 3)
## "unmelt" that bastard again
# fc_prophet <- fc_prophet %>% filter(str_detect(.key, "prediction"))
# fc_prophet <- fc_prophet[,c(4,9,10)]
# fc_prophet <- dplyr::filter(fc_prophet, .index >= "2020-01-01", .index <= "2020-12-01")
# #fc_prophet <- fc_prophet %>% subset(fc_prophet, as.character(.index) >"2020-01-01" & as.character(.index)< "2020-12-01" )
#
# fc_wide_prophet <- fc_prophet %>%
# pivot_wider(names_from = variable, values_from = value)
# Make a future forecast
refit_tbl_prophet <- calibration_tbl_prophet %>%
modeltime_refit(data = data)
future_fc_prophet <- refit_tbl_prophet %>%
modeltime_forecast(
new_data = data %>% group_by(variable) %>% future_frame(.length_out = "1 year"),
actual_data = data,
keep_data = TRUE
)
future_fc_prophet %>%
filter(as.numeric(variable) %in% 1:9) %>%
group_by(variable) %>%
plot_modeltime_forecast(.facet_ncol = 3)
# Reformat as wide
future_wide_tbl <- future_fc_prophet %>%
filter(.key == "prediction") %>%
select(.model_id, .model_desc, dates, variable, .value) %>%
pivot_wider(
id_cols = c(.model_id, .model_desc, dates),
names_from = variable,
values_from = .value
)
future_wide_tbl[names(df)]

R Explaining Random Forest Variable Selection Sample Code

I have the sample code of random forest variable selection. We want to choose the combination of variables with most importance and build the random forest model with the lowest OOB. Can anyone explain the for loop part in the function for me?
clinical_variables <- c("Age","location", "smoke", "perianal_disease","upper_tract", "LnASCA
IgA","LnASCA IgG", "LnANCA", "LnCbir", "LnOMPC", "CRP", "Albumin", "African American Race")
variable_selected_progress_biomarkers <- vector("list", 50)
error_rate_min_progress_biomarkers <- rep(NA, 50)
for (j in 1:50){
risk_progress_biomarker_variables <- risk_full %>%
select(names(risk), clinical_variables) %>%
select(-c("STRICTURE", "TIM2STRICTURE", "PENETRATING", "TIM2PENETRATING","BDNF","LASTFOLLOWUPDAYSPROGRESS", "PROGRESSED")) %>% names
risk_progress_biomarker_variables_total <- vector("list",104)
names(risk_progress_biomarker_variables_total) <- 104:1
error_rate_tail_progress_biomarker <- rep(NA, 104)
for (i in 1:104){
set.seed(4182019)
risk_progress_biomarker_variables_total[[i]] <- risk_progress_biomarker_variables
rf_risk_progress_biomarker <- rfsrc(
Surv(LASTFOLLOWUPDAYSPROGRESS, PROGRESSED) ~ .,
data = risk_full %>% select(risk_progress_biomarker_variables, LASTFOLLOWUPDAYSPROGRESS, PROGRESSED)%>%
mutate_if(is.factor, as.numeric),
ntree=1000,
importance = TRUE
)
error_rate_tail_progress_biomarker[i] <- tail(rf_risk_progress_biomarker$err.rate,n =1)
rf_risk_progress_biomarker_importance <- rf_risk_progress_biomarker$importance %>%
as.data.frame() %>%
rownames_to_column() %>%
as.tibble() %>%
dplyr::rename(VIMP = ".") %>%
arrange(desc(VIMP))
risk_progress_biomarker_variables <- rf_risk_progress_biomarker_importance %>%
head((dim(rf_risk_progress_biomarker_importance)[1]-1)) %>%
# top_n((dim(rf_risk_progress_biomarker_importance)[1]-1)) %>%
pull(rowname)
print(i)
}
tibble_error_rate_tail_progress_biomarker <- tibble(n = 104:1, error_rate = error_rate_tail_progress_biomarker)
suppressMessages(n_min_progress_biomarker <- tibble_error_rate_tail_progress_biomarker %>% top_n(-1) %>% pull(n))
suppressMessages(error_rate_min_progress_biomarker <- tibble_error_rate_tail_progress_biomarker %>% top_n(-1) %>% pull(error_rate))
variable_selected_progress_biomarkers[[j]] <- str_replace_all(risk_progress_biomarker_variables_total[[105-n_min_progress_biomarker]], "_", "")
error_rate_min_progress_biomarkers[j] <- error_rate_min_progress_biomarker
print(paste("Finish", j))
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unexpected outcome in text analysis using R with gutenbergr and cv.glmnet - r

Related

I need assistance with a nested for-loop function which outputs multiple tables

Multivariate time series - is there notation to select all the variables, or do they all have to be written out?

How to deal with external regressors in time series recipes?

How to handle forecast data (melt and "unmelt") generated by modeltime prediction - lost variables

R Explaining Random Forest Variable Selection Sample Code

Categories

Resources