I am using the awesome RWeka package in order to fit a SMOreg model as implemented in Weka. While everything is working fine, I have some problem extracting the weights from the fitted model.
As all Weka classifier object, my model has a nice print method that shows me all the features and their relative weights. However, I am not able to extract this weights in any way.
You can see for yourself by running the following code:
library(RWeka)
data("mtcars")
SMOreg_classifier <- make_Weka_classifier("weka/classifiers/functions/SMOreg")
model_SMOreg <- SMOreg_classifier(mpg ~ ., data = mtcars)
Now, if you simply call the model
model_SMOreg
you'll see that it prints all the features used in the model with their relative weight. I would like to access those weights as a vector or, even better, as a 2-columns table with one column containing the names of the features and the other containing the weights.
I am working on a Windows 7 x64 system, using RStudio Version 1.0.153, R 3.4.2 Short Summer and RWeka 0.4-35.
Does someone know how to do this ?
I think you cannot get this in numeric format.
attr(model_SMOreg, "meta")$class # "Weka_classifier"
getAnywhere("print.Weka_classifier")
Result:
A single object matching ‘print.Weka_classifier’ was found
It was found in the following places
registered S3 method for print from namespace RWeka
namespace:RWeka
with value
function (x, ...)
{
writeLines(.jcall(x$classifier, "S", "toString"))
invisible(x)
}
<bytecode: 0x8328630>
<environment: namespace:RWeka>
So we see: print.Weka_classifier() makes a .writeLines() call which in turn makes a rJava::.jcall call, which returns a string.
Thus, I think you need to parse the weights yourself, perhaps by calling the capture.output() method.
Based on the suggestion of #knb I have wrote a function to extract the weights from a SMOreg model and return a tibble with one column for the features name and one for the features weight, with the row arranged following the absolute value of the weight.
Note that this function only works for the SMOreg classifier, as the output of other classifiers is slightly different in terms of layout. However, I think the function can be easily adapted for other classifiers.
library(stringr)
library(tidyverse)
extract_weights_from_SMOreg <- function(model) {
oldw <- getOption("warn")
options(warn = -1)
raw_output <- capture.output(model)
trimmed_output <- raw_output[-c(1:3,(length(raw_output) - 4): length(raw_output))]
df <- data_frame(features_name = vector(length = length(trimmed_output) + 1, "character"),
features_weight = vector(length = length(trimmed_output) + 1, "numeric"))
for (line in 1:length(trimmed_output)) {
string_as_vector <- trimmed_output[line] %>%
str_split(string = ., pattern = " ") %>%
unlist(.)
numeric_element <- trimmed_output[line] %>%
str_split(string = ., pattern = " ") %>%
unlist(.) %>%
as.numeric(.)
position_mul <- string_as_vector[is.na(numeric_element)] %>%
str_detect(string = ., pattern = "[*]") %>%
which(.)
numeric_element <- numeric_element %>%
`[`(., c(1:position_mul))
text_element <- string_as_vector[is.na(numeric_element)]
there_is_plus <- string_as_vector[is.na(numeric_element)] %>%
str_detect(string = ., pattern = "[+]") %>%
sum(.)
if (there_is_plus) { sign_is <- "+"} else { sign_is <- "-"}
feature_weight <- numeric_element[!is.na(numeric_element)]
if (sign_is == "-") {df[line, "features_weight"] <- feature_weight * -1} else {df[line, "features_weight"] <- numeric_element[!(is.na(numeric_element))]}
df[line, "features_name"] <- paste(text_element[(position_mul + 1): length(text_element)], collapse = " ")
}
intercept_line <- raw_output[length(raw_output) - 4]
there_is_plus_intercept <- intercept_line %>%
str_detect(string = ., pattern = "[+]") %>%
sum(.)
if (there_is_plus_intercept) { intercept_sign_is <- "+"} else { intercept_sign_is <- "-"}
numeric_intercept <- intercept_line %>%
str_split(string = ., pattern = " ") %>%
unlist(.) %>%
as.numeric(.) %>%
`[`(., length(.))
df[nrow(df), "features_name"] <- "intercept"
if (intercept_sign_is == "-") {df[nrow(df), "features_weight"] <- numeric_intercept * -1} else {df[nrow(df), "features_weight"] <- numeric_intercept}
options(warn = oldw)
df <- df %>%
arrange(desc(abs(features_weight)))
return(df)
}
Here an example for one model
library(RWeka)
data("mtcars")
SMOreg_classifier <- make_Weka_classifier("weka/classifiers/functions/SMOreg")
mpg_model_weights <- extract_weights_from_SMOreg(SMOreg_classifier(data = mtcars, mpg ~ .))
mpg_model_weights
Related
I am using the R programming language. I am trying to follow the R tutorial over here on neural networks (lstm) and time series: https://blogs.rstudio.com/ai/posts/2018-06-25-sunspots-lstm/
I decided to create my own time series data ("y.mon") for this tutorial (the same format and the same variable names) :
library(tidyverse)
library(glue)
library(forcats)
library(timetk)
library(tidyquant)
library(tibbletime)
library(cowplot)
library(recipes)
library(rsample)
library(yardstick)
library(keras)
library(tfruns)
library(dplyr)
library(lubridate)
library(tibbletime)
library(timetk)
index = seq(as.Date("1749/1/1"), as.Date("2016/1/1"),by="day")
index <- format(as.Date(index), "%Y/%m/%d")
value <- rnorm(97520,27,2.1)
final_data <- data.frame(index, value)
y.mon<-aggregate(value~format(as.Date(index),
format="%Y/%m"),data=final_data, FUN=sum)
y.mon$index = y.mon$`format(as.Date(index), format = "%Y/%m")`
y.mon$`format(as.Date(index), format = "%Y/%m")` = NULL
y.mon %>%
mutate(index = paste0(index, '/01')) %>%
tk_tbl() %>%
mutate(index = as_date(index)) %>%
as_tbl_time(index = index) -> y.mon
From here on, I follow the instructions in the tutorial (replacing the "sun_spots data" with "y.mon". Everything works fine until this point (I posted a question yesterday that got closed for being too detailed https://stackoverflow.com/questions/65527230/r-error-in-is-symbolx-object-not-found-keras - the code can be followed from the rstudio tutorial) :
#ERROR
coln <- colnames(compare_train)[4:ncol(compare_train)]
cols <- map(coln, quo(sym(.)))
rsme_train <-
map_dbl(cols, function(col)
rmse(
compare_train,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>% mean()
rsme_train
Error in is_symbol(x) : object '.' not found
I found another stackoverflow post which deals with a similar problem:Getting error message while calculating rmse in a time series analysis
According to this stackoverflow post, this first error can be resolved like this:
coln <- colnames(compare_train)[4:ncol(compare_train)]
rsme_train <-
map_df(coln, function(col)
rmse(
compare_train,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>%
pull(.estimate) %>%
mean()
rsme_train
However, the following section of the tutorial has a similar section in which the same error persists even after applying the corrections:
compare_test %>% write_csv(str_replace(model_path, ".hdf5", ".test.csv"))
compare_test[FLAGS$n_timesteps:(FLAGS$n_timesteps + 10), c(2, 4:8)] %>% print()
cols <- map(coln, quo(sym(.)))
rsme_test <-
map_dbl(cols, function(col)
rmse(
compare_test,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>% mean()
rsme_test
#errors:
Error in stri_replace_first_regex(string, pattern, fix_replacement(replacement), :
object 'model_path' not found
Error in is_symbol(x) : object '.' not found
These errors are preventing me from finishing the rest of the tutorial.
Can someone please show me how to fix these?
Thanks
Try using coln in map_dbl :
rsme_test <- map_dbl(coln, function(col)
rmse(
compare_test,
truth = value,
estimate = !!col,
na.rm = TRUE
)) %>% mean()
I desperately need help!
I am trying to predict drug use based on 5 characteristics: Age, Gender, Education, Ethnicity, Country. I already build a tree model in R with rpart
DrugTree3 <- rpart(formula = DrugUser ~ Age+Gender+Education+Ethnicity+Country, data = traindata)
, a logistic regression model
DrugLog <- glm(formula = DrugUser ~ Age+Gender+Ethnicity+Education+Country,data = traindata, family = binomial)
, and a knn model
KnnModel <- train(form = DrugUser~., data = ModelData,method ='knn',tuneGrid=expand.grid(.k=1:100),metric='Accuracy',trControl=trainControl(method='repeatedcv',number=10,repeats=10)) .
I saved those as RDS files and uploaded them successfully in Power BI.
I then created tables for each characterization and created okviz filters for them.
Then I tried to predict whether a customer gets predicted as a drug user or a non-drug user based on the selections in the okviz filters. This is when everything went horribly wrong:
I created a custom R visual vor each model prediction and inserted the following code in each visual:
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class = %s) = %s",myclass,as.character(round(myprob,2))),cex=3.5)
Error: Can't determine relationship between fields.
What has gone wrong here?
When I then clicked on the diagonal arrow to get to R Studio, this happens: Unable to construct R script data for use in external R IDE.
I need help as I am literally going crazy over this and I don't know how to resolve the issue! I would be really happy if you can help me
enter image description here
You made a error in line 34, and line 25.
Below is a fixed version of your code.
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class =
Good Luck!
I have some code which I'm looking to replicate many times, each for a different country as the suffix.
Assuming 3 countries as a simple example:
country_list <- c('ALB', 'ARE', 'ARG')
I'm trying to create a series of variables called a_m5_ALB, a_m5_ARE, a_m5_ARG etc which have various functions e.g. addcol or round_df applied to reg_math_ALB, reg_math_ARE, reg_math_ARG etc
for (i in country_list) {
paste("a_m5", i , sep = "_") <- addcol(paste("reg_math", i , sep = "_"))
}
for (i in country_list) {
paste("a_m5", i , sep = "_") <- round_df(paste("reg_math", i , sep = "_"))
}
where addcol and round_df are defined as:
addcol = function(y){
dat1 = mutate(y, p.value = ((1 - pt(q = abs(reg.t.value), df = dof))*2))
return(dat1)
}
round_df <- function(x, digits) {
numeric_columns <- sapply(x, mode) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
The loop errors when any of the functions are added in brackets before the paste variable part but it works if doing it manually e.g.
a_m5_ALB <- addcol(reg_math_ALB)
Please could you help? I think it's the application of the function in a loop which i'm getting wrong.
Errors:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "character"
Error in round(x[numeric_columns], digits) :
non-numeric argument to mathematical function
Thank you
From your examples, you're really in a case where everything should be in a single dataframe. Here, keeping separate variables for each country is not the right tool for the job. Say you have your per-country dataframes saved as csv, you can rewrite everything as:
library(tidyverse)
country_list <- c('ALB', 'ARE', 'ARG')
read_data <- function(ctry){
read_csv(paste0("/path/to/file/", "reg_math_", ctry)) %>%
add_column(country = ctry)
}
total_df <- map_dfr(country_list, read_data)
total_df %>%
mutate(p.value = (1 - pt(q = abs(reg.t.value), df = dof))*2) %>%
mutate(across(where(is.numeric), round, digits = digits))
And it gives you immediate access to all other dplyr functions that are great for this kind of manipulation.
I wish to write a function to analyze several identical variables in several datasets
I built the function below but it does not work well. I am not sure how to pass a name in a function dynamically. Could someone help?
There are 10 identical variables (testvar1, testvar2,..., testvar10, etc...)
in 15 different datasets (mydata1, mydata2,...mydata15, etc...)
library(readxl)
to_analyze <- function (data="mydata1", var = testvar1) {
#reading my file in
excelfile <- paste(`data`, "xlsx", sep = ".")
dataset_name <- read_excel(excelfile)
#populating the testvar1, testvar2,...
dataset_name$var_interest <- dataset_name$var #this does not work
#I was hoping it would give dataset_name$var_interest <- dataset_name$testvar1
#creating a smaller dataset
eco <- dataset_name %>%
select(id, var_interest) #I want var_interest to be testvar1 (not the value but the name)
##doing some analysis on that dataset
}
#creating another function for all the datasets (15 total)
fct_all <- function(x){
for(i in 1:15){
iq <- as.double(i)
dsn <- paste("mydata", deparse(iq), sep="")
to_analyze(data=dsn, var = x)
}
}
#applying the function for all the variables
all_var <- c(testvar1, testvar2, testvar3)
fct_all(all_var)```
You can use equivalently data$var1 or data[['var1']]. Which means that with the second form, as you provide the variable name as a string, you can easily replace the string by a dynamical variable name :
var <- 'var1'
dataset_name$var_interest <- dataset_name[[var]]
Note that dplyr is very clever and does accept string or symbols. Which means you can further simplify your function with one of the following forms :
library(dplyr)
# Using select then rename with the var given as a string
to_analyze <- function (data = "mydata1", var = "testvar1") {
eco <- paste(data, "xlsx", sep = ".") %>%
read_excel() %>%
select(id, var) %>%
rename(var_intereset = var)
...
}
For the sake of interest, you could also use quasiquotation, but given that you'll eventually wrap your variables into a vector, I guess it's not that useful. Would be something of the form :
library(dplyr)
to_analyze <- function (data = "mydata1", var = quo(testvar1)){
quo_var <- enquo(var)
eco <- paste(data, "xlsx", sep = ".") %>%
read_excel() %>%
select(id, !!quo_var) %>%
rename(var_intereset = !!quo_var)
...
}
This form allows you to call your variable with the raw variable name rather than a string : to_analyse(data = "mydata1", var = testvar1). But as said before, probably not the most useful in your case.
Thank you #Romain. Your code worked well. I just changed the single quote to backticks
var <- `var`
to_analyze <- function (data = "mydata1", var = "testvar1") {
eco <- paste(data, "xlsx", sep = ".") %>%
read_excel() %>%
select(id, var) %>%
rename(var_intereset = var)
...
}
The function strip() below tries to produce a brief report on the result of its operation via the tee pipe (%T>%). Because this function is in turn being handed to a wrapper function and then to purrr::pwalk, which will supply it with a bunch of dataframes one by one, I want to get a report of its operation on each dataframe along with the dataframe name; which is to say, the name of the actual dataframe that is supplied to correspond to the formal argument tib in the function below. In the example supplied, this would be "tst_df". I don't know the names in advance of running the function, as they are constructed from the filenames read from disk and various other inputs.
Somewhat to my surprise, I actually have almost all of this working, except for getting the name of the supplied dataframe. In the example below, the code that is supposed to do this is enexpr(XX), but I have also tried expr(XX), and both of these expressions applied to tib or the dot (.), with or without a preceding !!. Also deparse(substitute()) on XX, tib, and ., but without the bang bangs.
I see that the names is stripped initially by pass-by-value, and then again, maybe, by each stage of the pipe, including the T, and again, maybe, by (XX = .) in the anonymous function after the T. But I know R + tidyverse will have a way. I just hope it does not involve providing an integer to count backwards up the call stack
tst_df <- tibble(A = 1:10, B = 11:20, C=21:30, D = 31:40)
tst_df
################################################################################
# The strip function expects a non-anonymous dataframe, from which it removes
# the rows specified in remove_rows and the columns specified in remove_cols. It
# also prints a brief report; just the df name, length and width.
strip <- function(tib, remove_rows = FALSE, remove_cols = NULL){
remove_rows <- enquo(remove_rows)
remove_cols <- enquo(remove_cols)
out <- tib %>%
filter(! (!! remove_rows)) %>%
select(- !! remove_cols) %T>% (function(XX = .){
function(XX = .)print(
paste0("length of ", enxpr(XX), " = ", nrow(XX), " Width = ", ncol(XX)))
cat("\n")
})
out
}
out_tb <- strip(tib = tst_df, remove_rows = (A < 3 | D > 38), remove_cols = c(C, D))
out_tb
Just save the name of tib at the beginning of your function,
it will be found by your reporter function:
strip <- function(tib, remove_rows = FALSE, remove_cols = NULL) {
remove_rows <- enquo(remove_rows)
remove_cols <- enquo(remove_cols)
tib_name <- as.character(substitute(tib))
report <- function(out) {
cat("output length of", tib_name, "=", nrow(out), ", width =", ncol(out), "\n")
}
tib %>%
filter(! (!! remove_rows)) %>%
select(- !! remove_cols) %T>%
report
}
out_tb <- strip(tib = tst_df, remove_rows = (A < 3 | D > 38), remove_cols = c(C, D))
output length of tst_df = 6 , width = 2