I desperately need help!
I am trying to predict drug use based on 5 characteristics: Age, Gender, Education, Ethnicity, Country. I already build a tree model in R with rpart
DrugTree3 <- rpart(formula = DrugUser ~ Age+Gender+Education+Ethnicity+Country, data = traindata)
, a logistic regression model
DrugLog <- glm(formula = DrugUser ~ Age+Gender+Ethnicity+Education+Country,data = traindata, family = binomial)
, and a knn model
KnnModel <- train(form = DrugUser~., data = ModelData,method ='knn',tuneGrid=expand.grid(.k=1:100),metric='Accuracy',trControl=trainControl(method='repeatedcv',number=10,repeats=10)) .
I saved those as RDS files and uploaded them successfully in Power BI.
I then created tables for each characterization and created okviz filters for them.
Then I tried to predict whether a customer gets predicted as a drug user or a non-drug user based on the selections in the okviz filters. This is when everything went horribly wrong:
I created a custom R visual vor each model prediction and inserted the following code in each visual:
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class = %s) = %s",myclass,as.character(round(myprob,2))),cex=3.5)
Error: Can't determine relationship between fields.
What has gone wrong here?
When I then clicked on the diagonal arrow to get to R Studio, this happens: Unable to construct R script data for use in external R IDE.
I need help as I am literally going crazy over this and I don't know how to resolve the issue! I would be really happy if you can help me
enter image description here
You made a error in line 34, and line 25.
Below is a fixed version of your code.
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset <- data.frame(chunk_id, model_id, model_str, AgeLabel, GenderLabel, CountryLabel, EducationLabel, EthnicityLabel)
# dataset <- unique(dataset)
# Paste or type your script code here:
library(dplyr)
from_byte_string = function(x) {
xcharvec = strsplit(x, " ")[[1]]
xhex = as.hexmode(xcharvec)
xraw = as.raw(xhex)
unserialize(xraw)
}
# R Visual imports tables with read.csv but no argument for strings_as_factors = F.
# This means some of the chunks are truncated (ie if they had a " " at the end).
# If you convert to a character and add a space if nchar == 9999 the deserialization works.
# (Thanks to Danny Shah)
dataset <- dataset %>%
mutate( model_str = as.character(model_str) ) %>%
mutate( model_str = ifelse(nchar(model_str) == 9999, paste0(model_str, " "), model_str) )
model_vct <- dataset %>%
filter(model_id == 1) %>%
distinct(model_id, chunk_id, model_str) %>%
arrange(model_id, chunk_id) %>%
pull(model_str)
finalfit.str <- paste( model_vct, collapse = "" )
finalfit <- from_byte_string(finalfit.str)
# get the user parameters
userdata <- dataset %>% select(AgeLabel,GenderLabel,CountryLabel,EducationLabel,EthnicityLabel) %>% unique()
# and then using them to make a prediction
myprediction <- predict(finalfit,newdata=data.frame(Age=userdata$AgeLabel,Gender=userdata$GenderLabel,Country=userdata$CountryLabel, Education=userdata$EducationLabel,Ethnicity=userdata$EthnicityLabel))
maxpred <- which(myprediction==max(myprediction))
myclass <- maxpred - 1
myprob <- myprediction[[maxpred]]
plot.new()
text(0.5,0.5,labels=sprintf("P(class =
Good Luck!
Related
I have been trying to understand how to use possibly() to wrap a lambda/anonymous function within map_dfr() so that my iterations continue on should an error be encountered. I am currently iterating over a large amount of webpages and using rvest to scrape them, however some are not compiled correctly or do not work. I would simply like to note that error so that I can return to it at a later time while continuing collecting data from the remainder of the webpages. My current code is posted below in addition to what I've tried:
df <- tibble(df, map_dfr(df$link, ~ {
# Replicate Human Input by Forcing Random Pauses
Sys.sleep(runif(1,1,3))
# Read in the html links
url <- .x %>% html_session(user_agent(user_agents)) %>% read_html()
# Full Job Description Text
description <- url %>%
html_elements(xpath = "//div[#id = 'jobDescriptionText']") %>%
html_text() %>% tolower()
description <- as.character(description)
# Hiring Insights
hiring_insights <- url %>%
html_elements(xpath = "//div[#id = 'hiringInsightsSectionRoot']") %>%
html_text() %>% str_extract("#REGEX") %>%
str_extract("#REGEX") %>%
str_trim()
hiring_insights <- as.character(hiring_insights)
### Extract Number of Hires
hiring_insights <- str_trim(str_extract(hiring_insights,"#REGEX"))
hiring_insights <- tolower(hiring_insights)
### Fill in all Missing Values with 1
hiring_insights[which(is.na(hiring_insights))] <- "1"
tibble(description, hiring_insights)
}))
I have tried wrapping the lambda function a few different ways but without success:
# First Attempt
df <- tibble(df, map_dfr(df$link, possibly(~ {——}, otherwise = "error)))
# Second Attempt
df <- tibble(df, map_dfr(df$link, possibly(function(x) {——}, otherwise = "error)))
# Third Attempt
df <- tibble(df, possibly(map_dfr(df$link, ~ {——}), otherwise = "error"))
# Fourth Attempt
df <- tibble(df, possibly(map_dfr(df$link, function(x) {——}), otherwise = "error"))
When writing the function with function(x) rather than with the ~ I update the .x to x within the lambda function when defining the url variable. However with each of these iterations I encountered a bad link and receive the HTTP 403 error, which then stops the iteration and discards all of data scraped from the previous variables. What I would like is to either have a dummy variable which notes whether or not the link was bad and then if it is bad fill in the values for the scraped variables with or simply whatever the otherwise argument is set too. Thank you in advance! I've really hit a wall here
map_dfr() expects a dataframe or named vector on every iteration. Your otherwise value isn’t named, so it throws an error. To illustrate:
library(purrr)
vals <- list(1, 2, "bad", 4, 5)
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = NA_real_
)
)
Error in `dplyr::bind_rows()`:
! Argument 3 must have names.
But if you change otherwise to return a dataframe:
map_dfr(
vals,
possibly(
~ data.frame(x = .x^2),
otherwise = data.frame(x = NA_real_)
)
)
x
1 1
2 4
3 NA
4 16
5 25
I have a list of species and I am running an ensemble SDM modelling function on the datset filtering by each species, to give an ensemble SDM per species from the dataset.
I have used purrr package to get it running, and the code works fine when there is no naming convention added in. However, when it outputs the Ensemble.SDM for each species, they are all named the same thing "ensemble.sdm", so when I want to stack them, I cannot as they are all named the same thing.
I would like to be able to name each output of the model something different, ideally linked to the species name picked out in the line: data <- Occ_full %>% filter(NAME == .x)
The working code is written below:
list_of_species <- unique(unlist(Occ_full$NAME))
# Return unique values
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1)
})
The code I have tried to get it named within it, is below, but it does not work, it names it with lots of repeitions of the row number.
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
label <- as.character(data)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1, name = label )
})
Could anyone help me please? I simply want each "output" to be named with the species name specified in the filter. Thank you
Try using split with imap -
list_of_species <- split(Occ_full, Occ_full$NAME)
output <- purrr::imap(list_of_species,~{
ensemble_modelling(c('GAM'), .x, Env_Vars,Xcol = 'LONGITUDE',
Ycol = 'LATITUDE', rep = 1, name = .y)
})
split would ensure that the list_of_species is named which can be used in imap.
So basically I am trying the following loop:
rawData = read.csv(file = "SampleData.csv")
companySplit = split(rawData, rawData$Company)
NameOfCompany <- numeric()
DateOfOrder <- character()
WhichProducts <- numeric()
for (i in 1:length(companySplit)){
company_DateSplit = split(companySplit[[i]], companySplit[[i]]$Date)
for (j in 1:length(company_DateSplit)){
WhichProducts[j] <- (paste0(company_DateSplit[[j]]$ID, collapse=","))
DateOfOrder[j] <- (paste0(company_DateSplit[[j]]$Date[1]))
NameOfCompany[j] <- (paste0(companySplit[[i]]$Company[[1]]))
}
}
df <- data.frame(NameOfCompany,DateOfOrder, WhichProducts)
write.csv(df, file = "basket.csv")
If you check basket.csv there is output for only company D. It is not writing because of nesting of for loops I guess. I am not able to get out of it.
I need exact output as basket.csv but for all companies.
Here are the CSVs:
Input Data: Link
Output of code basket.csv: Link
The output should look like this:
Company,Date, All IDs comma seperated.
e.g.
A,Jan-18,(1,2,4)
A,Feb-18,(1,4)
B,Jan-18,(2,3,4)
I'm able to get it from the above code. But Not able to save it in CSV for all A,B,C,D companies. It saves values for only company D which is the last value in looping. (check output file link)
The initial error is that you import your data without the parameter stringsAsFactors = FALSE which happens all the time. Also, looping in R is usually less efficient and harder to reason about than using a more functional approach. I think what you're trying to do can be done with the aggregate function
rawData <- read.csv(file = "SampleData.csv", stringsAsFactors = FALSE)
df <- aggregate(ID ~ Company + Date, data = rawData, FUN = paste, collapse = ",")
colnames(df) <- c("NameOfCompany", "DateOfOrder", "ID")
df = split(df, df$NameOfCompany)
Or using a tidy approach
df <- rawData %>% group_by(Company, Date) %>%
summarise(WhichProducts=paste(ID,collapse=',')) %>%
rename(DateOfOrder = Date) %>%
rename(NameOfCompany = Company) %>%
group_split()
I am using the awesome RWeka package in order to fit a SMOreg model as implemented in Weka. While everything is working fine, I have some problem extracting the weights from the fitted model.
As all Weka classifier object, my model has a nice print method that shows me all the features and their relative weights. However, I am not able to extract this weights in any way.
You can see for yourself by running the following code:
library(RWeka)
data("mtcars")
SMOreg_classifier <- make_Weka_classifier("weka/classifiers/functions/SMOreg")
model_SMOreg <- SMOreg_classifier(mpg ~ ., data = mtcars)
Now, if you simply call the model
model_SMOreg
you'll see that it prints all the features used in the model with their relative weight. I would like to access those weights as a vector or, even better, as a 2-columns table with one column containing the names of the features and the other containing the weights.
I am working on a Windows 7 x64 system, using RStudio Version 1.0.153, R 3.4.2 Short Summer and RWeka 0.4-35.
Does someone know how to do this ?
I think you cannot get this in numeric format.
attr(model_SMOreg, "meta")$class # "Weka_classifier"
getAnywhere("print.Weka_classifier")
Result:
A single object matching ‘print.Weka_classifier’ was found
It was found in the following places
registered S3 method for print from namespace RWeka
namespace:RWeka
with value
function (x, ...)
{
writeLines(.jcall(x$classifier, "S", "toString"))
invisible(x)
}
<bytecode: 0x8328630>
<environment: namespace:RWeka>
So we see: print.Weka_classifier() makes a .writeLines() call which in turn makes a rJava::.jcall call, which returns a string.
Thus, I think you need to parse the weights yourself, perhaps by calling the capture.output() method.
Based on the suggestion of #knb I have wrote a function to extract the weights from a SMOreg model and return a tibble with one column for the features name and one for the features weight, with the row arranged following the absolute value of the weight.
Note that this function only works for the SMOreg classifier, as the output of other classifiers is slightly different in terms of layout. However, I think the function can be easily adapted for other classifiers.
library(stringr)
library(tidyverse)
extract_weights_from_SMOreg <- function(model) {
oldw <- getOption("warn")
options(warn = -1)
raw_output <- capture.output(model)
trimmed_output <- raw_output[-c(1:3,(length(raw_output) - 4): length(raw_output))]
df <- data_frame(features_name = vector(length = length(trimmed_output) + 1, "character"),
features_weight = vector(length = length(trimmed_output) + 1, "numeric"))
for (line in 1:length(trimmed_output)) {
string_as_vector <- trimmed_output[line] %>%
str_split(string = ., pattern = " ") %>%
unlist(.)
numeric_element <- trimmed_output[line] %>%
str_split(string = ., pattern = " ") %>%
unlist(.) %>%
as.numeric(.)
position_mul <- string_as_vector[is.na(numeric_element)] %>%
str_detect(string = ., pattern = "[*]") %>%
which(.)
numeric_element <- numeric_element %>%
`[`(., c(1:position_mul))
text_element <- string_as_vector[is.na(numeric_element)]
there_is_plus <- string_as_vector[is.na(numeric_element)] %>%
str_detect(string = ., pattern = "[+]") %>%
sum(.)
if (there_is_plus) { sign_is <- "+"} else { sign_is <- "-"}
feature_weight <- numeric_element[!is.na(numeric_element)]
if (sign_is == "-") {df[line, "features_weight"] <- feature_weight * -1} else {df[line, "features_weight"] <- numeric_element[!(is.na(numeric_element))]}
df[line, "features_name"] <- paste(text_element[(position_mul + 1): length(text_element)], collapse = " ")
}
intercept_line <- raw_output[length(raw_output) - 4]
there_is_plus_intercept <- intercept_line %>%
str_detect(string = ., pattern = "[+]") %>%
sum(.)
if (there_is_plus_intercept) { intercept_sign_is <- "+"} else { intercept_sign_is <- "-"}
numeric_intercept <- intercept_line %>%
str_split(string = ., pattern = " ") %>%
unlist(.) %>%
as.numeric(.) %>%
`[`(., length(.))
df[nrow(df), "features_name"] <- "intercept"
if (intercept_sign_is == "-") {df[nrow(df), "features_weight"] <- numeric_intercept * -1} else {df[nrow(df), "features_weight"] <- numeric_intercept}
options(warn = oldw)
df <- df %>%
arrange(desc(abs(features_weight)))
return(df)
}
Here an example for one model
library(RWeka)
data("mtcars")
SMOreg_classifier <- make_Weka_classifier("weka/classifiers/functions/SMOreg")
mpg_model_weights <- extract_weights_from_SMOreg(SMOreg_classifier(data = mtcars, mpg ~ .))
mpg_model_weights
I have been working on feature engineering and I have come up with following code to converted the categorical dataset into binary transaction dataset, but I am not able get desired output, only three columns get populated with one. Rest all remain zero.
Here's my code:
`binarize<-function(dataset,names){
drops<-NA
colnames(dataset)<- paste("{",names,"}:",sep = "")
dataset<-dataset[,!(names(dataset)%in%drops)]
dataset$id = 1:nrow(dataset)
xt<-dataset %>% mutate_each(funs(as.character))%>% gather(key,values,starts_with("{"))
xt$values <- apply( xt[ , c("key","values") ] , 1 , paste , collapse = "" )
xt<-xt %>% mutate(present = 1)%>% spread(values,present,fill = 0)
xt<-xt%>% distinct(id)
rmCols<-c("id","key")
xt<-xt[,-which(names(xt) %in% rmCols)]
return(xt)
}`
Can someone help me debug this? names in the argument is a character vector.
Thanks,
Prerit