Getting error when applying Smbinning in R - r
I am working on an example from http://r-statistics.co/Logistic-Regression-With-R.html. I have problem with smbinning code. I am trying to get Information Value via using smbinning.
library(smbinning)
# segregate continuous and factor variables
factor_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY")
continuous_vars <- c("AGE", "FNLWGT","EDUCATIONNUM", "HOURSPERWEEK", "CAPITALGAIN", "CAPITALLOSS")
iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(14)) # init for IV results
# compute IV for categoricals
for(factor_var in factor_vars){
smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var) # WOE table
if(class(smb) != "character"){ # heck if some error occured
iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
}
}
This is the code given. I cannot understand the reason behind checking class of the smbinning. My general understanding on smbinning is also not that good.
for(vars in factor_vars){
smb <- smbinning.factor(trainingData, y = "ABOVE50K", x = vars )
iv_df[iv_df$VARS == vars, "IV"] <- smb["iv"]
}
When I run this code I am getting some values NA values. So class checking is apparently needed but why?
Thank you very much.
Following the example to the letter, your problem would be the following:
If you do smb <- smbinning.factor(trainingData, y="ABOVE50K", x="EDUCATION") and then smb, you get
1 "Too many categories"
str(trainingData) shows that:
$ EDUCATION : Factor w/ 16 levels...
While the smbinning documentation says that
maxcat - Specifies the maximum number of categories. Default value is 10. Name of x
must not have a dot.
Therefore your solution is to use: smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var, maxcat=16) in the for loop
Related
How *not* to remove entire case from analysis, using aov_car
I'm running an ANOVA with: within: Session (Pre vs. Post) within: Condition (A, B, C) between: Group (Female, Male) Three participants are missing all of 'C' (pre and post). I don't want to completely exclude them from my analyses because I think their 'A' and 'B' data is still interesting. I have tried including na.rm=TRUE to my script, and to no avail. Is there any way that I can run my aov_car (mixed-design ANOVA) without completely remove all the data from these three participants? I keep getting the following error: Contrasts set to contr.sum for the following variables: Group. Warning message: Missing values for following ID(s): P20, R21, R22. Removing those cases from the analysis. Sample data (note, it's fudged/randomized data here): my_data <- readr::read_csv("PID,Session,Condition,Group,data P1,Pre,A,Female,0.935147485 P2,Pre,A,Female,0.290449952 P3,Pre,A,Female,0.652213856 P4,Pre,A,Female,0.349222763 P5,Pre,A,Female,0.235789135 P6,Pre,A,Female,0.268469251 P7,Pre,A,Female,0.419284033 P8,Pre,A,Female,0.797236877 P9,Pre,A,Female,0.784526027 P10,Pre,A,Female,0.44837527 P11,Pre,A,Female,0.359525572 P12,Pre,A,Male,0.923775343 P13,Pre,A,Male,0.431557872 P14,Pre,A,Male,0.425703913 P15,Pre,A,Male,0.39916012 P16,Pre,A,Male,0.168378348 P17,Pre,A,Male,0.260462544 P18,Pre,A,Male,0.945835896 P19,Pre,A,Male,0.495932288 P20,Pre,A,Male,0.045565042 P21,Pre,A,Male,0.748259161 P22,Pre,A,Male,0.426588091 P1,Pre,B,Female,0.761677517 P2,Pre,B,Female,0.985953719 P3,Pre,B,Female,0.657063156 P4,Pre,B,Female,0.166859072 P5,Pre,B,Female,0.850201269 P6,Pre,B,Female,0.227918183 P7,Pre,B,Female,0.701946655 P8,Pre,B,Female,0.079116861 P9,Pre,B,Female,0.094935181 P10,Pre,B,Female,0.376525478 P11,Pre,B,Female,0.725431114 P12,Pre,B,Male,0.922099723 P13,Pre,B,Male,0.664993697 P14,Pre,B,Male,0.450501356 P15,Pre,B,Male,0.201276143 P16,Pre,B,Male,0.735428897 P17,Pre,B,Male,0.304752274 P18,Pre,B,Male,0.393020637 P19,Pre,B,Male,0.452345203 P20,Pre,B,Male,0.697709526 P21,Pre,B,Male,0.130459291 P22,Pre,B,Male,0.210211859 P1,Pre,C,Female,0.280820754 P2,Pre,C,Female,0.206499238 P3,Pre,C,Female,0.127540559 P4,Pre,C,Female,0.001998028 P5,Pre,C,Female,0.554408227 P6,Pre,C,Female,0.235435708 P7,Pre,C,Female,0.341077362 P8,Pre,C,Female,0.101103042 P9,Pre,C,Female,0.834297025 P10,Pre,C,Female,0.256605011 P11,Pre,C,Female,0.65647746 P12,Pre,C,Male,0.110716441 P13,Pre,C,Male,0.075856866 P14,Pre,C,Male,0.518357132 P15,Pre,C,Male,0.222078883 P16,Pre,C,Male,0.414747048 P17,Pre,C,Male,0.525522894 P18,Pre,C,Male,0.758019496 P19,Pre,C,Male,0.213927508 P20,Pre,C,Male, P21,Pre,C,Male, P22,Pre,C,Male, P1,Post,A,Female,0.435204978 P2,Post,A,Female,0.681378597 P3,Post,A,Female,0.928158111 P4,Post,A,Female,0.525061816 P5,Post,A,Female,0.46271948 P6,Post,A,Female,0.649810342 P7,Post,A,Female,0.748819476 P8,Post,A,Female,0.207494638 P9,Post,A,Female,0.060148769 P10,Post,A,Female,0.074998663 P11,Post,A,Female,0.177396477 P12,Post,A,Male,0.61446322 P13,Post,A,Male,0.367348586 P14,Post,A,Male,0.853124208 P15,Post,A,Male,0.268734518 P16,Post,A,Male,0.784226481 P17,Post,A,Male,0.892830959 P18,Post,A,Male,0.950081146 P19,Post,A,Male,0.731274982 P20,Post,A,Male,0.901554267 P21,Post,A,Male,0.170960222 P22,Post,A,Male,0.2337913 P1,Post,B,Female,0.940130538 P2,Post,B,Female,0.575209304 P3,Post,B,Female,0.84527559 P4,Post,B,Female,0.160605498 P5,Post,B,Female,0.547844182 P6,Post,B,Female,0.287795345 P7,Post,B,Female,0.010274473 P8,Post,B,Female,0.408166731 P9,Post,B,Female,0.562733542 P10,Post,B,Female,0.44217795 P11,Post,B,Female,0.390071799 P12,Post,B,Male,0.767768344 P13,Post,B,Male,0.548800315 P14,Post,B,Male,0.489825627 P15,Post,B,Male,0.783939035 P16,Post,B,Male,0.772595033 P17,Post,B,Male,0.252895712 P18,Post,B,Male,0.383513642 P19,Post,B,Male,0.709882712 P20,Post,B,Male,0.517304459 P21,Post,B,Male,0.77186642 P22,Post,B,Male,0.395415627 P1,Post,C,Female,0.649783292 P2,Post,C,Female,0.490853459 P3,Post,C,Female,0.467705056 P4,Post,C,Female,0.988740552 P5,Post,C,Female,0.413980642 P6,Post,C,Female,0.83941706 P7,Post,C,Female,0.111722237 P8,Post,C,Female,0.501984852 P9,Post,C,Female,0.15634255 P10,Post,C,Female,0.547770503 P11,Post,C,Female,0.576203944 P12,Post,C,Male,0.857518274 P13,Post,C,Male,0.176794297 P14,Post,C,Male,0.127501287 P15,Post,C,Male,0.831191664 P16,Post,C,Male,0.257022941 P17,Post,C,Male,0.295366754 P18,Post,C,Male,0.113785049 P19,Post,C,Male,0.621389037 P20,Post,C,Male, P21,Post,C,Male, P22,Post,C,Male,") Current Code : library(tidyverse) library(car) library(afex) library(emmeans) my_anova <-aov_car(data ~ Group*Session*Condition + Error(PID/Session*Condition), na.rm = TRUE, data=my_data) I've also tried: my_anova2 <- aov_ez("PID", "data", my_data, within = c("Session", "Condition"), between = "Group", na.rm=TRUE)
Error in is.single.string(object) : argument "object" is missing, with no default
I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments. library(tidyr) variant_calls = read.delim("variant_calls.txt") info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":") df = cbind(variant_calls["Gene.refGene"],info) library(biomaRt) ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice") pep <- vector() for(i in 1:length(df$`Refseq ID`)){ temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl) temp <- sapply(temp$peptide, nchar) temp <- sort(temp, decreasing = TRUE) temp <- names(temp[1]) pep[i] <- temp } df$Sequence <- pep Traceback: Error in is.single.string(object) : argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often. My solution consisted in calling the function like this: biomaRt::getSequence()
Problems following a code example. InformationValue::Woe
I'm learning new feature selection methods with this entry of a blog: https://www.machinelearningplus.com/machine-learning/feature-selection/ Point 9. And I stumbled upon some problems. First is the CV, which I have solved. library(InformationValue) adult <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', sep = ',', fill = F, strip.white = T,stringsAsFactors = FALSE) colnames(adult) <- c('age', 'WORKCLASS', 'fnlwgt', 'EDUCATION', 'educatoin_num', 'MARITALSTATUS', 'OCCUPATION', 'RELATIONSHIP', 'RACE', 'SEX', 'capital_gain', 'capital_loss', 'hours_per_week', 'NATIVECOUNTRY', 'ABOVE50K') inputData <- adult print(head(inputData)) But then I can't solve the next chunk # Choose Categorical Variables to compute Info Value. cat_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY") # get all categorical variables # Init Output df_iv <- data.frame(VARS=cat_vars, IV=numeric(length(cat_vars)), STRENGTH=character(length(cat_vars)), stringsAsFactors = F) # init output dataframe # Get Information Value for each variable for (factor_var in factor_vars){ df_iv[df_iv$VARS == factor_var, "IV"] <- InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K) df_iv[df_iv$VARS == factor_var, "STRENGTH"] <- attr(InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K), "howgood") } # Sort df_iv <- df_iv[order(-df_iv$IV), ] df_iv And I keep getting 0 values in IV and, of course, Not predictive in the column of the dataframe. I've tried to do a factor_vars=cat_vars But it doesn't seems to work and quite frankly I can't figure out why this doesn't work.
Just solved it. In first instance the argument of stringsAsFactors = FALSE its unnecesary, since we need factors. Then, consulting the IV function and looking at the summary of the dataset, i noticed that despise its a factor the function requieres a numeric input, the function cannot extract its "value" (level). So we must work arround it. as.numeric(inputData$ABOVE50K) "Solves it" Although maybe i should change the values since it gives 1-2 instead of the classic 0-1 response. Im working on it. I think theres got to be an easiest solution, but: levels(inputData$ABOVE50K) inputData$ABOVE50K2 = as.numeric(inputData$ABOVE50K) inputData$ABOVE50K3= ifelse(inputData$ABOVE50K2 ==1,0, ifelse(inputData$ABOVE50K2==2,1,NA)) inputData$ABOVE50K3 <- factor(inputData$ABOVE50K3) And the output is the same. So there is no need to change the levels to 0-1. # Choose Categorical Variables to compute Info Value. cat_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY") # get all categorical variables factor_vars= cat_vars # Init Output df_iv <- data.frame(VARS=cat_vars, IV=numeric(length(cat_vars)), STRENGTH=character(length(cat_vars)), stringsAsFactors = F) # init output dataframe # Get Information Value for each variable for (factor_var in factor_vars){ df_iv[df_iv$VARS == factor_var, "IV"] <- InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K3) df_iv[df_iv$VARS == factor_var, "STRENGTH"] <- attr(InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K3), "howgood") } # Sort df_iv <- df_iv[order(-df_iv$IV), ] df_iv
Debug error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE]: subscript out of bounds
I'm using rpart library to build a regression tree, with the following code: skillcraft <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00272/SkillCraft1_Dataset.csv", header = T, sep =",") skillcraft$LeagueIndex <- factor(skillcraft$LeagueIndex) skillcraft <- skillcraft[-1] skillcraft$Age <- as.numeric(levels(skillcraft$Age))[skillcraft$Age] skillcraft$TotalHours <- as.numeric( levels(skillcraft$TotalHours))[skillcraft$TotalHours] skillcraft$HoursPerWeek <- as.numeric( levels(skillcraft$HoursPerWeek))[skillcraft$HoursPerWeek] skillcraft <- skillcraft[complete.cases(skillcraft),] library(caret) set.seed(133) skillcraft_sampling_vector <- createDataPartition( skillcraft$LeagueIndex, p = 0.8, list = F) skillcraft_train <- skillcraft[skillcraft_sampling_vector,] skillcraft_test <- skillcraft[-skillcraft_sampling_vector,] library(rpart) regtree <- rpart(LeagueIndex ~., data = skillcraft_train) regtree_predictions <- predict(regtree, skillcraft_test) The last line of this code is throwing the error: Error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE] : subscript out of bounds This doesn't seem very clear, but I've checked that both data frames (train and test) have the same structure and now I'm having trouble in finding a way to debug this code. Can anyone help? Thanks in advance!
My best guess is that the problem lies in the LeagueIndex factor. This variable was provided as ordinal data (from Bronze to Professional) and converted to a character factor "1", "2", "3", etc. up to "8". It looks like in addition to your error with rpart, you get a warning when partitioning the data based on this factor: In createDataPartition(skillcraft$LeagueIndex, p = 0.8, list = F) : Some classes have no records ( 8 ) and these will be ignored Apparently there are no records with LeagueIndex of 8. This seems to come after you select for completed cases here: skillcraft <- skillcraft[complete.cases(skillcraft),] And all of the LeagueIndex=8 cases are removed as these will have missing data for Age, HoursPerWeek, and TotalHours (coerced to NA) when converted via as.numeric. skillcraft[which(skillcraft$LeagueIndex == 8), c("Age", "HoursPerWeek", "TotalHours")] Age HoursPerWeek TotalHours 3341 ? ? ? 3342 ? ? ? 3343 ? ? ? ... Assuming you still wanted a factor, I believe if you get rid of the unused factor level this will work such as: skillcraft$LeagueIndex <- droplevels(skillcraft$LeagueIndex) before partitioning the data. (You could just do on the training set in this example, but you would want the same factor levels in your test and train sets.)
replacing a value in column X based on columns Y with R
i've gone through several answers and tried the following but each either yields an error or an un-wanted result: here's the data: Network Campaign Moburst_Chartboost Test Campaign Moburst_Chartboost Test Campaign Moburst_Appnext unknown Moburst_Appnext 1065 i'd like to replace "Test Campaign" with "1055" whenever "Network" == "Moburst_Chartboost". i realize this should be very simple but trying out these: dataset = read.csv('C:/Users/User/Downloads/example.csv') for( i in 1:nrow(dataset)){ if(dataset$Network == 'Moburst_Chartboost') dataset$Campaign <- '1055' } this yields an error: Warning messages: 1: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" : the condition has length > 1 and only the first element will be used 2: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" : the condition has length > 1 and only the first element will be used etc. then i tried: within(dataset, { dataset$Campaign <- ifelse(dataset$Network == 'Moburst_Chartboost', '1055', dataset$Campaign) }) this turned ALL 4 values in row "Campaign" into "1055" over running what was there even when condition isn't met also this: dataset$Campaign[which(dataset$Network == 'Moburst_Chartboost')] <- 1055 yields this error, and replaced the values in the two first rows of "Campaign" with NA: Warning message: In `[<-.factor`(`*tmp*`, which(dataset$Network == "Moburst_Chartboost"), : invalid factor level, NA generated scratching my head here. new to R but this shouldn't be so hard :(
In your first attempt, you're trying to iterate over all the columns, when you only want to change the 2nd column. In your second, you're trying to assign the value "1055" to all of the 2nd column. The way to think about it is as an if else, where if the condition in col 1 is met, col 2 is changed, otherwise it remains the same. dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost", "Moburst_Appnext", "Moburst_Appnext"), Campaign = c("Test Campaign", "Test Campaign", "unknown", "1065")) dataset$Campaign <- ifelse(dataset$Network == "Moburst_Chartboost", "1055", dataset$Campaign) head(dataset) Network Campaign 1 Moburst_Chartboost 1055 2 Moburst_Chartboost 1055 3 Moburst_Appnext unknown 4 Moburst_Appnext 1065
You may also try dataset$Campaign[dataset$Campaign=="Test Campaign"]<-1055 to avoid the use of loops and ifelse statements. Where dataset dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost", "Moburst_Appnext", "Moburst_Appnext"), Campaign = c("Test Campaign", "Test Campaign", "unknown", 1065))
Try the following dataset = read.csv('C:/Users/User/Downloads/example.csv', stringsAsFactors = F) for( i in 1:nrow(dataset)){ if(dataset$Network[i] == 'Moburst_Chartboost') dataset$Campaign[i] <- '1055' } It seems your forgot the index variable. Without [i] you work on the whole vector of the data frame, resulting in the error/warning you mentioned. Note that I added stringsAsFactors = F to the read.csv() function to make sure the strings are indeed interpreted as strings and not factors. Using factors this would result in an error like this In `[<-.factor`(`*tmp*`, i, value = c(NA, 2L, 3L, 1L)) : invalid factor level, NA generated Alternatively you can do the following without using a for loop: idx <- which(dataset$Network == 'Moburst_Chartboost') dataset$Campaign[idx] <- '1055' Here, idx is a vector containing the positions where Network has the value 'Moburst_Chartboost'
thank you for the help! not elegant, but since this lingered with me when going to sleep last night i decided to try to bludgeon this with some ugly code but it worked too - just as a workaround...separated to two data frames, replaced all values and then binded back... # subsetting only chartboost chartboost <- subset(dataset, dataset$Network=='Moburst_Chartboost') # replace all values in Campaign chartboost$Campaign <-sub("^.*", "1055",chartboost$Campaign) #subsetting only "not chartboost" notChartboost <-subset(dataset, dataset$Network!='Moburst_Chartboost') # binding back to single dataframe newSet <- rbind(chartboost, notChartboost) Ugly as a duckling but worked :)