Getting error when applying Smbinning in R - r

I am working on an example from http://r-statistics.co/Logistic-Regression-With-R.html. I have problem with smbinning code. I am trying to get Information Value via using smbinning.
library(smbinning)
# segregate continuous and factor variables
factor_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY")
continuous_vars <- c("AGE", "FNLWGT","EDUCATIONNUM", "HOURSPERWEEK", "CAPITALGAIN", "CAPITALLOSS")
iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(14)) # init for IV results
# compute IV for categoricals
for(factor_var in factor_vars){
smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var) # WOE table
if(class(smb) != "character"){ # heck if some error occured
iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
}
}
This is the code given. I cannot understand the reason behind checking class of the smbinning. My general understanding on smbinning is also not that good.
for(vars in factor_vars){
smb <- smbinning.factor(trainingData, y = "ABOVE50K", x = vars )
iv_df[iv_df$VARS == vars, "IV"] <- smb["iv"]
}
When I run this code I am getting some values NA values. So class checking is apparently needed but why?
Thank you very much.

Following the example to the letter, your problem would be the following:
If you do smb <- smbinning.factor(trainingData, y="ABOVE50K", x="EDUCATION") and then smb, you get
1 "Too many categories"
str(trainingData) shows that:
$ EDUCATION : Factor w/ 16 levels...
While the smbinning documentation says that
maxcat - Specifies the maximum number of categories. Default value is 10. Name of x
must not have a dot.
Therefore your solution is to use: smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var, maxcat=16) in the for loop

Related

How *not* to remove entire case from analysis, using aov_car

I'm running an ANOVA with:
within: Session (Pre vs. Post)
within: Condition (A, B, C)
between: Group (Female, Male)
Three participants are missing all of 'C' (pre and post). I don't want to completely exclude them from my analyses because I think their 'A' and 'B' data is still interesting. I have tried including na.rm=TRUE to my script, and to no avail. Is there any way that I can run my aov_car (mixed-design ANOVA) without completely remove all the data from these three participants?
I keep getting the following error: Contrasts set to contr.sum for the following variables: Group. Warning message: Missing values for following ID(s): P20, R21, R22. Removing those cases from the analysis.
Sample data (note, it's fudged/randomized data here):
my_data <- readr::read_csv("PID,Session,Condition,Group,data
P1,Pre,A,Female,0.935147485
P2,Pre,A,Female,0.290449952
P3,Pre,A,Female,0.652213856
P4,Pre,A,Female,0.349222763
P5,Pre,A,Female,0.235789135
P6,Pre,A,Female,0.268469251
P7,Pre,A,Female,0.419284033
P8,Pre,A,Female,0.797236877
P9,Pre,A,Female,0.784526027
P10,Pre,A,Female,0.44837527
P11,Pre,A,Female,0.359525572
P12,Pre,A,Male,0.923775343
P13,Pre,A,Male,0.431557872
P14,Pre,A,Male,0.425703913
P15,Pre,A,Male,0.39916012
P16,Pre,A,Male,0.168378348
P17,Pre,A,Male,0.260462544
P18,Pre,A,Male,0.945835896
P19,Pre,A,Male,0.495932288
P20,Pre,A,Male,0.045565042
P21,Pre,A,Male,0.748259161
P22,Pre,A,Male,0.426588091
P1,Pre,B,Female,0.761677517
P2,Pre,B,Female,0.985953719
P3,Pre,B,Female,0.657063156
P4,Pre,B,Female,0.166859072
P5,Pre,B,Female,0.850201269
P6,Pre,B,Female,0.227918183
P7,Pre,B,Female,0.701946655
P8,Pre,B,Female,0.079116861
P9,Pre,B,Female,0.094935181
P10,Pre,B,Female,0.376525478
P11,Pre,B,Female,0.725431114
P12,Pre,B,Male,0.922099723
P13,Pre,B,Male,0.664993697
P14,Pre,B,Male,0.450501356
P15,Pre,B,Male,0.201276143
P16,Pre,B,Male,0.735428897
P17,Pre,B,Male,0.304752274
P18,Pre,B,Male,0.393020637
P19,Pre,B,Male,0.452345203
P20,Pre,B,Male,0.697709526
P21,Pre,B,Male,0.130459291
P22,Pre,B,Male,0.210211859
P1,Pre,C,Female,0.280820754
P2,Pre,C,Female,0.206499238
P3,Pre,C,Female,0.127540559
P4,Pre,C,Female,0.001998028
P5,Pre,C,Female,0.554408227
P6,Pre,C,Female,0.235435708
P7,Pre,C,Female,0.341077362
P8,Pre,C,Female,0.101103042
P9,Pre,C,Female,0.834297025
P10,Pre,C,Female,0.256605011
P11,Pre,C,Female,0.65647746
P12,Pre,C,Male,0.110716441
P13,Pre,C,Male,0.075856866
P14,Pre,C,Male,0.518357132
P15,Pre,C,Male,0.222078883
P16,Pre,C,Male,0.414747048
P17,Pre,C,Male,0.525522894
P18,Pre,C,Male,0.758019496
P19,Pre,C,Male,0.213927508
P20,Pre,C,Male,
P21,Pre,C,Male,
P22,Pre,C,Male,
P1,Post,A,Female,0.435204978
P2,Post,A,Female,0.681378597
P3,Post,A,Female,0.928158111
P4,Post,A,Female,0.525061816
P5,Post,A,Female,0.46271948
P6,Post,A,Female,0.649810342
P7,Post,A,Female,0.748819476
P8,Post,A,Female,0.207494638
P9,Post,A,Female,0.060148769
P10,Post,A,Female,0.074998663
P11,Post,A,Female,0.177396477
P12,Post,A,Male,0.61446322
P13,Post,A,Male,0.367348586
P14,Post,A,Male,0.853124208
P15,Post,A,Male,0.268734518
P16,Post,A,Male,0.784226481
P17,Post,A,Male,0.892830959
P18,Post,A,Male,0.950081146
P19,Post,A,Male,0.731274982
P20,Post,A,Male,0.901554267
P21,Post,A,Male,0.170960222
P22,Post,A,Male,0.2337913
P1,Post,B,Female,0.940130538
P2,Post,B,Female,0.575209304
P3,Post,B,Female,0.84527559
P4,Post,B,Female,0.160605498
P5,Post,B,Female,0.547844182
P6,Post,B,Female,0.287795345
P7,Post,B,Female,0.010274473
P8,Post,B,Female,0.408166731
P9,Post,B,Female,0.562733542
P10,Post,B,Female,0.44217795
P11,Post,B,Female,0.390071799
P12,Post,B,Male,0.767768344
P13,Post,B,Male,0.548800315
P14,Post,B,Male,0.489825627
P15,Post,B,Male,0.783939035
P16,Post,B,Male,0.772595033
P17,Post,B,Male,0.252895712
P18,Post,B,Male,0.383513642
P19,Post,B,Male,0.709882712
P20,Post,B,Male,0.517304459
P21,Post,B,Male,0.77186642
P22,Post,B,Male,0.395415627
P1,Post,C,Female,0.649783292
P2,Post,C,Female,0.490853459
P3,Post,C,Female,0.467705056
P4,Post,C,Female,0.988740552
P5,Post,C,Female,0.413980642
P6,Post,C,Female,0.83941706
P7,Post,C,Female,0.111722237
P8,Post,C,Female,0.501984852
P9,Post,C,Female,0.15634255
P10,Post,C,Female,0.547770503
P11,Post,C,Female,0.576203944
P12,Post,C,Male,0.857518274
P13,Post,C,Male,0.176794297
P14,Post,C,Male,0.127501287
P15,Post,C,Male,0.831191664
P16,Post,C,Male,0.257022941
P17,Post,C,Male,0.295366754
P18,Post,C,Male,0.113785049
P19,Post,C,Male,0.621389037
P20,Post,C,Male,
P21,Post,C,Male,
P22,Post,C,Male,")
Current Code :
library(tidyverse)
library(car)
library(afex)
library(emmeans)
my_anova <-aov_car(data ~ Group*Session*Condition
+ Error(PID/Session*Condition), na.rm = TRUE,
data=my_data)
I've also tried:
my_anova2 <- aov_ez("PID", "data",
my_data,
within = c("Session", "Condition"),
between = "Group", na.rm=TRUE)

Error in is.single.string(object) : argument "object" is missing, with no default

I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments.
library(tidyr)
variant_calls = read.delim("variant_calls.txt")
info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":")
df = cbind(variant_calls["Gene.refGene"],info)
library(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice")
pep <- vector()
for(i in 1:length(df$`Refseq ID`)){
temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl)
temp <- sapply(temp$peptide, nchar)
temp <- sort(temp, decreasing = TRUE)
temp <- names(temp[1])
pep[i] <- temp
}
df$Sequence <- pep
Traceback:
Error in is.single.string(object) :
argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often.
My solution consisted in calling the function like this:
biomaRt::getSequence()

Problems following a code example. InformationValue::Woe

I'm learning new feature selection methods with this entry of a blog:
https://www.machinelearningplus.com/machine-learning/feature-selection/
Point 9. And I stumbled upon some problems. First is the CV, which I have solved.
library(InformationValue)
adult <- read.csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T,stringsAsFactors = FALSE)
colnames(adult) <- c('age', 'WORKCLASS', 'fnlwgt', 'EDUCATION',
'educatoin_num', 'MARITALSTATUS', 'OCCUPATION', 'RELATIONSHIP', 'RACE', 'SEX',
'capital_gain', 'capital_loss', 'hours_per_week', 'NATIVECOUNTRY', 'ABOVE50K')
inputData <- adult
print(head(inputData))
But then I can't solve the next chunk
# Choose Categorical Variables to compute Info Value.
cat_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY") # get all categorical variables
# Init Output
df_iv <- data.frame(VARS=cat_vars, IV=numeric(length(cat_vars)), STRENGTH=character(length(cat_vars)), stringsAsFactors = F) # init output dataframe
# Get Information Value for each variable
for (factor_var in factor_vars){
df_iv[df_iv$VARS == factor_var, "IV"] <- InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K)
df_iv[df_iv$VARS == factor_var, "STRENGTH"] <- attr(InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K), "howgood")
}
# Sort
df_iv <- df_iv[order(-df_iv$IV), ]
df_iv
And I keep getting 0 values in IV and, of course, Not predictive in the column of the dataframe.
I've tried to do a
factor_vars=cat_vars
But it doesn't seems to work and quite frankly I can't figure out why this doesn't work.
Just solved it. In first instance the argument of stringsAsFactors = FALSE its unnecesary, since we need factors.
Then, consulting the IV function and looking at the summary of the dataset, i noticed that despise its a factor the function requieres a numeric input, the function cannot extract its "value" (level). So we must work arround it.
as.numeric(inputData$ABOVE50K)
"Solves it" Although maybe i should change the values since it gives 1-2 instead of the classic 0-1 response. Im working on it.
I think theres got to be an easiest solution, but:
levels(inputData$ABOVE50K)
inputData$ABOVE50K2 = as.numeric(inputData$ABOVE50K)
inputData$ABOVE50K3= ifelse(inputData$ABOVE50K2 ==1,0, ifelse(inputData$ABOVE50K2==2,1,NA))
inputData$ABOVE50K3 <- factor(inputData$ABOVE50K3)
And the output is the same. So there is no need to change the levels to 0-1.
# Choose Categorical Variables to compute Info Value.
cat_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY") # get all categorical variables
factor_vars= cat_vars
# Init Output
df_iv <- data.frame(VARS=cat_vars, IV=numeric(length(cat_vars)), STRENGTH=character(length(cat_vars)), stringsAsFactors = F) # init output dataframe
# Get Information Value for each variable
for (factor_var in factor_vars){
df_iv[df_iv$VARS == factor_var, "IV"] <- InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K3)
df_iv[df_iv$VARS == factor_var, "STRENGTH"] <- attr(InformationValue::IV(X=inputData[, factor_var], Y=inputData$ABOVE50K3), "howgood")
}
# Sort
df_iv <- df_iv[order(-df_iv$IV), ]
df_iv

Debug error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE]: subscript out of bounds

I'm using rpart library to build a regression tree, with the following code:
skillcraft <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00272/SkillCraft1_Dataset.csv", header = T, sep =",")
skillcraft$LeagueIndex <- factor(skillcraft$LeagueIndex)
skillcraft <- skillcraft[-1]
skillcraft$Age <- as.numeric(levels(skillcraft$Age))[skillcraft$Age]
skillcraft$TotalHours <- as.numeric(
levels(skillcraft$TotalHours))[skillcraft$TotalHours]
skillcraft$HoursPerWeek <- as.numeric(
levels(skillcraft$HoursPerWeek))[skillcraft$HoursPerWeek]
skillcraft <- skillcraft[complete.cases(skillcraft),]
library(caret)
set.seed(133)
skillcraft_sampling_vector <- createDataPartition(
skillcraft$LeagueIndex, p = 0.8, list = F)
skillcraft_train <- skillcraft[skillcraft_sampling_vector,]
skillcraft_test <- skillcraft[-skillcraft_sampling_vector,]
library(rpart)
regtree <- rpart(LeagueIndex ~., data = skillcraft_train)
regtree_predictions <- predict(regtree, skillcraft_test)
The last line of this code is throwing the error:
Error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE] :
subscript out of bounds
This doesn't seem very clear, but I've checked that both data frames (train and test) have the same structure and now I'm having trouble in finding a way to debug this code.
Can anyone help?
Thanks in advance!
My best guess is that the problem lies in the LeagueIndex factor. This variable was provided as ordinal data (from Bronze to Professional) and converted to a character factor "1", "2", "3", etc. up to "8".
It looks like in addition to your error with rpart, you get a warning when partitioning the data based on this factor:
In createDataPartition(skillcraft$LeagueIndex, p = 0.8, list = F) :
Some classes have no records ( 8 ) and these will be ignored
Apparently there are no records with LeagueIndex of 8. This seems to come after you select for completed cases here:
skillcraft <- skillcraft[complete.cases(skillcraft),]
And all of the LeagueIndex=8 cases are removed as these will have missing data for Age, HoursPerWeek, and TotalHours (coerced to NA) when converted via as.numeric.
skillcraft[which(skillcraft$LeagueIndex == 8), c("Age", "HoursPerWeek", "TotalHours")]
Age HoursPerWeek TotalHours
3341 ? ? ?
3342 ? ? ?
3343 ? ? ?
...
Assuming you still wanted a factor, I believe if you get rid of the unused factor level this will work such as:
skillcraft$LeagueIndex <- droplevels(skillcraft$LeagueIndex)
before partitioning the data. (You could just do on the training set in this example, but you would want the same factor levels in your test and train sets.)

replacing a value in column X based on columns Y with R

i've gone through several answers and tried the following but each either yields an error or an un-wanted result:
here's the data:
Network Campaign
Moburst_Chartboost Test Campaign
Moburst_Chartboost Test Campaign
Moburst_Appnext unknown
Moburst_Appnext 1065
i'd like to replace "Test Campaign" with "1055" whenever "Network" == "Moburst_Chartboost". i realize this should be very simple but trying out these:
dataset = read.csv('C:/Users/User/Downloads/example.csv')
for( i in 1:nrow(dataset)){
if(dataset$Network == 'Moburst_Chartboost') dataset$Campaign <- '1055'
}
this yields an error: Warning messages:
1: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" :
the condition has length > 1 and only the first element will be used
2: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" :
the condition has length > 1 and only the first element will be used
etc.
then i tried:
within(dataset, {
dataset$Campaign <- ifelse(dataset$Network == 'Moburst_Chartboost', '1055', dataset$Campaign)
})
this turned ALL 4 values in row "Campaign" into "1055" over running what was there even when condition isn't met
also this:
dataset$Campaign[which(dataset$Network == 'Moburst_Chartboost')] <- 1055
yields this error, and replaced the values in the two first rows of "Campaign" with NA:
Warning message:
In `[<-.factor`(`*tmp*`, which(dataset$Network == "Moburst_Chartboost"), :
invalid factor level, NA generated
scratching my head here. new to R but this shouldn't be so hard :(
In your first attempt, you're trying to iterate over all the columns, when you only want to change the 2nd column.
In your second, you're trying to assign the value "1055" to all of the 2nd column.
The way to think about it is as an if else, where if the condition in col 1 is met, col 2 is changed, otherwise it remains the same.
dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost",
"Moburst_Appnext", "Moburst_Appnext"),
Campaign = c("Test Campaign", "Test Campaign",
"unknown", "1065"))
dataset$Campaign <- ifelse(dataset$Network == "Moburst_Chartboost",
"1055",
dataset$Campaign)
head(dataset)
Network Campaign
1 Moburst_Chartboost 1055
2 Moburst_Chartboost 1055
3 Moburst_Appnext unknown
4 Moburst_Appnext 1065
You may also try dataset$Campaign[dataset$Campaign=="Test Campaign"]<-1055 to avoid the use of loops and ifelse statements.
Where dataset
dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost",
"Moburst_Appnext", "Moburst_Appnext"),
Campaign = c("Test Campaign", "Test Campaign",
"unknown", 1065))
Try the following
dataset = read.csv('C:/Users/User/Downloads/example.csv', stringsAsFactors = F)
for( i in 1:nrow(dataset)){
if(dataset$Network[i] == 'Moburst_Chartboost') dataset$Campaign[i] <- '1055'
}
It seems your forgot the index variable. Without [i] you work on the whole vector of the data frame, resulting in the error/warning you mentioned.
Note that I added stringsAsFactors = F to the read.csv() function to make sure the strings are indeed interpreted as strings and not factors. Using factors this would result in an error like this
In `[<-.factor`(`*tmp*`, i, value = c(NA, 2L, 3L, 1L)) :
invalid factor level, NA generated
Alternatively you can do the following without using a for loop:
idx <- which(dataset$Network == 'Moburst_Chartboost')
dataset$Campaign[idx] <- '1055'
Here, idx is a vector containing the positions where Network has the value 'Moburst_Chartboost'
thank you for the help! not elegant, but since this lingered with me when going to sleep last night i decided to try to bludgeon this with some ugly code but it worked too - just as a workaround...separated to two data frames, replaced all values and then binded back...
# subsetting only chartboost
chartboost <- subset(dataset, dataset$Network=='Moburst_Chartboost')
# replace all values in Campaign
chartboost$Campaign <-sub("^.*", "1055",chartboost$Campaign)
#subsetting only "not chartboost"
notChartboost <-subset(dataset, dataset$Network!='Moburst_Chartboost')
# binding back to single dataframe
newSet <- rbind(chartboost, notChartboost)
Ugly as a duckling but worked :)

Resources