Calculate Multiple Information Value in R - r

I am new to R programming and trying to learn part time so apologize for naive coding and questions in advance. I have spent about 1 day trying to figure out code for this and unable to do so hence asking here.
https://www.kaggle.com/c/titanic/data?select=train.csv
I am working on train Titanic Data set from Kaggle imported as train_data. I have cleaned up all the col and also converted them to factor where needed.
My question is 2 fold:
1. Unable to understand why this formula gives IV values as 0 for everything. What have I done wrong?
factor_vars <- colnames(train_data)
all_iv <- data.frame(VARS=factor_vars, IV=numeric(length(factor_vars)),STRENGTH=character(length(factor_vars)),stringsAsFactors = F)
for (factor_var in factor_vars){
all_iv[all_iv$VARS == factor_var, "IV"] <-
InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived)
all_iv[all_iv$VARS == factor_var, "STRENGTH"] <-
attr(InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived), "howgood")
}
all_iv <- all_iv[order(-all_iv$IV), ]
2. I am trying to create my own function to calculate IV values for multiple columns in 1 go so that I do not have to do repetitive task however when I run the following formula I get count of total 0 and total 1 instead of items grouped by like I requested. Again, what is that I am doing wrong in this example?
train_data %>% group_by(train_data[[3]]) %>%
summarise(zero = sum(train_data[[2]]==0),
one = sum(train_data[[2]]==1))
I get output
zero one
1 549 342
2 549 342
3 549 342
where as I would anticipate an answer like:
zero one
1 80 136
2 97 87
3 372 119
what is wrong with my code?
3. Is there any pre built function which can give IV values for all columns? On searching I found iv.mult function but I can not get it to work. Any suggestion would be great.

Let's take a look at your questions:
1.
length(factor_vars)
#> [1] 12
length() returns the number of elements of your vector factor_vars. So your code numeric(length(factor_vars)) is evaluated to numeric(12) which returns an numeric vector of length 12, default filled with zeros.
The same applies to character(length(factor_vars)) which returns a character vector of length 12 filled with empty strings "".
Your code doesn't use a correct dplyr syntax.
library(dplyr)
library(dplyr)
train_data %>%
group_by(Pclass) %>%
summarise(zero = sum(Survived == 0),
one = sum(Survived == 1))
returns
# A tibble: 3 x 3
Pclass zero one
<dbl> <int> <int>
1 1 80 136
2 2 97 87
3 3 372 119
which is most likely what you are looking for.
Don't know the meaning of IV.

Related

Assign multiple columns via vector without recycling

I am importing measurement data as a dataframe and want to include the experimental conditions in the data which are given in the filename. I want to add new columns to the dataframe that represent the conditions, and I want to assign the columns with the value specified by the filename. Later, this will facilitate comparisons to other experimental conditions once I merge the editted dataframes from each individual sample/file.
Here is an example of my pre-existing dataframe Measurements:
Measurements <- data.frame(
X = 1:4,
Length = c(130, 150, 170, 140)
)
Here are the example vectors of variables and values that would be derived from the filename:
FileVars.vec <- c("Condition", "Plant")
FileInfo.vec <- c("aKG", "1")
Here is one way that I have solved how to do what I want:
for (i in 1:length(FileVars.vec)) {
Measurements[FileVars.vec[i]] <- FileInfo.vec[i]
}
Which gives the desired output:
X Length Condition Plant
1 130 aKG 1
2 150 aKG 1
3 170 aKG 1
4 140 aKG 1
But my (limited) understanding of R is that it is a vectorized language that often avoids the need for using for-loops. I feel like this simpler code should work:
Measurements[FileVars.vec] <- FileInfo.vec
But instead of assigning one value for one entire column, it recycles the values within each column:
X Length Condition Plant
1 130 aKG aKG
2 150 1 1
3 170 aKG aKG
4 140 1 1
Is there any way to do a similar simple assignment but without recycling, i.e. one value is assigned to one full column only? I imagine there's a simple formatting fix but I've searched for a solution for >6 hours and no where did I see an assignment like this. I have also thought of creating a separate dataframe of just the experimental conditions and then merging to the actual dataframe, but that seems more roundabout to me, especially with more experimental conditions and observations than these examples.
Also, if there is a more established pipeline/package for taking information from the filename and adding it to the data in a tidy fashion, that would be marvelous as well! The original filename would be something like:
"aKG_1.csv"
Thank you for helping an R noobie! May you receive good coding karma when debugging!
We can convert to a list and then assign to avoid the recycling of values column wise. As it is a list, each element will be treated as a unit and the assignment occurs for the respectively columns by recycling those elements
Measurements[FileVars.vec] <- as.list(FileInfo.vec)
-output
Measurements
# X Length Condition Plant
#1 1 130 aKG 1
#2 2 150 aKG 1
#3 3 170 aKG 1
#4 4 140 aKG 1
If we want to reset the type, use type.convert
Measurements <- type.convert(Measurements, as.is = TRUE)
Note that by creating a vector for FileInfo.vec, it will have a single type i.e. character. Instead if we want to have multiple types, it can be a list
Measurements[FileVars.vec] <- list("akg", 1)
For the second part of the question, if we have a string
str1 <- "aKG_1.csv"
and wants to create two columns from that, either use, read.csv or strsplit
Measurements[FileVars.vec] <- read.table(text = tools::file_path_sans_ext(str1),
sep="_", header = FALSE)

Is there an R function to select one variable from each group (group_by()) from the dataframe?

I've a dataset where two variables interests me: trial and truth. Trial numbers the questions people were asked (in total 20). And truth stands for the correct answer for each question. I want to calculate the log10() of the truth for each question. I came up with this:
logT <- data %>%
group_by(trial) %>%
unique(truth, incomparables = F) %>%
summarize(log10(truth))
I'm not sure if it's the best idea to work with unique(), however in a small dataframe the syntax works for me.
trial truth
1 1 34
2 1 34
3 2 321
4 2 321
5 3 78
6 3 78
But with the original data it keeps repeating all the rows, although they are exactly the same. So I end up with 1600 obs. instead of 20, which I'm aiming for.
I used select() to work just with the relevant variables before running the argument, but it still doesn't work.
Where do I go wrong or is there a better way of doing it from scratch?
A dplyr way could be
library(dplyr)
data %>%
group_by(trial) %>%
summarise(truth = first(log10(truth)))
Or, if the logarithms are already computed (unlikely),
data %>%
group_by(trial) %>%
summarise(truth = first(truth))
With dplyr, we can also use distinct
library(dplyr)
distinct(data)
Use unique(mydata) or distinct(mydata). Including the log10 code we have:
mydata %>%
distinct %>%
mutate(truth = log10(truth))
Note
The input, mydata, in reproducible form is assumed to be:
Lines <- "trial truth
1 1 34
2 1 34
3 2 321
4 2 321
5 3 78
6 3 78"
mydata <- read.table(text = Lines)

Converting contingency tables with counts to two-column data tables with frequency columns

I would like to enter a frequency table into an R data.table.
The data are in a format like this:
Height
Gender 3 35
m 173 125
f 323 198
... where the entries in the table (173, 125, etc.) are counts.
I have a 2 by 2 table, and I want to turn it into two-column data.table.
The data is from a study of birds who nest at a height. The question is whether different genders of the bird prefer certain heights.
I thought the frequency table should be turned into something like this:
Gender height N
m 3 173
m 35 125
f 3 323
f 35 198
but now I'm not so sure. Some of the models I want to run need every case itemized.
Can I do this conversion in R? Ideally, I'd like a way to switch back and forth between the two formats.
Based on a review of ?table.
Make a data frame (x) with columns for Gender, Height, and Freq which would be your N value.
Convert that to a table by using
tabledata <- xtabs(Freq ~ ., x)
There are a number of base functions that can work with this kind of data, which is obviously much more compact than individual rows.
Also from ?loglin this example using table.
loglin(HairEyeColor, list(c(1, 2), c(1, 3), c(2, 3)))
Thanks, everybody (#simon and #Elin) for the help. I thought I was conducting a poll that would get answers like "start with the 4-row version" or "start with the 719-row version" and you all have given me an entire toolbox of ways to move from one to the other. It's really great, informative, and way more than the inquiry deserves.
I unquestionably need to work harder and get more explicit in forming a question. I see by the -3 rating that this boondoggle has earned, crystallizing the fact that I'm not adding anything to the knowledge base, so will delete the question in order to keep future searchers from finding this. I've had a bad run recently with my questions, and as a former teacher of the year, writer of five books, and PhD statistician, it's extremely embarrassing to have been on Stack Exchange for as long as I have, and stand here with one reputation point. One. That means that my upvotes of your answers don't count for a thing.
That reputation point should be scarlet colored.
Here's what I was getting at:
In a book, a common way to express data is in a 2×2 table:
Height
Gender 3 35
M 173 175
F 323 198
My tic-tac-sized mind sees two ways of entering that into a data table:
require(data.table)
GENDER <- c("m","m","f","f")
HEIGHT <- c(3, 35, 3, 35)
N <- c(173, 125, 323, 198)
SANDFLIERS <-data.table(GENDER, HEIGHT, N)
That gives the four-line flat-file/tidy representation of the data:
GENDER HEIGHT N
1: m 3 173
2: m 35 125
3: f 3 323
4: f 35 198
The other option is to make a 719-row data table with 173 male#3ft, 125 male#35 feet, etc. It's not too bad if you use the rep() command and build your table columns carefully. I hate doing arithmetic, so I leave some of these numbers bare and untotaled.
# I need 173+125 males, and 323+198 females.
# One c(rep()) for "m", one c(rep() for "f", and one c() to merge them
gender <- c(c(rep("m", 173+25)), c(rep("f",(323+198))))
# Same here, except the c() functions are one level 'deeper'. I need two
# sets for males (at heights 3 and 35, 173 and 125 of each, respectively)
# and two sets for females (at heights 3 and 35, 323 and 198 respectively)
heights <-c(c(c(rep(3, 173)), c(rep(35,25))), c(c(rep(3, 323)), c(rep(35,198))))
which, when merged into a data.table gives 719 rows, one for each observed bird.
1: m 3
2: m 3
3: m 3
4: m 3
5: m 3
---
715: f 35
716: f 35
717: f 35
718: f 35
719: f 35
Now that I have the data in two formats, I start looking for ways to do plots and analyses.
I can get a mosaic plot using the 719-row version, but you can't see it because of my 1-point reputation
mosaicplot(table(sandfliers), COLOR=TRUE, margin, legend=TRUE)
Mosaic Plot
and you can get a balloon plot using the 4-row version
Balloon Plot
So my question was, for those of you with lots and lots of experience with this sort of thing, do you find the 4-row or the 719-row tables more common. I can change from one to the other, but that's more code to add to the book (again I hear my editor, "You're teaching statistics, not R").
So, as I said at the top, this was just an informal poll on whether one is used more often than the other, or whether beginners are better off with one.
This is in the form of a contingency table. It isn't easy to enter directly into R but it can be done as follows (based on http://cyclismo.org/tutorial/R/tables.html):
> f <- matrix(c(173,125,323,198),nrow=2,byrow=TRUE)
> colnames(f) <- c(3,35)
> rownames(f) <- c("m","f")
> f <- as.table(f)
> f
3 35
m 173 125
f 323 198
You can then create a count or frequency table with:
> as.data.frame(f)
Var1 Var2 Freq
1 m 3 173
2 f 3 323
3 m 35 125
4 f 35 198
The R Cookbook gives a short function to convert to a table of cases (i.e. a long list of the individual items), as follows:
> countsToCases(as.data.frame(f))
... where:
# Convert from data frame of counts to data frame of cases.
# `countcol` is the name of the column containing the counts
countsToCases <- function(x, countcol = "Freq") {
# Get the row indices to pull from x
idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
# Drop count column
x[[countcol]] <- NULL
# Get the rows from x
x[idx, ]
}
... thus you can convert the data to the format needed by any analysis method from any starting format.
(EDIT)
Another way to read in the contingency table is to start with text like this:
> ss <- " 3 35
+ m 173 125
+ f 323 198"
> read.table(text=ss,row.name=1)
X3 X35
m 173 125
f 323 198
Instead of using text =, you can also use a file name to read the table from (for example) a CSV file.

Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397

I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.
index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]
This code creates two datasets with sizes 1396 and 1398 observations respectively.
I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default?
Thanks in advance!
It has to do with the number of cases of the response variable (final_ts$SAR in your case).
For example:
y <- rep(c(0,1), 10)
table(y)
y
0 1
10 10
# even number of cases
Now we split:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 10 obs
train
0 1
5 5
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
If we build and example instead with odd number of cases:
y <- rep(c(0,1), 11)
table(y)
y
0 1
11 11
We have:
train <- y[caret::createDataPartition(y, p=0.5,list=F)]
table(train) # we have 12 obs.
train
0 1
6 6
test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
table(test) # we have 10 obs.
test
0 1
5 5
More info here.
Here is another thread which explains why the number returned from createDataPartition might seem to be "off" to us but not according to what this function is trying to do.
So, it depends on what you have in final_ts$SAR and the spread of the data.
If it is categorical value, ex: T and F, if you have 100 total, 55 are T, 45 are F. When you invoke the way in your code, it will return you 51 because:
55*0.5=27.5, 45*0.5=22.5, round each result up, 28+23=51.
You can refer to below thread which has a great explanation about this when the values you want to split are numbers.
R - caret createDataPartition returns more samples than expected

setting variable value by subsetting

this is my first question, so please bear with me
I am creating a new variable age.f.sex in my dataframe wm.13 using an already existing variable SB1. In the original dataframe, SB1 indicates the age of first sexual intercourse of women interviewed in UNICEF's Multiple Indicators Cluster Surveys. The values that SB1 can take are:
> sort(unique(wm.13$SB1))
[1] 0 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[26] 30 31 32 33 34 35 36 37 38 39 40 41 42 44 48 95 97 99
Here is the meaning of the values SB1 can take
0 means she never had sex
97 and 99 mean "does not remember/does not know"
95 means that she had her first sexual intercourse when she started living with her husband/partner (for which there is a specific variable, i.e MA9)
Any number between 0 and 95 is the declared age ate their first sexual intercourse
there are also NAs that sort() does not show but they appear if I just use unique()
I created a new variable from SB1, which I called age.f.sex.
wm.13$age.f.sex <- wm.13$SB1
I had the 0, 97 and 99 values replaced with NAs, and I kept the original NAs in SB1. I did this using the following code:
wm.13$age.f.sex[wm.13$SB1 == 0] <- NA
wm.13$age.f.sex[wm.13$SB1 == 97] <- NA
wm.13$age.f.sex[wm.13$SB1 == 99] <- NA
wm.13$age.f.sex[is.na(wm.13$SB1)] <- NA
Everything worked fine until here. However, I am in trouble with the 95 value. I want to code so that the observations that have value 95 in SB1 (i.e. the age of first sexual intercourse) will have the value from MA9 (i.e. the age when the woman started living with her partner/husband) in my new variable age.f.sex.
I first started with this code
> wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9
but i got the following error message
Error in wm.13$age.f.sex[wm.13$SB1 == 95] <- wm.13$MA9 :
NAs are not allowed in subscripted assignments
After some researches in this website, I realised that I might need to subset the right-hand side of the code too, but honestly I do not know how to do it. I have a feeling that which() or if.else() might come of use here, but I cannot figure out their argument. Examples I have found in this website show how to impute one specific value, but I could not find anything on subsetting according to the value the observations take in another variable.
I hope I have been clear enough. Any suggestion will be much appreciated.
Thanks, Manolo
Perhaps you could try:
wm.13$age.f.sex <- ifelse(wm.13$SB1 %in% c(0,97,99) | is.na(wm.13$SB1), NA, ifelse(wm.13$SB1 == 95, wm.13$MA9, wm.13$SB1))
In short, it works like this: The code checks whether wm.13$SB1 is 0, 97, 99 or missing, and then returns NA. Subsequently, it checks whether wm.13$SB1 is 95, and if so, it returns the value on that row in the MA9 column. In all other cases it returns the SB1 value. Because of "wm.13$age.f.sex <-" at the beginning of the line the return values are assigned to your new age.f.sex variable.
As the error message indicates, it is not possible to do subscripted assignments when the filter contains NAs. A way to circumvent this is to explicitly include NA as a factor level. The following example illustrates a possible way to replace 95s by their corresponding value in a second column.
# example dataframe
df <- data.frame(a = c(NA, 3, 95, NA),
b = 1:4)
# set a to factor with NA as one of the levels (besides those in a and b)
df$a <- factor(df$a, levels = union(df$a, df$b), exclude = NULL)
# subscripted assignment (don't forget to filter b too!)
df$a[df$a == 95] <- df$b[df$a == 95]
# restore to numeric
df$a <- as.numeric(as.character(df$a))

Resources