Loop to create new variable based on answers to other variables - r

I would like to create a new variable based on the answers to three other variables (A,B and C) in my data set. Basically my variables have three modalities : "often", "sometime" and "never". I would like to have a new variable in which each individuals has a grade ranging from 0 to 3. For each variable (A,B and C), if he answers "often", he gets 1 point otherwise, he gets 0.
My data set looks like this with "often" coded with 2 ; "sometimes" coded with 1 and "never" coded with 0.
A <- c(2,1,1,NA, 0,2)
B <- c(2,2,0,1,2,NA)
C <- c(2,1,NA,2,1,0)
data <- data.frame(A,B,C)
I know I could use case_when but it is a rather unwieldy solution. I was thinking of a loop but I never used loops in R. Can you help me with this loop?

Do you mean something like this?
Update: thanks to markus. His solution (rowSums(data == 2, na.rm = TRUE))is much better than my original
base R
data$points = rowSums(data == 2, na.rm = TRUE)
dplyr
library(dplyr)
data %>% mutate(point = rowSums(data == 2, na.rm = TRUE))
data.table
library(data.table)
setDT(data)
data[, points:=rowSums(data == 2, na.rm = TRUE)]
Output
> data
A B C points
1 2 2 2 3
2 1 2 1 1
3 1 0 NA 0
4 NA 1 2 1
5 0 2 1 1
6 2 NA 0 1

Related

Creating a function looping over each row in R

I want to write a function that creates a new column with rowmeans for Columns 1-3, only if more than 2 questions for Columns 1-3 per row were answered, otherwise print 'N'.
Here is my dataframe:
test <- data.frame(Manager1 = c(1, 3, 3), Manager2 = c(3, 4, 1), Manager3 = c(NA , 4, 2), Team1 = c(3, 4, 1))
Desired output:
Manager1 Manager2 Manager3 Team1 mean_score
1 3 3 N
3 4 4 4 3.66667
3 1 2 1 2
My code is as follows, but it's not working:
#create function
mean_score <- function(x) {
for (i in 1:nrow(test)){
if (sum(test[i, x] != "NA", na.rm = TRUE) >2){
test$mean_score[i] <- rowMeans(test[i, x], na.rm = TRUE)
} else
test$mean_score[i] <- print("N")
}
}
#compute function
mean_score(1:3)
What am I missing? Suggestions on better code are welcome too.
I think it is not ideal to put a character together with a numeric value, since it will convert the whole column into character. However, if that is what you want:
my_sum <- function(x,min=2){
s <- mean(x, na.rm = T) # get the mean
no_na <- sum(!is.na(x)) # count the number of non NAs
if(no_na>min){s}else{"N"} # return mean if enough non NAs
}
test$mean <- apply(test[,1:3],1,my_sum)
test
Manager1 Manager2 Manager3 Team1 mean
1 1 3 NA 3 N
2 3 4 4 4 3.66666666666667
3 3 1 2 1 2
str(test)
'data.frame': 3 obs. of 5 variables:
$ Manager1: num 1 3 3
$ Manager2: num 3 4 1
$ Manager3: num NA 4 2
$ Team1 : num 3 4 1
$ mean : chr "N" "3.66666666666667" "2"
You simply can use rowMeans what will return NA if there is one row holding NA what should be here equivalent to only if more than 2 questions for Columns 1-3 per row were answered.
test$mean_score <- rowMeans(test[,1:3])
# Manager1 Manager2 Manager3 Team1 mean_score
#1 1 3 NA 3 NA
#2 3 4 4 4 3.666667
#3 3 1 2 1 2.000000
While GKi has a better answer that's more simple and that you should use here is what I changed your code to be so that it works.
Generally when making a function you want to have the input be the dataframe, in this case text and changing the function from there.
Another important thing of note is you probably want to make a vector of values first and then attach said vector to the dataframe as I do in the code below, but you need to make sure you create an empty vector object to do so. R doesn't really let you slowly add cell data to a dataframe, it prefers that a vector (which can be added to) of equal length be joined to it.
Also you don't need to use print() to insert a character into a vector either.
Hope this helps explain why your function was having issues, but frankly GKi's answer is better for general R use!
mean_score <- function(x) {
mean_score <- vector()
for (i in 1:nrow(x)){
if (sum(x[i,] != "NA", na.rm = TRUE) >3){
mean_score[i] <- rowMeans(x[i,], na.rm = TRUE)
} else
mean_score[i] <- "N"
}
x$mean_score <- mean_score
return(x)
}
mean_score(test)

How to use a complicated result from paste inside data.table's "i" part?

Say I have a data table and I want to calculate a new variable based on several conditions of the old variables like this:
library(data.table)
test <- data.table(a = c(1,1,0), b = c(0,1,0), c = c(1,1,1))
test[a==1 & b==1 & c==1,test2:=1]
But I actually have many more conditions (all combinations of the different variables) which also have a different length. I draw those from a list such as:
conditions<-list(c("a","b","c"), c("b","c"))
and then I want to loop through that list and build a character vector like this (with which I want to do something before deleting it and going to the next element of the list):
mystring <- paste0(paste0(conditions[[1]], collapse = "==1 & "), "==1")
But how can I use "mystring" inside the data.table? as.function() or get() or eval() don't seem to work. Something like:
test[mystring,test3:=1]
is what I'm looking for.
For the given use case, you may use join with on = to achieve the desired goal without having to create and evaluate complex strings of conditions.
Instead of
test[a==1 & b==1 & c==1, test2 := 1][]
we can write
test[.(1, 1, 1), on = c("a", "b", "c"), test2 := 1][]
# a b c test2
#1: 1 0 1 NA
#2: 1 1 1 1
#3: 0 0 1 NA
Now, the OP had requested to loop over a list of conditions using lapply() "to do something". This can be achieved as follows
# create list of conditions for subsetting
col = list(c("a","b","c"), c("b","c"))
val = list(c(1, 1, 1), c(0, 1))
# loop over conditions
lapply(seq_along(col), function(i) test[as.list(val[[i]]), on = col[[i]], test2 := i])
#[[1]]
#
#[[2]]
# a b c test2
#1: 1 0 1 2
#2: 1 1 1 1
#3: 0 0 1 2
Note that the output of lapply() is not used because test has been modified in place:
test
# a b c test2
#1: 1 0 1 2
#2: 1 1 1 1
#3: 0 0 1 2

R- create new dataframe variable from subset of two variables with missing data NA

I have a simple example data frame with two data columns (data1 and data2) and two grouping variables (Measure 1 and 2). Measure 1 and 2 have missing data NA.
d <- data.frame(Measure1 = 1:2, Measure2 = 3:4, data1 = 1:10, data2 = 11:20)
d$Measure1[4]=NA
d$Measure2[8]=NA
d
Measure1 Measure2 data1 data2
1 1 3 1 11
2 2 4 2 12
3 1 3 3 13
4 NA 4 4 14
5 1 3 5 15
6 2 4 6 16
7 1 3 7 17
8 2 NA 8 18
9 1 3 9 19
10 2 4 10 20
I want to create a new variable (d$new) that contains data1, but only for rows where Measure1 equals 1. I tried this and get the following error:
d$new[d$Measure1 == 1] = d$data1[d$Measure1 == 1]
Error in d$new[d$Measure1 == 1] = d$data1[d$Measure1 == 1] : NAs
are not allowed in subscripted assignments
Next I would like to add to d$new the data from data2 only for rows where Measure2 equals 4. However, the missing data in Measure1 and Measure2 is causing problems in subsetting the data and assigning it to a new variable. I can think of some overly complicated solutions, but I'm sure there's an easy way I'm not thinking of. Thanks for the help!
Find rows where Measure1 is not NA and is the value you want.
measure1_notNA = which(!is.na(d$Measure1) & d$Measure1 == 1)
Initialize your new column with some default value.
d$new = NA
Replace only those rows with corresponding values from data1 column.
d$new[measure1_notNA] = d$data1[measure1_notNA]
Or, in 1 line:
d$new[d$Measure1 == 1 & !is.na(d$Measure1)] = d$data1[d$Measure1 == 1 & !is.na(d$Measure1)]
Based on the description, it seems that the OP want to create a column 'new' based on two columns i.e. when Measure1==1, get the corresponding elements of 'data1', similarly for Measure2==4, get the corresponding 'data2' values, and the rest with NA. We can use ifelse
d$new <- with(d, ifelse(Measure1==1 & !is.na(Measure1), data1,
ifelse(Measure2==4, data2, NA)))
We could also do this with data.table by assigning (:=) in two steps. Convert the 'data.frame' to 'data.table' (setDT(d)). Based on the logical condition (Measure1==1 & !is.na(Measure1)), we assign the column 'new' as 'data1'. This will create the column with values from 'data1' for that are TRUE for the logical condition and get NA for the rest. In the second step, we do the same using 'Measure2/data2'.
library(data.table)
setDT(d)[Measure1==1 & !is.na(Measure1), new:= data1]
d[Measure2==4, new:= data2]

How do I take subsets of a data frame according to a grouping in R?

I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]

Rep values from a data frame to another data frame. apply? sapply?

I have the following data frame
data<-data.frame(ID=c("a", "b", "c", "d"), zeros=c(3,2,5,4), ones=c(1,1,2,1))
ID zeros ones
1 a 3 1
2 b 2 1
3 c 5 2
4 d 4 1
and I wish to create another data frame with 2 columns:
First column(id) the ID is repeated (zero+ones) times
Second column value should be the c(rep(0, zeros), rep(1, ones))
so that the result would be
id value
1 a 0
2 a 0
3 a 0
4 a 1
5 b 0
6 b 0
7 b 1
8 c 0
9 c 0
10 c 0
11 c 0
12 c 0
13 c 1
14 c 1
15 d 0
16 d 0
17 d 0
18 d 0
19 d 1
I tried data.frame(id=(rep(data$ID, (data$zeros+data$ones))), value=c(rep(0, data$zeros), rep(1, data$ones))) but doesnt work. Any ideas? Thank you in advance
This is perhaps overkill, using ddply from the plyr package, but it's the first thing that came to me:
ddply(dat,.(ID),function(x){data.frame(value = rep(c(0,1),times = c(x$zeros,x$ones)))})
Oh and I changed the name of your data frame to dat to avoid a bad habit (data is the name of an oft used function).
Here's a base R solution. I prefer the overkill of plyr myself:
dat <- data.frame(ID = letters[1:4], zeros = c(3,2,5,4), ones = c(1,1,2,1))
do.call("rbind"
, apply(dat, 1, function(x)
data.frame(cbind(id = x[1], value = rep(0:1, times = x[2:3])))
)
)
Since you've already got a base R solution for the first column, this is one for your second column:
lengths<-as.vector(t(as.matrix(data[,2:3]))) #notice the t
what<-rep(c(0,1), nrow(data))
times<-rep(what, lengths)
Edit: changed a minor thing above and tested it. It works now.
I also prefer the plyr method, but I thought I'd throw another base R solution related to reshaping the data first, and then replicating it. (also using dat instead of data):
names(dat)[2:3] <- c("times.0", "times.1")
tmp <- reshape(dat, varying=2:3, direction="long")
tmp <- tmp[rep(seq(length=nrow(tmp)),tmp$times),c("ID","time")]
names(tmp) <- c("id","value")
tmp <- tmp[order(tmp$id, tmp$value),]
rownames(tmp) <- NULL
Not as elegant as some of the other base solutions because it requires intermediate storage, but possibly interesting.

Resources