I'm cleaning up some survey data in R; assigning variables 1,0 based on the responses to a question. Say I had a question with 3 options; a,b,c; and I had a data frame with the responses and logical variables:
df <- data.frame(a = rep(0,3), b = rep(0,3), c = rep(0,3), response = I(list(c(1),c(1,2),c(2,3))))
So I want to change the 0's to 1's if the response matches the column index (ie 1=a, 2=b, 3=c).
This is fairly easy to do with a loop:
for (i in 1:nrow(df2)) df2[i,df2[i,"response"][[1]]] <- 1
Is there any way to do this with an apply/lapply/sapply/etc? Something like:
df <- sapply(df,function(x) x[x["response"][[1]]] <- 1)
Or should I stick with a loop?
You can use matrix indexing, from ?[:
A third form of indexing is via a numeric matrix with the one column
for each dimension: each row of the index matrix then selects a single
element of the array, and the result is a vector. Negative indices are
not allowed in the index matrix. NA and zero values are allowed: rows
of an index matrix containing a zero are ignored, whereas rows
containing an NA produce an NA in the result.
# construct a matrix representing the index where the value should be one
idx <- with(df, cbind(rep(seq_along(response), lengths(response)), unlist(response)))
idx
# [,1] [,2]
#[1,] 1 1
#[2,] 2 1
#[3,] 2 2
#[4,] 3 2
#[5,] 3 3
# do the assignment
df[idx] <- 1
df
# a b c response
#1 1 0 0 1
#2 1 1 0 1, 2
#3 0 1 1 2, 3
or you can try this .
library(tidyr)
library(dplyr)
df1=df %>%mutate(Id=row_number()) %>%unnest(response)
df[,1:3]=table(df1$Id,df1$response)
a b c response
1 1 0 0 1
2 1 1 0 1, 2
3 0 1 1 2, 3
Perhaps this helps
df[1:3] <- t(sapply(df$response, function(x) as.integer(names(df)[1:3] %in% names(df)[x])))
df
# a b c response
#1 1 0 0 1
#2 1 1 0 1, 2
#3 0 1 1 2, 3
Or a compact option is
library(qdapTools)
df[1:3] <- mtabulate(df$response)
Related
I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?
We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1
You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1
dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))
I have data frame that contains many columns with almost identical names, like A and A...1 , B and B...1 and so on. I would like to combine these columns, such as A and A...1 become one column. All these columns contain 0,1 or NA, NA:s should be considered to be zeros (0). And so if column A is 0,0,1,1,NA and column A...1 is 1,0,0,0,1 combined_A should be = 1,0,1,1,1. So the if any of these column elements are 1 in other column, they should be one in the combined column.
Here's some code to produce example
original_table <- data.frame(A = c(0,0,1,1,NA),B = c(1,1,NA,NA,1),A...1 = c(1,0,0,0,1),B...1 = c(0,1,0,1,1))
So the original table looks like this
A B A...1 B...1
0 1 1 0
0 1 0 1
1 NA 0 0
1 NA 0 1
NA 1 1 1
The desired output table would look like this after combining.
combined_table <- data.frame(combined_A = c(1,0,1,1,1),combined_B = c(1,1,0,1,1))
combined_A combined_B
1 1
0 1
1 0
1 1
1 1
I'm fairly familiar with R, but i couldn't find any help for this problem.
We can use split.default to split based on common part in the column names. In this example, it seems we can find common columns by extracting the first letter of each column name.
substr(names(original_table), 1, 1)
#[1] "A" "B" "A" "B"
We use this to split columns and in each group use pmax to get max value in each row removing NA
as.data.frame(lapply(split.default(original_table,
substr(names(original_table), 1, 1)), function(x)
do.call(pmax, c(x, na.rm = TRUE))))
# A B
#1 1 1
#2 0 1
#3 1 0
#4 1 1
#5 1 1
An other base solution :
find the normal column names:
initial_col <- str_extract(names(original_table),"[A-Z]")%>%
unique()
> initial_col
[1] "A" "B"
then for all columns containing these names (grep(col,names(original_table),value = T)), make a row sum and tramsform it to binary output
sapply(initial_col,function(col){
tmp <- original_table[,grep(col,names(original_table),value = T)] %>%
rowSums(.,na.rm = T,1)
ifelse( tmp > 0,1,0)
})
A B
[1,] 1 1
[2,] 0 1
[3,] 1 0
[4,] 1 1
[5,] 1 1
I have a dataframe that is similar to a simplified version below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
df<-data.frame(MO1,MO2,MO3)
df
I am trying to create a new variable that would scan through the observations looking for all the 1 values. I would then like the observations in this new variable to take on the name of the column variable that it was obtained from, see below:
MO1<-c("0","1","2","3")
MO2<-c("1","0","3","2")
MO3<-c("3","2","1","0")
MOTIVATION<-c("MO2","MO1","MO3","")
df2<-data.frame(MO1,MO2,MO3,MOTIVATION)
df2
Sorry, I do not know how to just show the resulting data frame, df2 from above.
I have 989 observations and 19 different MO.. variables in my dataset.
Another option
> ind <- which(df==1, arr.ind = TRUE)
> df2 <- df # just cloning df
> df2$MOTIVATION <- NA
> df2$MOTIVATION[ind[,1]] <- names(df) [ind[,2]]
> df2
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
An option is to use apply in combination with which as:
df$MOTIVATION <- apply(df,1,function(x)names(df)[which(x==1)])
df
# MO1 MO2 MO3 MOTIVATION
# 1 0 1 3 MO2
# 2 1 0 2 MO1
# 3 2 3 1 MO3
# 4 3 2 0
1) Try max.col like this. Insert a 1 in front of each row and then find the column of the last 1. Subtract 1 so that it corresponds tot he original column numbers and a missing 1 gives 0. Then replace all zeros with NA and look up the corresponding column names.
ix <- max.col(cbind(1, df) == 1, "last") - 1
transform(df, MOTIVATION = names(df)[replace(ix, ix == 0, NA)])
giving:
MO1 MO2 MO3 MOTIVATION
1 0 1 3 MO2
2 1 0 2 MO1
3 2 3 1 MO3
4 3 2 0 <NA>
2) A variation would be the following. We compute max.col and then multiply each result by 1 if there is a 1 in that row or NA if not.
df1 <- df == 1
transform(df, MOTIVATION = names(df)[max.col(df1) * match(rowSums(df1), 1)])
The following does the trick (note that this support the case where two Columns have "1" not sure if this was a valid edge case for you.
(I slightly modified MO4 from original so that it would contain two "1"
MO1<-c("0","1","2","3")
MO2<-c("1","2","3","2")
MO3<-c("3","2","1","0")
MO4<-c("3","2","1","1")
df<-data.frame(MO1,MO2,MO3,MO4)
df
findx <- function(dfx)
{
idx <- which(dfx=="1")
res <- lapply(idx, function(x) paste0('MO', x))
res
}
found <- apply(df,2,findx)
newdf <- unlist(found)
newdf
With an ouput of
"MO2" "MO1" "MO3" "MO3" "MO4"
I have a set of data on which respondents were given a series of questions, each with five response options (e.g., 1:5). Given those five options, I have a scoring key for each question, where some responses are worth full points (e.g., 2), others half points (1), and others no points (0). So, the data frame is n (people) x k (questions), and the scoring key is a k (questions) x m (responses) matrix.
What I am trying to do is to programmatically create a new dataset of the rescored items. Trivial dataset:
x <- sample(c(1:5), 50, replace = TRUE)
y <- sample(c(1:5), 50, replace = TRUE)
z <- sample(c(1:5), 50, replace = TRUE)
dat <- data.frame(cbind(x,y,z)) # 3 items, 50 observations (5 options per item)
head(dat)
x y z
1 3 1 2
2 2 1 3
3 5 3 4
4 1 4 5
5 1 3 4
6 4 5 4
# Each option is scored 0, 1, or 2:
key <- matrix(sample(c(0,0,1,1,2), size = 15, replace = TRUE), ncol=5)
key
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 1 2
[2,] 2 1 1 1 2
[3,] 2 2 1 1 2
Some other options, firstly using Map:
data.frame(Map( function(x,y) key[y,x], dat, seq_along(dat) ))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Secondly using matrix indexing on key:
newdat <- dat
newdat[] <- key[cbind( as.vector(col(dat)), unlist(dat) )]
newdat
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Things would be even simpler if you specified key as a list:
key <- list(x=c(0,0,0,1,2),y=c(2,1,1,1,2),z=c(2,2,1,1,2))
data.frame(Map("[",key,dat))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
For posterity, I was discussing this issue with a friend, who suggested another approach. The benefits of this is that it still uses mapvalues() to do the rescoring, but does not require a for loop, instead uses "from" in sapply to do the indexing.
library(plyr)
scored <- sapply(1:ncol(raw), function(x, dat, key){
mapvalues(dat[,x], from = 1:ncol(key), to = key[x,])
}, dat = dat, key = key)
My current working approach is to use 1) mapvalues, which lives within package:plyr to do the heavy lifting: it takes a vector of data to modify, and two additional parameters "from", which is the original data (here 1:5), and "to", or what we want to convert the data to; and, 2) A for loop with index notation, in which we cycle through the available questions, extract the vector pertaining to each using the current loop value, and use it to select the proper row from our scoring key.
library(plyr)
newdat <- matrix(data=NA, nrow=nrow(dat), ncol=ncol(dat))
for (i in 1:3) {
newdat[,i] <- mapvalues(dat[,i], from = c(1,2,3,4,5),
to = c(key[i,1], key[i,2], key[i,3], key[i,4], key[i,5]))
}
head(newdat)
[,1] [,2] [,3]
[1,] 0 2 2
[2,] 0 2 1
[3,] 2 1 1
[4,] 0 1 2
[5,] 0 1 1
[6,] 1 2 1
I am pretty happy with this solution, but if anyone has any better approaches, I would love to see them!
I would like to ask,if some of You dont know any simple way to solve this kind of problem:
I need to generate all combinations of A numbers taken from a set B (0,1,2...B), with their sum = C.
ie if A=2, B=3, C=2:
Solution in this case:
(1,1);(0,2);(2,0)
So the vectors are length 2 (A), sum of all its items is 2 (C), possible values for each of vectors elements come from the set {0,1,2,3} (maximum is B).
A functional version since I already started before SO updated:
A=2
B=3
C=2
myfun <- function(a=A, b=B, c=C) {
out <- do.call(expand.grid, lapply(1:a, function(x) 0:b))
return(out[rowSums(out)==c,])
}
> out[rowSums(out)==c,]
Var1 Var2
3 2 0
6 1 1
9 0 2
z <- expand.grid(0:3,0:3)
z[rowSums(z)==2, ]
Var1 Var2
3 2 0
5 1 1
7 0 2
If you wanted to do the expand grid programmatically this would work:
z <- expand.grid( rep( list(C), A) )
You need to expand as a list so that the items remain separate. rep(0:3, 3) would not return 3 separate sequences. So for A=3:
> z <- expand.grid(rep(list(0:3), 3))
> z[rowSums(z)==2, ]
Var1 Var2 Var3
3 2 0 0
6 1 1 0
9 0 2 0
18 1 0 1
21 0 1 1
33 0 0 2
Using the nifty partitions() package, and more interesting values of A, B, and C:
library(partitions)
A <- 2
B <- 5
C <- 7
comps <- t(compositions(C, A))
ii <- apply(comps, 1, FUN=function(X) all(X %in% 0:B))
comps[ii, ]
# [,1] [,2]
# [1,] 5 2
# [2,] 4 3
# [3,] 3 4
# [4,] 2 5