I have a set of data on which respondents were given a series of questions, each with five response options (e.g., 1:5). Given those five options, I have a scoring key for each question, where some responses are worth full points (e.g., 2), others half points (1), and others no points (0). So, the data frame is n (people) x k (questions), and the scoring key is a k (questions) x m (responses) matrix.
What I am trying to do is to programmatically create a new dataset of the rescored items. Trivial dataset:
x <- sample(c(1:5), 50, replace = TRUE)
y <- sample(c(1:5), 50, replace = TRUE)
z <- sample(c(1:5), 50, replace = TRUE)
dat <- data.frame(cbind(x,y,z)) # 3 items, 50 observations (5 options per item)
head(dat)
x y z
1 3 1 2
2 2 1 3
3 5 3 4
4 1 4 5
5 1 3 4
6 4 5 4
# Each option is scored 0, 1, or 2:
key <- matrix(sample(c(0,0,1,1,2), size = 15, replace = TRUE), ncol=5)
key
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 1 2
[2,] 2 1 1 1 2
[3,] 2 2 1 1 2
Some other options, firstly using Map:
data.frame(Map( function(x,y) key[y,x], dat, seq_along(dat) ))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Secondly using matrix indexing on key:
newdat <- dat
newdat[] <- key[cbind( as.vector(col(dat)), unlist(dat) )]
newdat
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Things would be even simpler if you specified key as a list:
key <- list(x=c(0,0,0,1,2),y=c(2,1,1,1,2),z=c(2,2,1,1,2))
data.frame(Map("[",key,dat))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
For posterity, I was discussing this issue with a friend, who suggested another approach. The benefits of this is that it still uses mapvalues() to do the rescoring, but does not require a for loop, instead uses "from" in sapply to do the indexing.
library(plyr)
scored <- sapply(1:ncol(raw), function(x, dat, key){
mapvalues(dat[,x], from = 1:ncol(key), to = key[x,])
}, dat = dat, key = key)
My current working approach is to use 1) mapvalues, which lives within package:plyr to do the heavy lifting: it takes a vector of data to modify, and two additional parameters "from", which is the original data (here 1:5), and "to", or what we want to convert the data to; and, 2) A for loop with index notation, in which we cycle through the available questions, extract the vector pertaining to each using the current loop value, and use it to select the proper row from our scoring key.
library(plyr)
newdat <- matrix(data=NA, nrow=nrow(dat), ncol=ncol(dat))
for (i in 1:3) {
newdat[,i] <- mapvalues(dat[,i], from = c(1,2,3,4,5),
to = c(key[i,1], key[i,2], key[i,3], key[i,4], key[i,5]))
}
head(newdat)
[,1] [,2] [,3]
[1,] 0 2 2
[2,] 0 2 1
[3,] 2 1 1
[4,] 0 1 2
[5,] 0 1 1
[6,] 1 2 1
I am pretty happy with this solution, but if anyone has any better approaches, I would love to see them!
Related
I'm cleaning up some survey data in R; assigning variables 1,0 based on the responses to a question. Say I had a question with 3 options; a,b,c; and I had a data frame with the responses and logical variables:
df <- data.frame(a = rep(0,3), b = rep(0,3), c = rep(0,3), response = I(list(c(1),c(1,2),c(2,3))))
So I want to change the 0's to 1's if the response matches the column index (ie 1=a, 2=b, 3=c).
This is fairly easy to do with a loop:
for (i in 1:nrow(df2)) df2[i,df2[i,"response"][[1]]] <- 1
Is there any way to do this with an apply/lapply/sapply/etc? Something like:
df <- sapply(df,function(x) x[x["response"][[1]]] <- 1)
Or should I stick with a loop?
You can use matrix indexing, from ?[:
A third form of indexing is via a numeric matrix with the one column
for each dimension: each row of the index matrix then selects a single
element of the array, and the result is a vector. Negative indices are
not allowed in the index matrix. NA and zero values are allowed: rows
of an index matrix containing a zero are ignored, whereas rows
containing an NA produce an NA in the result.
# construct a matrix representing the index where the value should be one
idx <- with(df, cbind(rep(seq_along(response), lengths(response)), unlist(response)))
idx
# [,1] [,2]
#[1,] 1 1
#[2,] 2 1
#[3,] 2 2
#[4,] 3 2
#[5,] 3 3
# do the assignment
df[idx] <- 1
df
# a b c response
#1 1 0 0 1
#2 1 1 0 1, 2
#3 0 1 1 2, 3
or you can try this .
library(tidyr)
library(dplyr)
df1=df %>%mutate(Id=row_number()) %>%unnest(response)
df[,1:3]=table(df1$Id,df1$response)
a b c response
1 1 0 0 1
2 1 1 0 1, 2
3 0 1 1 2, 3
Perhaps this helps
df[1:3] <- t(sapply(df$response, function(x) as.integer(names(df)[1:3] %in% names(df)[x])))
df
# a b c response
#1 1 0 0 1
#2 1 1 0 1, 2
#3 0 1 1 2, 3
Or a compact option is
library(qdapTools)
df[1:3] <- mtabulate(df$response)
I have these loops :
xall = data.frame()
for (k in 1:nrow(VectClasses))
{
for (i in 1:nrow(VectIndVar))
{
xall[i,k] = sum(VectClasses[k,] == VectIndVar[i,])
}
}
The data:
VectClasses = Data Frame containing the characteristics of each classes
VectIndVar = Data Frame containing each record of the data base
The two for loops work and give an output I can work with, however, it takes too long, hence my need for the apply family
The output I am looking for is as this:
V1 V2 V3 V4
1 3 3 2 2
2 2 2 1 1
3 3 4 3 3
4 3 4 3 3
5 4 4 3 3
6 3 2 3 3
I tried using :
xball = data.frame()
xball = sapply(xball, function (i,k){
sum(VectClasses[k,] == VectIndVar[i,])})
xcall = data.frame()
xcall = lapply(xcall, function (i, k){sum(VectClasses[k,] == VectIndVar[i,]} )
but neither seems to be filling the dataframe
reproductible data (shortened):
VectIndVar <- data.frame(a=sample(letters[1:5], 100, rep=T), b=floor(runif(100)*25),
c = sample(c(1:5), 100, rep=T),
d=sample(c(1:2), 100, rep=T))
and :
> K1 = 4
VectClasses= VectIndVar [sample(1:nrow(VectIndVar ), K1, replace=FALSE), ]
Can you help me?
I would use outer instead of *apply:
res <- outer(
1:nrow(VectIndVar),
1:nrow(VectClasses),
Vectorize(function(i,k) sum(VectIndVar[i,-1]==VectClasses[k,-1]))
)
(Thanks to this Q&A for clarifying that Vectorize is needed.)
This gives
> head(res) # with set.seed(1) before creating the data
[,1] [,2] [,3] [,4]
[1,] 1 1 2 1
[2,] 0 0 1 0
[3,] 0 0 0 0
[4,] 0 0 1 0
[5,] 1 0 0 1
[6,] 1 1 1 1
As for speed, I would suggest using matrices instead of data.frames:
cmat <- as.matrix(VectClasses[-1]); rownames(cmat)<-VectClasses$a
imat <- as.matrix(VectIndVar[-1]); rownames(imat)<-VectIndVar$a
I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have been searching around and I cannot figure out how to sumarise the data I have in my data frame (subject to some ranges). I know that it can be done when applying some combination of daaply/taaply or table but I haven't been able to get the exact result I was expecting.
Basically I want to turn this:
part_no val1 val2 val3
2 1 2 3 45.3
2 1 3 4 -12.3
3 1 3 4 99.3
3 1 5 2 -3.2
3 1 4 3 -55.3
Into this:
part_no val3_between0_50 val3_bw50_100 val3_bw-50_0 val3_bw-100_-50
2 1 0 0 1 0
3 0 1 0 1 1
This is dummy data, I got a lot more rows, but the idea is the same. I just want to count the number of values for a participant that meet certain condition.
If anyone could explain it sort of step by step, I would really appreciate it. I saw lots of different little posts around, but none do exactly this and my attempts only got me half way there. Like using table, etc.
Better solution that the one below (will not need the extra row used below although if you wanted to move the renaming code to this matrix result, you could):
xtabs(~part_no +cut(val4, breaks=c(-100, -50, 0, 50, 100) ), dat=dat)
#-------------
cut(val4, breaks = c(-100, -50, 0, 50, 100))
part_no (-100,-50] (-50,0] (0,50] (50,100]
2 0 1 1 0
3 1 1 0 1
First try: .... n to a slightly different problem and would be easy to adapt to your situation. The difficulty I ran into is that my solution requires the part_no to start with 1. You could assign row labels later I suppose. Or make 'part_no' a factor and use its numeric-mode value.
dat <- read.table(text="part_no val1 val2 val3 val4
1 1 2 3 -32
2 1 2 3 45.3
2 1 3 4 -12.3
3 1 3 4 99.3
3 1 5 2 -3.2
3 1 4 3 -55.3
", head=T)
levs= 4; recs <- matrix( c(unique(dat$part_no),
rep(0, levs*length(unique(dat$part_no))) ),
nrow=length(unique(dat$part_no)) )
recs[ cbind( dat$part_no,
1+ findInterval(dat$val4, c(-100, -50, 0, 50, 100) ) )] <- 1
recs
#------------------------------------
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 0 0
[2,] 2 0 1 1 0
[3,] 3 1 1 0 1
#------------------------------------
colnames(recs) <- c(names(dat)[1] ,
paste("val_btwn",
c(-100, -50, 0, 50, 100)[1:4],
c(-100, -50, 0, 50, 100)[2:5],
sep="_") )
recs
#------------------------------------
part_no val_btwn_-100_-50 val_btwn_-50_0 val_btwn_0_50 val_btwn_50_100
[1,] 1 0 1 0 0
[2,] 2 0 1 1 0
[3,] 3 1 1 0 1
And now that I think further I might use cut and xtabs next time. In fact it worked so well I am going to post it on top.
I would like to ask,if some of You dont know any simple way to solve this kind of problem:
I need to generate all combinations of A numbers taken from a set B (0,1,2...B), with their sum = C.
ie if A=2, B=3, C=2:
Solution in this case:
(1,1);(0,2);(2,0)
So the vectors are length 2 (A), sum of all its items is 2 (C), possible values for each of vectors elements come from the set {0,1,2,3} (maximum is B).
A functional version since I already started before SO updated:
A=2
B=3
C=2
myfun <- function(a=A, b=B, c=C) {
out <- do.call(expand.grid, lapply(1:a, function(x) 0:b))
return(out[rowSums(out)==c,])
}
> out[rowSums(out)==c,]
Var1 Var2
3 2 0
6 1 1
9 0 2
z <- expand.grid(0:3,0:3)
z[rowSums(z)==2, ]
Var1 Var2
3 2 0
5 1 1
7 0 2
If you wanted to do the expand grid programmatically this would work:
z <- expand.grid( rep( list(C), A) )
You need to expand as a list so that the items remain separate. rep(0:3, 3) would not return 3 separate sequences. So for A=3:
> z <- expand.grid(rep(list(0:3), 3))
> z[rowSums(z)==2, ]
Var1 Var2 Var3
3 2 0 0
6 1 1 0
9 0 2 0
18 1 0 1
21 0 1 1
33 0 0 2
Using the nifty partitions() package, and more interesting values of A, B, and C:
library(partitions)
A <- 2
B <- 5
C <- 7
comps <- t(compositions(C, A))
ii <- apply(comps, 1, FUN=function(X) all(X %in% 0:B))
comps[ii, ]
# [,1] [,2]
# [1,] 5 2
# [2,] 4 3
# [3,] 3 4
# [4,] 2 5