R subsetting with dplyr - r

Troubles with R subsetting and arranging datasets.
I have a dataset that looks like this:
Student Skill Correct
64525 10 1
64525 10 1
70363 10 0
70363 10 1
70363 10 1
64525 15 0
70363 15 0
70363 15 1
I would need to create a new dataset for each skill, with a row for each student and a column for each observation (Correct). Like this:
Skill: 10
Student Obs1 Obs2 Obs3
64525 1 1 NA
70363 0 1 1
Skill: 15
Student Obs1 Obs2
64525 0 NA
70363 0 1
Notice that the number of columns of each skill dataset can vary, depending on the numebr of observations for each student. Notice also that the value can be a NA if there is not such an observation in the dataset (a student can try the skill a different number of times than other students).
I think this might e a job for the dplyr package but I am not sure.
I really appreciate the help of the community!!

Here's a possible data.table implementation
library(data.table) # V 1.10.0
res <- setDT(df)[, .(.(dcast(.SD, Student ~ rowid(Student)))), by = Skill]
Which will result in a data.table of data.tables
res
# Skill V1
# 1: 10 <data.table>
# 2: 15 <data.table>
Which could be segmented by the Skill column
res[Skill == 10, V1]
# [[1]]
# Student 1 2 3
# 1: 64525 1 1 NA
# 2: 70363 0 1 1
Or in order to see the whole column
res[, V1]
# [[1]]
# Student 1 2 3
# 1: 64525 1 1 NA
# 2: 70363 0 1 1
#
# [[2]]
# Student 1 2
# 1: 64525 0 NA
# 2: 70363 0 1

This will get the job done.
xy <- read.table(text = "Student Skill Correct
64525 10 1
64525 10 1
70363 10 0
70363 10 1
70363 10 1
64525 15 0
70363 15 0
70363 15 1", header = TRUE)
# first split by skill and work on each element
sapply(split(xy, xy$Skill), FUN = function(x) {
# extract column correct
out <- sapply(split(x, x$Student), FUN = "[[", "Correct")
# pad shortest vectors with NAs at the end
out <- mapply(out, max(lengths(out)), FUN = function(m, a) {
c(m, rep(NA, times = (a - length(m))))
}, SIMPLIFY = FALSE)
do.call(rbind, out)
})
$`10`
[,1] [,2] [,3]
64525 1 1 NA
70363 0 1 1
$`15`
[,1] [,2]
64525 0 NA
70363 0 1

Related

Transforming list into dataframe of yes/no

I have a dataframe of values that I am trying to turn into a two-mode matrix. The first dataframe contains a people and games (by id). I am trying to turn that into a dataframe that lists all games and whether a person has them or not. Can someone explain how to do this in R, or is this a question better suited to another programming language?
df<-data.frame(c(1,4,1),c(2,2,3),c(3,1,NA)) #note person3 only has 2 games... all empty spaces are filled with NA
row.names(df)<-c("person1","person2","person3")
colnames(df)<-c("game","game","game")
df
## game game game
## person1 1 2 3
## person2 4 2 1
## person3 1 3 NA
res<-data.frame(c(1,1,1),c(1,1,0),c(1,0,1),c(0,1,0))
colnames(res)<-c("1","2","3","4")
row.names(res)<-c("person1","person2","person3")
res
## 1 2 3 4
## person1 1 1 1 0
## person2 1 1 0 1
## person3 1 0 1 0
First create an empty matrix for the results:
r <- matrix(0, nrow=nrow(df), ncol=max(df, na.rm=TRUE))
row.names(r) <- row.names(df)
Then create the index matrix, those entries to set to 1:
x <- matrix(c(as.vector(row(df)), as.vector(as.matrix(df))), ncol=2)
Set those entries to 1:
r[x] <- 1
r
## [,1] [,2] [,3] [,4]
## person1 1 1 1 0
## person2 1 1 0 1
## person3 1 0 1 0

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

Cumulative sum for positive numbers only [duplicate]

This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 1 year ago.
I have this vector :
x = c(1,1,1,1,1,0,1,0,0,0,1,1)
And I want to do a cumulative sum for the positive numbers only. I should have the following vector in return:
xc = (1,2,3,4,5,0,1,0,0,0,1,2)
How could I do it?
I've tried : cumsum(x) but that do the cumulative sum for all values and gives :
cumsum(x)
[1] 1 2 3 4 5 5 6 6 6 6 7 8
One option is
x1 <- inverse.rle(within.list(rle(x), values[!!values] <-
(cumsum(values))[!!values]))
x[x1!=0] <- ave(x[x1!=0], x1[x1!=0], FUN=seq_along)
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Or a one-line code would be
x[x>0] <- with(rle(x), sequence(lengths[!!values]))
x
#[1] 1 2 3 4 5 0 1 0 0 0 1 2
Here's a possible solution using data.table v >= 1.9.5 and its new rleid funciton
library(data.table)
as.data.table(x)[, cumsum(x), rleid(x)]$V1
## [1] 1 2 3 4 5 0 1 0 0 0 1 2
Base R, one line solution with Map Reduce :
> Reduce('c', Map(function(u,v) if(v==0) rep(0,u) else 1:u, rle(x)$lengths, rle(x)$values))
[1] 1 2 3 4 5 0 1 0 0 0 1 2
Or:
unlist(Map(function(u,v) if(v==0) rep(0,u) else 1:u, rle(x)$lengths, rle(x)$values))
x=c(1,1,1,1,1,0,1,0,0,0,1,1)
cumsum_ <- function(x) {
r <- rle(x)
s <- split(x, rep(seq_along(r$values), rle(x)$lengths))
return(unlist(sapply(s, cumsum), use.names = F))
}
(xc <- cumsum_(x))
# [1] 1 2 3 4 5 0 1 0 0 0 1 2
I dont know much of R but i have written a small code in Python. Logic remains the same in all language. Hope this will help you
x=[1,1,1,1,1,0,1,0,0,0,1,1]
tot=0
for i in range(0,len(x)):
if x[i]!=0:
tot=tot+x[i]
x[i]=tot
else:
tot=0
print x
x<-c(1,1,1,1,1,0,1,0,0,0,1,1)
skumulowana<-function(x) {
dl<-length(x)
xx<-numeric(dl+1)
for (i in 1:dl){
ifelse (x[i]==0,xx[i+1]<-0,xx[i+1]<-xx[i]+x[i])
}
wynik<<-xx[1:dl+1]
return (wynik)
}
skumulowana(x)
## [1] 1 2 3 4 5 0 1 0 0 0 1 2
Try this one-liner...
Reduce(function(x,y) (x+y)*(y!=0), x, accumulate=T)
split and lapply version:
x <- c(1,1,1,1,1,0,1,0,0,0,1,1)
unlist(lapply(split(x, cumsum(x==0)), cumsum))
step by step:
a <- split(x, cumsum(x==0)) # divides x into pieces where each 0 starts a new piece
b <- lapply(a, cumsum) # calculates cumsum in each piece
unlist(b) # rejoins the pieces
Result has useless names but is otherwise what you wanted:
# 01 02 03 04 05 11 12 2 3 41 42 43
# 1 2 3 4 5 0 1 0 0 0 1 2
Here is another base R solution using aggregate. The idea is to make a data frame with x and a new column named x.1 by which we can apply aggregate functions (cumsum in this case):
x <- c(1,1,1,1,1,0,1,0,0,0,1,1)
r <- rle(x)
df <- data.frame(x,
x.1=unlist(sapply(1:length(r$lengths), function(i) rep(i, r$lengths[i]))))
# df
# x x.1
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 0 2
# 7 1 3
# 8 0 4
# 9 0 4
# 10 0 4
# 11 1 5
# 12 1 5
agg <- aggregate(df$x~df$x.1, df, cumsum)
as.vector(unlist(agg$`df$x`))
# [1] 1 2 3 4 5 0 1 0 0 0 1 2

Reshaping wide dataset in interval format

I am working on a "wide" dataset, and now I would like to use a specific package (-msSurv-, for non-parametric multistate models) which requires data in interval form.
My current dataset is characterized by one row for each individual:
dat <- read.table(text = "
id cohort t0 s1 t1 s2 t2 s3 t3
1 2 0 1 50 2 70 4 100
2 1 0 2 15 3 100 0 0
", header=TRUE)
where cohort is a time-fixed covariate, and s1-s3 correspond to the values that a time-varying covariate s = 1,2,3,4 takes over time (they are the distinct states visited by the individual over time). Calendar time is defined by t1-t3, and ranges from 0 to 100 for each individual.
So, for instance, individual 1 stays in state = 1 up to calendar time = 50, then he stays in state = 2 up to time = 70, and finally he stays in state = 4 up to time 100.
What I would like to obtain is a dataset in "interval" form, that is:
id cohort t.start t.stop start.s end.s
1 2 0 50 1 2
1 2 50 70 2 4
1 2 70 100 4 4
2 1 0 15 2 3
2 1 15 100 3 3
I hope the example is sufficiently clear, otherwise please let me know and I will try to further clarify.
How would you automatize this reshaping? Consider that I have a relatively large number of (simulated) individuals, around 1 million.
Thank you very much for any help.
I think I understand. Does this work?
require(data.table)
dt <- data.table(dat, key=c("id", "cohort"))
dt.out <- dt[, list(t.start=c(t0,t1,t2), t.stop=c(t1,t2,t3),
start.s=c(s1,s2,s3), end.s=c(s2,s3,s3)),
by = c("id", "cohort")]
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 0
# 6: 2 1 100 0 0 0
If the output you show is indeed right and is what you require, then you can obtain with two more lines (not the best way probably, but it should nevertheless be fast)
# remove rows where start.s and end.s are both 0
dt.out <- dt.out[, .SD[start.s > 0 | end.s > 0], by=1:nrow(dt.out)]
# replace end.s values with corresponding start.s values where end.s == 0
# it can be easily done with max(start.s, end.s) because end.s >= start.s ALWAYS
dt.out <- dt.out[, end.s := max(start.s, end.s), by=1:nrow(dt.out)]
dt.out[, nrow:=NULL]
> dt.out
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 3

Reshaping from wide to long and vice versa (multistate/survival analysis dataset)

I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2

Resources