I have a dataframe of values that I am trying to turn into a two-mode matrix. The first dataframe contains a people and games (by id). I am trying to turn that into a dataframe that lists all games and whether a person has them or not. Can someone explain how to do this in R, or is this a question better suited to another programming language?
df<-data.frame(c(1,4,1),c(2,2,3),c(3,1,NA)) #note person3 only has 2 games... all empty spaces are filled with NA
row.names(df)<-c("person1","person2","person3")
colnames(df)<-c("game","game","game")
df
## game game game
## person1 1 2 3
## person2 4 2 1
## person3 1 3 NA
res<-data.frame(c(1,1,1),c(1,1,0),c(1,0,1),c(0,1,0))
colnames(res)<-c("1","2","3","4")
row.names(res)<-c("person1","person2","person3")
res
## 1 2 3 4
## person1 1 1 1 0
## person2 1 1 0 1
## person3 1 0 1 0
First create an empty matrix for the results:
r <- matrix(0, nrow=nrow(df), ncol=max(df, na.rm=TRUE))
row.names(r) <- row.names(df)
Then create the index matrix, those entries to set to 1:
x <- matrix(c(as.vector(row(df)), as.vector(as.matrix(df))), ncol=2)
Set those entries to 1:
r[x] <- 1
r
## [,1] [,2] [,3] [,4]
## person1 1 1 1 0
## person2 1 1 0 1
## person3 1 0 1 0
Related
I am struggling to create list of tables (object = table, not data.frame) by loop in R.
Structure of my data is also a little bit complicated - sometimes the table function do not give 2x2 table - how to fill tables with incomplete dimensions to 2x2 automatically?
Sample data (in real dataset much much larger...)
my.data <- data.frame(y.var = c(0,1,0,1,1,1,0,1,1,0),
sex = rep(c("male","female"), times = 5),
apple = c(0,1,1,0,0,0,1,0,0,0),
orange = c(1,0,1,1,0,1,1,1,0,0),
ananas = c(0,0,0,0,0,0,0,0,0,0))
# y.var sex apple orange ananas
# 1 0 male 0 1 0
# 2 1 female 1 0 0
# 3 0 male 1 1 1
Look into creating tables - for apple I have nice 2x2 tables
table(my.data$y.var, my.data$apple)
# 0 1
# 0 2 2
# 1 5 1 .... Ok, nice 2x2 table.
table(my.data$y.var, my.data$apple, my.data$sex)
# , , = female
# 0 1
# 0 1 0
# 1 3 1
# , , = male
# 0 1
# 0 1 2
# 1 2 0 .... Ok, nice 2x2 table.
However for ananas I have only 2x1 tables
table(my.data$y.var, my.data$ananas)
# 0 # 0 1
# 0 4 # 0 4 0
# 1 6 .... NOT Ok! I need 2x2 table like this: # 1 6 0
table(my.data$y.var, my.data$ananas, my.data$sex)
# , , = female
# 0 # 0 1
# 0 1 # 0 1 0
# 1 4 .... NOT Ok! I need 2x2 table like this: # 1 4 0
# , , = male
# 0 # 0 1
# 0 3 # 0 3 0
# 1 2 .... NOT Ok! I need 2x2 table like this: # 1 2 0
I can do list manually like this, however this is not very practical.
my.list <- list(table(my.data$y.var, my.data$apple),
table(my.data$y.var, my.data$apple, my.data$sex),
table(my.data$y.var, my.data$orange),
table(my.data$y.var, my.data$orange, my.data$sex),
table(my.data$y.var, my.data$ananas),
table(my.data$y.var, my.data$ananas, my.data$sex))
How to do self-correcting-table-dimensions-loop? Necessary for following analyses...
We can use lapply to loop over the list of columns after converting the columns of interest to have the same levels with factor and then do a table and keep the output in the list
my.data[-2] <- lapply(my.data[-2], factor, levels = 0:1)
lst1 <- lapply(my.data[3:5], function(x) table(my.data$y.var, x))
Troubles with R subsetting and arranging datasets.
I have a dataset that looks like this:
Student Skill Correct
64525 10 1
64525 10 1
70363 10 0
70363 10 1
70363 10 1
64525 15 0
70363 15 0
70363 15 1
I would need to create a new dataset for each skill, with a row for each student and a column for each observation (Correct). Like this:
Skill: 10
Student Obs1 Obs2 Obs3
64525 1 1 NA
70363 0 1 1
Skill: 15
Student Obs1 Obs2
64525 0 NA
70363 0 1
Notice that the number of columns of each skill dataset can vary, depending on the numebr of observations for each student. Notice also that the value can be a NA if there is not such an observation in the dataset (a student can try the skill a different number of times than other students).
I think this might e a job for the dplyr package but I am not sure.
I really appreciate the help of the community!!
Here's a possible data.table implementation
library(data.table) # V 1.10.0
res <- setDT(df)[, .(.(dcast(.SD, Student ~ rowid(Student)))), by = Skill]
Which will result in a data.table of data.tables
res
# Skill V1
# 1: 10 <data.table>
# 2: 15 <data.table>
Which could be segmented by the Skill column
res[Skill == 10, V1]
# [[1]]
# Student 1 2 3
# 1: 64525 1 1 NA
# 2: 70363 0 1 1
Or in order to see the whole column
res[, V1]
# [[1]]
# Student 1 2 3
# 1: 64525 1 1 NA
# 2: 70363 0 1 1
#
# [[2]]
# Student 1 2
# 1: 64525 0 NA
# 2: 70363 0 1
This will get the job done.
xy <- read.table(text = "Student Skill Correct
64525 10 1
64525 10 1
70363 10 0
70363 10 1
70363 10 1
64525 15 0
70363 15 0
70363 15 1", header = TRUE)
# first split by skill and work on each element
sapply(split(xy, xy$Skill), FUN = function(x) {
# extract column correct
out <- sapply(split(x, x$Student), FUN = "[[", "Correct")
# pad shortest vectors with NAs at the end
out <- mapply(out, max(lengths(out)), FUN = function(m, a) {
c(m, rep(NA, times = (a - length(m))))
}, SIMPLIFY = FALSE)
do.call(rbind, out)
})
$`10`
[,1] [,2] [,3]
64525 1 1 NA
70363 0 1 1
$`15`
[,1] [,2]
64525 0 NA
70363 0 1
I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have a set of data on which respondents were given a series of questions, each with five response options (e.g., 1:5). Given those five options, I have a scoring key for each question, where some responses are worth full points (e.g., 2), others half points (1), and others no points (0). So, the data frame is n (people) x k (questions), and the scoring key is a k (questions) x m (responses) matrix.
What I am trying to do is to programmatically create a new dataset of the rescored items. Trivial dataset:
x <- sample(c(1:5), 50, replace = TRUE)
y <- sample(c(1:5), 50, replace = TRUE)
z <- sample(c(1:5), 50, replace = TRUE)
dat <- data.frame(cbind(x,y,z)) # 3 items, 50 observations (5 options per item)
head(dat)
x y z
1 3 1 2
2 2 1 3
3 5 3 4
4 1 4 5
5 1 3 4
6 4 5 4
# Each option is scored 0, 1, or 2:
key <- matrix(sample(c(0,0,1,1,2), size = 15, replace = TRUE), ncol=5)
key
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 1 2
[2,] 2 1 1 1 2
[3,] 2 2 1 1 2
Some other options, firstly using Map:
data.frame(Map( function(x,y) key[y,x], dat, seq_along(dat) ))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Secondly using matrix indexing on key:
newdat <- dat
newdat[] <- key[cbind( as.vector(col(dat)), unlist(dat) )]
newdat
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
Things would be even simpler if you specified key as a list:
key <- list(x=c(0,0,0,1,2),y=c(2,1,1,1,2),z=c(2,2,1,1,2))
data.frame(Map("[",key,dat))
# x y z
#1 0 2 2
#2 0 2 1
#3 2 1 1
#4 0 1 2
#5 0 1 1
#6 1 2 1
For posterity, I was discussing this issue with a friend, who suggested another approach. The benefits of this is that it still uses mapvalues() to do the rescoring, but does not require a for loop, instead uses "from" in sapply to do the indexing.
library(plyr)
scored <- sapply(1:ncol(raw), function(x, dat, key){
mapvalues(dat[,x], from = 1:ncol(key), to = key[x,])
}, dat = dat, key = key)
My current working approach is to use 1) mapvalues, which lives within package:plyr to do the heavy lifting: it takes a vector of data to modify, and two additional parameters "from", which is the original data (here 1:5), and "to", or what we want to convert the data to; and, 2) A for loop with index notation, in which we cycle through the available questions, extract the vector pertaining to each using the current loop value, and use it to select the proper row from our scoring key.
library(plyr)
newdat <- matrix(data=NA, nrow=nrow(dat), ncol=ncol(dat))
for (i in 1:3) {
newdat[,i] <- mapvalues(dat[,i], from = c(1,2,3,4,5),
to = c(key[i,1], key[i,2], key[i,3], key[i,4], key[i,5]))
}
head(newdat)
[,1] [,2] [,3]
[1,] 0 2 2
[2,] 0 2 1
[3,] 2 1 1
[4,] 0 1 2
[5,] 0 1 1
[6,] 1 2 1
I am pretty happy with this solution, but if anyone has any better approaches, I would love to see them!
I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2