This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
id <- c('1','1','1','2','2','3')
name <- c('myfile_1','myfile_2','myfile_4','myfile_1','myfile_2','myfile_3')
count <- c(5,4,2,1,3,1)
input <- data.frame(id, name, count)
Having a dataframe as input as the previous one.
id name count
1 myfile_1 5
1 myfile_2 4
1 myfile_4 2
2 myfile_1 1
2 myfile_2 3
3 myfile_3 1
How is it possible to have a new dataframe like this:
id myfile_1 myfile_2 myfile_3 myfile_4
1 5 4 0 2
2 1 2 0 0
3 0 0 1 0
library(tidyverse);
input %>%
spread(name, count, fill = 0);
# id myfile_1 myfile_2 myfile_3 myfile_4
#1 1 5 4 0 2
#2 2 1 3 0 0
#3 3 0 0 1 0
More details (other than the link given in the duplicate flag) on long-to-wide conversion can be found here.
Related
I get the correct output shown below, with code beneath, in the SumIfs_1 column which calculates the sum of all Code2's in the array for the single condition where all Code1's in the array are < current row Code1:
Name Group Code1 Code2 SumIfs_1
1 B 1 0 1 1
2 R 1 1 0 2
3 R 1 1 2 2
4 R 2 3 0 4
5 R 2 3 1 4
6 B 3 4 2 5
7 A 3 -1 1 0
8 A 4 0 0 1
9 A 1 0 0 1
Code:
library(dplyr)
myData <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A"),
Group = c(1,1,1,2,2,3,3,4,1),
Code1 = c(0,1,1,3,3,4,-1,0,0),
Code2 = c(1,0,2,0,1,2,1,0,0)
)
myData %>% mutate(SumIfs_1 = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]])))
I'd like to expand the code to add another condition to the above sumifs() equivalent, creating a sumifs() with multiple conditions, where we add only those Code 2's for Groups in the array that are < than the current row Group, as further explained in this image (orange shows what already works in the Excel equivalent of the above code for SumIfs_1, yellow shows the sumifs() with more conditions that I am trying to add (SumIfs_2)):
Any recommendations for how to do this?
I'd like to stick with sapply() if possible, and more importantly I'd like to stick with dplyr or base R as I'm trying to prevent package bloat.
For what it's worth, here's my humble attempt to generate the SumIfs_2 column (which does not work):
myData %>% mutate(SumIfs_2 = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]][Group[1:n()] < Group[x]])))
You're doing the same thing pretty much, you just need to add another & condition where you are subsetting.
Also you don't need to call Code1[1:n()], when you call Code1 it already takes all of the values in the Code1 column.
I believe you are looking for
myData %>% mutate(SumIfs_2 = sapply(1:n(), function(x) sum(Code2[(Code1 < Code1[x]) & (Group < Group[x])])))
Name Group Code1 Code2 SumIfs_2
1 B 1 0 1 0
2 R 1 1 0 0
3 R 1 1 2 0
4 R 2 3 0 3
5 R 2 3 1 3
6 B 3 4 2 4
7 A 3 -1 1 0
8 A 4 0 0 1
9 A 1 0 0 0
This question already has an answer here:
Get a square matrix out of a non symetric data frame
(1 answer)
Closed 5 years ago.
I have a data frame looking like this:
ID cat1 cat2 cat3
1 cat1_A cat2_A cat3_A
2 cat1_B cat2_A cat3_B
3 cat1_B cat2_B cat3_A
I would now like to convert this to a kind of transposed table using all values in each column as new column names, and a 0/1 (presence/absence) call for the respective column name as new value:
ID cat1_A cat1_B cat2_A cat2_B cat3_A cat3_B
1 1 0 1 0 1 0
2 0 1 1 0 0 1
3 0 1 0 1 1 0
I hope it's clear what I'd like to do, not sure how to explain it in a better way. Any help would be greatly appreciated!
Thanks!
We can use mtabulate from qdapTools
res <- cbind(df1[1], mtabulate(as.data.frame(t(df1[-1]))))
row.names(res) <- NULL
res
# ID cat1_A cat2_A cat3_A cat1_B cat3_B cat2_B
#1 1 1 1 1 0 0 0
#2 2 0 1 0 1 1 0
#3 3 0 0 1 1 0 1
This question already has answers here:
Generate a dummy-variable
(17 answers)
How do I make a dummy variable in R?
(3 answers)
Create new dummy variable columns from categorical variable
(8 answers)
How to create dummy variables?
(3 answers)
One-Hot Encoding in [R] | Categorical to Dummy Variables [duplicate]
(1 answer)
Closed 5 years ago.
I've survey results (categorical) stored in csv file with multiple responses within the same cell. I'd like to split it into separate column (dummy variables)
The data looks like
response <-c(1,2,3,123)
df <-data.frame(response)
I tried the code below
for(t in unique(df$response))
{df[paste("response",t,sep="")] <- ifelse(df$response==t,1,0)}
the result is here, but it created a new column for 123
head(df)
response response1 response2 response3 response123
1 1 1 0 0 0
2 2 0 1 0 0
3 3 0 0 1 0
4 123 0 0 0 1
I'd like the data to look as below
response response1 response2 response3
1 1 1 0 0
2 2 0 1 0
3 3 0 0 1
4 123 1 1 1
Appreciate your help and advice :)
We can do
df1 <- cbind(df, +(sapply(1:3, grepl, x = df$response)))
colnames(df1)[-1] <- paste0("response", colnames(df1)[-1])
df1
# response response1 response2 response3
#1 1 1 0 0
#2 2 0 1 0
#3 3 0 0 1
#4 123 1 1 1
This question already has an answer here:
R: apply simple function to specific columns by grouped variable
(1 answer)
Closed 5 years ago.
I'm trying to convert a dataset that has multiple observations per person over a period of time. For example, person 1 can be obese and not obese (just overweight) during this time. Here's an example from person 1:
ID Obese Overweight
1 NA NA
1 NA NA
1 0 1
1 1 0
1 0 0
2 NA 0
2 0 1
2 0 NA
I need to replace the values in each column to "1" if a 1 appears at all WITHIN THAT COLUMN, across a specified number of columns (there are 700+; e.g. c(5:749)) BY "ID". Ideally, the output would look like:
ID Obese Overweight
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
2 0 1
2 0 1
2 0 1
First I changed all the NAs to 0's; I then figured I could take the maximum along each column and replace (by ID), but can't find documentation on how to do this by group ("ID") AND a given set of columns (i.e. c(5:749)). Also I would not want to create new columns, but rather just replace values within columns already existing within the data frame.
I got it to work for a single variable, but couldn't translate this into a loop to go through a set of variables...
dat2 <- dat[, Obese:= max(Obese), by=ID]
Also I think a loop would take too long given the data size. Any other recommendations? Thanks in advance. Here's an example dataset:
dat <- as.data.frame(matrix(NA,18))
dat$id <- as.character(c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3))
dat$ob1 <- as.character(c(NA,NA,0,1,0,NA,0,1,0,0,0,0,0,0,0,0,0,0))
dat$ob2 <- as.character(c(NA,NA,1,0,0,NA,0,0,1,0,0,0,0,1,0,0,0,0))
dat <- dat[,-1]
As far as the linked paged using "lapply", it doesn't seem to work in the case where all values are NA (or 0) for a given individual. In this scenario, it seems to "fill in" / impute with values from other columns (which never appeared in the column in the original dataset); this was clearly spotted when a binary variable was imputed/replaced with a continuous value. Any idea why this may be happening?
I think tapply is helpful for this case.
You can find the max for each id by
with(dat, tapply(ob1, id, max))
My solution is:
dat$ob1 <- as.numeric(dat$ob1)
dat$ob2 <- as.numeric(dat$ob2)
dat[is.na(dat)] <- 0
dat$ob1 <- with(dat,tapply(ob1,id,max)[id])
dat$ob2 <- with(dat,tapply(ob2,id,max)[id])
dat
id ob1 ob2
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 2 1 1
8 2 1 1
9 2 1 1
10 2 1 1
11 2 1 1
12 2 1 1
13 3 0 1
14 3 0 1
15 3 0 1
16 3 0 1
17 3 0 1
18 3 0 1
I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2