R: Converting wide format to long format with multiple 3 time period variables [duplicate] - r

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
Apologies if this is a simple question, but I haven't been able to find a simple solution after searching. I'm fairly new to R, and am having trouble converting wide format to long format using either the melt (reshape2) or gather(tidyr) functions. The dataset that I'm working with contains 22 different time variables that are each 3 time periods. The problem occurs when I try to convert all of these from wide to long format at once. I have had success in converting them individually, but it's a very inefficient and long, so I was wondering if anyone could suggest a simpler solution. Below is a sample dataset I created that is formatted in a similar way as the dataset I am working with:
Subject <- c(1, 2, 3)
BlueTime1 <- c(2, 5, 6)
BlueTime2 <- c(4, 6, 7)
BlueTime3 <- c(1, 2, 3)
RedTime1 <- c(2, 5, 6)
RedTime2 <- c(4, 6, 7)
RedTime3 <- c(1, 2, 3)
GreenTime1 <- c(2, 5, 6)
GreenTime2 <- c(4, 6, 7)
GreenTime3 <- c(1, 2, 3)
sample.df <- data.frame(Subject, BlueTime1, BlueTime2, BlueTime3,
RedTime1, RedTime2, RedTime3,
GreenTime1,GreenTime2, GreenTime3)
A solution that has worked for me is to use the gather function from tidyr, arranging the data by Subject (so that each subject's data is grouped together), and then selecting only the subject, time period, and rating. This was done for each variable (in my case 22).
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
BlueGather <- gather(sample.df, Time_Blue, Rating_Blue, c(BlueTime1,
BlueTime2,
BlueTime3))
BlueSorted <- arrange(BlueGather, Subject)
BlueSubtracted <- select(BlueSorted, Subject, Time_Blue, Rating_Blue)
After this code, I combine everything into one data frame. This seems very slow and inefficient to me, and was hoping that someone could help me find a simpler solution. Thank you!

The idea here is to gather() all the time variables (all variables but Subject), use separate() on key to split them into a label and a time and then spread() the label and value to obtain your desired output.
library(dplyr)
library(tidyr)
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time"), "(?<=[a-z])(?=[0-9])") %>%
spread(label, value)
Which gives:
# Subject time BlueTime GreenTime RedTime
#1 1 1 2 2 2
#2 1 2 4 4 4
#3 1 3 1 1 1
#4 2 1 5 5 5
#5 2 2 6 6 6
#6 2 3 2 2 2
#7 3 1 6 6 6
#8 3 2 7 7 7
#9 3 3 3 3 3
Note
Here we use the regex in separate() from this answer by #RichardScriven to split the column on the first encountered digit.
Edit
I understand from your comments that your dataset column names are actually in the form ColorTime_Pre, ColorTime_Post, ColorTime_Final. If that is the case, you don't have to specify a regex in separate() as the default one sep = "[^[:alnum:]]+" will match your _ and split the key into label and time accordingly:
sample.df %>%
gather(key, value, -Subject) %>%
separate(key, into = c("label", "time")) %>%
spread(label, value)
Will give:
# Subject time BlueTime GreenTime RedTime
#1 1 Final 1 1 1
#2 1 Post 4 4 4
#3 1 Pre 2 2 2
#4 2 Final 2 2 2
#5 2 Post 6 6 6
#6 2 Pre 5 5 5
#7 3 Final 3 3 3
#8 3 Post 7 7 7
#9 3 Pre 6 6 6

We can use melt from data.table which can take multiple measure columns as a regex pattern
library(data.table)
melt(setDT(sample.df), measure = patterns("^Blue", "^Red", "^Green"),
value.name = c("BlueTime", "RedTime", "GreenTime"), variable.name = "time")
# Subject time BlueTime RedTime GreenTime
#1: 1 1 2 2 2
#2: 2 1 5 5 5
#3: 3 1 6 6 6
#4: 1 2 4 4 4
#5: 2 2 6 6 6
#6: 3 2 7 7 7
#7: 1 3 1 1 1
#8: 2 3 2 2 2
#9: 3 3 3 3 3
Or as #StevenBeaupré mentioned in the comments, if there are many patterns, one option would be to use the names of the dataset after extracting the substring as the patterns argument
melt(setDT(sample.df), measure = patterns(as.list(unique(sub("\\d+", "",
names(sample.df)[-1])))),value.name = c("BlueTime", "RedTime",
"GreenTime"), variable.name = "time")

If your goal is to convert the three colors to long this can be accomplished with the base R reshape function:
reshape(sample.df, idvar="subject", varying=2:length(sample.df), sep="", direction="long")
Subject time BlueTime RedTime GreenTime subject
1.1 1 1 2 2 2 1
2.1 2 1 5 5 5 2
3.1 3 1 6 6 6 3
1.2 1 2 4 4 4 1
2.2 2 2 6 6 6 2
3.2 3 2 7 7 7 3
1.3 1 3 1 1 1 1
2.3 2 3 2 2 2 2
3.3 3 3 3 3 3 3
The time variable captures the 1,2,3 in the names of the wide variables. The varying argument tells reshape which variables should be converted to long. The sep argument tells reshape to look for numbers at the end of the varying variables that are not separated by any characters, while the direction argument tells the function to attempt a long conversion.
I always add the id variable, even if it is not necessary for future reference.
If your data.frame doesn't have actually have the numbers for the time variable, a fairly simple solution is to change the variable names so that they do. For example, the following would replace "_Pre" with "1" at the end of any such variables.
names(df)[grep("_Pre$", names(df))] <- gsub("_Pre$", "1",
names(df)[grep("_Pre$", names(df))])

Related

Sort data.frame or data.table using vector of column names [duplicate]

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)
Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]
Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

How do I use the tidyverse packages to get a running total of unique values occurring in a column? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 3 years ago.
I'm trying to use the tidyverse (whatever package is appropriate) to add a column (via mutate()) that is a running total of the unique values that have occurred in the column so far. Here is some toy data, showing the desired output.
data.frame("n"=c(1,1,1,6,7,8,8),"Unique cumsum"=c(1,1,1,2,3,4,4))
Who knows how to accomplish this in the tidyverse?
Here is an option with group_indices
library(dplyr)
df1%>%
mutate(unique_cumsum = group_indices(., n))
# n unique_cumsum
#1 1 1
#2 1 1
#3 1 1
#4 6 2
#5 7 3
#6 8 4
#7 8 4
data
df1 <- data.frame("n"=c(1,1,1,6,7,8,8))
Here's one way, using the fact that a factor will assign a sequential value to each unique item, and then converting the underlying factor codes with as.numeric:
data.frame("n"=c(1,1,1,6,7,8,8)) %>% mutate(unique_cumsum=as.numeric(factor(n)))
n unique_cumsum
1 1 1
2 1 1
3 1 1
4 6 2
5 7 3
6 8 4
7 8 4
Another solution:
df <- data.frame("n"=c(1,1,1,6,7,8,8))
df <- df %>% mutate(`unique cumsum` = cumsum(!duplicated(n)))
This should work even if your data is not sorted.

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Filter ids with having count > 1 in data.table [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I would like to subset my data frame to keep only groups that have 3 or more observations on DIFFERENT days. I want to get rid of groups that have less than 3 observations, or the observations they have are not from 3 different days.
Here is a sample data set:
Group Day
1 1
1 3
1 5
1 5
2 2
2 2
2 4
2 4
3 1
3 2
3 3
4 1
4 5
So for the above example, group 1 and group 3 will be kept and group 2 and 4 will be removed from the data frame.
I hope this makes sense, I imagine the solution will be quite simple but I can't work it out (I'm quite new to R and not very fast at coming up with solutions to things like this). I thought maybe the diff function could come in handy but didn't get much further.
With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.
One idea using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
filter(length(unique(Day)) >= 3)
#Source: local data frame [7 x 2]
#Groups: Group [2]
# Group Day
# (int) (int)
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#5 3 1
#6 3 2
#7 3 3
We can use base R
i1 <- rowSums(table(df1)!=0)>=3
subset(df1, Group %in% names(i1)[i1])
# Group Day
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#9 3 1
#10 3 2
#11 3 3
Or a one-liner base R would be
df1[with(df1, as.logical(ave(Day, Group, FUN = function(x) length(unique(x)) >=3))),]

Resources