Gathering specific pairs of columns into rows by dplyr in R [duplicate] - r
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to convert a data frame from wide to long format by gathering specific pairs of columns of which example is shown below:
An example of data frame
df <- data.frame(id=c(1,2,3,4,5), var=c("a","d","g","f","i"),a1=c(3,5,1,2,2), b1=c(2,4,1,2,3), a2=c(8,1,2,5,1), b2=c(1,6,4,7,2), a3=c(7,7,2,3,1), b3=c(1,1,4,9,6))
Initial table:
id var a1 b1 a2 b2 a3 b3
1 1 a 3 2 8 1 7 1
2 2 d 5 4 1 6 7 1
3 3 g 1 1 2 4 2 4
4 4 f 2 2 5 7 3 9
5 5 i 2 3 1 2 1 6
Desired result:
id var a b
1 1 a 3 2
2 1 a 8 1
3 1 a 7 1
4 2 d 5 4
5 2 d 1 6
6 2 d 7 1
7 3 g 1 1
8 3 g 2 4
9 3 g 2 4
10 4 f 2 2
11 4 f 5 7
12 4 f 3 9
13 5 i 2 3
14 5 i 1 2
15 5 i 1 6
Conditions:
Pair of ai and bi should be gathered: As there are 3 pairs of a and b, "a1 and b1", "a2 and b2" and "a3 and b3", values in those pairs should be moved to a pair of "a and b" by replicating each record in three times
First and second fields (id of each sample and its common variable) should be kept in each replicated rows
I was thinking that it is possible to make it by gather() in tidyverse, however, as far as I understand, I suppose that gather function may not be suitable for gathering such specific pairs of fields into specific multiple columns (two columns in this case).
It is possible to make it to prepare three data frames separately and binding it into one (example scripts are shown below), however I prefer to make it in one continuous pipe operation in tidyverse not to stop manipulation.
df1 <- df %>% dplyr::select(id,var,a1,b1)
df2 <- df %>% dplyr::select(id,var,a2,b2)
df3 <- df %>% dplyr::select(id,var,a3,b3)
df.fin <- bind_rows(df1,df2,df3)
I would appreciate your elegant suggestons using tidyverse.
=================Additional Questions==================
#Akrun & Camille
Thank you for your suggestions and sorry for my late reply. I am now trying to apply your idea into actual data frame but still struggling with another issue.
Followings are column names in actual data frame (sorry, I do not set any values of each columns as it may not be a matter).
colnames(df) <- c("hid","mid","rel","age","gen","mlic","vlic",
"wtaz","staz","ocp","ocpot","emp","empot","expm",
"minc","otaz1","op1","dtime1","atime1","dp1","dtaz1",
"pur1", "repm1","lg1t1","lg2t1","lg3t1","lg4t1","expt1",
"otaz2","op2","dtime2","atime2","dp2","dtaz2","pur2",
"repm2","lg1t2","lg2t2","lg3t2","lg4t2","expt2",
"otaz3","op3","dtime3","atime3","dp3","dtaz3","pur3",
"repm3","lg1t3","lg2t3","lg3t3","lg4t3","expt3",
"otaz4","op4","dtime4","atime4","dp4","dtaz4","pur4",
"repm4","lg1t4","lg2t4","lg3t4","lg4t4","expt4",
"otaz5","op5","dtime5","atime5","dp5","dtaz5","pur5",
"repm5","lg1t5","lg2t5","lg3t5","lg4t5","expt5"
)
Then, I am trying to apply your suggestions as below:
In the data frame, columns 1:15 are commons variables and others are repeated variables with 5 repetitions (1 to 5 located at the end of each varible). I could rund following script but still have problem:
#### Convert member table into activity table
## Common variables
hm.com <- names(hm)[c(1:15)]
## Repeating variables
hm.rep <- names(hm)[c(-1:-15)]
hm.rename <- unique(sub("\\d+$","",hm.rep))
## Extract members with trips
hm.trip <- hm %>% filter(otaz!=0) %>% data.frame()
## Convert from member into trip table
test <- split(hm.rep, sub(".*[^1-9$]", "", hm.rep)) %>%
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename) %>%
arrange(hid,mid,dtime,atime) %>%
data.frame()
The result still have an issue:
I could rename first set of repeated variables, however remaining fields from 2 to 5 are still remaining and records are not appropriately stored in the data frame.
I mean that, a set of repeated variables, for instance, from otaz2 to expt2, are stored not in the second row of otaz~expt but stored in its original position (from otaz2 to expt2). I suppose map_df is not working correctly in my case.
========== Problem Solved ==========
Above script was containing incorrect manipulation:
Wrong:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename)
Correct:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x) %>%
rename_at(16:28, ~ hm.rename))
Thank you, I could go to the next step.
We could do this with melt from data.table which can take multiple patterns in the measure argument to reshape into 'long' format. In this case we are using column names that start (^) with "a" followed by numbers as one pattern and those start with "b" and followed by numbers as other
library(data.table)
melt(setDT(df), measure = patterns("^a\\d+", "^b\\d+"),
value.name = c("a", "b"))[order(id)][, variable := NULL][]
# id var a b
# 1: 1 a 3 2
# 2: 1 a 8 1
# 3: 1 a 7 1
# 4: 2 d 5 4
# 5: 2 d 1 6
# 6: 2 d 7 1
# 7: 3 g 1 1
# 8: 3 g 2 4
# 9: 3 g 2 4
#10: 4 f 2 2
#11: 4 f 5 7
#12: 4 f 3 9
#13: 5 i 2 3
#14: 5 i 1 2
#15: 5 i 1 6
Or using tidyverse, we gather the columns of interest to 'long' format (but should be cautious when dealing with groups of columns that are having different classes - where melt is more useful), then separate the 'key' column into two, and spread to 'wide' format
library(tidyverse)
df %>%
gather(key, val, a1:b3) %>%
separate(key, into = c("key1", "key2"), sep=1) %>%
spread(key1, val) %>%
select(-key2)
# id var a b
#1 1 a 3 2
#2 1 a 8 1
#3 1 a 7 1
#4 2 d 5 4
#5 2 d 1 6
#6 2 d 7 1
#7 3 g 1 1
#8 3 g 2 4
#9 3 g 2 4
#10 4 f 2 2
#11 4 f 5 7
#12 4 f 3 9
#13 5 i 2 3
#14 5 i 1 2
#15 5 i 1 6
This isn't very scaleable, so if you end up needing more than these 3 pairs of columns, go with #akrun's answer. I just wanted to point out that the bind_rows snippet you included could, in fact, be done in one pipe:
library(tidyverse)
bind_rows(
df %>% select(id, var, a = a1, b = b1),
df %>% select(id, var, a = a2, b = b2),
df %>% select(id, var, a = a3, b = b3)
) %>%
arrange(id, var)
#> id var a b
#> 1 1 a 3 2
#> 2 1 a 8 1
#> 3 1 a 7 1
#> 4 2 d 5 4
#> 5 2 d 1 6
#> 6 2 d 7 1
#> 7 3 g 1 1
#> 8 3 g 2 4
#> 9 3 g 2 4
#> 10 4 f 2 2
#> 11 4 f 5 7
#> 12 4 f 3 9
#> 13 5 i 2 3
#> 14 5 i 1 2
#> 15 5 i 1 6
Created on 2018-05-07 by the reprex package (v0.2.0).
If you want something that scales and you like map_* functions (from purrr in the tidyverse), you can abstract the above pipeline:
1:3 %>%
map_df(~select(df, id, var, ends_with(as.character(.))) %>%
setNames(c("id", "var", "a", "b"))) %>%
arrange(id, var)
where 1:3 just represents the numbers of the pairs you have.
a base R solution:
res <- do.call(rbind,lapply(1:3,function(x) setNames(df[c(1:2,2*x+(1:2))],names(df)[1:4])))
res[order(res$id),]
# id var a1 b1
# 1 1 a 3 2
# 6 1 a 8 1
# 11 1 a 7 1
# 2 2 d 5 4
# 7 2 d 1 6
# 12 2 d 7 1
# 3 3 g 1 1
# 8 3 g 2 4
# 13 3 g 2 4
# 4 4 f 2 2
# 9 4 f 5 7
# 14 4 f 3 9
# 5 5 i 2 3
# 10 5 i 1 2
# 15 5 i 1 6
Related
How to stack multiple columns into one using R [duplicate]
This question already has answers here: Reshaping data.frame from wide to long format (8 answers) Closed 1 year ago. I have the following data frame: A <- c(3,5,6,7) B <- c(2,4,5,3) C <- c(4,6,7,8) D <- c(2,4,5,3) gene <- c(1,2,3,4) df <- data.frame(gene,A,B,C,D) df gene A B C D 1 1 3 2 4 2 2 2 5 4 6 4 3 3 6 5 7 5 4 4 7 3 8 3 How can I stack each lettered column into one new column called "count" such that there is another new column called "sample" that keeps track of the original column from which each count value came frame (ie. I would like the following output): count sample 3 A 5 A 6 A 7 A 2 B 4 B 5 B 3 B 4 C 6 C 7 C 8 C 2 D 4 D 5 D 3 D Sorry this is difficult to explain but the output data frame above should make it clear. Thanks
In base R, use stack after removing the first column out <- stack(df[-1]) names(out) <- c("count", "sample")
We could use pivot_longer: library(tidyr) library(dplyr) df %>% pivot_longer( cols = -gene, names_to = "sample", values_to = "count" ) %>% select(-gene) %>% arrange(sample) sample count <chr> <dbl> 1 A 3 2 A 5 3 A 6 4 A 7 5 B 2 6 B 4 7 B 5 8 B 3 9 C 4 10 C 6 11 C 7 12 C 8 13 D 2 14 D 4 15 D 5 16 D 3
Replicate rows of a data frame using purrr [duplicate]
This question already has answers here: Repeat each row of data.frame the number of times specified in a column (10 answers) Closed 2 years ago. I have the following data frame: Id Value Freq 1 A 8 2 2 B 7 3 3 C 2 4 and I want to obtain a new data frame by replicating each Value according to Freq: Id Value 1 A 8 2 A 8 3 B 7 4 B 7 5 B 7 6 C 2 7 C 2 8 C 2 9 C 2 I understand this can be very easily done with purrr (I have identified map_dfr as the most suitable function), but I cannot understand what is the best and most "compact" way to do it.
You can just use some nice indexing-properties of dataframes. df <- data.frame(Id=c("A","B","C"),Value=c(8,7,2),Freq=c(2,3,4)) replicatedDataframe <- do.call("rbind",lapply(1:NROW(df), function(k) { df[rep(k,df$Freq[k]),-3] })) This can be done more easier using the times-argument in rep: replicatedDataframe <- df[rep(1:NROW(df),times=df$Freq),-3]
Convert Freq to a vector and unnest. df %>% mutate(Freq = map(Freq, seq_len)) %>% unnest(Freq) %>% select(-Freq) #> # A tibble: 9 x 2 #> Id Value #> <chr> <dbl> #> 1 A 8 #> 2 A 8 #> 3 B 7 #> 4 B 7 #> 5 B 7 #> 6 C 2 #> 7 C 2 #> 8 C 2 #> 9 C 2
Dynamic select expression in function [duplicate]
This question already has answers here: Reshaping multiple sets of measurement columns (wide format) into single columns (long format) (8 answers) Closed 4 years ago. I am trying to write a function that will convert this data frame library(dplyr) library(rlang) library(purrr) df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4), S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10), S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10)) obj S1 PR1 S2 PR2 1 1 a 3 c 7 2 1 b 7 d 3 3 2 a 3 c 7 4 2 b 7 d 3 5 3 a 3 c 7 6 3 b 7 d 3 7 3 a 3 c 7 8 4 b 7 d 3 9 4 a 3 c 7 10 4 b 7 d 3 In to this data frame df %>% {bind_rows(select(., obj, S = S1, PR = PR1), select(., obj, S = S2, PR = PR2))} obj S PR 1 1 a 3 2 1 b 7 3 2 a 3 4 2 b 7 5 3 a 3 6 3 b 7 7 3 a 3 8 4 b 7 9 4 a 3 10 4 b 7 11 1 c 7 12 1 d 3 13 2 c 7 14 2 d 3 15 3 c 7 16 3 d 3 17 3 c 7 18 4 d 3 19 4 c 7 20 4 d 3 But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept. This is my attempt at this function: stack_col <- function(df, patterns, nums, cnames, keep){ keep <- enquo(keep) build_exp <- function(x){ paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>% parse_expr() } exps <- map(1:length(patterns), ~expr(!!build_exp(.))) sel_fun <- function(num){ df %>% select(!!keep, !!!exps) } map(nums, sel_fun) %>% bind_rows() } I can get the sel_fun part to work for a fixed number of patterns like this patterns <- c("S", "PR") cnames <- c("Species", "PR") keep <- quo(obj) sel_fun <- function(num){ df %>% select(!!keep, !!sym(cnames[[1]]) := paste0(patterns[[1]], num), !!sym(cnames[[2]]) := paste0(patterns[[2]], num)) } sel_fun(1) But the dynamic version that I have tried does not work and gives this error: Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest stack_col <- function(dat, pat, cname, keep) { purrr::map2(pat, cname, ~ dat %>% dplyr::select(keep, matches(.x)) %>% tidyr::gather(key, val, matches(.x)) %>% dplyr::select(-key) %>% dplyr::rename(!! .y := val)) %>% dplyr::bind_cols(.) %>% dplyr::select(keep, cname) } stack_col(df, patterns, cnames, 1) # obj Species PR #1 1 a 3 #2 1 b 7 #3 2 a 3 #4 2 b 7 #5 3 a 3 #6 3 b 7 #7 3 a 3 #8 4 b 7 #9 4 a 3 #10 4 b 7 #11 1 c 7 #12 1 d 3 #13 2 c 7 #14 2 d 3 #15 3 c 7 #16 3 d 3 #17 3 c 7 #18 4 d 3 #19 4 c 7 #20 4 d 3 Also, multiple patterns reshaping can be done with data.table::melt library(data.table) melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"), value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function: The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns. library(dplyr) library(purrr) library(tidyr) library(stringr) patterns <- c("S", "PR") cnames <- c("Species", "PR") names(cnames) <- patterns complete_pattern <- str_c("^", str_c(patterns, collapse = "|^")) df %>% mutate(rownumber = 1:n()) %>% gather(new_variable, value, matches(complete_pattern)) %>% mutate(group = str_extract(new_variable, complete_pattern), group = str_replace_all(group, cnames), group_number = str_extract(new_variable, "\\d+")) %>% select(-new_variable) %>% spread(group, value) # obj rownumber group_number PR Species # 1 1 1 1 3 a # 2 1 1 2 7 c # 3 1 2 1 7 b # 4 1 2 2 3 d # 5 2 3 1 3 a # 6 2 3 2 7 c # 7 2 4 1 7 b # 8 2 4 2 3 d # 9 3 5 1 3 a # 10 3 5 2 7 c # 11 3 6 1 7 b # 12 3 6 2 3 d # 13 3 7 1 3 a # 14 3 7 2 7 c # 15 4 8 1 7 b # 16 4 8 2 3 d # 17 4 9 1 3 a # 18 4 9 2 7 c # 19 4 10 1 7 b # 20 4 10 2 3 d
Compare variable names with String to compute new variable
I want to use variable names in R for comparison with String-Data in an id-Variable. Table 1 shows how my sample table is currently looking and Table 2 shows what I want to achieve. In Table 2 corresponding values are copied into a new variable "newid" if the variable id matches the column name. The other values are also copied into new variables for easier calculating id a b c newid c 1 2 3 a 4 2 1 b 5 8 9 id a b c newid rest1 rest2 c 1 2 3 3 1 2 a 4 2 1 4 2 1 b 5 8 9 8 5 9 Thank you in advance
An approach via tidyverse, library(tidyverse) df %>% gather(var, val, -id) %>% filter(id == var) %>% left_join(df, ., by = 'id') %>% select(-var) which gives, id a b c val 1 c 1 2 3 3 2 a 4 2 1 4 3 b 5 8 9 8
Using match and subsetting with an index matrix: DF <- read.table(text = "id a b c c 1 2 3 a 4 2 1 b 5 8 9 ", header = TRUE) DF$newid <- DF[, -1][cbind(seq_len(nrow(DF)), match(DF$id, colnames(DF)[-1]))] # id a b c newid #1 c 1 2 3 3 #2 a 4 2 1 4 #3 b 5 8 9 8
By using melt from reshape temp=with(melt(dt,'id'),melt(dt,'id')[id==variable,]) dt$New=temp$value[match(dt$id,temp$variable)] dt id a b c New 1 c 1 2 3 3 2 a 4 2 1 4 3 b 5 8 9 8
How to find the date after the last occurrence of a certain observation in R?
I have data grouped using dplyr in R. I would like to find the 'date' after the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'). In other words, the 'date' where 1/2/3/4 has turned to 0. Simply finding the date for the first occurrence of 0 will not work as in some groups 1/2/3/4 switches to 0 and then back again and does not give the result I'd like. I would like this 'date' for each group to be given in a new column ('date.after'). For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables): A B date a 2 1 a 2 2 a 1 5 a 0 8 b 3 1 b 3 4 b 3 6 b 0 7 b 0 9 c 1 2 c 1 3 c 1 4 I would like to achieve the following: A B date date.after a 2 1 8 a 2 2 8 a 1 5 8 a 0 8 8 b 3 1 7 b 3 4 7 b 3 6 7 b 0 7 7 b 0 9 7 c 1 2 NA c 1 3 NA c 1 4 NA I hope this makes sense, thank you all very much for your help! This post may look familiar, I have just asked a very similar question: How to find the last occurrence of a certain observation in grouped data in R?
Here's a dplyr option: df %>% group_by(A) %>% mutate(date_after = date[last(which(B >= 1)) + 1]) #Source: local data frame [12 x 4] #Groups: A [3] # # A B date date_after # (fctr) (int) (int) (int) #1 a 2 1 8 #2 a 2 2 8 #3 a 1 5 8 #4 a 0 8 8 #5 b 3 1 7 #6 b 3 4 7 #7 b 3 6 7 #8 b 0 7 7 #9 b 0 9 7 #10 c 1 2 NA #11 c 1 3 NA #12 c 1 4 NA Alternatively, you could use dplyr's nth function: df %>% group_by(A) %>% mutate(date_after = nth(date, last(which(B >= 1)) + 1)) What it does (in both cases): It computes the position of the last entry of B equal to or greater than 1, then adds 1 to that index and returns date of that position. It returns NA if that position is not available (as is the case in the last group). You can do exactly the same in data.table using: library(data.table) setDT(df)[, date_after := date[last(which(B >= 1)) + 1], by = A]
I went with dplyr since I think the code is easier to read than data.table library(dplyr) df %>% group_by(A) %>% mutate( Date0 = date[B == 0][1] )