How to stack multiple columns into one using R [duplicate] - r

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 1 year ago.
I have the following data frame:
A <- c(3,5,6,7)
B <- c(2,4,5,3)
C <- c(4,6,7,8)
D <- c(2,4,5,3)
gene <- c(1,2,3,4)
df <- data.frame(gene,A,B,C,D)
df
gene A B C D
1 1 3 2 4 2
2 2 5 4 6 4
3 3 6 5 7 5
4 4 7 3 8 3
How can I stack each lettered column into one new column called "count" such that there is another new column called "sample" that keeps track of the original column from which each count value came frame (ie. I would like the following output):
count sample
3 A
5 A
6 A
7 A
2 B
4 B
5 B
3 B
4 C
6 C
7 C
8 C
2 D
4 D
5 D
3 D
Sorry this is difficult to explain but the output data frame above should make it clear.
Thanks

In base R, use stack after removing the first column
out <- stack(df[-1])
names(out) <- c("count", "sample")

We could use pivot_longer:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(
cols = -gene,
names_to = "sample",
values_to = "count"
) %>%
select(-gene) %>%
arrange(sample)
sample count
<chr> <dbl>
1 A 3
2 A 5
3 A 6
4 A 7
5 B 2
6 B 4
7 B 5
8 B 3
9 C 4
10 C 6
11 C 7
12 C 8
13 D 2
14 D 4
15 D 5
16 D 3

Related

How to enumerate groups in R? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 2 years ago.
Which native R function or which function in any other library would I be able to make a column listed with the one in the image below?
Dataset
lines = "Group
C
C
C
B
B
A
A
A
A
A
A
D
D
D
D
"
dataset = read.table(textConnection(lines), sep=";", h=T)
Try with cur_group_id() from dplyr:
library(dplyr)
#Code 1
newdf <- dataset%>%
mutate(Group=factor(Group,levels = unique(Group),ordered = T)) %>%
group_by(Group) %>% mutate(Num=cur_group_id())
Output:
# A tibble: 15 x 2
# Groups: Group [4]
Group Num
<ord> <int>
1 C 1
2 C 1
3 C 1
4 B 2
5 B 2
6 A 3
7 A 3
8 A 3
9 A 3
10 A 3
11 A 3
12 D 4
13 D 4
14 D 4
15 D 4
Or using base R:
#Code 2
dataset$Num <- as.integer(factor(dataset$Group,levels = unique(dataset$Group)))
Output:
Group Num
1 C 1
2 C 1
3 C 1
4 B 2
5 B 2
6 A 3
7 A 3
8 A 3
9 A 3
10 A 3
11 A 3
12 D 4
13 D 4
14 D 4
15 D 4

Transpose and Merge columns in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
Quite new to R and I have a dataset in this format:
A B C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
But I want it in this format:
A 1
A 2
A 3
A 4
A 5
B 1
B 2
B 3
...etc.
Seems like such a simple issue but I need HELP! Thanks
df <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
stack(df)
values ind
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 1 C
12 2 C
13 3 C
14 4 C
15 5 C
Examples using dplyr's gather function:
library(tidyverse)
A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5)
C <- c(1,2,3,4,5)
df <- data.frame(A,B,C)
df %>% gather(key = "key", value = "value")
key value
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
You can use the package tidyr. This let's you choose, which columns you want to gather in the column "variable".
# if not installed yet
install.packages("tidyr")
library(tidyr)
data <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
data %>% pivot_longer(c(A, B, C), names_to = "variable", values_to = "value")
# Result
variable value
<chr> <int>
1 A 1
2 B 1
3 C 1
4 A 2
5 B 2
6 C 2
7 A 3
8 B 3
9 C 3
10 A 4
11 B 4
12 C 4
13 A 5
14 B 5
15 C 5

Dynamic select expression in function [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d

Gathering specific pairs of columns into rows by dplyr in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to convert a data frame from wide to long format by gathering specific pairs of columns of which example is shown below:
An example of data frame
df <- data.frame(id=c(1,2,3,4,5), var=c("a","d","g","f","i"),a1=c(3,5,1,2,2), b1=c(2,4,1,2,3), a2=c(8,1,2,5,1), b2=c(1,6,4,7,2), a3=c(7,7,2,3,1), b3=c(1,1,4,9,6))
Initial table:
id var a1 b1 a2 b2 a3 b3
1 1 a 3 2 8 1 7 1
2 2 d 5 4 1 6 7 1
3 3 g 1 1 2 4 2 4
4 4 f 2 2 5 7 3 9
5 5 i 2 3 1 2 1 6
Desired result:
id var a b
1 1 a 3 2
2 1 a 8 1
3 1 a 7 1
4 2 d 5 4
5 2 d 1 6
6 2 d 7 1
7 3 g 1 1
8 3 g 2 4
9 3 g 2 4
10 4 f 2 2
11 4 f 5 7
12 4 f 3 9
13 5 i 2 3
14 5 i 1 2
15 5 i 1 6
Conditions:
Pair of ai and bi should be gathered: As there are 3 pairs of a and b, "a1 and b1", "a2 and b2" and "a3 and b3", values in those pairs should be moved to a pair of "a and b" by replicating each record in three times
First and second fields (id of each sample and its common variable) should be kept in each replicated rows
I was thinking that it is possible to make it by gather() in tidyverse, however, as far as I understand, I suppose that gather function may not be suitable for gathering such specific pairs of fields into specific multiple columns (two columns in this case).
It is possible to make it to prepare three data frames separately and binding it into one (example scripts are shown below), however I prefer to make it in one continuous pipe operation in tidyverse not to stop manipulation.
df1 <- df %>% dplyr::select(id,var,a1,b1)
df2 <- df %>% dplyr::select(id,var,a2,b2)
df3 <- df %>% dplyr::select(id,var,a3,b3)
df.fin <- bind_rows(df1,df2,df3)
I would appreciate your elegant suggestons using tidyverse.
=================Additional Questions==================
#Akrun & Camille
Thank you for your suggestions and sorry for my late reply. I am now trying to apply your idea into actual data frame but still struggling with another issue.
Followings are column names in actual data frame (sorry, I do not set any values of each columns as it may not be a matter).
colnames(df) <- c("hid","mid","rel","age","gen","mlic","vlic",
"wtaz","staz","ocp","ocpot","emp","empot","expm",
"minc","otaz1","op1","dtime1","atime1","dp1","dtaz1",
"pur1", "repm1","lg1t1","lg2t1","lg3t1","lg4t1","expt1",
"otaz2","op2","dtime2","atime2","dp2","dtaz2","pur2",
"repm2","lg1t2","lg2t2","lg3t2","lg4t2","expt2",
"otaz3","op3","dtime3","atime3","dp3","dtaz3","pur3",
"repm3","lg1t3","lg2t3","lg3t3","lg4t3","expt3",
"otaz4","op4","dtime4","atime4","dp4","dtaz4","pur4",
"repm4","lg1t4","lg2t4","lg3t4","lg4t4","expt4",
"otaz5","op5","dtime5","atime5","dp5","dtaz5","pur5",
"repm5","lg1t5","lg2t5","lg3t5","lg4t5","expt5"
)
Then, I am trying to apply your suggestions as below:
In the data frame, columns 1:15 are commons variables and others are repeated variables with 5 repetitions (1 to 5 located at the end of each varible). I could rund following script but still have problem:
#### Convert member table into activity table
## Common variables
hm.com <- names(hm)[c(1:15)]
## Repeating variables
hm.rep <- names(hm)[c(-1:-15)]
hm.rename <- unique(sub("\\d+$","",hm.rep))
## Extract members with trips
hm.trip <- hm %>% filter(otaz!=0) %>% data.frame()
## Convert from member into trip table
test <- split(hm.rep, sub(".*[^1-9$]", "", hm.rep)) %>%
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename) %>%
arrange(hid,mid,dtime,atime) %>%
data.frame()
The result still have an issue:
I could rename first set of repeated variables, however remaining fields from 2 to 5 are still remaining and records are not appropriately stored in the data frame.
I mean that, a set of repeated variables, for instance, from otaz2 to expt2, are stored not in the second row of otaz~expt but stored in its original position (from otaz2 to expt2). I suppose map_df is not working correctly in my case.
========== Problem Solved ==========
Above script was containing incorrect manipulation:
Wrong:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename)
Correct:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x) %>%
rename_at(16:28, ~ hm.rename))
Thank you, I could go to the next step.
We could do this with melt from data.table which can take multiple patterns in the measure argument to reshape into 'long' format. In this case we are using column names that start (^) with "a" followed by numbers as one pattern and those start with "b" and followed by numbers as other
library(data.table)
melt(setDT(df), measure = patterns("^a\\d+", "^b\\d+"),
value.name = c("a", "b"))[order(id)][, variable := NULL][]
# id var a b
# 1: 1 a 3 2
# 2: 1 a 8 1
# 3: 1 a 7 1
# 4: 2 d 5 4
# 5: 2 d 1 6
# 6: 2 d 7 1
# 7: 3 g 1 1
# 8: 3 g 2 4
# 9: 3 g 2 4
#10: 4 f 2 2
#11: 4 f 5 7
#12: 4 f 3 9
#13: 5 i 2 3
#14: 5 i 1 2
#15: 5 i 1 6
Or using tidyverse, we gather the columns of interest to 'long' format (but should be cautious when dealing with groups of columns that are having different classes - where melt is more useful), then separate the 'key' column into two, and spread to 'wide' format
library(tidyverse)
df %>%
gather(key, val, a1:b3) %>%
separate(key, into = c("key1", "key2"), sep=1) %>%
spread(key1, val) %>%
select(-key2)
# id var a b
#1 1 a 3 2
#2 1 a 8 1
#3 1 a 7 1
#4 2 d 5 4
#5 2 d 1 6
#6 2 d 7 1
#7 3 g 1 1
#8 3 g 2 4
#9 3 g 2 4
#10 4 f 2 2
#11 4 f 5 7
#12 4 f 3 9
#13 5 i 2 3
#14 5 i 1 2
#15 5 i 1 6
This isn't very scaleable, so if you end up needing more than these 3 pairs of columns, go with #akrun's answer. I just wanted to point out that the bind_rows snippet you included could, in fact, be done in one pipe:
library(tidyverse)
bind_rows(
df %>% select(id, var, a = a1, b = b1),
df %>% select(id, var, a = a2, b = b2),
df %>% select(id, var, a = a3, b = b3)
) %>%
arrange(id, var)
#> id var a b
#> 1 1 a 3 2
#> 2 1 a 8 1
#> 3 1 a 7 1
#> 4 2 d 5 4
#> 5 2 d 1 6
#> 6 2 d 7 1
#> 7 3 g 1 1
#> 8 3 g 2 4
#> 9 3 g 2 4
#> 10 4 f 2 2
#> 11 4 f 5 7
#> 12 4 f 3 9
#> 13 5 i 2 3
#> 14 5 i 1 2
#> 15 5 i 1 6
Created on 2018-05-07 by the reprex package (v0.2.0).
If you want something that scales and you like map_* functions (from purrr in the tidyverse), you can abstract the above pipeline:
1:3 %>%
map_df(~select(df, id, var, ends_with(as.character(.))) %>%
setNames(c("id", "var", "a", "b"))) %>%
arrange(id, var)
where 1:3 just represents the numbers of the pairs you have.
a base R solution:
res <- do.call(rbind,lapply(1:3,function(x) setNames(df[c(1:2,2*x+(1:2))],names(df)[1:4])))
res[order(res$id),]
# id var a1 b1
# 1 1 a 3 2
# 6 1 a 8 1
# 11 1 a 7 1
# 2 2 d 5 4
# 7 2 d 1 6
# 12 2 d 7 1
# 3 3 g 1 1
# 8 3 g 2 4
# 13 3 g 2 4
# 4 4 f 2 2
# 9 4 f 5 7
# 14 4 f 3 9
# 5 5 i 2 3
# 10 5 i 1 2
# 15 5 i 1 6

Generate combination of data frame and vector

I know expand.grid is to create all combinations of given vectors. But is there a way to generate all combinations of a data frame and a vector by taking each row in the data frame as unique. For instance,
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
how to create a new data frame that is the combination of df and c without expanding df:
df.c:
a b c
1 5 9
2 6 9
3 7 9
1 5 10
2 6 10
3 7 10
Thanks!
As for me the simplest way is merge(df, as.data.frame(c))
a b c
1 1 5 9
2 2 6 9
3 3 7 9
4 1 5 10
5 2 6 10
6 3 7 10
This may not scale when your dataframe has more than two columns per row, but you can just use expand.grid on the first column and then merge the second column in.
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
combined <- expand.grid(a=df$a, c=c)
combined <- merge(combined, df)
> combined[order(combined$c), ]
a c b
1 1 9 5
3 2 9 6
5 3 9 7
2 1 10 5
4 2 10 6
6 3 10 7
You could also do something like this
do.call(rbind,lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
# or using rbindlist as a fast alternative to do.call(rbind,list)
library(data.table)
rbindlist(lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
or
rbindlist(Map(data.frame, c = 9:10, MoreArgs = list(a= 1:3,b=5:7)))
This question is really old but I found one more answer.
Use tidyr's expand_grid().
expand_grid(df, c)
# A tibble: 6 × 3
a b c
<int> <int> <int>
1 1 5 9
2 1 5 10
3 2 6 9
4 2 6 10
5 3 7 9
6 3 7 10

Resources