Related
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
Quite new to R and I have a dataset in this format:
A B C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
But I want it in this format:
A 1
A 2
A 3
A 4
A 5
B 1
B 2
B 3
...etc.
Seems like such a simple issue but I need HELP! Thanks
df <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
stack(df)
values ind
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 1 C
12 2 C
13 3 C
14 4 C
15 5 C
Examples using dplyr's gather function:
library(tidyverse)
A <- c(1,2,3,4,5)
B <- c(1,2,3,4,5)
C <- c(1,2,3,4,5)
df <- data.frame(A,B,C)
df %>% gather(key = "key", value = "value")
key value
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
You can use the package tidyr. This let's you choose, which columns you want to gather in the column "variable".
# if not installed yet
install.packages("tidyr")
library(tidyr)
data <- data.frame(
A = 1:5,
B = 1:5,
C = 1:5
)
data %>% pivot_longer(c(A, B, C), names_to = "variable", values_to = "value")
# Result
variable value
<chr> <int>
1 A 1
2 B 1
3 C 1
4 A 2
5 B 2
6 C 2
7 A 3
8 B 3
9 C 3
10 A 4
11 B 4
12 C 4
13 A 5
14 B 5
15 C 5
I have the following dataframe
df<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5),
'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency
1 A 1 1
2 A 2 2
3 A 3 NA
4 A 4 4
5 A 5 6
6 B 1 1
7 B 2 2
8 B 3 5
9 B 4 6
10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 1
8 B 2
9 B 3
10 B 4
11 B 5
12 B 6
13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<-df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>%
mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq
<fct> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 NA 3
4 A 4 4 4
5 A 5 6 5
6 B 1 1 1
7 B 2 2 2
8 B 3 5 3
9 B 4 6 4
10 B 5 7 5
Request someone to help me here
gather function from tidyr package will be helpful here:
library(tidyverse)
df %>%
gather(x, CFreq, -ID) %>%
select(-x) %>%
na.omit() %>%
unique() %>%
arrange(ID, CFreq)
A different tidyverse possibility could be:
df %>%
nest(A_Frequency, B_Frequency, .key = C_Frequency) %>%
mutate(C_Frequency = map(C_Frequency, function(x) unique(x[!is.na(x)]))) %>%
unnest()
ID C_Frequency
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
9 A 6
10 B 1
11 B 2
12 B 3
13 B 4
14 B 5
18 B 6
19 B 7
Base R approach would be to split the dataframe based on ID and for every list we count the number of unique enteries and create a sequence based on that.
do.call(rbind, lapply(split(df, df$ID), function(x) data.frame(ID = x$ID[1] ,
CFreq = seq_len(length(unique(na.omit(unlist(x[-1]))))))))
# ID CFreq
#A.1 A 1
#A.2 A 2
#A.3 A 3
#A.4 A 4
#A.5 A 5
#A.6 A 6
#B.1 B 1
#B.2 B 2
#B.3 B 3
#B.4 B 4
#B.5 B 5
#B.6 B 6
#B.7 B 7
This will also work when A_Frequency B_Frequency has characters in them or some other random numbers instead of sequential numbers.
In tidyverse we can do
library(tidyverse)
df %>%
group_split(ID) %>%
map_dfr(~ data.frame(ID = .$ID[1],
CFreq= seq_len(length(unique(na.omit(flatten_chr(.[-1])))))))
A data.table option
library(data.table)
cols <- c('A_Frequency', 'B_Frequency')
out <- setDT(df)[, .(CFreq = sort(unique(unlist(.SD)))),
.SDcols = cols,
by = ID]
out
# ID CFreq
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: A 5
# 6: A 6
# 7: B 1
# 8: B 2
# 9: B 3
#10: B 4
#11: B 5
#12: B 6
#13: B 7
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to convert a data frame from wide to long format by gathering specific pairs of columns of which example is shown below:
An example of data frame
df <- data.frame(id=c(1,2,3,4,5), var=c("a","d","g","f","i"),a1=c(3,5,1,2,2), b1=c(2,4,1,2,3), a2=c(8,1,2,5,1), b2=c(1,6,4,7,2), a3=c(7,7,2,3,1), b3=c(1,1,4,9,6))
Initial table:
id var a1 b1 a2 b2 a3 b3
1 1 a 3 2 8 1 7 1
2 2 d 5 4 1 6 7 1
3 3 g 1 1 2 4 2 4
4 4 f 2 2 5 7 3 9
5 5 i 2 3 1 2 1 6
Desired result:
id var a b
1 1 a 3 2
2 1 a 8 1
3 1 a 7 1
4 2 d 5 4
5 2 d 1 6
6 2 d 7 1
7 3 g 1 1
8 3 g 2 4
9 3 g 2 4
10 4 f 2 2
11 4 f 5 7
12 4 f 3 9
13 5 i 2 3
14 5 i 1 2
15 5 i 1 6
Conditions:
Pair of ai and bi should be gathered: As there are 3 pairs of a and b, "a1 and b1", "a2 and b2" and "a3 and b3", values in those pairs should be moved to a pair of "a and b" by replicating each record in three times
First and second fields (id of each sample and its common variable) should be kept in each replicated rows
I was thinking that it is possible to make it by gather() in tidyverse, however, as far as I understand, I suppose that gather function may not be suitable for gathering such specific pairs of fields into specific multiple columns (two columns in this case).
It is possible to make it to prepare three data frames separately and binding it into one (example scripts are shown below), however I prefer to make it in one continuous pipe operation in tidyverse not to stop manipulation.
df1 <- df %>% dplyr::select(id,var,a1,b1)
df2 <- df %>% dplyr::select(id,var,a2,b2)
df3 <- df %>% dplyr::select(id,var,a3,b3)
df.fin <- bind_rows(df1,df2,df3)
I would appreciate your elegant suggestons using tidyverse.
=================Additional Questions==================
#Akrun & Camille
Thank you for your suggestions and sorry for my late reply. I am now trying to apply your idea into actual data frame but still struggling with another issue.
Followings are column names in actual data frame (sorry, I do not set any values of each columns as it may not be a matter).
colnames(df) <- c("hid","mid","rel","age","gen","mlic","vlic",
"wtaz","staz","ocp","ocpot","emp","empot","expm",
"minc","otaz1","op1","dtime1","atime1","dp1","dtaz1",
"pur1", "repm1","lg1t1","lg2t1","lg3t1","lg4t1","expt1",
"otaz2","op2","dtime2","atime2","dp2","dtaz2","pur2",
"repm2","lg1t2","lg2t2","lg3t2","lg4t2","expt2",
"otaz3","op3","dtime3","atime3","dp3","dtaz3","pur3",
"repm3","lg1t3","lg2t3","lg3t3","lg4t3","expt3",
"otaz4","op4","dtime4","atime4","dp4","dtaz4","pur4",
"repm4","lg1t4","lg2t4","lg3t4","lg4t4","expt4",
"otaz5","op5","dtime5","atime5","dp5","dtaz5","pur5",
"repm5","lg1t5","lg2t5","lg3t5","lg4t5","expt5"
)
Then, I am trying to apply your suggestions as below:
In the data frame, columns 1:15 are commons variables and others are repeated variables with 5 repetitions (1 to 5 located at the end of each varible). I could rund following script but still have problem:
#### Convert member table into activity table
## Common variables
hm.com <- names(hm)[c(1:15)]
## Repeating variables
hm.rep <- names(hm)[c(-1:-15)]
hm.rename <- unique(sub("\\d+$","",hm.rep))
## Extract members with trips
hm.trip <- hm %>% filter(otaz!=0) %>% data.frame()
## Convert from member into trip table
test <- split(hm.rep, sub(".*[^1-9$]", "", hm.rep)) %>%
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename) %>%
arrange(hid,mid,dtime,atime) %>%
data.frame()
The result still have an issue:
I could rename first set of repeated variables, however remaining fields from 2 to 5 are still remaining and records are not appropriately stored in the data frame.
I mean that, a set of repeated variables, for instance, from otaz2 to expt2, are stored not in the second row of otaz~expt but stored in its original position (from otaz2 to expt2). I suppose map_df is not working correctly in my case.
========== Problem Solved ==========
Above script was containing incorrect manipulation:
Wrong:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename)
Correct:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x) %>%
rename_at(16:28, ~ hm.rename))
Thank you, I could go to the next step.
We could do this with melt from data.table which can take multiple patterns in the measure argument to reshape into 'long' format. In this case we are using column names that start (^) with "a" followed by numbers as one pattern and those start with "b" and followed by numbers as other
library(data.table)
melt(setDT(df), measure = patterns("^a\\d+", "^b\\d+"),
value.name = c("a", "b"))[order(id)][, variable := NULL][]
# id var a b
# 1: 1 a 3 2
# 2: 1 a 8 1
# 3: 1 a 7 1
# 4: 2 d 5 4
# 5: 2 d 1 6
# 6: 2 d 7 1
# 7: 3 g 1 1
# 8: 3 g 2 4
# 9: 3 g 2 4
#10: 4 f 2 2
#11: 4 f 5 7
#12: 4 f 3 9
#13: 5 i 2 3
#14: 5 i 1 2
#15: 5 i 1 6
Or using tidyverse, we gather the columns of interest to 'long' format (but should be cautious when dealing with groups of columns that are having different classes - where melt is more useful), then separate the 'key' column into two, and spread to 'wide' format
library(tidyverse)
df %>%
gather(key, val, a1:b3) %>%
separate(key, into = c("key1", "key2"), sep=1) %>%
spread(key1, val) %>%
select(-key2)
# id var a b
#1 1 a 3 2
#2 1 a 8 1
#3 1 a 7 1
#4 2 d 5 4
#5 2 d 1 6
#6 2 d 7 1
#7 3 g 1 1
#8 3 g 2 4
#9 3 g 2 4
#10 4 f 2 2
#11 4 f 5 7
#12 4 f 3 9
#13 5 i 2 3
#14 5 i 1 2
#15 5 i 1 6
This isn't very scaleable, so if you end up needing more than these 3 pairs of columns, go with #akrun's answer. I just wanted to point out that the bind_rows snippet you included could, in fact, be done in one pipe:
library(tidyverse)
bind_rows(
df %>% select(id, var, a = a1, b = b1),
df %>% select(id, var, a = a2, b = b2),
df %>% select(id, var, a = a3, b = b3)
) %>%
arrange(id, var)
#> id var a b
#> 1 1 a 3 2
#> 2 1 a 8 1
#> 3 1 a 7 1
#> 4 2 d 5 4
#> 5 2 d 1 6
#> 6 2 d 7 1
#> 7 3 g 1 1
#> 8 3 g 2 4
#> 9 3 g 2 4
#> 10 4 f 2 2
#> 11 4 f 5 7
#> 12 4 f 3 9
#> 13 5 i 2 3
#> 14 5 i 1 2
#> 15 5 i 1 6
Created on 2018-05-07 by the reprex package (v0.2.0).
If you want something that scales and you like map_* functions (from purrr in the tidyverse), you can abstract the above pipeline:
1:3 %>%
map_df(~select(df, id, var, ends_with(as.character(.))) %>%
setNames(c("id", "var", "a", "b"))) %>%
arrange(id, var)
where 1:3 just represents the numbers of the pairs you have.
a base R solution:
res <- do.call(rbind,lapply(1:3,function(x) setNames(df[c(1:2,2*x+(1:2))],names(df)[1:4])))
res[order(res$id),]
# id var a1 b1
# 1 1 a 3 2
# 6 1 a 8 1
# 11 1 a 7 1
# 2 2 d 5 4
# 7 2 d 1 6
# 12 2 d 7 1
# 3 3 g 1 1
# 8 3 g 2 4
# 13 3 g 2 4
# 4 4 f 2 2
# 9 4 f 5 7
# 14 4 f 3 9
# 5 5 i 2 3
# 10 5 i 1 2
# 15 5 i 1 6
how do I extract specific row of data when the column has repetitive value? my data looks like this: I want to extract the row of the end of each repeat of x (A 3 10, A 2 3 etc) or the index of the last value
Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3
Expected output
Index Name X M
3 A 3 10
5 A 2 3
10 A 5 3
13 B 3 10
15 B 2 3
Using base R duplicated and cumsum:
dups <- !duplicated(cumsum(dat$X == 1), fromLast=TRUE)
cbind(dat[dups,], Index=which(dups))
# Name X M Index
#3 A 3 10 3
#5 A 2 3 5
#10 A 5 3 10
#13 B 3 10 13
#15 B 2 3 15
A solution using dplyr.
library(dplyr)
df2 <- df %>%
mutate(Flag = ifelse(lead(X) < X, 1, 0)) %>%
mutate(Index = 1:n()) %>%
filter(Flag == 1 | is.na(Flag)) %>%
select(Index, X, M)
df2
# Index X M
# 1 3 3 10
# 2 5 2 3
# 3 10 5 3
# 4 13 3 10
# 5 15 2 3
Flag is a column showing if the next number in A is smaller than the previous number. If TRUE, Flag is 1, otherwise is 0. We can then filter for Flag == 1 or where Flag is NA, which is the last row. df2 is the final filtered data frame.
DATA
df <- read.table(text = "Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3",
header = TRUE, stringsAsFactors = FALSE)