Replicate rows of a data frame using purrr [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I have the following data frame:
Id Value Freq
1 A 8 2
2 B 7 3
3 C 2 4
and I want to obtain a new data frame by replicating each Value according to Freq:
Id Value
1 A 8
2 A 8
3 B 7
4 B 7
5 B 7
6 C 2
7 C 2
8 C 2
9 C 2
I understand this can be very easily done with purrr (I have identified map_dfr as the most suitable function), but I cannot understand what is the best and most "compact" way to do it.

You can just use some nice indexing-properties of dataframes.
df <- data.frame(Id=c("A","B","C"),Value=c(8,7,2),Freq=c(2,3,4))
replicatedDataframe <- do.call("rbind",lapply(1:NROW(df), function(k) {
df[rep(k,df$Freq[k]),-3]
}))
This can be done more easier using the times-argument in rep:
replicatedDataframe <- df[rep(1:NROW(df),times=df$Freq),-3]

Convert Freq to a vector and unnest.
df %>%
mutate(Freq = map(Freq, seq_len)) %>%
unnest(Freq) %>%
select(-Freq)
#> # A tibble: 9 x 2
#> Id Value
#> <chr> <dbl>
#> 1 A 8
#> 2 A 8
#> 3 B 7
#> 4 B 7
#> 5 B 7
#> 6 C 2
#> 7 C 2
#> 8 C 2
#> 9 C 2

Related

One-To-One apply/map in R operations

I've got 3 tibbles in a list that I would like to add a column, for each, with a different value. I found it struggling to do this apart from using the usual for loop (granted, for may well be the best solution in this use case, but I am curious if there's a clever solution).
library(tibble)
library(dplyr)
t1 <- tibble(a=1:10, b=2:11, c=3:12)
t2 <- tibble(a=1:10, b=2:11, c=3:12)
t3 <- tibble(a=1:10, b=2:11, c=3:12)
tlist <- list(t1, t2, t3)
names <- c("A", "B", "C")
The result I wanted to achieve would be the same as that we do dplyr::mutate on each tibble to add that extra column with values in names respectively; to illustrate:
t1 %>% mutate(name=names[1])
t2 %>% mutate(name=names[2])
t3 %>% mutate(name=names[3])
I've tried lapply, sapply, and mapply (and some combinations of either two of them), or purrr::map, I couldn't see a way of applying mutate action of a single value on a single tibble (i.e., one-to-one apply/map). Python has zip which sometimes creates a pair of values that we can easily access in apply functions, but we don't have that facility in R.
A piece of pusedo-ish code in R (which mimics what zip in Python would be like):
args <- pair(tlist, names)
add.col <- function(arg.pair) -> {
arg.pair[[1]] %>% mutate(name=arg.pair[[2]])
}
res <- args %>% lapply(add.col)
Note that apply functions do not take in Pair object (an utility from stats package).
It's very likely there's blind spot for me as I became increasingly obsessed with apply; is there a clever way to do this one-to-one mapping?
Use map2:
library(purrr)
library(dplyr)
map2(tlist, names,
~ .x %>%
mutate(name = .y))
Or in base R with Map:
Map(function(x, y) transform(x, name = y), tlist, names)
output:
[[1]]
# A tibble: 10 × 4
a b c name
<int> <int> <int> <chr>
1 1 2 3 A
2 2 3 4 A
3 3 4 5 A
4 4 5 6 A
5 5 6 7 A
6 6 7 8 A
7 7 8 9 A
8 8 9 10 A
9 9 10 11 A
10 10 11 12 A
[[2]]
# A tibble: 10 × 4
a b c name
<int> <int> <int> <chr>
1 1 2 3 B
2 2 3 4 B
3 3 4 5 B
4 4 5 6 B
5 5 6 7 B
6 6 7 8 B
7 7 8 9 B
8 8 9 10 B
9 9 10 11 B
10 10 11 12 B
[[3]]
# A tibble: 10 × 4
a b c name
<int> <int> <int> <chr>
1 1 2 3 C
2 2 3 4 C
3 3 4 5 C
4 4 5 6 C
5 5 6 7 C
6 6 7 8 C
7 7 8 9 C
8 8 9 10 C
9 9 10 11 C
10 10 11 12 C
mapply also works:
mapply(mutate, tlist, name=names, SIMPLIFY=F)
Using imap
library(purrr)
library(dplyr)
imap(setNames(tlist, names), ~ .x %>%
mutate(name = .y))

How to stack multiple columns into one using R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 1 year ago.
I have the following data frame:
A <- c(3,5,6,7)
B <- c(2,4,5,3)
C <- c(4,6,7,8)
D <- c(2,4,5,3)
gene <- c(1,2,3,4)
df <- data.frame(gene,A,B,C,D)
df
gene A B C D
1 1 3 2 4 2
2 2 5 4 6 4
3 3 6 5 7 5
4 4 7 3 8 3
How can I stack each lettered column into one new column called "count" such that there is another new column called "sample" that keeps track of the original column from which each count value came frame (ie. I would like the following output):
count sample
3 A
5 A
6 A
7 A
2 B
4 B
5 B
3 B
4 C
6 C
7 C
8 C
2 D
4 D
5 D
3 D
Sorry this is difficult to explain but the output data frame above should make it clear.
Thanks
In base R, use stack after removing the first column
out <- stack(df[-1])
names(out) <- c("count", "sample")
We could use pivot_longer:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(
cols = -gene,
names_to = "sample",
values_to = "count"
) %>%
select(-gene) %>%
arrange(sample)
sample count
<chr> <dbl>
1 A 3
2 A 5
3 A 6
4 A 7
5 B 2
6 B 4
7 B 5
8 B 3
9 C 4
10 C 6
11 C 7
12 C 8
13 D 2
14 D 4
15 D 5
16 D 3

Split multiple columns based on the same delimiter [duplicate]

This question already has answers here:
Tidy method to split multiple columns using tidyr::separate
(5 answers)
Closed 2 years ago.
I struggle to find a more programmatic way to split multiple columns based on the same delimiter...
The solution should work for n columns, all to be identified by a similar pattern, in my example "^var[0-9]"
E.g.
library(tidyverse)
foo <- data.frame(var1 = paste0("a_",1:10), var2 = paste("a_",1:10), id = 1:10)
# Desired output
foo %>%
separate(var1, into = c("group1", "index1")) %>%
separate(var2, into = c("group2", "index2"))
#> group1 index1 group2 index2 id
#> 1 a 1 a 1 1
#> 2 a 2 a 2 2
#> 3 a 3 a 3 3
#> 4 a 4 a 4 4
#> 5 a 5 a 5 5
#> 6 a 6 a 6 6
#> 7 a 7 a 7 7
#> 8 a 8 a 8 8
#> 9 a 9 a 9 9
#> 10 a 10 a 10 10
This isn't very elegant, but it works (based on answers here: Tidy method to split multiple columns using tidyr::separate):
grep("var[0-9]", names(foo), value = TRUE) %>%
map_dfc(~ foo %>%
select(.x) %>%
separate(.x,
into = paste0(c("group", "index"), gsub("[^0-9]", "", .x)))
) %>%
bind_cols(id = foo$id)
Returns:
group1 index1 group2 index2 id
1 a 1 a 1 1
2 a 2 a 2 2
3 a 3 a 3 3
4 a 4 a 4 4
5 a 5 a 5 5
6 a 6 a 6 6
7 a 7 a 7 7
8 a 8 a 8 8
9 a 9 a 9 9
10 a 10 a 10 10

Gathering specific pairs of columns into rows by dplyr in R [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to convert a data frame from wide to long format by gathering specific pairs of columns of which example is shown below:
An example of data frame
df <- data.frame(id=c(1,2,3,4,5), var=c("a","d","g","f","i"),a1=c(3,5,1,2,2), b1=c(2,4,1,2,3), a2=c(8,1,2,5,1), b2=c(1,6,4,7,2), a3=c(7,7,2,3,1), b3=c(1,1,4,9,6))
Initial table:
id var a1 b1 a2 b2 a3 b3
1 1 a 3 2 8 1 7 1
2 2 d 5 4 1 6 7 1
3 3 g 1 1 2 4 2 4
4 4 f 2 2 5 7 3 9
5 5 i 2 3 1 2 1 6
Desired result:
id var a b
1 1 a 3 2
2 1 a 8 1
3 1 a 7 1
4 2 d 5 4
5 2 d 1 6
6 2 d 7 1
7 3 g 1 1
8 3 g 2 4
9 3 g 2 4
10 4 f 2 2
11 4 f 5 7
12 4 f 3 9
13 5 i 2 3
14 5 i 1 2
15 5 i 1 6
Conditions:
Pair of ai and bi should be gathered: As there are 3 pairs of a and b, "a1 and b1", "a2 and b2" and "a3 and b3", values in those pairs should be moved to a pair of "a and b" by replicating each record in three times
First and second fields (id of each sample and its common variable) should be kept in each replicated rows
I was thinking that it is possible to make it by gather() in tidyverse, however, as far as I understand, I suppose that gather function may not be suitable for gathering such specific pairs of fields into specific multiple columns (two columns in this case).
It is possible to make it to prepare three data frames separately and binding it into one (example scripts are shown below), however I prefer to make it in one continuous pipe operation in tidyverse not to stop manipulation.
df1 <- df %>% dplyr::select(id,var,a1,b1)
df2 <- df %>% dplyr::select(id,var,a2,b2)
df3 <- df %>% dplyr::select(id,var,a3,b3)
df.fin <- bind_rows(df1,df2,df3)
I would appreciate your elegant suggestons using tidyverse.
=================Additional Questions==================
#Akrun & Camille
Thank you for your suggestions and sorry for my late reply. I am now trying to apply your idea into actual data frame but still struggling with another issue.
Followings are column names in actual data frame (sorry, I do not set any values of each columns as it may not be a matter).
colnames(df) <- c("hid","mid","rel","age","gen","mlic","vlic",
"wtaz","staz","ocp","ocpot","emp","empot","expm",
"minc","otaz1","op1","dtime1","atime1","dp1","dtaz1",
"pur1", "repm1","lg1t1","lg2t1","lg3t1","lg4t1","expt1",
"otaz2","op2","dtime2","atime2","dp2","dtaz2","pur2",
"repm2","lg1t2","lg2t2","lg3t2","lg4t2","expt2",
"otaz3","op3","dtime3","atime3","dp3","dtaz3","pur3",
"repm3","lg1t3","lg2t3","lg3t3","lg4t3","expt3",
"otaz4","op4","dtime4","atime4","dp4","dtaz4","pur4",
"repm4","lg1t4","lg2t4","lg3t4","lg4t4","expt4",
"otaz5","op5","dtime5","atime5","dp5","dtaz5","pur5",
"repm5","lg1t5","lg2t5","lg3t5","lg4t5","expt5"
)
Then, I am trying to apply your suggestions as below:
In the data frame, columns 1:15 are commons variables and others are repeated variables with 5 repetitions (1 to 5 located at the end of each varible). I could rund following script but still have problem:
#### Convert member table into activity table
## Common variables
hm.com <- names(hm)[c(1:15)]
## Repeating variables
hm.rep <- names(hm)[c(-1:-15)]
hm.rename <- unique(sub("\\d+$","",hm.rep))
## Extract members with trips
hm.trip <- hm %>% filter(otaz!=0) %>% data.frame()
## Convert from member into trip table
test <- split(hm.rep, sub(".*[^1-9$]", "", hm.rep)) %>%
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename) %>%
arrange(hid,mid,dtime,atime) %>%
data.frame()
The result still have an issue:
I could rename first set of repeated variables, however remaining fields from 2 to 5 are still remaining and records are not appropriately stored in the data frame.
I mean that, a set of repeated variables, for instance, from otaz2 to expt2, are stored not in the second row of otaz~expt but stored in its original position (from otaz2 to expt2). I suppose map_df is not working correctly in my case.
========== Problem Solved ==========
Above script was containing incorrect manipulation:
Wrong:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x)) %>%
rename_at(16:28, ~ hm.rename)
Correct:
map_df(~ hm.trip %>% dplyr::select(hm.com, .x) %>%
rename_at(16:28, ~ hm.rename))
Thank you, I could go to the next step.
We could do this with melt from data.table which can take multiple patterns in the measure argument to reshape into 'long' format. In this case we are using column names that start (^) with "a" followed by numbers as one pattern and those start with "b" and followed by numbers as other
library(data.table)
melt(setDT(df), measure = patterns("^a\\d+", "^b\\d+"),
value.name = c("a", "b"))[order(id)][, variable := NULL][]
# id var a b
# 1: 1 a 3 2
# 2: 1 a 8 1
# 3: 1 a 7 1
# 4: 2 d 5 4
# 5: 2 d 1 6
# 6: 2 d 7 1
# 7: 3 g 1 1
# 8: 3 g 2 4
# 9: 3 g 2 4
#10: 4 f 2 2
#11: 4 f 5 7
#12: 4 f 3 9
#13: 5 i 2 3
#14: 5 i 1 2
#15: 5 i 1 6
Or using tidyverse, we gather the columns of interest to 'long' format (but should be cautious when dealing with groups of columns that are having different classes - where melt is more useful), then separate the 'key' column into two, and spread to 'wide' format
library(tidyverse)
df %>%
gather(key, val, a1:b3) %>%
separate(key, into = c("key1", "key2"), sep=1) %>%
spread(key1, val) %>%
select(-key2)
# id var a b
#1 1 a 3 2
#2 1 a 8 1
#3 1 a 7 1
#4 2 d 5 4
#5 2 d 1 6
#6 2 d 7 1
#7 3 g 1 1
#8 3 g 2 4
#9 3 g 2 4
#10 4 f 2 2
#11 4 f 5 7
#12 4 f 3 9
#13 5 i 2 3
#14 5 i 1 2
#15 5 i 1 6
This isn't very scaleable, so if you end up needing more than these 3 pairs of columns, go with #akrun's answer. I just wanted to point out that the bind_rows snippet you included could, in fact, be done in one pipe:
library(tidyverse)
bind_rows(
df %>% select(id, var, a = a1, b = b1),
df %>% select(id, var, a = a2, b = b2),
df %>% select(id, var, a = a3, b = b3)
) %>%
arrange(id, var)
#> id var a b
#> 1 1 a 3 2
#> 2 1 a 8 1
#> 3 1 a 7 1
#> 4 2 d 5 4
#> 5 2 d 1 6
#> 6 2 d 7 1
#> 7 3 g 1 1
#> 8 3 g 2 4
#> 9 3 g 2 4
#> 10 4 f 2 2
#> 11 4 f 5 7
#> 12 4 f 3 9
#> 13 5 i 2 3
#> 14 5 i 1 2
#> 15 5 i 1 6
Created on 2018-05-07 by the reprex package (v0.2.0).
If you want something that scales and you like map_* functions (from purrr in the tidyverse), you can abstract the above pipeline:
1:3 %>%
map_df(~select(df, id, var, ends_with(as.character(.))) %>%
setNames(c("id", "var", "a", "b"))) %>%
arrange(id, var)
where 1:3 just represents the numbers of the pairs you have.
a base R solution:
res <- do.call(rbind,lapply(1:3,function(x) setNames(df[c(1:2,2*x+(1:2))],names(df)[1:4])))
res[order(res$id),]
# id var a1 b1
# 1 1 a 3 2
# 6 1 a 8 1
# 11 1 a 7 1
# 2 2 d 5 4
# 7 2 d 1 6
# 12 2 d 7 1
# 3 3 g 1 1
# 8 3 g 2 4
# 13 3 g 2 4
# 4 4 f 2 2
# 9 4 f 5 7
# 14 4 f 3 9
# 5 5 i 2 3
# 10 5 i 1 2
# 15 5 i 1 6

Generate combination of data frame and vector

I know expand.grid is to create all combinations of given vectors. But is there a way to generate all combinations of a data frame and a vector by taking each row in the data frame as unique. For instance,
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
how to create a new data frame that is the combination of df and c without expanding df:
df.c:
a b c
1 5 9
2 6 9
3 7 9
1 5 10
2 6 10
3 7 10
Thanks!
As for me the simplest way is merge(df, as.data.frame(c))
a b c
1 1 5 9
2 2 6 9
3 3 7 9
4 1 5 10
5 2 6 10
6 3 7 10
This may not scale when your dataframe has more than two columns per row, but you can just use expand.grid on the first column and then merge the second column in.
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
combined <- expand.grid(a=df$a, c=c)
combined <- merge(combined, df)
> combined[order(combined$c), ]
a c b
1 1 9 5
3 2 9 6
5 3 9 7
2 1 10 5
4 2 10 6
6 3 10 7
You could also do something like this
do.call(rbind,lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
# or using rbindlist as a fast alternative to do.call(rbind,list)
library(data.table)
rbindlist(lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
or
rbindlist(Map(data.frame, c = 9:10, MoreArgs = list(a= 1:3,b=5:7)))
This question is really old but I found one more answer.
Use tidyr's expand_grid().
expand_grid(df, c)
# A tibble: 6 × 3
a b c
<int> <int> <int>
1 1 5 9
2 1 5 10
3 2 6 9
4 2 6 10
5 3 7 9
6 3 7 10

Resources