R: Create duplicate rows based on a variable (dplyr preferred) [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 3 years ago.
I'd like to create a new list with duplicate entries based upon an existing list in R. I'm trying to use tidyverse as much as possible, so dplyr would be preferred.
Say I have a list of times where sales occured:
df <- data.frame(time = c(0,1,2,3,4,5), sales = c(1,1,2,1,1,3))
> df
time sales
1 0 1
2 1 1
3 2 2
4 3 1
5 4 1
6 5 3
And I'd like instead to have a list with an entry for each sale:
ans <- data.frame(salesTime = c(0,1,2,2,3,4,5,5,5))
> ans
salesTime
1 0
2 1
3 2
4 2
5 3
6 4
7 5
8 5
9 5
I found an interesting example using dplyr here: Create duplicate rows based on conditions in R
But this will only allow me to create one new row when sales == n, and not create n new rows when sales == n.
Any help would be greatly appreciated.

A nice tidyr function for this is uncount():
df %>%
uncount(sales) %>%
rename(salesTime = time)
salesTime
1 0
2 1
3 2
3.1 2
4 3
5 4
6 5
6.1 5
6.2 5

data.frame(salesTime = rep(df$time, df$sales))
# salesTime
#1 0
#2 1
#3 2
#4 2
#5 3
#6 4
#7 5
#8 5
#9 5
If you like dplyr and pipes you can go for:
df %>% {data.frame(salesTime = rep(.$time, .$sales))}

df %>% rowwise %>% mutate(time=list(rep(time,sales))) %>% unnest
## A tibble: 9 x 2
# sales time
# <dbl> <dbl>
#1 1 0
#2 1 1
#3 2 2
#4 2 2
#5 1 3
#6 1 4
#7 3 5
#8 3 5
#9 3 5

Related

Using loops with mutate in R to sum columns with partially matching column names

df <- data.frame(x_1_jr=c(1,2,3,4), x_2_jr=c(1,2,3,4), y_1_jr=c(4,3,2,1), y_2_jr=c(4,3,2,1)
x_1_jr x_2_jr y_1_jr y_2_jr
1 1 1 4 4
2 2 2 3 3
3 3 3 2 2
4 4 4 1 1
I am trying to generate new variables that are the sum of x and y with the same column name suffix, i.e.
df <- df %>% mutate(z_1_jr= x_1_jr + y_1_jr)
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr
1 1 1 4 4 5
2 2 2 3 3 5
3 3 3 2 2 5
4 4 4 1 1 5
I could write this out for each variable combination, but I have a large number of variables(>50 for each x and y group), and would like to use a loop... however, I'm relatively new to R and am not sure where to begin!
Can someone help? Thank you!
EDIT: for additional clarity, the dataset contains other non-numeric variables. There are >700 columns (from a large survey). x_1_jr represents, for example, the number of male individuals ages 1 year, y_1_jr female individuals of 1 year. I am trying to get a total (male plus female of 1 year) for each age group.
A
An option with base R
df[c("z_1_jr", "z_2_jr")] <- sapply(split.default(df,
sub("^[a-z]+_", "", names(df))), rowSums)
df
# x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
#1 1 1 4 4 5 5
#2 2 2 3 3 5 5
#3 3 3 2 2 5 5
#4 4 4 1 1 5 5
One dplyr and purrr option could be:
df %>%
bind_cols(map_dfc(.x = unique(sub(".*?_", "_", names(df))),
~ df %>%
transmute(!!paste0("z", .x) := rowSums(select(., ends_with(.x))))))
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
1 1 1 4 4 5 5
2 2 2 3 3 5 5
3 3 3 2 2 5 5
4 4 4 1 1 5 5

Vectorized solution for using previous value in a column under some condition

This is probably easy, but beyond a clunky for loop I haven't been able to find a vectorized solution for this.
df <- tibble(a=c(1,2,3,4,3,2,5,6,9), b=c(1,2,3,4,4,4,5,6,9))
Column a should be continuously increasing and should look like column b. So, whenever the next value in a is smaller than the previous value in a, the previous value should be used instead.
Thanks!
We can use lag and fill from tidyverse
library(tidyverse)
df %>%
mutate(b1 = replace(a, a < lag(a), NA)) %>%
fill(b1)
# a b b1
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 3 4 4
#6 2 4 4
#7 5 5 5
#8 6 6 6
#9 9 9 9
The logic being we replace the values in a with NA where the previous value is greater than the next and then use fill to replace those NAs with last non-NA value.
Using cummax() from base R:
df[["b1"]] <- cummax(df[["a"]])
> df
a b b1
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 3 4 4
6 2 4 4
7 5 5 5
8 6 6 6
9 9 9 9
Using more dplyr syntax:
df %>%
mutate(b1 = cummax(a))

Split dataframe based on one column in r, with a non-fixed width column [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I have a problem that is an extension of a well-covered issue here on SE. I.e:
Split a column of a data frame to multiple columns
My data has a column with a string format, comma-separated, but of no fixed length.
data = data.frame(id = c(1,2,3), treatments = c("1,2,3", "2,3", "8,9,1,2,4"))
So I would like to have my dataframe eventually be in the proper tidy/long form of:
id treatments
1 1
1 2
1 3
...
3 1
3 2
3 4
Something like separate or strsplit doesn't seem on it's own to be the solution. Separate fails with warnings that various columns have too many values (NB id 3 has more values than id 1).
Thanks
You can use tidyr::separate_rows:
library(tidyr)
separate_rows(data, treatments)
# id treatments
#1 1 1
#2 1 2
#3 1 3
#4 2 2
#5 2 3
#6 3 8
#7 3 9
#8 3 1
#9 3 2
#10 3 4
Using dplyr and tidyr packages:
data %>%
separate(treatments, paste0("v", 1:5)) %>%
gather(var, treatments, -id) %>%
na.exclude %>%
select(id, treatments) %>%
arrange(id)
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4
You can also use unnest:
library(tidyverse)
data %>%
mutate(treatments = stringr::str_split(treatments, ",")) %>%
unnest()
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4

How can I operate on elements of a data.frame in r, that creates a new column? [duplicate]

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 7 years ago.
Suppose I have a data.frame, df.
a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
I'd like to operate on it so that for all places where a and b are equal, I compute the mean of d.
I found that using aggregate can do this,
aggregate(d ~ a + b, df, mean)
This gives me something reasonable
a b d
1 2 5
2 1 3
2 3 6
But I would ideally like to keep my original d column, and add a new column m, so that I get the original data.frame with a new column "m" that contains the averages like,
a b d m
1 2 4 5
1 2 5 5
1 2 6 5
2 1 5 3
2 3 6 6
2 1 1 3
Any ideas on how to do this "properly" in R?
library(dplyr)
df <- read.table(text = "a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
" , header = T)
df %>%
group_by(a , b) %>%
mutate(m = mean(d))

How to add index of a List item after melt() in R [duplicate]

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 7 years ago.
I am working with a list as follows:
> l <- list(c(2,4,9), c(4,2,6,1))
> m <- melt(l)
> m
value L1
2 1
4 1
9 1
4 2
2 2
6 2
1 2
i want to add index i for my resulting data frame m looks like this:
> m
i value L1
1 2 1
2 4 1
3 9 1
1 4 2
2 2 2
3 6 2
4 1 2
i indicating 3 values belongs to first list element and 4 values belongs to the second list element.
How can i archive it please, can anyone help?
Just for completeness, some other options
data.table (which is basically what getanID is doing)
library(data.table)
setDT(m)[, i := seq_len(.N), L1]
dplyr
library(dplyr)
m %>%
group_by(L1) %>%
mutate(i = row_number())
Base R (from comments by #user20650)
transform(m, i = ave(L1, L1, FUN = seq_along))
You could use splitstackshape
library(splitstackshape)
getanID(m, 'L1')[]
# value L1 .id
#1: 2 1 1
#2: 4 1 2
#3: 9 1 3
#4: 4 2 1
#5: 2 2 2
#6: 6 2 3
#7: 1 2 4
Or using base R
transform(stack(setNames(l, seq_along(l))), .id= rapply(l, seq_along))
Less elegant than ave but does the work:
transform(m, i=unlist(sapply(rle(m$L1)$length, seq_len)))
# value L1 i
#1 2 1 1
#2 4 1 2
#3 9 1 3
#4 4 2 1
#5 2 2 2
#6 6 2 3
#7 1 2 4
Or
m$i <- sequence(rle(m$L1)$lengths)

Resources