This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I have a trouble with repeating rows of my real data using dplyr. There is already another post in here repeat-rows-of-a-data-frame but no solution for dplyr.
Here I just wonder how could be the solution for dplyr
but failed with error:
Error: wrong result size (16), expected 4 or 1
library(dplyr)
df <- data.frame(column = letters[1:4])
df_rep <- df%>%
mutate(column=rep(column,each=4))
Expected output
>df_rep
column
#a
#a
#a
#a
#b
#b
#b
#b
#*
#*
#*
Using the uncount function will solve this problem as well. The column count indicates how often a row should be repeated.
library(tidyverse)
df <- tibble(letters = letters[1:4])
df
# A tibble: 4 x 1
letters
<chr>
1 a
2 b
3 c
4 d
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
uncount(count)
# A tibble: 11 x 1
letters
<chr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
I was looking for a similar (but slightly different) solution. Posting here in case it's useful to anyone else.
In my case, I needed a more general solution that allows each letter to be repeated an arbitrary number of times. Here's what I came up with:
library(tidyverse)
df <- data.frame(letters = letters[1:4])
df
> df
letters
1 a
2 b
3 c
4 d
Let's say I want 2 A's, 3 B's, 2 C's and 4 D's:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <int>
1 a 1
2 a 2
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 d 1
9 d 2
10 d 3
11 d 4
If you don't want to keep the count column:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
select(letters)
# A tibble: 11 x 1
# Groups: letters [4]
letters
<fctr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
If you want the count to reflect the number of times each letter is repeated:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
mutate(count = max(count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <dbl>
1 a 2
2 a 2
3 b 3
4 b 3
5 b 3
6 c 2
7 c 2
8 d 4
9 d 4
10 d 4
11 d 4
This is rife with peril if the data.frame has other columns (there, I said it!), but the do block will allow you to generate a derived data.frame within a dplyr pipe (though, ceci n'est pas un pipe):
library(dplyr)
df <- data.frame(column = letters[1:4], stringsAsFactors = FALSE)
df %>%
do( data.frame(column = rep(.$column, each = 4), stringsAsFactors = FALSE) )
# column
# 1 a
# 2 a
# 3 a
# 4 a
# 5 b
# 6 b
# 7 b
# 8 b
# 9 c
# 10 c
# 11 c
# 12 c
# 13 d
# 14 d
# 15 d
# 16 d
As #Frank suggested, a much better alternative could be
df %>% slice(rep(1:n(), each=4))
I did a quick benchmark to show that uncount() is a lot faster than expand()
# for the pipe
library(magrittr)
# create some test data
df_test <-
tibble::tibble(
letter = letters,
row_count = sample(1:10, size = 26, replace = TRUE)
)
# benchmark
bench <- microbenchmark::microbenchmark(
expand = df_test %>%
dplyr::group_by(letter) %>%
tidyr::expand(row_count = seq(1:row_count)),
uncount = df_test %>%
tidyr::uncount(row_count)
)
# plot the benchmark
ggplot2::autoplot(bench)
Related
I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")
Take an example dataframe like so (the real dataframe has more columns):
df <- data.frame(A = seq(1, 3, 1),
B = seq(4, 6, 1))
I can use pivot_longer to collect my columns of interest (A and B) like so:
library(dplyr)
library(tidyr)
df <- df %>%
pivot_longer(cols = c("A", "B"), names_to = "Letter", values_to = "Number")
df
Letter Number
<chr> <dbl>
1 A 1
2 B 4
3 A 2
4 B 5
5 A 3
6 B 6
Now let's say I have another column C in my dataframe, making it no longer tidy
C <- seq(7, 12, 1)
df_2 <- data.frame(df, C)
df_2
Letter Number C
1 A 1 7
2 B 4 8
3 A 2 9
4 B 5 10
5 A 3 11
6 B 6 12
I want to use pivot_longer again to make df_2 tidy and get this output:
data.frame(Letter = c(rep("A", 3), rep("B", 3), rep("C", 3)),
Number = seq(1, 12, 1))
Letter Number
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
Using the same strategy creates an error though:
df_2 %>%
pivot_longer(cols = "C", names_to = "Letter", values_to = "Number")
Error: Failed to create output due to bad names.
* Choose another strategy with `names_repair`
Setting names_repair to minimal runs but doesn't produce the output I want.
Follow it like this
library(tidyverse)
df <- data.frame(A = seq(1, 3, 1),
B = seq(4, 6, 1))
df <- df %>%
pivot_longer(cols = c("A", "B"), names_to = "Letter", values_to = "Number")
C <- seq(7, 12, 1)
df_2 <- data.frame(C)
df_2 <- df_2 %>% pivot_longer(cols = C, names_to = "Letter", values_to = "Number")
df_result <- rbind(df, df_2)
Output
> df_result
# A tibble: 12 x 2
Letter Number
<chr> <dbl>
1 A 1
2 B 4
3 A 2
4 B 5
5 A 3
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
Maybe try this if it is helpful:
library(tidyverse)
#Code
df_2 %>% pivot_longer(everything()) %>%
arrange(name) %>% group_by(name) %>%
filter(!duplicated(value))
Output:
# A tibble: 12 x 2
# Groups: name [3]
name value
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
10 C 10
11 C 11
12 C 12
We could do this easily with stack
library(dplyr)
stack(df_2)[2:1] %>%
distinct %>%
set_names(c("Letter", "Number"))
-output
# Letter Number
#1 A 1
#2 A 2
#3 A 3
#4 B 4
#5 B 5
#6 B 6
#7 C 7
#8 C 8
#9 C 9
#10 C 10
#11 C 11
#12 C 12
Or an option with unnest/enframe
library(tidyr)
library(tibble)
unclass(df_2) %>%
enframe(name = "Letter", value = "Number") %>%
unnest(c(Number)) %>%
distinct
Or using melt
library(reshape2)
melt(df_2) %>%
distinct()
Or in a single line in base R
unique(stack(df_2)[2:1])
I have a large tibble with one nested list column. Each element of the nested list column has 10,000 iterations and i would like to apply a cumulative sum across these iterations by a grouping variable.
I have created a minimal reproducible example below
tibble(a = list(c(1,2),c(3,4), c(5,6), c(7,8)),
c = c(1,1, 2, 2))
The intended output should be
tibble(a = list(c(1,2),c(4,6), c(5,6), c(12,14)),
c = c(1,1, 2, 2))
Tried the follwoing syntax but its clearly wrong
x <- tibble(a = list(c(1,2),c(4,6), c(5,6), c(7,8)),
c = c(1,1, 2, 2))
x %>%
group_by(c) %>%
mutate(a = map(a,cumsum))
Any help greatly appreciated. I can potentially spread the data and add across the columns but that would be slow
One base R option could be:
with(tbl, ave(a, c, FUN = function(x) Reduce(`+`, x, accumulate = TRUE)))
[[1]]
[1] 1 2
[[2]]
[1] 4 6
[[3]]
[1] 5 6
[[4]]
[1] 12 14
I think you're looking for the following though it doesn't match your desired output for the last two values (can you check that these are correct):
library(dplyr)
library(purrr)
library(tidyr)
df %>%
group_by(c) %>%
mutate(x = accumulate(a, `+`)) %>%
unnest(cols = c(a, x))
# A tibble: 8 x 3
# Groups: c [2]
a c x
<dbl> <dbl> <dbl>
1 1 1 1
2 2 1 2
3 3 1 4
4 4 1 6
5 5 2 5
6 6 2 6
7 7 2 12
8 8 2 14
Input
(Say d is the data frame below.)
a b c
1 5 7
2 6 8
3 7 9
I want to shift the contents of column b one position down and put an arbitrary number in the first position in b. How do I do this? I would appreciate any help in this regard. Thank you.
I tried c(6,tail(d["b"],-1)) but it does not produce (6,5,6).
Output
a b c
1 6 7
2 5 8
3 6 9
Use head instead
df$b <- c(6, head(df$b, -1))
# a b c
#1 1 6 7
#2 2 5 8
#3 3 6 9
You could also use lag in dplyr
library(dplyr)
df %>% mutate(b = lag(b, default = 6))
Or shift in data.table
library(data.table)
setDT(df)[, b:= shift(b, fill = 6)]
A dplyr solution uses lag with an explicit default argument, if you prefer:
library(dplyr)
d <- tibble(a = 1:3, b = 5:7, c = 7:9)
d %>% mutate(b = lag(b, default = 6))
#> # A tibble: 3 x 3
#> a b c
#> <int> <dbl> <int>
#> 1 1 6 7
#> 2 2 5 8
#> 3 3 6 9
Created on 2019-12-05 by the reprex package (v0.3.0)
Here is a solution similar to the head approach by #Ronak Shah
df <- within(df,b <- c(runif(1),b[-1]))
where a uniformly random variable is added to the first place of b column:
> df
a b c
1 1 0.6644704 7
2 2 6.0000000 8
3 3 7.0000000 9
Best solution below will help in any lag or lead position
d <- data.frame(a=c(1,2,3),b=c(5,6,7),c=c(7,8,9))
d1 <- d %>% arrange(b) %>% group_by(b) %>%
mutate(b1= dplyr::lag(b, n = 1, default = NA))
I have a list of ten points with X and Ỳ coordinates. I would like to calculate the possible permutations of distances between any two points. Precisely, only one of the distances in 1-2, 2-1 should be present. I have managed to remove the distances of a point with itself. But couldn't achieve this permutation distances.
# Data Generation
df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = 1:10)
# Temporary key Creation
df <- df %>% mutate(key = 1)
# Calculating pairwise distances
df %>% full_join(df, by = "key") %>%
mutate(dist = sqrt((X.x - X.y)^2 + (Y.x - Y.y)^2)) %>%
select(ID.x, ID.y, dist) %>% filter(!dist == 0) %>% head(11)
# Output
# ID.x ID.y dist
# 1 1 2 0.90858911
# 2 1 3 0.71154587
# 3 1 4 0.05687495
# 4 1 5 1.03885510
# 5 1 6 0.93747717
# 6 1 7 0.62070415
# 7 1 8 0.88351690
# 8 1 9 0.89651911
# 9 1 10 0.05079906
# 10 2 1 0.90858911
# 11 2 3 0.27530175
How to achieve the expected output shown below?
# Expected Output
# ID.x ID.y dist
# 1 1 2 0.90858911
# 2 1 3 0.71154587
# 3 1 4 0.05687495
# 4 1 5 1.03885510
# 5 1 6 0.93747717
# 6 1 7 0.62070415
# 7 1 8 0.88351690
# 8 1 9 0.89651911
# 9 1 10 0.05079906
# 10 2 3 0.27530175
# 11 2 4 0.5415415
But this approach is computationally slower compared to dist(). Would be happier to listen to faster approaches.
I would use dist on the data and then process the output into the required format. You can replace dist with any other distance function. Here I've used letters rather than numbers as ID to better show what is happening
set.seed(42)
df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = letters[1:10])
df %>%
column_to_rownames("ID") %>% #make the ID the rownames. dist will use these> NB will not work on a tibble
dist() %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column(var = "ID.x") %>% #capture the row IDs
gather(key = ID.y, value = dist, -ID.x) %>%
filter(ID.x < ID.y) %>%
as_tibble()
# A tibble: 45 x 3
ID.x ID.y dist
<chr> <chr> <dbl>
1 a b 0.2623175
2 a c 0.7891034
3 b c 0.6856994
4 a d 0.2191960
5 b d 0.4757855
6 c d 0.8704269
7 a e 0.2730984
8 b e 0.3913770
9 c e 0.5912681
10 d e 0.2800021
# ... with 35 more rows
dist is very fast compared with looping through calculating distances.
The code can probably be made more efficient, by working directly of the dist object rather than converting it into a matrix.
Perhaps this is a slightly simpler approach:
df <- data.frame(X = runif(10, 0, 1), Y = runif(10, 0, 1), ID = 1:10)
df2 <- data.frame(ID1 = rep(1:10, each = 10),
ID2 = 1:10,
distance = as.vector(as.matrix((dist(df)))))
Then get rid of diagonal:
df2 <- df2[df2$ID1 != df2$ID2,]
Get rid of upper triangle:
df2 <- df2[df2$ID1 < df2$ID2,]
df2
ID1 ID2 distance
2 1 2 1.000615
3 1 3 2.057813
4 1 4 3.010261
5 1 5 4.039502
6 1 6 5.029982
7 1 7 6.035427
8 1 8 7.012540
9 1 9 8.006249
10 1 10 9.015352
13 2 3 1.099245
14 2 4 2.011664
...