This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a this dataframe in R:
V1
A
A
C
C
C
I want add a variable like this:
V1 V2
A 1
A 2
C 1
C 2
C 3
Thanks!!
Another dplyr approach:
library(dplyr)
df %>% group_by(V1) %>% mutate(V2 = seq_along(V1))
Does this answer:
> df <- data.frame(V1 = c('A','A','C','C','C'))
> df
V1
1 A
2 A
3 C
4 C
5 C
> df %>% group_by(V1) %>% mutate(v2 = row_number())
# A tibble: 5 x 2
# Groups: V1 [2]
V1 v2
<fct> <int>
1 A 1
2 A 2
3 C 1
4 C 2
5 C 3
>
Related
I have a specific filtering question. Here is how my sample dataset looks like:
df <- data.frame(id = c(1,2,3,3,4,5),
cat= c("A","A","A","B","B","B"))
> df
id cat
1 1 A
2 2 A
3 3 A
4 3 B
5 4 B
6 5 B
Grouping by id, when the cat has multiple categories, I would only filter cat A. So the desired output would be:
> df.1
id cat
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
Any ideas?
Thanks!
If there are only two groups in cat, we can use the following logic:
df %>%
group_by(id) %>%
filter(! (n() == 2 & cat == "B"))
# A tibble: 5 x 2
# Groups: id [5]
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
When there are multiple other letters possible
df <- data.frame(id = c(1,2,3,3,4,5,6,6,6,7),
cat= c("A","A","A","B","B","B", "A", "B", "C","D"))
df %>%
group_by(id) %>%
filter(! (n() >= 2 & cat %in% LETTERS[2:26]))
# A tibble: 7 x 2
# Groups: id [7]
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
6 6 A
7 7 D
Explanation: n() gives the current group size. When that condition is met, we filter for everything that is not "B".
In this example you can take the first item from the group. In other situations you may need to reorder arrange before.
(using dplyr)
df %>% group_by(id) %>% summarise(cat = first(cat))
Base R:
aggregate(
df$cat,
by = list(id = df$id),
FUN = \(x) {
unx <- unique(x)
if (length(unx) > 1) 'A' else unx
}
)
# id x
# 1 1 A
# 2 2 A
# 3 3 A
# 4 4 B
# 5 5 B
One approach with dplyr. After grouping by id, filter where there is only one row per id or cat is "A".
library(dplyr)
df %>%
group_by(id) %>%
filter(n() == 1 | cat == "A")
Output
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
Also, if it is possible to have the same cat repeated within a single id, you can filter where the number of distinct cat is 1 (or keep if cat is "A"):
df %>%
group_by(id) %>%
filter(n_distinct(cat) == 1 | cat == "A")
Using base R
subset(df, cat == 'A'|id %in% names(which(table(id) == 1)))
id cat
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
Assuming the following list:
a <- data.frame(id = 1:3, x = 1:3)
b <- data.frame(id = 3:1, y = 4:6)
my_list <- list(a, b)
my_list
# [[1]]
# id x
# 1 1 1
# 2 2 2
# 3 3 3
# [[2]]
# id y
# 1 3 4
# 2 2 5
# 3 1 6
I now want to column bind the list elements into a data frame/tibble while matching the respective rows based on the id variable, i.e. the outcome should be:
# A tibble: 3 x 3
id x y
<int> <int> <int>
1 1 1 6
2 2 2 5
3 3 3 4
I know how I can do it with some pivoting, but I'm wondering if there's a smarter way of doing it and I hoped there was some binding function that simply allows for specifying an id column?
Current approach:
library(tidyverse)
my_list %>%
bind_rows() %>%
pivot_longer(cols = -id) %>%
filter(!is.na(value)) %>%
pivot_wider()
reduce(my_list, full_join, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If its only 2 dataframes:
invoke(full_join, my_list, by='id')
id x y
1 1 1 6
2 2 2 5
3 3 3 4
If you are using base R, any of the following should work:
Reduce(merge, my_list)
do.call(merge, my_list)
I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I have a trouble with repeating rows of my real data using dplyr. There is already another post in here repeat-rows-of-a-data-frame but no solution for dplyr.
Here I just wonder how could be the solution for dplyr
but failed with error:
Error: wrong result size (16), expected 4 or 1
library(dplyr)
df <- data.frame(column = letters[1:4])
df_rep <- df%>%
mutate(column=rep(column,each=4))
Expected output
>df_rep
column
#a
#a
#a
#a
#b
#b
#b
#b
#*
#*
#*
Using the uncount function will solve this problem as well. The column count indicates how often a row should be repeated.
library(tidyverse)
df <- tibble(letters = letters[1:4])
df
# A tibble: 4 x 1
letters
<chr>
1 a
2 b
3 c
4 d
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
uncount(count)
# A tibble: 11 x 1
letters
<chr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
I was looking for a similar (but slightly different) solution. Posting here in case it's useful to anyone else.
In my case, I needed a more general solution that allows each letter to be repeated an arbitrary number of times. Here's what I came up with:
library(tidyverse)
df <- data.frame(letters = letters[1:4])
df
> df
letters
1 a
2 b
3 c
4 d
Let's say I want 2 A's, 3 B's, 2 C's and 4 D's:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <int>
1 a 1
2 a 2
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 d 1
9 d 2
10 d 3
11 d 4
If you don't want to keep the count column:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
select(letters)
# A tibble: 11 x 1
# Groups: letters [4]
letters
<fctr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
If you want the count to reflect the number of times each letter is repeated:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
mutate(count = max(count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <dbl>
1 a 2
2 a 2
3 b 3
4 b 3
5 b 3
6 c 2
7 c 2
8 d 4
9 d 4
10 d 4
11 d 4
This is rife with peril if the data.frame has other columns (there, I said it!), but the do block will allow you to generate a derived data.frame within a dplyr pipe (though, ceci n'est pas un pipe):
library(dplyr)
df <- data.frame(column = letters[1:4], stringsAsFactors = FALSE)
df %>%
do( data.frame(column = rep(.$column, each = 4), stringsAsFactors = FALSE) )
# column
# 1 a
# 2 a
# 3 a
# 4 a
# 5 b
# 6 b
# 7 b
# 8 b
# 9 c
# 10 c
# 11 c
# 12 c
# 13 d
# 14 d
# 15 d
# 16 d
As #Frank suggested, a much better alternative could be
df %>% slice(rep(1:n(), each=4))
I did a quick benchmark to show that uncount() is a lot faster than expand()
# for the pipe
library(magrittr)
# create some test data
df_test <-
tibble::tibble(
letter = letters,
row_count = sample(1:10, size = 26, replace = TRUE)
)
# benchmark
bench <- microbenchmark::microbenchmark(
expand = df_test %>%
dplyr::group_by(letter) %>%
tidyr::expand(row_count = seq(1:row_count)),
uncount = df_test %>%
tidyr::uncount(row_count)
)
# plot the benchmark
ggplot2::autoplot(bench)
Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c