I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1
Related
I have a data frame like this:
Team
GF
A
3
B
5
A
2
A
3
B
1
B
6
Looking for output like this (just an additional column):
Team
x
avg(X)
A
3
0
B
5
0
A
2
3
A
3
2.5
B
1
5
B
6
3
avg(x) is the average of all previous instances of x where Team is the same. I have the following R code which gets the overall average, however I'm looking for the "step-wise" average.
new_df <- df %>% group_by(Team) %>% summarise(avg_x = mean(x))
Is there a way to vectorize this while only evaluating the previous rows on each "iteration"?
You want the cummean() function from dplyr, combined with lag():
df %>% group_by(Team) %>% mutate(avg_x = replace_na(lag(cummean(x)), 0))
Producing the following:
# A tibble: 6 × 3
# Groups: Team [2]
Team x avg_x
<chr> <dbl> <dbl>
1 A 3 0
2 B 5 0
3 A 2 3
4 A 3 2.5
5 B 1 5
6 B 6 3
As required.
Edit 1:
As #Ritchie Sacramento pointed out, the following is cleaner and clearer:
df %>% group_by(Team) %>% mutate(avg_x = lag(cummean(x), default = 0))
This question already has an answer here:
R code to assign a sequence based off of multiple variables [duplicate]
(1 answer)
Closed 3 years ago.
I have following kind of data and i need output as the second data frame...
a <- c(1,1,1,1,2,2,2,2,2,2,2)
b <- c(1,1,1,2,3,3,3,3,4,5,6)
d <- c(1,2,3,4,1,2,3,4,5,6,7)
df <- as.data.frame(cbind(a,b,d))
output <- c(1,1,1,2,1,1,1,1,2,3,4)
df_output <- as.data.frame(cbind(df,output))
I have tried cumsum and I am not able to get the desired results. Please guide. Regards, Enthu.
based on column a value cahnges and if b is to be reset starting from one.
the condition is if b has same value it should start with 1.
Like in the 5th record, col b has value as 3. It should reset to 1 and if all the values if col b is same ( as the case from ro 6,6,7,8 is same , then it should be 1 and any change should increment by 1).
We can do a group by column 'a' and then create the new column with either match the unique values in 'b'
library(dplyr)
df2 <- df %>%
group_by(a) %>%
mutate(out = match(b, unique(b)))
df2
# A tibble: 11 x 4
# Groups: a [2]
# a b d out
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1
# 2 1 1 2 1
# 3 1 1 3 1
# 4 1 2 4 2
# 5 2 3 1 1
# 6 2 3 2 1
# 7 2 3 3 1
# 8 2 3 4 1
# 9 2 4 5 2
#10 2 5 6 3
#11 2 6 7 4
Or another option is to coerce a factor variable to integer
df %>%
group_by(a) %>%
mutate(out = as.integer(factor(b)))
data
df <- data.frame(a, b, d)
Say I have this data.frame :
library(dplyr)
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
# x y
# 1 a a
# 2 b b
# 3 b b
# 4 c c
# 5 c c
# 6 c c
I can group and count easily by mentioning the names :
df1 %>%
count(x,y)
# A tibble: 3 x 3
# x y n
# <fctr> <fctr> <int>
# 1 a a 1
# 2 b b 2
# 3 c c 3
How do I do to group by everything without mentioning individual column names, in the most compact /readable way ?
We can pass the input itself to the ... argument and splice it with !!! :
df1 %>% count(., !!!.)
#> x y n
#> 1 a a 1
#> 2 b b 2
#> 3 c c 3
Note : see edit history to make sense of some comments
With base we could do : aggregate(setNames(df1[1],"n"), df1, length)
For those who wouldn't get the voodoo you are using in the accepted answer, if you don't need to use dplyr, you can do it with data.table:
setDT(df1)
df1[, .N, names(df1)]
# x y N
# 1: a a 1
# 2: b b 2
# 3: c c 3
Have you considered the (now superceded) group_by_all()?
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
df1 %>% group_by_all() %>% count
df1 %>% group_by(across()) %>% count()
df1 %>% count(across()) # don't know why this returns a data.frame and not tibble
See the colwise vignette "other verbs" section for explanation... though honestly I get turned around myself sometimes.
I have a large dataset with about 15 columns and more than 3 million rows.
Because the dataset is so big, I would like to use multidplyron it .
Because of the data, it would be impossible to just split my data frame to 12 parts. Lets say that there are columns col1 and col2 which each have several different values but they repeat (in each column separately).
How can I make 12 (or n) similar sized groups which each of them contain rows that have the same value in both col1 and col2?
Example: Lets say one of the possible values in col1 foo and in col2 is bar. Then they would be grouped, all rows with this values would be in one group.
So that the question makes sense, there are always more than 12 unique combinations of col1 and col2.
I would try to do something with for and while loops if this was python but as this is R, there probably is another way.
Try this:
# As you provided no example data, I created some data repeating three times.
# I used dplyr within tidyverse. Then grouped by the columns and sliced
# the data by chance for n=2.
library(tidyverse)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3))
# the data:
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n())
# A tibble: 78 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 A a 3
4 B b 1
5 B b 2
6 B b 3
7 C c 1
8 C c 2
9 C c 3
10 D d 1
# ... with 68 more rows
Slicing down the data by chance on two rows per group.
set.seed(123)
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n()) %>%
sample_n(2)
# A tibble: 52 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 B b 2
4 B b 3
5 C c 3
6 C c 1
7 D d 2
8 D d 3
9 E e 2
10 E e 1
# ... with 42 more rows
# Create sample data
library(dplyr)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3),
nobs=sample(1:100, 26*3,replace=T), stringsAsFactors=F)
# Get all unique combinations of col1 and col2
combos <- df %>%
group_by(a,b) %>%
summarize(n=sum(nobs)) %>%
as.data.frame(.)
top12 <- combos %>%
arrange(desc(n)) %>%
top_n(12,n)
top12
l <- list()
for(i in 1:11){
l[[i]] <- combos[combos$a==top12[i,"a"] & combos$b==top12[i,"b"],]
}
l[[12]] <- combos %>%
anti_join(top12,by=c("a","b"))
l
# This produces a list 'l' that contains twelve data frames -- the top 11 most-commonly occuring pairs of col1 and col2, and all the rest of the data in the 12th list element.
Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c