I have two columns in a dataframe advertisementID and Payout, Many advertisementID's have more than one Payout value, but I need to find those advertisementID's which have only one unique Payout value. How to do it in R ?
Example:
advertisementID Payout
1 10
2 3
1 10
2 4
3 5
3 4
So the output should be like this:
advertisementID Payout
1 10
as advertisementID 1 is having payout value unique which is 10
Using R base:
new <- aggregate(Payout ~ advertisementID, dt, unique)
new[lengths(new$Payout)==1, ]
output:
advertisementID Payout
1 1 10
Or in a cleaner way with magrittr:
library(magrittr)
aggregate(Payout ~ advertisementID, dt, unique) %>% subset(lengths(Payout)==1)
A solution from dplyr.
library(dplyr)
dt2 <- dt %>%
group_by(advertisementID) %>%
filter(n_distinct(Payout) == 1) %>%
distinct(advertisementID, Payout) %>%
ungroup()
dt2
# A tibble: 1 x 2
advertisementID Payout
<int> <int>
1 1 10
DATA
dt <- read.table(text = "advertisementID Payout
1 10
2 3
1 10
2 4
3 5
3 4",
header = TRUE)
Related
I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2
I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2
I would like to add a column that counts the number of consecutive values. Most of what I am seeing on here is how to count duplicate values (1,1,1,1,1) and I would like to count a when the number goes up by 1 ( 5,6,7,8,9). The ID column is what I have and the counter column is what I would like to create. Thanks!
ID Counter
5 1
6 2
7 3
8 4
10 1
11 2
13 1
14 2
15 3
16 4
A solution using the dplyr package. The idea is to calculate the difference between each number to create a grouping column, and then assign counter to each group.
library(dplyr)
dat2 <- dat %>%
mutate(Diff = ID - lag(ID, default = 0),
Group = cumsum(Diff != 1)) %>%
group_by(Group) %>%
mutate(Counter = row_number()) %>%
ungroup() %>%
select(-Diff, -Group)
dat2
# # A tibble: 10 x 2
# ID Counter
# <int> <int>
# 1 5 1
# 2 6 2
# 3 7 3
# 4 8 4
# 5 10 1
# 6 11 2
# 7 13 1
# 8 14 2
# 9 15 3
# 10 16 4
DATA
dat <- read.table(text = "ID
5
6
7
8
10
11
13
14
15
16",
header = TRUE, stringsAsFactors = FALSE)
A loop version is simple:
for (i in 2:length(ID))
if (diff(ID)[i-1] == 1)
counter[i] <- counter[i-1] +1
else
counter[i] <- 1
But this loop will perform very bad for n > 10^4! I'll try to think of a vector-solution!
You can using
s=df$ID-shift(df$ID)
s[is.na(s)]=1
ave(s,cumsum(s!=1),FUN=seq_along)
[1] 1 2 3 4 1 2 1 2 3 4
This one makes use solely of highly efficient vector-arithmetic. Idea goes as follows:
1.take the cumulative sum of the differences of ID
2.subtract the value if jump is bigger than one
cum <- c(0, cumsum(diff(ID))) # take the cumulative difference of ID
ccm <- cum * c(1, (diff(ID) > 1)) # those with jump > 1 will remain its value
# subtract value with jump > 1 for all following numbers (see Link for reference)
# note: rep(0, n) is because ccm[...] starts at first non null value
counter <- cum - c(rep(0, which(diff(dat) != 1)[1]),
ccm[which(ccm != 0)][cumsum(ccm != 0)]) + 1
enter code here
Notes:
Reference for highliy efficient fill-function by nacnudus: Fill in data frame with values from rows above
Restriction: Id must be monotonically increasing
That should deal with your millions of data efficiently!
Another solution:
breaks <- c(which(diff(ID)!=1), length(ID))
x <- c(breaks[1], diff(breaks))
unlist(sapply(x, seq_len))
This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 5 years ago.
I have a dataframe df with a column called ID.
Multiple rows may have the same ID and I want to set a column value "occurrence" to indicate how many times the ID has been seen before.
for (i in unique(df$ID)) {
rows = df[df$ID==i, ]
for (idx in 1:nrow(rows)) {
rows[idx,'occurrence'] = idx
}
}
Unfortunately, this adds the occurrence column to rows, but it does not update the original data frame. How do I get the occurrence column added to df?
Update: The row_number() function pointed out by neilfws works great. Actually, I have a followup question: The dataframe also has a year column, an what I need to do is to add a new column (say Prev.Year.For.This.ID) for the year of the previous occurrence of the ID. e.g if the input is
Year = c(1991,1991,1993,1994,1995)
ID = c(1,2,1,2,1)
df <- data.frame (Year, ID)
I'd like the output to look like this:
ID Year occurrence Prev.Year.For.This.Id
1 1991 1 <NA>
2 1992 1 <NA>
1 1993 2 1991
2 1994 2 1992
1 1995 3 1993
You can use dplyr to group_by ID, then row_number gives the running total of occurrences.
library(dplyr)
df1 <- data.frame(ID = c(1,2,3,1,4,5,6,2,7,8,2))
df1 %>%
group_by(ID) %>%
mutate(cnt = row_number()) %>%
ungroup()
ID cnt
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 1 2
5 4 1
6 5 1
7 6 1
8 2 2
9 7 1
10 8 1
11 2 3
Are you after something like the following (I made up sample data for you):
library(dplyr)
df = data.frame(ID = c(1,1,1,2,2,3))
answer = df %>% group_by(ID) %>% mutate(occurrence = cumsum(ID / ID) - 1) %>% as.data.frame
This will give something which looks like this:
ID occurrence
1 0
1 1
1 2
2 0
2 1
3 0
The dplyr package is a great tool for grouping and summarising data. I also find the code very readable when I use the pipe %>% (though, admittedly, it does take some getting used to).
> library(data.table)
> df = data.frame(ID = c(1,1,1,2,2,3))
> df <- data.table(df)
> df[, occurrence := sequence(.N), by = c("ID")]
> df
ID occurrence
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 3 1
I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2