How to subset data based on combination of criteria in R - r

I have a several million rows of data and I need to create a subset. No success despite of trying hard and searching all over the web. The question is:
How to create a subset including only the smallest values of value for all ID & item combinations?
The data structure looks like this:
> df = data.frame(ID = c(1,1,1,1,2,2,2,2),
item = c('A','A','B','B','A','A','B','B'),
value = c(10,5,3,2,7,8,9,10))
> df
ID item value
1 1 A 10
2 1 A 5
3 1 B 3
4 1 B 2
5 2 A 7
6 2 A 8
7 2 B 9
8 2 B 10
The the result should look like this:
ID item value
1 A 5
1 B 2
2 A 7
2 B 9
Any hints greatly appreciated. Thank you!

We can use aggregate from baseR with grouping variables 'ID' and 'item' to get the min of 'value'
aggregate(value~., df, min)
# ID item value
#1 1 A 5
#2 2 A 7
#3 1 B 2
#4 2 B 9
Or using dplyr
library(dplyr)
df %>%
group_by(ID, item) %>%
summarise(value = min(value))
Or with data.table
library(data.table)
setDT(df)[, .(value = min(value)) , .(ID, item)]
Or another option would be to order and get the first row after grouping
setDT(df)[order(value), head(.SD, 1), .(ID, item)]

Related

Subsetting datatable based on consecutive difference of dates

Preamble:
The main problem is how to subset a datatable based on IDs, forming subsets within an ID based on consecutive time differences. A hint regarding this would be most welcome.
The complete question/setup:
I have a dataset dt in data.table format that looks like
date id val1 val2
%d.%m.%Y
1 01.01.2000 1 5 10
2 09.01.2000 1 4 9
3 01.08.2000 1 3 8
4 01.01.2000 2 2 7
5 01.01.2000 3 1 6
6 14.01.2000 3 7 5
7 28.01.2000 3 8 4
8 01.06.2000 3 9 3
I want to combine observations (grouped by id) which are not more than two weeks apart (consecutively from observation to observation). By combining I mean that for each subset, I
keep the value of the last observation of val1
replace val2 of the last observation with the sum of all values of val2 of the group
add counter for how many observations came together in this group.
I.e., I want to end up with a dataset like this
date id val1 val2 counter
%d.%m.%Y
2 09.01.2000 1 4 19 2
3 01.08.2000 1 3 8 1
4 01.01.2000 2 2 7 1
7 28.01.2000 3 8 15 3
8 01.06.2000 3 9 3 1
Still, I am trying to wrap my head around data.table functions, particularly .SD and want to solve the issue with these tools.
So far I know
that I can indicate what I mean by first and last using setkey(dt,date)
that I can replace the last val2 of a subset with the sum
dt[, val2 := replace(val2, .N, sum(val2[-.N], na.rm = TRUE)), by=id]
that I get the length of a subset with [.N]
how to delete rows
that I can calculate the difference between two dates with difftime(strptime(dt$date[1],format ="%d.%m.%Y"),strptime(dt$date[2],format ="%d.%m.%Y"),units="weeks")
However I can't get my head around how to subset the observations such that each subset contains only groups of observations of the same id with dates of (consecutive) distance at max 2 weeks.
Any help is appreciated. Many thanks in advance.
The trick is to use cumsum() on a condition. In this case, the condition is being more than 14 days. When the condition is true, the cumulative sum increments.
df %>%
mutate(rownumber = row_number()) %>%
group_by(id) %>%
mutate(interval = as.numeric(as.Date(date, format = "%d.%m.%Y") - as.Date(lag(date), format = "%d.%m.%Y"))) %>%
mutate(interval = ifelse(is.na(interval), 0, interval)) %>%
mutate(group = cumsum(interval > 14) + 1) %>%
ungroup() %>%
group_by(id, group) %>%
summarise(
rownumber = last(rownumber),
date = last(date),
val1 = last(val1),
val2 = sum(val2),
counter = n()
) %>%
select(rownumber, date, id, val1, val2, counter)
Output
rownumber date id val1 val2 counter
<int> <chr> <int> <int> <int> <int>
1 2 09.01.2000 1 4 19 2
2 3 01.08.2000 1 3 8 1
3 4 01.01.2000 2 2 7 1
4 7 28.01.2000 3 8 15 3
5 8 01.06.2000 3 9 3 1

R calculate median and last row in groups for certain rows

I'm working with grouping and median, I'd like to have a grouping of a data.frame with the median of certain rows (not all) and the last value.
My data are something like this:
test <- data.frame(
id = c('A','A','A','A','A','B','B','B','B','B','C','C','C','C'),
value = c(1,2,3,4,5,3,4,5,1,8,3,4,2,9))
> test
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 B 3
7 B 4
8 B 5
9 B 1
10 B 8
11 C 3
12 C 4
13 C 2
14 C 9
For each id, I need the median of the three (number may vary, in this case three) central rows, then the last value.
I've tried first of all with only one id.
test_a <- test[which(test$id == 'A'),]
> test_a
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
The desired output is this for this one,
Having this:
median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value) # median of three central values
tail(test_a,1)$value # last value
I used this:
library(tidyverse)
test_a %>% group_by(id) %>%
summarise(m = median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value),
last = tail(test_a,1)$value) %>%
data.frame()
id m last
1 A 3 5
But when I tried to generalize to all id:
test %>% group_by(id) %>%
summarise(m = median(test[(nrow(test)-3):(nrow(test)-1),]$value),
last = tail(test,1)$value) %>%
data.frame()
id m last
1 A 3 9
2 B 3 9
3 C 3 9
I think that the formulas take the full dataset to calculate last value and median, but I cannot imagine how to make it works. Thanks in advance.
This works:
test %>%
group_by(id) %>%
summarise(m = median(value[(length(value)-3):(length(value)-1)]),
last = value[length(value)])
# A tibble: 3 x 3
id m last
<fctr> <dbl> <dbl>
1 A 3 5
2 B 4 8
3 C 4 9
You just refer to variable value instead of the whole dataset within summarise.
Edit: Here's a generalized version.
test %>%
group_by(id) %>%
summarise(m = ifelse(length(value) == 1, value,
ifelse(length(value) == 2, median(value),
median(value[(ceiling(length(value)/2)-1):(ceiling(length(value)/2)+1)])),
last = value[length(value)])
If a group has only one row, the value itself will be stored in m. If it has only two rows, the median of these two rows will be stored in m. If it has three or more rows, the middle three rows will be chosen dynamically and the median of those will be stored in m.

Count subgroups in group_by with dplyr [duplicate]

This question already has answers here:
Add count of unique / distinct values by group to the original data
(3 answers)
Closed 4 years ago.
I'm stuck trying to do some counting on a data frame. The gist is to group by one variable and then break further by groups based on a second variable. From here I want to count the size if the subgroups for each group. The sample code is this:
set.seed(123456)
df <- data.frame(User = c(rep("A", 5), rep("B", 4), rep("C", 6)),
Rank = c(rpois(5,1), rpois(4,2), rpois(6,3)))
#This results in an error
df %>% group_by(User) %>% group_by(Rank) %>% summarize(Res = n_groups())
So what I want is 'User A' to have 3, 'User B' to have 4, and 'User C' to have 5. In other words the data frame df would end up looking like:
User Rank Result
1 A 2 3
2 A 2 3
3 A 1 3
4 A 0 3
5 A 0 3
6 B 1 4
7 B 2 4
8 B 0 4
9 B 6 4
10 C 1 5
11 C 4 5
12 C 3 5
13 C 5 5
14 C 5 5
15 C 8 5
I'm still learning dplyr, so I'm unsure how I should do it. How can this be achieved? Non-dplyr answers are also very welcome. Thanks in advance!
Try this:
df %>% group_by(User) %>% mutate(Result=length(unique(Rank)))
Or (see comment below):
df %>% group_by(User) %>% mutate(Result=n_distinct(Rank))
A base R option would be using ave
df$Result <- with(df, ave(Rank, User, FUN = function(x) length(unique(x))))
df$Result
#[1] 3 3 3 3 3 4 4 4 4 5 5 5 5 5 5
and a data.table option is
library(data.table)
setDT(df)[, Result := uniqueN(Rank), by = User]

How to find the first and last occurrence in a panel data set in R

I have a table:
id time
1 1
1 2
1 5
2 3
2 2
2 7
3 8
3 3
3 14
And I want to convert it to:
id first last
1 1 5
2 3 7
3 8 14
Please help!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', we get the first and last value of 'time'
library(data.table)
setDT(df1)[, list(firstocc = time[1L], lastocc = time[.N]),
by = id]
Or with dplyr, we use the same methodology.
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(firstocc = first(time), lastocc = last(time))
Or with base R (no packages needed)
do.call(rbind, lapply(split(df1, df1$id),
function(x) data.frame(id = x$id[1],
firstocc = x$time[1], lastocc = x$time[nrow(x)])))
If we need to be based on the min and max values (not related to the expected output) , the data.table option is
setDT(df1)[, setNames(as.list(range(time)),
c('firstOcc', 'lastOcc')) ,id]
and dplyr is
df1 %>%
group_by(id) %>%
summarise(firstocc = min(time), lastocc = max(time))
There are many packages that can perform aggregation of this sort in R. We show how to do it without any packages and then show it with some packages.
1) Use aggregate. No packages needed.
ag <- aggregate(time ~ id, DF, function(x) c(first = min(x), last = max(x)))
giving:
> ag
id time.first time.last
1 1 1 5
2 2 2 7
3 3 3 14
ag is a two column data frame whose second column contains a two column matrix with columns named 'first' and 'last'. If you want to flatten it to a 3 column data frame use:
do.call("cbind", ag)
giving:
id first last
[1,] 1 1 5
[2,] 2 2 7
[3,] 3 3 14
1a) This variation of (1) is more compact at the expense of uglier column names.
aggregate(time ~ id, DF, range)
2) sqldf
library(sqldf)
sqldf("select id, min(time) first, max(time) last from DF group by id")
giving:
id first last
[1,] 1 1 5
[2,] 2 2 7
[3,] 3 3 14
3) summaryBy summaryBy in the doBy package is much like aggregate:
library(doBy)
summaryBy(time ~ id, data = DF, FUN = c(min, max))
giving:
id time.min time.max
1 1 1 5
2 2 2 7
3 3 3 14
Note: Here is the input DF in reproducible form:
Lines <- "id time
1 1
1 2
1 5
2 3
2 2
2 7
3 8
3 3
3 14"
DF <- read.table(text = Lines, header = TRUE)
Update: Added (1a), (2) and (3) and fixed (1).
You can remove duplicates and reshape it
dd <- read.table(header = TRUE, text = "id time
1 1
1 2
1 5
2 3
2 2
2 7
3 8
3 3
3 14")
d2 <- dd[!(duplicated(dd$id) & duplicated(dd$id, fromLast = TRUE)), ]
reshape(within(d2, tt <- c('first', 'last')), dir = 'wide', timevar = 'tt')
# id time.first time.last
# 1 1 1 5
# 4 2 3 7
# 7 3 8 14

How can I get the most common combination of several columns, aggregating by others, in a data.frame?

Let's say I have a dataframe with the following structure:
id A B
1 1 1
1 1 2
1 1 2
1 2 2
1 2 3
1 2 4
1 2 5
2 1 2
2 2 2
2 3 2
2 3 5
2 3 5
2 4 6
I'd like to get the most common combination of values in A and B for each id:
id A B
1 1 2
2 3 5
I need to do this for a fairly big dataset (several million rows). I've got to a couple of horrible, slow, and very un-idiomatic solutions; I'm sure there is an easy, R-ish way.
I think I should be using aggregate, but I can't find a way to do it that works:
> aggregate(cbind(A, B) ~ id, d, Mode)
id A B
1 1 2 2
2 2 3 2
> # wrong!
> aggregate(interaction(A, B) ~ id, d, Mode)
id interaction(A, B)
1 1 1.2
2 2 3.5
> # close, but I need the original columns
Using dplyr:
library(dplyr)
df %>%
group_by(id, A, B) %>%
mutate(n = n()) %>%
group_by(id) %>%
slice(which.max(n)) %>%
select(-n)
#Source: local data frame [2 x 3]
#Groups: id
#
# id A B
#1 1 1 2
#2 2 3 5
And a similar data.table approach:
library(data.table)
setDT(df)[, .N, by=.(id, A, B)][, .SD[which.max(N)], by = id]
# id A B N
#1: 1 1 2 2
#2: 2 3 5 2
Edit to include a brief explanation:
Both approaches do essentially the same:
group the data by id, A and B.
Add a column with the number of rows per group
group the data by id (only) and return the (first) maximum group per id.
In the data.table version, you start with setDT(df) to convert the data.frame to a data.table object.

Resources