R - select and assign value to group based on condition in column - r

Having a data frame that looks like the following:
d
year pos days sal
1 2009 A 31 2000
2 2009 B 60 4000
3 2009 C 10 600
4 2010 B 10 1000
5 2010 D 90 7000
I would like to group data by year, adding days and sal, and select pos where days is maximum in the group.
The result should be like:
year pos days sal
1 2009 B 101 6600
2 2010 D 100 8000
I could deal with numeric values such as days and sal using functions like tapply(d$days, d$year, sum).
However, I have no idea how I can select pos that meets a condition on days and assign it to the group.
Any comments will be greatly appreciated!

We can use dplyr. After grouping by 'year', get the 'pos' where the 'days' are max (which.max(days)), as well do the sum of 'days' and 'sal'.
library(dplyr)
d %>%
group_by(year) %>%
summarise(pos = pos[which.max(days)], days = sum(days), sal = sum(sal))
# # A tibble: 2 × 4
# year pos days sal
# <int> <chr> <int> <int>
#1 2009 B 101 6600
#2 2010 D 100 8000

A solution with base R:
m1 <- d[as.logical(with(d, ave(days, year, FUN = function(x) seq_along(x) == which.max(x)) )), c('year','pos')]
m2 <- aggregate(cbind(days, sal) ~ year, d, sum)
merge(m1, m2, by = 'year')
Or with the data.table package:
library(data.table)
setDT(d)[order(days), .(pos = pos[.N], days = sum(days), sal = sum(sal)), by = year]
the resulting data.frame / data.table:
year pos days sal
1 2009 B 101 6600
2 2010 D 100 8000

With sqldf:
library(sqldf)
cbind.data.frame(sqldf('select year, sum(days) as days, sum(sal) as sal
from d group by year'),
sqldf('select pos from d group by year having days=max(days)'))
year days sal pos
1 2009 101 6600 B
2 2010 100 8000 D

Related

How to summarize `Number of days since first date` and `Number of days seen` by ID and for a large data frame

The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294

Impute/ fill in missing values between time periods

I have data that often contains missing observations between time periods. I want to fill in those observations, properly incrementing the time periods, but conditional on the values of the observations. Here's an example:
df <- data.frame(id=c("a","a","b","b"), group=c("x","x","y","z"), year=c(2000,2003,2003,2005))
Which gives the 4 observation data frame
id group year
1 a x 2000
2 a x 2003
3 b y 2003
4 b z 2005
I would like to have 2 additional observations here (between #1 and #2) for 2001 and 2002, since observation #1 and #2 match on id and group. But I don't want additional observation between #3 and #4 because the id and group do not match.
You can use full_seq from tidyr - it was created exactly for tasks like this (Create the full sequence of values in a vector):
library(tidyr)
library(dplyr)
df %>%
group_by(id, group) %>%
complete(year = full_seq(year, period = 1))
id group year
<fct> <fct> <dbl>
1 a x 2000
2 a x 2001
3 a x 2002
4 a x 2003
5 b y 2003
6 b z 2005
Or using data.table
library(data.table)
setDT(df)[, .(year = year[1]:year[.N]), .(id, group)]
# id group year
#1: a x 2000
#2: a x 2001
#3: a x 2002
#4: a x 2003
#5: b y 2003
#6: b z 2005

Selecting distinct rows in dplyr [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
dat <- data.frame(loc.id = rep(1:2, each = 3),
year = rep(1981:1983, times = 2),
prod = c(200,300,400,150,450,350),
yld = c(1200,1250,1200,3000,3200,3200))
If I want to select for each loc.id distinct values of yld, I do this:
dat %>% group_by(loc.id) %>% distinct(yld)
loc.id yld
<int> <dbl>
1 1200
1 1250
2 3000
2 3200
However, what I want to do is for loc.id, if years have the same yld, then select the yld with a lower
prod value. My dataframe should look like i.e. I want the prod and year column too included in the final dataframe
loc.id year prod yld
1 1981 200 1200
1 1982 300 1250
2 1981 150 3000
2 1983 350 3200
We can do an arrange by 'prod' and then slice the first observation
dat %>%
arrange(loc.id, prod) %>%
group_by(loc.id, yld) %>%
slice(1)
# A tibble: 4 x 4
# Groups: loc.id, yld [4]
# loc.id year prod yld
# <int> <int> <dbl> <dbl>
#1 1 1981 200 1200
#2 1 1982 300 1250
#3 2 1981 150 3000
#4 2 1983 350 3200

count and listing all factor levels of all factors

I have a data frame in R like this:
D I S ...
110 2012 1000
111 2012 2000
110 2012 1000
111 2014 2000
110 2013 1000
111 2013 2000
I want to calculate how many factor levels are there for each factor and safe this in an DF like this:
D Count I Count S Count ...
110 3 2012 3 1000 3
111 3 2013 2 2000 3
2014 1
or this:
D Count
110 3
111 3
I Count
2012 3
2013 2
2014 1
S Count
1000 3
2000 3
....
I tried to do it with sapply, levels, the library(dplyr) or aggregate, but it does not produce the desired output. How can I do that?
I think the most efficient way to do it, in terms of length of code and storing final output in a tidy format is this:
library(tidyverse)
# example data
data <- data.frame(D = rep(c("110", "111"), 3),
I = c(rep("2012", 3), "2014", "2013", "2013"),
S = rep(c("1000", "2000"), 3))
data %>%
gather(name,value) %>% # reshape datset
count(name, value) # count combinations
# # A tibble: 7 x 3
# name value n
# <chr> <chr> <int>
# 1 D 110 3
# 2 D 111 3
# 3 I 2012 3
# 4 I 2013 2
# 5 I 2014 1
# 6 S 1000 3
# 7 S 2000 3
1st column represent the name of you factor variable.
2nd column has the unique values of each variable.
3rd column is the counter.
Here is a sulution using data.table
data <- data.frame(D = rep(c("110", "111"), 3),
I = c(rep("2012", 3), "2014", "2013", "2013"),
S = rep(c("1000", "2000"), 3))
str(data)
# you just want
table(data$D)
table(data$I)
table(data$S)
# one option using data.table
require(data.table)
dt <- as.data.table(data)
dt # see dt
dt[, table(D)] # or dt[, .N, by = D], for one variable
paste(names(dt), "Count", sep = "_") # names of new count columns
dt[, paste(names(dt), "Count", sep = "_") := lapply(.SD, table)]
dt # new dt
data2 <- as.data.frame(dt)[, sort(names(dt))]
data2 # final data frame
And a dplyr's one for the second output.
counts <- data %>%
lapply(table) %>%
lapply(as.data.frame)
counts
I think the easy way is by using the "plyr" R-library.
library(plyr)
count(data$D)
count(data$I)
count(data$S)
It will give you
> count(data$D)
x freq
1 110 3
2 111 3
> count(data$I)
x freq
1 2012 3
2 2013 2
3 2014 1
> count(data$S)
x freq
1 1000 3
2 2000 3

Subset R data frame contingent on the value of duplicate variables

How can I subset the following example data frame to only return one
observation for the earliest occurance [i.e. min(year)] of each id?
id <- c("A", "A", "C", "D", "E", "F")
year <- c(2000, 2001, 2001, 2002, 2003, 2004)
qty <- c(100, 300, 100, 200, 100, 500)
df=data.frame(year, qty, id)
In the example above there are two observations for the "A" id at years 2000 and 2001. In the case of duplicate id's, I would like the subset data frame to only include the the first occurance (i.e. at 2000) of the observations for the duplicate id.
df2 = subset(df, ???)
This is what I am trying to return:
df2
year qty id
2000 100 A
2001 100 C
2002 200 D
2003 100 E
2004 500 F
Any assistance would be greatly appreciated.
You can aggregate on minimum year + id, then merge with the original data frame to get qty:
df2 <- merge(aggregate(year ~ id, df1, min), df1)
# > df2
# id year qty
# 1 A 2000 100
# 2 C 2001 100
# 3 D 2002 200
# 4 E 2003 100
# 5 F 2004 500
Is this what you're looking for? Your second row looks wrong to me (it's the duplicated year, not the first).
> duplicated(df$year)
[1] FALSE FALSE TRUE FALSE FALSE FALSE
> df[!duplicated(df$year), ]
year qty id
1 2000 100 A
2 2001 300 A
4 2002 200 D
5 2003 100 E
6 2004 500 F
Edit 1: Er, I completely misunderstood what you were asking for. I'll keep this here for completeness though.
Edit 2:
Ok, here's a solution: Sort by year (so the first entry per ID has the earliest year) and then use duplicated. I think this is the simplest solution:
> df.sort.year <- df[order(df$year), ]
> df.sort.year[!duplicated(df$id), ]
year qty id
1 2000 100 A
3 2001 100 C
4 2002 200 D
5 2003 100 E
6 2004 500 F
Using plyr
library(plyr)
## make sure first row will be min (year)
df <- arrange(df, id, year)
df2 <- ddply(df, .(id), head, n = 1)
df2
## year qty id
## 1 2000 100 A
## 2 2001 100 C
## 3 2002 200 D
## 4 2003 100 E
## 5 2004 500 F
or using data.table. Setting the key as id, year will ensure the first row is the minimum of year.
library(data.table)
DF <- data.table(df, key = c('id','year'))
DF[,.SD[1], by = 'id']
## id year qty
## [1,] A 2000 100
## [2,] C 2001 100
## [3,] D 2002 200
## [4,] E 2003 100
## [5,] F 2004 500
There is likely a prettier way of doing this, but this is what came to mind
# use which() to get index for each id, saving only first
first_occurance <- with(df, sapply(unique(id), function(x) which(id %in% x)[1]))
df[first_occurance,]
# year qty id
#1 2000 100 A
#3 2001 100 C
#4 2002 200 D
#5 2003 100 E
#6 2004 500 F

Resources