Aggregate, but function uses two columns - r

I'm sure this has been asked 1000 times, but I can't find the question, and can't figure it out.
I have a data.frame, with a location (a factor), a date, and a variable.
I want to find the date on which the variable is maximized, for each location.
df = data.frame(FAC = factor(rep(c("A","B","C"),each=5)), VAR = runif(15), DATE = rep(as.Date(c("2000-01-01","2000-01-02","2000-01-03","2000-01-04","2000-01-05"))))
I can easily (but messily) do this with a for loop:
df_summary = data.frame(FAC = levels(df$FAC),date=as.Date(character(1)))
for(i in seq_along(levels(df$FAC))){
df_subset = subset(df,FAC == levels(df$FAC)[i])
max_date = df_subset$DATE[which.max(df_subset$VAR)]
df_summary$date[df_summary$FAC == levels(df$FAC)[i]] = max_date
}
But I imagine there's a 'nice' way either with aggregate or dplyr, but I can't figure it out.
My (failed) attempts:
aggregate(x=df$DATE,by=list(df$FAC),FUN=function(x) x[which.max(df$VAR)])
This doesn't work, because df$VAR isn't subset in the function.
And I don't really know how to use dplyr because I generally use base R.
Any suggestions?

In dplyr, you can do -
library(dplyr)
df %>% group_by(FAC) %>% summarise(max_date = DATE[which.max(VAR)])
In data.table -
library(data.table)
setDT(df)[, .(max_date = DATE[which.max(VAR)]), FAC]

We can use
library(dplyr)
df %>%
arrange(FAC, desc(VAR)) %>%
group_by(FAC) %>%
slice(1)

Related

How to write a function that includes pipes using functions from the srvyr package?

I have a survey that I am trying to get to be grouped by years and the calculate totals for certain variables. I need to do this about 20 times with different variables so I am writing a function but I can't seem to get to work properly even though it works fine outside the function.
this works fine:
mepsdsgn %>% group_by(YEAR) %>% summarise(tot_pri = survey_total(TOTPRV)) %>% select(YEAR, tot_pri)
when I try a function:
total_calc <- function(x) {mepsdsgn %>% group_by(YEAR) %>% summarise(total = survey_total(x)) %>% select(YEAR, total)}
total_calc(TOTPRV)
I get this error: Error in stop_for_factor(x) : object 'TOTPRV' not found
Figured it out:
total_fun <- function(x) {
col = x
mepsdsgn %>% group_by(YEAR) %>% summarise(total = survey_total(!!sym(col), na.rm = TRUE)) %>% select(YEAR, total)
}
there are a couple of things I'd suggest doing, see below
# first try to make a working minimal example people can run in a new R session
library(magrittr)
library(dplyr)
dt <- data.frame(y=1:10, x=rep(letters[1:2], each=5))
# simple group and mean using the column names explicitly
dt %>% group_by(x) %>% summarise(mean(y))
# a bit of googling showed me you need to use group_by_at(vars("x")) to replicate
# using a string input
# in this function, add all arguments, so the data you use - dt & the column name - column.x
foo <- function(dt, column.x){
dt %>% group_by_at(vars(column.x)) %>% summarise(mean(y))
}
# when running a function, you need to supply the name of the column as a string, e.g. "x" not x
foo(dt, column.x="x")
I don't use dplyr, so there may be a better way

Creating a new table with cumsum - code doesn't seem to be working?

So I am using dplyr to create a new data frame to plot: date in column 1, IDCount in column 2 and CumulativeIDCount in column 3. Here is the code that I am using to do it:
df2 <- df %>%
group_by(Date)%>%
summarise(IDCount =n(),CumulativeIDCount=cumsum(n()))
but the cumulativeIDCount column isn't cumulative, it's exactly the same as the IDCount column. Where am I going wrong with this code?
Most probably what you need is cumsum of IDCount after the grouping
library(dplyr)
df %>%
group_by(Date)%>%
summarise(IDCount =n()) %>%
mutate(CumulativeIDCount = cumsum(IDCount))
We can use data.table
library(data.table)
setDT(df)[, .(IDCount = .N), Date][, CumulativeIDCount = cumsum(IDCount)][]
Or with dplyr
library(dplyr)
df %>%
count(Date) %>%
mutate(CumulativeIDCount = cumsum(n))

Simple mutate with dplyr gives "wrong result size" error

My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))

efficient dplyr summarise in one data frame based on intervals in another one

I frequently need to calculate means of many parameters in time series datasets based on intervals defined as "events" in a second dataset.
The example code below illustrates my current approach, which does work nicely.
As my datasets will be increasing, though, I am wondering if there is a more efficient way (example runs in ~30 s on my PC).
It is important to stay within dplyr/tidyverse (data.table ways are appreciated, but won't really help).
library(tidyverse)
#generate time series data
data <- bind_cols(
data_frame(td=seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-31 23:59"),
by = 60)),
as_data_frame(replicate(20,runif(525600))))
#generate events
events <- data_frame(
event = as.character(1:669),
start_cet = seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-01 00:00"),
by = 43200),
stop_cet = seq(from = as.POSIXct("2010-01-01 02:00"),
to = as.POSIXct("2010-12-01 02:00"),
by = 43200)
)
#calculate means of data columns within event intervals
system.time(
means <- events %>%
rowwise() %>%
mutate(s = list(data %>% select(td) %>% filter(td >= start_cet & td < stop_cet))) %>%
unnest() %>%
select(event,td) %>%
left_join(.,data) %>%
group_by(event) %>%
summarise_at(vars(V1:V20),funs(mean=mean)) %>%
ungroup()
)
Here's an efficient way of doing it using the latest devel (1.9.7+) version of data.table that takes about 10 milliseconds to run for OP sample:
library(data.table)
setDT(data); setDT(events)
data[events, on = .(td >= start_cet, td <= stop_cet), lapply(.SD, mean), by = .EACHI]
Answer to myself after ~ 3 yrs...
The mutate step in the above dplyr solution was unnecessarily complicated, as also indicated in the comment by JDLong. I now use
means2 <- events %>%
rowwise() %>%
mutate(td = list(seq(start_cet, stop_cet - 60, "min"))) %>%
unnest() %>%
select(event,td) %>%
left_join(.,data) %>%
group_by(event) %>%
summarise_at(vars(V1:V20),funs(mean=mean)) %>%
ungroup()
which is ~ 25 times faster than the old dplyr solution above.
The dt solution is still ~ 5 times faster than this dplyr chain. However, the output is a bit messed up. Instead of a column with the events, I get two columns td, which are the start and stop times of the events. Some dt experts know how to fix this?

Modify a variable in a data frame, only for some levels of a factor (possibly with dplyr)

Sample df:
df <- data.frame(x = c(runif(10,0,2*pi),runif(10,0,360)), group = gl(n = 2, k = 10, labels =c("A","B")))
I want to modify x only for group A (convert it to degrees). With base I just do:
df <- within(df,x[group == "A"] <- x[group == "A"]*180/pi)
I was wondering if there could be a way to do this with dplyr. This is wrong:
df <- df %>% filter(group == "A") %>% mutate(x = x*180/pi)
Because it returns only the subset of df where group == "A". Is there a (simple) way to do this, or is this a case where base trumps dplyr for ease of use?
We can use ifelse to create the logical condition, and based on that we either do the arithmetic calculation or else return the original values.
df %>%
mutate(x = ifelse(group=="A", x*180/pi, x))
Or as #AlexIoannides mentioned, if_else from dplyr can be used so as the type should be taken care of.
In data.table, this can be done by assignment in place and should be more efficient.
library(data.table)
setDT(df)[group=="A", x := x*180/pi]

Resources