tapply based on multiple indexes in R - r

I have a data frame, much like this one:
ref=rep(c("A","B"),each=240)
year=rep(rep(2014:2015,each=120),2)
month=rep(rep(1:12,each=10),4)
values=c(rep(NA,200),rnorm(100,2,1),rep(NA,50),rnorm(40,4,2),rep(NA,90))
DF=data.frame(ref,year,month,values)
I would like to compute the maximum number of consecutive NAs per reference, per year.
I have created a function, which works out the maximum number of consecutive NAs, but can only be based on one variable.
For example,
func <- function(x) {
max(rle(is.na(x))$lengths)
}
with(DF, tapply(values,ref, func))
# A B
# 200 90
with(DF, tapply(values,year, func))
# 2014 2015
# 120 90
So there are a maximum of 200 consecutive NAs in ref A in total, and maximum of 90 in ref B, which is correct. There are also 120 NAs in 2014, and 90 in 2015.
What I'd like is a result per ref and year, such as:
A 2015 80
A 2014 120
B 2015 90
B 2014 50

There are multiple ways of doing this, one is with the plyr library:
library(plyr)
ddply(DF,c('ref','year'),summarise,NAs=max(rle(is.na(values))$lengths))
ref year NAs
1 A 2014 120
2 A 2015 80
3 B 2014 60
4 B 2015 90
Using your function, you could also try:
with(DF, tapply(values,list(ref,year), func))
which gives a slightly different output
2014 2015
A 120 80
B 60 90
By using melt() you can however get to the same dataframe.

Very similar to the tapply solution above. I find aggregate give a better output than tapply though.
with(DF, aggregate(list(Value = values),list(Year = year,ref = ref), func))
Year ref Value
1 2014 A 120
2 2015 A 80
3 2014 B 60
4 2015 B 90

I like the recipe format
library(dplyr)
DF$values[is.na(DF$values)] <- 1
DF %>%
filter(values==1) %>%
group_by(ref,year) %>%
mutate(csum=cumsum(values)) %>%
group_by(ref,year) %>%
summarise(max(csum))
Source: local data frame [4 x 3]
Groups: ref [?]
ref year max(csum)
(fctr) (int) (dbl)
1 A 2014 120
2 A 2015 80
3 B 2014 50
4 B 2015 90

Related

Performing operation among levels of grouped variable in R/dplyr

I want to perform a calculation among levels a grouping variable and fit this into a dplyr/tidyverse style workflow. I know this is confusing wording, but I hope the example below helps to clarify.
Below, I want to find the difference between levels "A" and "B" for each year that that I have data. One solution was to cast the data from long to wide format, and use mutate() in order to find the difference between A and B and create a new column with the results.
Ultimately, I'm working with a much larger dataset in which for each of N species, and for every year of sampling, I want to find the response ratio of some measured variable. Being able to keep the calculation in a long-format workflow would greatly help with later uses of the data.
library(tidyverse)
library(reshape)
set.seed(34)
test = data.frame(Year = rep(seq(2011,2020),2),
Letter = rep(c('A','B'),each = 10),
Response = sample(100,20))
test.results = test %>%
cast(Year ~ Letter, value = 'Response') %>%
mutate(diff = A - B)
#test.results
Year A B diff
2011 93 48 45
2012 33 44 -11
2013 9 80 -71
2014 10 61 -51
2015 50 67 -17
2016 8 43 -35
2017 86 20 66
2018 54 99 -45
2019 29 100 -71
2020 11 46 -35
Is there some solution where I could group by Year, and then use a function like summarize() to calculate between the levels of variable "Letters"?
group_by(Year)%>%
summarise( "something here to perform a calculation between levels A and B of the variable "Letters")
You can subset the Response values for "A" and "B" and then take the difference.
library(dplyr)
test %>%
group_by(Year) %>%
summarise(diff = Response[Letter == 'A'] - Response[Letter == 'B'])
# Year diff
# <int> <int>
# 1 2011 45
# 2 2012 -11
# 3 2013 -71
# 4 2014 -51
# 5 2015 -17
# 6 2016 -35
# 7 2017 66
# 8 2018 -45
# 9 2019 -71
#10 2020 -35
In this example, we can also take advantage of the fact that if we arrange the data "A" would come before "B" so we can use diff :
test %>%
arrange(Year, desc(Letter)) %>%
group_by(Year) %>%
summarise(diff = diff(Response))

Cumsum w/ panel data: different start dates

Trying to find the cumsum across different types of contracts. Each has a unique stop (i.e. delivery) date with several months of expected delivery leading up to that date. Needing to calculate the cumsum of all expected deliveries before the actual delivery date.
For some reason the cumsum/rollsum function is not working. I have tried both DT and dplyr versions but both have failed.
Here is a simplified data for the problem I am working on.
df <- data.frame(report_year = c(rep(2017,10), rep(2018,10)),
report_month = c(seq(1,5,1), seq(2,6,1), seq(3,7,1), seq(2,6,1)),
delivery_year = c(rep(2017,10), rep(2018,10)),
delivery_month = c(rep(5,5),rep(6,5), rep(7,5), rep(6,5)),
sum = c(rep(seq(100,500,100), 4)),
cumsum = c(rep(c(100,300,600,1000,1500),4)))
The first 5 columns is what I currently have.
I am trying to get the last column (i.e. cumsum)
I am probably doing something wrong. Any help is appreciated.
The question did not specifically define which grouping columns to use so this may have to be modified slightly depending on what you want but this does it without any packages:
df$cumsum <- NULL # remove the result from df shown in question
transform(df, cumsum = ave(sum, delivery_year, delivery_month, FUN = cumsum))
Note that although the above works you may run into some problems using sum and cumsum as the column names due to confusion with the functions of the same name so you might want to use Sum and Cumsum, say. For example if you don't null out cumsum as we did above then FUN = cumsum will think that you want to apply the cumsum column which is not a function.
Use arrange and mutate
# Import library
library(dplyr)
# Calculating cumsum
df %>%
group_by(delivery_year, delivery_month) %>%
arrange(sum) %>%
mutate(cs = cumsum(sum))
Output
report_year report_month delivery_year delivery_month sum cumsum cs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 1 2017 5 100 100 100
2 2017 2 2017 6 100 100 100
3 2018 3 2018 7 100 100 100
4 2018 2 2018 6 100 100 100
5 2017 2 2017 5 200 300 300
6 2017 3 2017 6 200 300 300
7 2018 4 2018 7 200 300 300

Subset a dataframe by unique combination of values from another dataframe in R

I have a large dataframe A similar to the following and a second one, B, containing only lat/lon values.
What I am trying to do is to subset dataframe A based on the unique combinations of lat/lon from dataframe B.
So far, I have tried the following but does not work.
How should I change my code in order to effectively do this?
head(A)
vals time lon lat mo year
1 5 1978-11-01 100 32 01 1988
2 3 1978-11-02 100 45 02 1988
3 3 1978-11-03 100 45 01 1998
4 9 1978-11-04 100 50 05 1998
5 1 1978-11-05 100 60 05 1998
6 4 1978-11-06 100 32 05 1998
A_subset <-subset(A, A[, "lon"] %in% B$lon | A[, "lat"]
%in% B$lat)
Consider running an expand.grid on data frame B for all combination of unique coordinates. Then merge to data frame A:
B_all_combns <- expand.grid(lon = unique(B$lon), lat = unique(B$lat))
A_subset <- merge(A, B_all_combns, by=c("lon", "lat"))

From panel data to cross-sectional data using averages

I am very new to R so I am not sure how basic my question is, but I am stuck at the following point.
I have data that has a panel structure, similar to this
Country Year Outcome Country-characteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60
For some reason I need to put this in a cross-sectional structure such I get averages over all years for each country, that is in the end, it should look like,
Country Outcome Country-Characteristic
A 12 40
B 11 60
Has anybody faced a similar problem? I was playing with lapply(table$country, table$outcome, mean) but that did not work as I wanted it.
Two tips: 1- When you ask a question, you should provide a reproducible example for the data too (as I did with read.table below). 2- It's not a good idea to use "-" in column names. You should use "_" instead.
You can get a summary using the dplyr package:
df1 <- read.table(text="Country Year Outcome Countrycharacteristic
A 1990 10 40
A 1991 12 40
A 1992 14 40
B 1991 10 60
B 1992 12 60", header=TRUE, stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(Country) %>%
summarize(Outcome=mean(Outcome),Countrycharacteristic=mean(Countrycharacteristic))
# A tibble: 2 x 3
Country Outcome Countrycharacteristic
<chr> <dbl> <dbl>
1 A 12 40
2 B 11 60
We can do this in base R with aggregate
aggregate(.~Country, df1[-2], mean)
# Country Outcome Countrycharacteristic
#1 A 12 40
#2 B 11 60

R: subsetting all observations of individuals that have one matched observation

Sorry for another dang subsetting question; I just can't find this case described, though it must be common. Boiled-down data looks like this:
Plot Year BA
A 1980 44
A 1990 54
A 2000 66
B 1980 58
B 1990 69
B 2000 80
I want all observations for any plot with BA < 50 in 1980 -- in the above, all three A rows. I understand subset(Df, BA<50 & Year==1980) but can't figure out the next level of indexing.
Also if anyone has a better way to phrase the title I'll change it. Every way I could think of to search on only turned up the &/| questions. (So many &/| questions...)
Index your condition on Plot, checking membership with %in% in case there is more than one Plot satisfying the condition in the real data.
subset(df, Plot %in% unique(Plot[BA < 50 & Year == 1980]))
# Plot Year BA
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Or with standard evaluation [ subsetting,
df[with(df, Plot %in% unique(Plot[BA < 50 & Year == 1980])), ]
# Plot Year BA
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Another option with dplyr, this assumes there is only one record equal to 1980 for each plot, otherwise you may want to wrap the condition with all() or any() depending on your desired logic:
library(dplyr)
df %>% group_by(Plot) %>% filter(BA[Year == 1980] < 50)
# Source: local data frame [3 x 3]
# Groups: Plot [1]
# Plot Year BA
# <fctr> <int> <int>
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Under circumstances where multiple 1980 exist for some plots, the logic by #DirtySockSniffer's answer is equivalent to df %>% group_by(Plot) %>% filter(any(BA[Year == 1980] < 50)) in dplyr.
We can use data.table
library(data.table)
setDT(df1)[, if(all(BA[Year == 1980] < 50)) .SD, by = Plot]
# Plot Year BA
#1: A 1980 44
#2: A 1990 54
#3: A 2000 66

Resources