How to summarize based on multiple columns in R? - r

I want to summarize the dataset based on "year", "months", and "subdist_id" columns. For each subdist_id, I want to get average values of "Rainfall" for the months 11,12,1,2 but for different years. For example, for subdist_id 81, the mean Rainfall value of 2004 will be the mean Rainfall of months 11, 12 of 2004, and months 1,2 of 2005.
I am getting no clue how to do it, although I searched online rigorously.

Expanding on #Bloxx's answer and incorporating my comment:
# Set up example data frame:
df = data.frame(year=c(rep.int(2004,2),rep.int(2005,4)),
month=((0:5%%4)-2)%%12+1,
Rainfall=seq(.5,by=0.15,length.out=6))
Now use mutate to create year2 variable:
df %>% mutate(year2 = year - (month<3)*1) # or similar depending on the problem specs
And now apply the groupby/summarise action:
df %>% mutate(year2 = year - (month<3)*1) %>%
group_by(year2) %>%
summarise(Rainfall = mean(Rainfall))

Lets assume your dataset is called df. Is this what you are looking for?
df %>% group_by(subdist_id, year) %>% summarise(Rainfall = mean(Rainfall))

I think you can simply do this:
df %>% filter(months %in% c(1,2,11,12)) %>%
group_by(subdist_id, year=if_else(months %in% c(1,2),year-1,year)) %>%
summarize(meanRain = mean(Rainfall))
Output:
subdist_id year meanRain
<dbl> <dbl> <dbl>
1 81 2004 0.611
2 81 2005 0.228
Input:
df = data.frame(
subdist_id = 81,
year=c(2004,2004, 2005, 2005, 2005, 2005),
months=c(11,12,1,2,11,12),
Rainfall = c(.251,.333,.731,1.13,.111,.346)
)

Related

Counting the Number of Combinations of Unique Values Per ID

id = sample.int(50,1000, replace = TRUE)
years <- c("2010", "2011", "2013", "2014", "2015")
year<- sample(years, 1000, prob=c(0.2,0.2,0.2,0.2, 0.2), replace = TRUE)
my_data = data.frame(id, year)
my_data <- my_data[order(id),]
For example, patient # 1 visited the hospital twice in 2010, once in 2011, no times in 2012 ... patient # 2 visited the hospital no times in 2010, 5 times in 2011, 3 times in 2012, etc.
For this dataset, I want to find out the number of the number of times each combination of "years" appears (not including duplicates). For example - since there are 5 possible years, there will be 2^5 = 32 possible combinations :
# sample output of the final result
combinations frequency
2010 11
2011 9
2012 5
2013 1
2014 19
2015 11
2011,2012 9
2011, 2012, 2013 5
2013, 2015 1
2010,2011,2012,2013,2014,2015 19
This would be the equivalent of finding out:
How many patients visited the hospital only in 2010?
How many patients visited the hospital in 2010 and 2013?
etc.
I tried to find a way to do this different ways:
# Method 1: Did not work
library(data.table)
final = setDT(my_data)[,list(Count=.N) ,names(my_data)]
# Method 2: Did not work
final = = aggregate(.~year,rev(aggregate(.~id,my_data,paste0,collapse = ",")),length)
# Method 3: Not sure if this is correct? There are 50 rows (instead of 32) and all frequency counts are 1?
library(dplyr)
library(stringr)
df = my_data %>%
group_by(id) %>%
summarise(years = str_c(year, collapse=",")) %>%
count(years)
My Question: Can someone please show me how to fix this?
In base R:
table(tapply(my_data$year, my_data$id,
function(x) paste(sort(unique(x)), collapse=',')))
I think you want something like this:
library(dplyr)
my_data %>%
group_by(id, year) %>%
summarise(frequency = n()) %>%
group_by(id) %>%
summarise(combinations = str_c(year, collapse=",")) %>%
count(combinations)
Output:
`summarise()` has grouped output by 'id'. You can override using the `.groups` argument.
# A tibble: 2 × 2
combinations n
<chr> <int>
1 2010,2011,2013,2014,2015 49
2 2010,2011,2013,2015 1
This means that there is one patient, which does not went to the hospital in 2014. And it seems that all patient do not went to the hospital in 2012.
By using this code, we can check which patient did not went to the hospital in 2014:
my_data %>%
group_by(id) %>%
filter(!all(2014 %in% year)) %>%
ungroup %>%
distinct(id) %>%
pull(id)
Output:
[1] 12
So patient 12 did not went to hospital in 2014.
You need to first get the combinations of years for each patient and then count them
# with data.table
setDT(my_data)
my_data_agg<-my_data[,.(combinations=paste(sort(unique(year)), collapse=",")), by=.(id)]
my_data_agg[,.N, by=.(combinations)]

matching yearly time points to preceding 365 days of data in R

I am trying to merge two datasets. The survey dataset consists of biodiversity surveys from different regions conducted every 1-5 years in a certain month (the month is constant within, but not between, regions). The temperature dataset consists of daily temperature readings in each survey region.
For multiple surveys that have different start months and temporal extents, I want to pair each survey*year combination with the twelve months of temperature data preceding it. In other words, I want to pair a May 1983 survey with the 12 months (or 365 days -- I don't care which) of daily temperature records preceding it, ending April 30, 1983. Meanwhile, another survey elsewhere conducted in August 1983 needs to be paired with the 365 days of temperature data ending July 31, 1983.
There are (at least) two ways to do this -- one would be joining the survey data to the (longer) temperature data and then somehow subsetting or identifying which dates fall in the 12 months preceding the survey-date. Another is to start with the survey data and try to pair the temperature data to each row with a matrix-column -- I tried doing this with time-series tools from tsibble and tsModel but couldn't get it to "lag" the right values when grouped by region.
I was able to create an identifier to join the datasets such that each date in the temperature data is matched with the subsequent survey in time. However, not all of those are within 365 days (e.g., in the dataset created below, the date 1983-06-03 is matched with the ref_year aleutian_islands-5-1986 because the survey only happens every 3-5 years).
Here are some examples of the behavior I want for a single region (from the example dataset below), although I'm open to solutions that achieve the same thing but don't look exactly like this:
For this row, the value in the new column that I want to generate (ref_match) should be NA; the date is more than 365 days before ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 1982-06-09 1982 6 6-1982 aleutian_islands-5-1983 0
For this row, ref_match should be aleutian_islands-5-2014 since the date is within 12 months of ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 2013-07-22 2013 7 7-2013 aleutian_islands-5-2014 0.998
The following script will generate a dataset temp_dat with columns like those in the snippets above from which I hope to generate the ref_match column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <- rbind(ai_dat, ebs_dat) %>%
mutate(month_year = paste0(startmonth,"-",year)) %>%
select(region, month_year) %>%
distinct() %>%
mutate(region_month_year = paste0(region,"-",month_year))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <- expand.grid(month=seq(1, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(region_month_year = paste0(region,"-",month,"-",year)) %>% # create unique identifier
mutate(ref_year = ifelse(region_month_year %in% surv_dat$region_month_year, region_month_year, NA),
month_year = paste0(month,"-",year)) %>%
select(region, month_year, ref_year) %>%
distinct() %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
month_year = paste0(month,"-",year)) %>%
left_join(surv_dat_exploded, by=c('region','month_year')) %>%
filter(!is.na(ref_year))# get rid of dates that are after any ref_year
Sounds like you want a non-equi join. This is easily done with data.table and is very fast. Here's an example that lightly modifies your MWE:
library(data.table)
# make survey dfs
ai_dat = data.table(year = c(1983, 1986, 1991, 1994, 1997),
region = "aleutian_islands", "startmonth" = 5)
ebs_dat = data.table(year = seq(1983, 1999, 1),
region = "eastern_bering_sea", "startmonth" = 5)
# bind together and create date (and cutoffdate) vars
surv_dat = rbind(ai_dat, ebs_dat)
surv_dat[, startdate := as.IDate(paste(year, startmonth, '01', sep = '-'))
][, cutoffdate := startdate - 365L]
# make temperature df
temp_dat = CJ(date=seq(as.IDate("1982-01-01"), as.IDate("1999-12-31"), "days"),
region=c('aleutian_islands','eastern_bering_sea'))
# add temperature var
temp_dat$temp = rnorm(nrow(temp_dat))
# create duplicate date variable (will make post-join processing easier)
temp_dat[, matchdate := date]
# Optional: Set keys for better join performance
setkey(surv_dat, region, startdate)
setkey(temp_dat, region, matchdate)
# Where the magic happens: Non-equi join
surv_dat = temp_dat[surv_dat, on = .(region == region,
matchdate <= startdate,
matchdate >= cutoffdate)]
# Optional: get rid of unneeded columns
surv_dat[, c('matchdate', 'matchdate.1') := NULL][]
#> date region temp year startmonth
#> 1: 1982-05-01 aleutian_islands 0.3680810 1983 5
#> 2: 1982-05-02 aleutian_islands 0.8349334 1983 5
#> 3: 1982-05-03 aleutian_islands -1.3622227 1983 5
#> 4: 1982-05-04 aleutian_islands 1.4327587 1983 5
#> 5: 1982-05-05 aleutian_islands 0.5068226 1983 5
#> ---
#> 8048: 1999-04-27 eastern_bering_sea -1.2924594 1999 5
#> 8049: 1999-04-28 eastern_bering_sea 0.7519078 1999 5
#> 8050: 1999-04-29 eastern_bering_sea -1.0185174 1999 5
#> 8051: 1999-04-30 eastern_bering_sea -1.4322252 1999 5
#> 8052: 1999-05-01 eastern_bering_sea -1.0412836 1999 5
Created on 2021-05-20 by the reprex package (v2.0.0)
Try this solution.
I basically used your reference column to generate a ref_date and estimate the difference in days between the observation and reference. Then, I used a simple ifelse to test if the dates fall within the 365 days range and then copy them to the temp_valid column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <-
rbind(ai_dat, ebs_dat) %>%
mutate(year_month = paste0(year,"-",startmonth),
region_year_month = paste0(region,"-",year,"-",startmonth))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <-
expand.grid(month=seq(01, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(year_month = paste0(year,"-",month)) %>%
mutate(region_year_month = paste0(region,"-",year,"-",month)) %>%
mutate(ref_year = ifelse(region_year_month %in% surv_dat$region_year_month, region_year_month,NA)) %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
year_month = paste0(year,"-",month))
final_df <-
left_join(temp_dat, surv_dat_exploded, by=c('region','year_month')) %>%
#split ref_column in ref_year and ref_region
separate(ref_year, c("ref_region","ref_year"), "-", extra="merge") %>%
#convert ref_year into date
mutate_at("ref_year", as.Date, format= "%Y-%M") %>%
#round it down to be in the first day of the month (not needed if the day matters)
mutate_at("ref_year", floor_date, "month" ) %>%
#difference between observed and the reference
mutate(diff_days = date - ref_year) %>%
# ifelse statement for capturing values of interest
mutate(temp_valid = ifelse(between(diff_days, -365, 0),temperature,NA))

R - Filter data by month

I apologize for my bad English, but I really need your help.
I have a .csv dataset with two columns - year and value. There is data about height of precipitation monthly from 1900 to 2019.
It looks like this:
year value
190001 100
190002 39
190003 78
190004 45
...
201912 25
I need to create two new datasets: the first one with the data for every year from July (07) to September (09) and the second one from January (01) to March (03).
Also I need to summarize this data for every year (it means I need only one value per year).
So I have data for summer 1900-2019 and winter 1900-2019.
You can use the dplyr and stringr packages to achive what you need. I created a mock data set first:
library(dplyr)
library(stringr)
df <- data.frame(time = 190001:201219, value=runif(length(190001:201219), 0, 100))
After that, we create two separate columns for month and year:
df$year <- as.numeric(str_extract(df$time, "^...."))
df$month <- as.numeric(str_extract(df$time, "..$"))
At this point, we can filter:
df_1 <- df %>% filter(between(month,7,9))
df_2 <- df %>% filter(between(month,1,3))
... and summarize:
df <- df %>% group_by(year) %>% summarise(value = sum(value))
library(tidyverse)
dat <- tribble(
~year, ~value,
190001, 100,
190002, 39,
190003, 78,
190004, 45)
Splitting the year variable into a month and year variable:
dat_prep <- dat %>%
mutate(month = str_remove(year, "^\\d{4}"), # Remove the first 4 digits
year = str_remove(year, "\\d{2}$"), # Remove the last 2 digits
across(everything(), as.numeric))
dat_prep %>%
filter(month %in% 7:9) %>% # For months Jul-Sep. Repeat with 1:3 for Jan-Mar
group_by(year) %>%
summarize(value = sum(value))

Convert aggregate result into summary table [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
given this data frame that is the result of a sum aggregation on the data.
how to transform this into the usual table as the one result of table in order to plot it properly.
for more clear picture of the desired output, it should be something like this:
Mountain Bikes Road Bikes
2005 130694 708713
2006 168445 1304031
2007 0 56112
I even tried something silly like calculating the value individually then combining them. but it's still a data frame so it think of the first column to be values instead of headers.
A solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
group_by(Category, Year) %>%
summarize(SUM = sum(x)) %>%
spread(Category, SUM, fill = 0)
dat2
# # A tibble: 3 x 3
# Year `Mountain Bikes` `Road Bikes`
# <dbl> <dbl> <dbl>
# 1 2005 130694 708713
# 2 2006 168445 1304031
# 3 2007 0 561122
DATA
dat <- data.frame(Category = paste(c("Mountain", "Road", "Mountain",
"Road", "Road"), "Bikes", sep = " "),
Year = c(2005, 2005, 2006, 2006, 2007),
x = c(130694, 708713, 168445, 1304031, 561122))

Translate Stata to R: keeping values using information of two different columns

I have a data frame like this:
df <- data.frame(date= c("2011-11-01", "2011-11-01", "2011-11-01", "2011-11-01"),
reference_year=c(2011, 2012, 2013, 2014),
mean=c(6.49, 5.55, 5.05, 4.87))
So I would like to create a new data frame with the mean in the cases where the year of the date (2011) be equal to year of the date + 1 (2012).
Using Stata I did just using this code:
gen eventtime=date(date, "YMD")
gen day=day(eventtime)
gen month=month(eventtime)
gen yr=year(eventtime)
keep if reference_year == yr+1
collapse (first) mean date, by(eventtime)
However, as a R beginner, I would like to do in R.
As an R beginner, the following may not make a whole lot of sense. But essentially, I'm splitting the date variable into 3 variables (year, month, day) and then I filter to the reference_year - 1. The %>% is called a "pipe" and can be read as "and then do this".
library(tidyverse)
df <- data.frame(date= c("2011-11-01", "2011-11-01", "2011-11-01", "2011-11-01"),
reference_year=c(2011, 2012, 2013, 2014),
mean=c(6.49, 5.55, 5.05, 4.87))
new_df <- df %>%
separate(date, c("year", "month", "day"), sep = "-") %>%
filter(year == (reference_year - 1))
#> year month day reference_year mean
#> 1 2011 11 01 2012 5.55

Resources