Summing a dataframe based on another dataframe - r

I have daily data of rainfall from 10 locations across 10 years
set.seed(123)
df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))
I have a separate data frame that has certain days using which I want to sum the rainfall in df
df.ref <- data.frame(loc.id = rep(1:10, each = 10),
years = rep(2001:2010,times = 10),
index1 = rep(250,times = 10*10),
index2 = sample(260:270, size = 10*10,replace = T),
index3 = sample(280:290, size = 10*10,replace = T),
index4 = sample(291:300, size= 10*10,replace = T))
df.ref
loc.id years index1 index2 index3 index4
1: 1 2001 250 264 280 296
2: 1 2002 250 269 284 298
3: 1 2003 250 268 289 293
4: 1 2004 250 266 281 295
5: 1 2005 250 260 289 293
What I want to is for row in in df.ref, use the index values in df.ref and
sum the rainfall in df between index1 to index2, index1 to index3 and index1 to index4. For example:
Using df.ref, for loc.id = 1 and year == 2001, sum the rainfall in df from 250 to 264, 250 to 280, 250 to 296 (as shown in df.ref)
Similarly, for year 2002, for loc.id = 1, sum the rainfall from 250 to 269, 250 to 284, 250 to 298.
I did this:
library(dplyr)
ptm <- proc.time()
dat <- df.ref %>% left_join(df)
index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1 = sum(rain))
index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))
index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))
all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))
proc.time() - ptm
user system elapsed
2.36 0.64 3.06
I am looking to make my code faster since my actual df.ref is quite large. Could anyone advise me how to make this quicker.

Non-equi join from data.table package can be both faster and more memory efficient than dplyr::left_join (slide | video)
For each value in df, find all the rain values in df.ref that have day in between index 1 and index 2. Then calculate the summation of rain based on loc.id and years.
df1 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index2)
, by = .EACHI][
][
, c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
# remove all redundant columns
, day := NULL][
, day := NULL][
, rain := NULL])
df2 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index3)
, by = .EACHI][
][
, c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
df3 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index4)
, by = .EACHI][
][
, c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
Merge all three data.tables together
df1[df2, on = .(loc.id, years)][
df3, on = .(loc.id, years)]
loc.id years sum_1 sum_2 sum_3
1: 1 1950 104159.11 222345.4 271587.1
2: 1 1951 118689.90 257450.2 347624.3
3: 1 1952 99262.27 212923.7 280877.6
4: 1 1953 72435.50 192072.7 251593.6
5: 1 1954 104021.19 242525.3 326463.4
6: 1 1955 93436.32 232653.1 304921.4
7: 1 1956 89122.79 190424.4 255535.0
8: 1 1957 135658.11 262918.7 346361.4
9: 1 1958 80064.18 220454.8 292966.4
10: 1 1959 114231.19 273181.0 349489.2
11: 2 1950 94360.69 238296.8 301751.8
12: 2 1951 93845.50 195273.7 289686.0
13: 2 1952 107692.53 245019.4 308093.7
14: 2 1953 86650.14 257225.1 332674.1
15: 2 1954 104085.83 238859.4 286350.7
16: 2 1955 101602.16 223107.3 300958.4
17: 2 1956 73912.77 198087.2 276590.1
18: 2 1957 117780.86 228299.8 305348.5
19: 2 1958 98625.45 220902.6 291583.7
20: 2 1959 109851.38 266745.2 324246.8
[ reached getOption("max.print") -- omitted 81 rows ]
Compare processing time and memory used
> time_dplyr; time_datatable
user system elapsed
2.17 0.27 2.61
user system elapsed
0.45 0.00 0.69
rowname Class MB
1 dat data.frame 508
2 df3 data.table 26
3 df2 data.table 20
4 df1 data.table 9
When testing for about 100 years of data, dplyr used more than 50 GB of memory while data.table consumed only 5 GB. dplyr also took about 4 times longer to finish.
'data.frame': 3650000 obs. of 4 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ rain : num 10.1 27.6 14.3 30.9 32.9 ...
'data.frame': 3650000 obs. of 6 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
$ index1: num 250 250 250 250 250 250 250 250 250 250 ...
$ index2: int 270 265 262 267 266 265 262 268 260 268 ...
$ index3: int 290 287 286 289 281 285 286 285 284 283 ...
$ index4: int 298 297 296 295 298 294 296 298 298 300 ...
> time_dplyr; time_datatable
user system elapsed
95.010 33.704 128.722
user system elapsed
26.175 3.147 29.312
rowname Class MB
1 dat data.frame 50821
2 df3 data.table 2588
3 df2 data.table 2004
4 df1 data.table 888
5 df.ref data.table 97
6 df data.table 70
If I increased the number of years to 150, dplyr broke even on a HPC cluster node with 256 GB RAM
Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted

Here's a starting point that will be much faster. Should be trivial figuring out the rest.
library(data.table)
setDT(df)
df[df.ref, on = .(loc.id, years, day >= index1, day <= index2), sum(rain), by = .EACHI]

Related

Efficient data.table method to generate additional rows given random numbers

I have a large data.table that I want to generate a random number (using two columns) and perform a calculation. Then I want to perform this step 1,000 times. I am looking for a way to do this efficiently with out a loop.
Example data:
> dt <- data.table(Group=c(rep("A",3),rep("B",3)),
Year=rep(2020:2022,2),
N=c(300,350,400,123,175,156),
Count=c(25,30,35,3,6,8),
Pop=c(1234,1543,1754,2500,2600,2400))
> dt
Group Year N Count Pop
1: A 2020 300 25 1234
2: A 2021 350 30 1543
3: A 2022 400 35 1754
4: B 2020 123 3 2500
5: B 2021 175 6 2600
6: B 2022 156 8 2400
> dt[, rate := rpois(.N, lambda=Count)/Pop*100000]
> dt[, value := N*(rate/100000)]
> dt
Group Year N Count Pop rate value
1: A 2020 300 25 1234 1944.8947 5.8346840
2: A 2021 350 30 1543 2009.0732 7.0317563
3: A 2022 400 35 1754 1938.4265 7.7537058
4: B 2020 123 3 2500 120.0000 0.1476000
5: B 2021 175 6 2600 115.3846 0.2019231
6: B 2022 156 8 2400 416.6667 0.6500000
I want to be able to do this calculation for value 1,000 times, and keep all instances (with an indicator column for 1-1,000 indicating which run) without using a loop. Any suggestions?
Maybe you can try replicate like below
n <- 1000
dt[, paste0(c("rate", "value"), rep(1:n, each = 2)) := replicate(n, list(u <- rpois(.N, lambda = Count) / Pop * 100000, N * (u / 100000)))]

How to calculate total and percentage while accounting for another column in R?

All,
Thanks in advance. I have this school dataset. Each category (in Category column) has a range number of students (e.g., from 30 to 60 students), so I need to calculate:
the total number of classrooms that fall in each category (from category 1 to category 4), and
the percentage of classrooms that fall in the category.
For example, how many classrooms (NumOfClassrooms column) fall in Category_4, and what's the percentage of those classrooms to the total classrooms? Here is an illustrative example for my question:
ID = 1:1050
District = rep(c("AR", "CO", "AL", "KS", "IN", "ME", "KY", "ME", "MN", "NJ"), times = c(80, 120, 100, 110, 120, 100, 100, 120, 100, 100))
schoolName = randomNames::randomNames(1050, ethnicity = 5 ,which.names = "last")
Grade = rep(c("First", "Second", "Third", "Fourth"), times = c(400, 300, 200, 150))
NumOfClassrooms = sample(1:6)
StudentNumber = sample(1:90, 5)
AverageNumOfStudents = StudentNumber/NumOfClassrooms
Category = ifelse(AverageNumOfStudents > 0 & AverageNumOfStudents < 10, "category_1",
ifelse(AverageNumOfStudents >=10 & AverageNumOfStudents < 30, "category_2",
ifelse(AverageNumOfStudents >=30 & AverageNumOfStudents <= 60, "category_3",
ifelse(AverageNumOfStudents > 60 , "category_4", "NA"))))
dat = data.frame(ID, schoolName, Grade, NumOfClassrooms, StudentNumber, AverageNumOfStudents, Category)
Finally, I need to divide the results based on the "District" column into separate excel files using the following code (it should work fine once I get the above two steps).
Final_Divide = Final_df %>%
dplyr::group_by(District) %>%
dplyr::ungroup()
list_data <- split(Final_Divide,
Final_Divide$District)
options(digits=3)
Map(openxlsx::write.xlsx, list_data, paste0(names(list_data), '.xlsx'))
Thank you very much in advance.
Setting a random seed before your code for reproducibility:
set.seed(42)
# Your code creating dat
Table1 <- xtabs(NumOfClassrooms~Category, dat)
Table1
# Category
# category_1 category_2 category_4
# 1925 1575 175
Table2 <- prop.table(Table1)
round(Table2, 4) # Proportions
# Category
# category_1 category_2 category_4
# 0.5238 0.4286 0.0476
round(Table2 * 100, 2) # Percent
# Category
# category_1 category_2 category_4
# 52.38 42.86 4.76
If we include District in dat:
dat <- data.frame(ID, District, schoolName, Grade, NumOfClassrooms, StudentNumber, AverageNumOfStudents, Category)
Table3 <- xtabs(NumOfClassrooms~District+Category, dat)
addmargins(Table3)
# Category
# District category_1 category_2 category_4 Sum
# AL 187 149 16 352
# AR 143 121 14 278
# CO 220 180 20 420
# IN 220 180 20 420
# KS 198 166 19 383
# KY 187 148 17 352
# ME 407 329 36 772
# MN 176 153 17 346
# NJ 187 149 16 352
# Sum 1925 1575 175 3675
For row percentages by District:
round(prop.table(Table3, 1) * 100, 2)
# Category
# District category_1 category_2 category_4
# AL 53.12 42.33 4.55
# AR 51.44 43.53 5.04
# CO 52.38 42.86 4.76
# IN 52.38 42.86 4.76
# KS 51.70 43.34 4.96
# KY 53.12 42.05 4.83
# ME 52.72 42.62 4.66
# MN 50.87 44.22 4.91
# NJ 53.12 42.33 4.55
Here's a possible solution using the tidyverse
dat %>%
mutate("Total Classrooms" = n()) %>%
group_by(Category) %>%
mutate("Number of Classrooms in Category" = n(),
"Category Percentage" = `Number of Classrooms in Category`/`Total Classrooms` * 100)
This will give us:
# Groups: Category [3]
ID District schoolName Grade NumOfClassrooms StudentNumber AverageNumOfStude~ Category `Total Classroom~ `Number of Classrooms in~ `Category Percent~
<int> <chr> <chr> <chr> <int> <int> <dbl> <chr> <int> <int> <dbl>
1 1 AR Svyatetskiy First 5 87 17.4 category~ 1050 525 50
2 2 AR Booco First 1 79 79 category~ 1050 175 16.7
3 3 AR Jones First 6 49 8.17 category~ 1050 350 33.3
4 4 AR Sapkin First 3 5 1.67 category~ 1050 350 33.3
5 5 AR Fosse First 2 35 17.5 category~ 1050 525 50
6 6 AR Vanwagenen First 4 87 21.8 category~ 1050 525 50
7 7 AR Orth First 5 79 17.4 category~ 1050 525 50
8 8 AR Moline First 1 49 79 category~ 1050 175 16.7
9 9 AR Bradford First 6 5 8.17 category~ 1050 350 33.3
10 10 AR Wollman First 3 35 1.67 category~ 1050 350 33.3
# ... with 1,040 more rows
If you need a separate table of just the category/# classrooms/percentage data:
dat %>%
mutate("Total Classrooms" = n()) %>%
group_by(Category) %>%
mutate("Number of Classrooms in Category" = n(),
"Category Percentage" = `Number of Classrooms in Category`/`Total Classrooms` * 100) %>%
select(Category, "Number of Classrooms in Category", "Category Percentage") %>%
unique()
This gives us:
# A tibble: 3 x 3
# Groups: Category [3]
Category `Number of Classrooms in Category` `Category Percentage`
<chr> <int> <dbl>
1 category_2 525 50
2 category_4 175 16.7
3 category_1 350 33.3
Note that in your post, this code is a bit redundant:
Final_Divide = Final_df %>%
dplyr::group_by(District) %>%
dplyr::ungroup()
If you group and then immediately ungroup, you're actually just doing this:
Final_Divide <- Final_df
You could also consider adding split(.$District) to transform your data into a list all in one chunk of code:
dat %>%
mutate("Total Classrooms" = n()) %>%
group_by(Category) %>%
mutate("Number of Classrooms in Category" = n(),
"Category Percentage" = `Number of Classrooms in Category`/`Total Classrooms` * 100) %>%
split(.$District)

Create a table out of a tibble

I do have the following dataframe with 45 million observations:
year month variable
1992 1 0
1992 1 1
1992 1 1
1992 2 0
1992 2 1
1992 2 0
My goal is to count the frequency of the variable for each month of a year.
I was already able to generate these sums with cps_data as my dataframe and SKILL_1 as my variable.
cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum))
Logically, I obtained 348 different rows as a tibble. Now, I struggle to create a new table with these values. My new table should look similar to my tibble. How can I do that? Is there even a way? I've already tried to read in an excel file with a date range from 01/1992 - 01/2021 in order to obtain exactly 349 rows and then merge it with the rows of the tibble, but it did not work..
# A tibble: 349 x 3
# Groups: YEAR [30]
YEAR MONTH name
<dbl> <int+lbl> <dbl>
1 1992 1 [January] 499
2 1992 2 [February] 482
3 1992 3 [March] 485
4 1992 4 [April] 457
5 1992 5 [May] 434
6 1992 6 [June] 470
7 1992 7 [July] 450
8 1992 8 [August] 438
9 1992 9 [September] 442
10 1992 10 [October] 427
# ... with 339 more rows
many thanks in advance!!
library(zoo)
createmonthyear <- function(start_date,end_date){
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)}
Once you create the function, you can specify the start and end date you want:
left_table <- data.frame(createmonthyear(1991-01-01,2021-01-01))
then left join the output with what you have
library(dplyr)
right_table <- data.frame(cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum)))
results <- left_join(left_table, right_table, by = c("Year" = "year", "Month" = "month")

Simple water demand supply model in R

I am trying to work out a simple soil water balance in R. Here's the step I need to do:
For a given year, starting from doy 200,I need to determine the Soil Water (SWi) which is calculated by following formula
`SW(i) = SW(i-1) + Rain(i) - ETo(i)`
where SW(i-1) is the water content in previous day, Rain(i) is the rainfall and ETo(i) is the evapotranspiration on day i
The conditions are that SW(i) cannot be negative or be more than SW(max) which is 20.
Here's a sample data:
library(tidyverse)
set.seed(123)
dat <- tibble(
year = rep(1980:2015, each = 100),
day = rep(200:299, times = 36),
rain = sample(0:17, size = 100*36,replace = T),
eto = sample(2:9, size = 100*36,replace = T))
SW.initial <- data.frame(year= 1980:2015, SW.199 = sample(1:10, 36, replace = T))
SW.initial is the water content for doy 199 for for each year
SW.max <- 20
dat$SW.fin <- NA
Taking the example of year 1980
dat.1980 <- dat[dat$year == 1980,]
SW.initial.1980 <- SW.initial[SW.initial$year== 1980,"SW.199"]
for(doy in dat.1980$day){
SW <- SW.initial.1980
SW <- SW + dat.1980[dat.1980$day == doy, "rain"] - dat.1980[dat.1980$day == doy, "eto"]
SW <- ifelse(SW < 0, 0, ifelse(SW >= SW.max, SW.max, SW))
dat[dat$year == years & dat$day == doy,"SW.fin"] <- SW
SW.initial.1980 <- SW
}
This loop will give me the SW of each day starting doy 200 till 299 using:
`SW(i) = SW(i-1) + Rain[i] + ETo[i]``
where for doy 200, SW(i-1) was given from the SW.initial data frame
I can loop through all years:
for(years in unique(dat$year)){
test <- dat[dat$year == years,]
SW.in <- SW.initial[SW.initial$year == years,"SW.199"]
for(doy in test$day){
SW <- SW.in
SW <- SW + test[test$day == doy, "rain"] - test[test$day == doy, "eto"]
SW <- ifelse(SW < 0, 0, ifelse(SW >= SW.max, SW.max, SW))
dat[dat$year == years & dat$day == doy,"SW.fin"] <- SW
SW.in <- SW
}}
I really want to avoid this long loop and was thinking if there is much clever (and faster way to do this).
Does this give what you want ?
edit -> added grouped by year
dat %>% group_by(year) %>% mutate(sw_oneless = c(NA, day[1:length(day)-1]),
sw_oneless + rain - eto)
# A tibble: 3,600 x 6
# Groups: year [36]
year day rain eto sw_oneless `sw_oneless + rain - eto`
<int> <int> <int> <int> <int> <int>
1 1980 200 5 2 NA NA
2 1980 201 14 6 200 208
3 1980 202 7 4 201 204
4 1980 203 15 5 202 212
5 1980 204 16 5 203 214
6 1980 205 0 8 204 196
7 1980 206 9 9 205 205
8 1980 207 16 6 206 216
9 1980 208 9 9 207 207
10 1980 209 8 4 208 212
# ... with 3,590 more rows
To solve the "problem" with day 200, why don't you just filter out day 199-300 from your original data? You can then run my code and na.omit() or filter again and the rows with day 199 are gone.
Or, if you can't do that, merge your SW.initial with your dat data frame

Aggregation using 'by' and secondary indicies in data table

Looking through the 4th data.table vignette here (secondary indices and auto indexing) it looks like example 2f. returns the wrong month label.
flights <- read.csv(url("https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv"))
The example gives:
> head(flights["JFK", max(dep_delay), keyby = month, on = "origin"])
month V1
1: 1 881
2: 1 1014
3: 1 920
4: 1 1241
5: 1 853
6: 1 798
But replicating this without using secondary indices gives:
> head(flights[origin == "JFK", max(dep_delay), keyby = month])
month V1
1: 1 881
2: 2 1014
3: 3 920
4: 4 1241
5: 5 853
6: 6 798
The error can be seen by looking for the row with dep_delay == 1014
> flights[month =="1" & dep_delay == 1014]
Empty data.table (0 rows) of 17 cols: year,month,day,dep_time,dep_delay,arr_time...
> flights[month =="2" & dep_delay == 1014]
year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight origin dest air_time distance hour min
1: 2014 2 21 844 1014 1151 1007 0 DL N983DL 2459 JFK MCO 139 944 8 44
Is this an error in the example code, or a data.table flaw?

Resources