R: do multiple linear regressions in a data table - r

I have a data table that looks like this (from the CSV) outlining voting data. What I need to know is how many votes come in per day (average) by year, by doing a linear regression over votesneeded ~ dayuntilelection. The slope would be the average votes coming in per day.
How can I run a linear regression function over this dataframe by year?
date,year,daysuntilelection,votesneeded
2018-01-25,2018,9,40
2018-01-29,2018,5,13
2018-01-30,2018,4,-11
2018-02-03,2018,0,-28
2019-01-23,2019,17,81
2019-02-01,2019,8,-4
2019-02-09,2019,0,-44
2020-01-17,2020,22,119
2020-01-24,2020,15,58
2020-01-30,2020,9,12
2020-02-03,2020,5,-4
2020-02-07,2020,1,-12
2021-01-08,2021,29,120
2021-01-26,2021,11,35
2021-01-29,2021,8,17
2021-02-01,2021,5,-2
2021-02-03,2021,3,-8
2021-02-06,2021,0,-10
The preferred output would be a dataframe looking something like this
year averagevotesperday
2018 8.27
2019 7.40
2020 6.55
2021 4.60
note: full data sets and analyses are at https://github.com/robhanssen/glenlake-elections, for the curious.

Do you need something like this?
library(dplyr)
dat |>
group_by(year) |>
summarize(
avgVoteDay = coef(lm(votesneeded ~ daysuntilelection))[2]
)
Output is slightly differs from yours:
# A tibble: 4 x 2
year avgvote_day
<int> <dbl>
1 2018 7.76
2 2019 7.40
3 2020 6.41
4 2021 4.74

Related

Different results of `summarize` and `group_by` with different months in time-series datasets

I have daily time-series data for more than 20 years. I want to extract the quantiles (0.1, 0.5, 0.9) by three months window for each year, which divided into JFM (Jan-Mar), FMA (Feb-Apr), ... and so on until OND (Oct-Dec). As a newbie in R, after so many days of research in the past two weeks, I finally found the method to do this. However, in the final step, I am stuck on this problem.
Actually, I am working using lists. But, for example, let's say we have this dataframe:
library(lubridate)
Date<-seq.Date(ymd(19700101),ymd(19721231),"day")
Q<-runif(ymd(19730101)-ymd(19700101),1,20)
df<-data.frame(Date,Q)
Now, we subset the df to obtain only specific three months (in this case JFM and FMA):
df.JFM<-df[months(df$Date) %in% month.name[1:3],] #cutting Jan-Mar
df.FMA<-df[months(df$Date) %in% month.name[2:4],] #cutting Feb-Apr
Then, to find the quantile of 50% for three-month series, I use this method:
library(dplyr)
df.JFM %>% group_by(Year=floor_date(Date, "3 months")) %>%
summarize(Q=quantile(Q, 0.5, na.rm=T))
# A tibble: 3 x 2
Year Q
<date> <dbl>
1 1970-01-01 8.83
2 1971-01-01 9.88
3 1972-01-01 11.3
No issue in the JFM set. Let's see for FMA set:
df.FMA %>% group_by(Year=floor_date(Date, "3 months")) %>%
summarize(Q=quantile(Q, 0.5, na.rm=T))
# A tibble: 6 x 2
Year Q
<date> <dbl>
1 1970-01-01 8.75
2 1970-04-01 13.5
3 1971-01-01 8.58
4 1971-04-01 13.2
5 1972-01-01 10.2
6 1972-04-01 8.29
Here, we found that the floor_date function round down the February dates to January dates of the same year. I expected that after I cut the data with February as the first element in the Date column, the floor_date would start from February. Apparently no. I also have tried with other three-month series and found that they give the same result as the FMA set. I also tried to change the index of the dataframe to become the same as the original index before the subset/cut, but no luck.
How to solve this problem?
Other methods for obtaining quantiles from a given period in a year (in the sense of my aim described at the beginning of the post) are also very welcomed.
Thank you.
Here, floor_date/ceiling_date performs rounding every 3 months always from the start of the year and not based on the dates in the data.
Here you may use cut instead which works as per your requirement.
library(dplyr)
df.JFM %>%
group_by(Year=cut(Date, "3 months")) %>%
summarize(Q=quantile(Q, 0.5, na.rm=TRUE))
# Year Q
# <fct> <dbl>
#1 1970-01-01 11.0
#2 1971-01-01 11.5
#3 1972-01-01 9.57
df.FMA %>%
group_by(Year= cut(Date, '3 months')) %>%
summarize(Q = quantile(Q, 0.5, na.rm=T))
# Year Q
# <fct> <dbl>
#1 1970-02-01 11.3
#2 1971-02-01 10.5
#3 1972-02-01 9.67

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Summing each hydrologic year in my dataframe at 649 locations and 11,088 observations

can someone pleas help me? I have a dataframe with 649 different locations and each with 11088 observations from the last 30 years. 1 hydologic year spans from sep. 1 to aug. 31. The datafram looks like this:
What I want to end up with is something like this:
In my original dataframe I also have a lot of data missing. If a location (i.e. 1.50.0 ) is missing more than 10% data in one hydrological year I do not want to keep that year in my new dataframe.
If my question is unclear pleas ask. :)
Without data it's not easy, but it may be something like that
df<-data.frame(d1=c(rnorm(9,5,2),NA),
d2=rnorm(10,15,2))
row.names(df)<-c(seq(today()-days(9),today(),"day"))
df%>%
rownames_to_column("id")%>%
gather(variable,value,-id)%>%
mutate(yr=year(id))%>%
group_by(yr)%>%
mutate(is_na=sum(is.na(value))/n())%>%
filter(is_na<.1)%>%
group_by(yr,variable)%>%
summarise(res=mean(value,na.rm=T))%>%
spread(variable,res)
# A tibble: 1 x 3
# Groups: yr [1]
yr d1 d2
<dbl> <dbl> <dbl>
1 2018. 4.41 14.7

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

Any workaround to do yearly statistics for non tabular data with `tidyr` or `dplyr`?

I have a gridded data in csv file where yearly precipitation observation for all grids is stored in plain text data (raw gridded data is right here). However, I want to do yearly statistics for this grid data, to calculate yearly total precipitation for each grid cell. Because the size of raw data is rather big, I put this grid data on the fly (please take a look raw gridded data on the fly).
Basically, based on yearly precipitation observation, I want to do simple statistics, to calculate yearly total precipitation for grid cell all along the whole coverage. Here is whole grid coverage that I am gonna compute yearly statistics for its each grid cell.
I think dplyr or tidyr could help for this data aggregation. Perhaps the first thing to do would be to reshape the raw data in csv and get long spreadsheet for each grid' coordinate and simply use base::sum or dplyr::summarize utilities to do yearly statistics. To do so, here is my first attempt to get long spreadsheet:
raw_csvData %>%
group_by(.$days) %>%
tidyr::spread(spread(key = .$days,value = precip))
getting long spreadsheet for each grid, above R scripts doesn't work. Perhaps, getting long spreadsheet is not necessary to do, maybe there is even fast and smart way to get this done.
basically, I intended to calculate yearly total precipitation for each grid cell and add this aggregation result into new column meanwhile drop all daily observation data because I don't need that data once I get yearly total sum precipitation for each grid, which would produce even clean and smaller output.
If I don't need to reshape raw data in csv to get long spreadsheet, what's the efficient and fast run solution for this data aggregation task? Any nice way to make this happen with dplyr, tidyr or vice-versa? Any idea?
Update (per comments)
If you only need the annual total precipation and long/lat, and nothing else, use group_by and summarise:
raw_csvData %>%
group_by(long, lat) %>%
summarise(total_precip = sum(precip))
Output:
# A tibble: 6 x 3
# Groups: long [1]
long lat total_precip
<dbl> <dbl> <dbl>
1 6.12 47.4 846.
2 6.12 47.6 847.
3 6.12 47.9 852.
4 6.12 48.1 860.
5 6.12 48.4 867.
6 6.12 48.6 899.
Original (for adding column to existing data frame)
You can remove days with select(-days).
Then group_by long and lat and compute grouped sums. Use mutate instead of summarise, which will add the sum column back into your original data.
raw_csvData %>%
select(-days) %>%
group_by(long, lat) %>%
mutate(total_precip = sum(precip))
Output:
# A tibble: 6 x 7
# Groups: long, lat [6]
year month day long lat precip total_precip
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 1980 1 1 6.12 47.4 0. 846.
2 1980 1 1 6.38 47.4 0. 846.
3 1980 1 1 6.62 47.4 0. 846.
4 1980 1 1 6.88 47.4 0. 844.
5 1980 1 1 7.12 47.4 0. 853.
6 1980 1 1 7.38 47.4 1.20 880.

Resources