Creating averages across time periods - r

I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?

Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68

Related

Processing data.frame that needs order and cumulative days

With the small reproducible example below, I'd like to identify the dplyr approach to arrive at the data.frame shown at the end of this note. The features of the dplyr output is that it will ensure that the data.frame is sorted by date (note that the dates 1999-04-13 and 1999-03-12 are out of order) and that it then "accumulate" the number of days within each wy grouping (wy = "water year"; Oct 1-Sep 30) that Q is above a threshold of 3.0.
dat <- read.table(text="
Date wy Q
1997-01-01 1997 9.82
1997-02-01 1997 3.51
1997-02-02 1997 9.35
1997-10-04 1998 0.93
1997-11-01 1998 1.66
1997-12-02 1998 0.81
1998-04-03 1998 5.65
1998-05-05 1998 7.82
1998-07-05 1998 6.33
1998-09-06 1998 0.55
1998-09-07 1998 4.54
1998-10-09 1999 6.50
1998-12-31 1999 2.17
1999-01-01 1999 5.67
1999-04-13 1999 5.66
1999-03-12 1999 4.67
1999-06-05 1999 3.34
1999-09-30 1999 1.99
1999-11-06 2000 5.75
2000-03-04 2000 6.28
2000-06-07 2000 0.81
2000-07-06 2000 9.66
2000-09-09 2000 9.08
2000-09-21 2000 6.72", header=TRUE)
dat$Date <- as.Date(dat$Date)
mdat <- dat %>%
group_by(wy) %>%
filter(Q > 3) %>%
?
Desired results:
Date wy Q abvThreshCum
1997-01-01 1997 9.82 1
1997-02-01 1997 3.51 2
1997-02-02 1997 9.35 3
1997-10-04 1998 0.93 0
1997-11-01 1998 1.66 0
1997-12-02 1998 0.81 0
1998-04-03 1998 5.65 1
1998-05-05 1998 7.82 2
1998-07-05 1998 6.33 3
1998-09-06 1998 0.55 3
1998-09-07 1998 4.54 4
1998-10-09 1999 6.50 1
1998-12-31 1999 2.17 1
1999-01-01 1999 5.67 2
1999-03-12 1999 4.67 3
1999-04-13 1999 5.66 4
1999-06-05 1999 3.34 5
1999-09-30 1999 1.99 5
1999-11-06 2000 5.75 1
2000-03-04 2000 6.28 2
2000-06-07 2000 0.81 2
2000-07-06 2000 9.66 3
2000-09-09 2000 9.08 4
2000-09-21 2000 6.72 5
library(dplyr)
dat %>%
arrange(Date) %>%
group_by(wy) %>%
mutate(abv = cumsum(Q > 3)) %>%
ungroup()
# # A tibble: 24 x 4
# Date wy Q abv
# <date> <int> <dbl> <int>
# 1 1997-01-01 1997 9.82 1
# 2 1997-02-01 1997 3.51 2
# 3 1997-02-02 1997 9.35 3
# 4 1997-10-04 1998 0.93 0
# 5 1997-11-01 1998 1.66 0
# 6 1997-12-02 1998 0.81 0
# 7 1998-04-03 1998 5.65 1
# 8 1998-05-05 1998 7.82 2
# 9 1998-07-05 1998 6.33 3
# 10 1998-09-06 1998 0.55 3
# # ... with 14 more rows
data.table approach
library(data.table)
setDT(dat, key = "Date")[, abvThreshCum := cumsum(Q > 3), by = .(wy)]

How can I divide into columns the summarize() funtion with tidyverse?

I am struggling with the tidyverse package. I'm using the mpg dataset from R to display the issue that I'm facing (ignore if the relationships are not relevant, it is just for the sake of explaining my problem).
What I'm trying to do is to obtain the average "displ" grouped by manufacturer and year AND at the same time (and this is what I can't figure out), have several columns for each of the fuel types variable (i.e.: a column for the mean of diesel, a column for the mean of petrol, etc.).
This is the first part of the code and I'm new to R so I really don't know what do I need to add...
mpg %>%
group_by(manufacturer, year) %>%
summarize(Mean. = mean(c(displ)))
# A tibble: 30 × 3
# Groups: manufacturer [15]
manufacturer year Mean.
<chr> <int> <dbl>
1 audi 1999 2.36
2 audi 2008 2.73
3 chevrolet 1999 4.97
4 chevrolet 2008 5.12
5 dodge 1999 4.32
6 dodge 2008 4.42
7 ford 1999 4.45
8 ford 2008 4.66
9 honda 1999 1.6
10 honda 2008 1.85
# … with 20 more rows
Any help is appreciated, thank you.
Perhaps, we need to reshape into 'wide'
library(dplyr)
library(tidyr)
mpg %>%
select(manufacturer, year, fl, displ) %>%
pivot_wider(names_from = fl, values_from = displ, values_fn = mean)
-output
# A tibble: 30 x 7
manufacturer year p r e d c
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 audi 1999 2.36 NA NA NA NA
2 audi 2008 2.73 NA NA NA NA
3 chevrolet 2008 6.47 4.49 5.3 NA NA
4 chevrolet 1999 5.7 4.22 NA 6.5 NA
5 dodge 1999 NA 4.32 NA NA NA
6 dodge 2008 NA 4.42 4.42 NA NA
7 ford 1999 NA 4.45 NA NA NA
8 ford 2008 5.4 4.58 NA NA NA
9 honda 1999 1.6 1.6 NA NA NA
10 honda 2008 2 1.8 NA NA 1.8
# … with 20 more rows

Computing lags but grouping by two categories with dplyr

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

Impute missing data of a single variable using the slope of a linear regression of another variable in R

Here is an extract of my dataset (df8) which contains time series from 2000 to 2018 for 194 countries.
iso3 year anc4 median
<chr> <dbl> <dbl> <dbl>
1 BIH 2000 NA 0.739
2 BIH 2001 NA 0.746
3 BIH 2002 NA 0.763
4 BIH 2003 NA 0.778
5 BIH 2004 NA 0.842
6 BIH 2005 NA 0.801
7 BIH 2006 NA 0.819
8 BIH 2007 NA 0.841
9 BIH 2008 NA 0.845
10 BIH 2009 NA 0.840
11 BIH 2010 0.842 0.856
12 BIH 2011 NA 0.873
13 BIH 2012 NA 0.867
14 BIH 2013 NA 0.889
15 BIH 2014 NA 0.879
16 BIH 2015 NA 0.883
17 BIH 2016 NA 0.854
18 BIH 2017 NA 0.891
19 BIH 2018 NA 0.920
20 BWA 2000 NA 0.739
21 BWA 2001 NA 0.746
22 BWA 2002 NA 0.763
23 BWA 2003 NA 0.778
24 BWA 2004 NA 0.842
25 BWA 2005 NA 0.801
26 BWA 2006 0.733 0.819
27 BWA 2007 NA 0.841
28 BWA 2008 NA 0.845
29 BWA 2009 NA 0.840
30 BWA 2010 NA 0.856
31 BWA 2011 NA 0.873
32 BWA 2012 NA 0.867
33 BWA 2013 NA 0.889
34 BWA 2014 NA 0.879
35 BWA 2015 NA 0.883
36 BWA 2016 NA 0.854
37 BWA 2017 NA 0.891
38 BWA 2018 NA 0.920
What I would like to do is to impute missing data for variable anc4, using the slope of a linear regression based on regional medians (median). I would like to do that at the country level as each country do not pertain to the same region.
This is what I have tried..
df_model <- df8
predictions <- vector()
for(i in unique(df_model$iso3)) {
temp <- df_model[df_model[,2]==i,]
predictions <- c(predictions,predict(lm(median~year,temp),df8[is.na(df8$anc4) & df8$iso3==i,]))
}
df8[is.na(df8$anc4),]$anc4 <- predictions
I used the code I have been using when imputing missing anc4 data using a linear regression of observed anc4 data points and tried to adapt that using medians..but did not quite work!
Thank you so much!
Your last comment made your question clear: you get the slope from the linear regression on the medians and you get the intercept from the only non-missing value.
However, there was a rather serious flaw in your code: you should never grow a vector inside a for loop. Use *apply functions, or even better use *map functions from the purrr package. If you have a very good reason to use a for loop, at least preallocate its size.
Since you get the intercept from outside the model, you cannot use predict here. Fortunately, it is rather simple to predict manually when using a linear model.
Here is my solution using the dplyr syntax. If you are not familiar with it, I urge you to read about it (for instance there)
x=df_model %>%
group_by(iso3) %>%
mutate(
slope=lm(median~year)$coefficients["year"],
intercept=anc4[!is.na(anc4)]-slope*year[!is.na(anc4)],
anc4_imput = intercept+year*slope,
anc4_error = anc4-anc4_imput,
)
x
#> # A tibble: 38 x 8
#> # Groups: iso3 [2]
#> iso3 year anc4 median slope intercept anc4_imput anc4_error
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 BIH 2000 NA 0.739 0.00844 -16.1 0.758 NA
#> 2 BIH 2001 NA 0.746 0.00844 -16.1 0.766 NA
#> 3 BIH 2002 NA 0.763 0.00844 -16.1 0.774 NA
#> 4 BIH 2003 NA 0.778 0.00844 -16.1 0.783 NA
#> 5 BIH 2004 NA 0.842 0.00844 -16.1 0.791 NA
#> 6 BIH 2005 NA 0.801 0.00844 -16.1 0.800 NA
#> 7 BIH 2006 NA 0.819 0.00844 -16.1 0.808 NA
#> 8 BIH 2007 NA 0.841 0.00844 -16.1 0.817 NA
#> 9 BIH 2008 NA 0.845 0.00844 -16.1 0.825 NA
#> 10 BIH 2009 NA 0.84 0.00844 -16.1 0.834 NA
#> # ... with 28 more rows
#error is negligible
x %>% filter(!is.na(anc4))
#> # A tibble: 2 x 8
#> # Groups: iso3 [2]
#> iso3 year anc4 median slope intercept anc4_imput anc4_error
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 BIH 2010 0.842 0.856 0.00844 -16.1 0.842 1.22e-15
#> 2 BWA 2006 10.7 10.8 0.00844 -6.20 10.7 0.
#Created on 2020-06-12 by the reprex package (v0.3.0)

common data/sample in two dataframes in R

I'm trying to compare model-based forecasts from two different models. Model number 2 however requires more non-missing data and has thus more missing values (NA) than model 1.
I am wondering now how I can quickly query both dataframes for non-missing values and identify the common sample? I used to work with excel and the function
=IF(AND(ISVALUE(a1);ISVALUE(b1));then;else)
comes to my mind but I don't know how to do it properly with R.
This is my df from model 1: Every observation is clearly identified by id and time.
(the rownumbers on the left are from my overall dataframe and are identical in both dataframes.)
> head(model1)
id time f1 f2 f3 f4 f5
9 1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
10 1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
11 1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
12 1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
13 1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
14 1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737
and this model 2:
> head(model2)
id time meanf1 meanf2 meanf3 meanf4 meanf5
9 1 1995 4.56 5.14 6.05 NA NA
10 1 1996 4.38 4.94 NA NA NA
11 1 1997 4.05 4.51 NA NA NA
12 1 1998 4.07 5.04 6.52 NA NA
13 1 1999 3.61 4.96 NA NA NA
14 1 2000 4.35 4.83 6.46 NA NA
Thank you for your help and hints.
The function complete.cases gives non-missing data across all columns. The sets (f4,meanf4) and (f5,meanf5) have no "common" non-missing values in the sample data, hence have no observations. Is this what you were looking for
#Read Data
model1=read.table(text='id time f1 f2 f3 f4 f5
1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737',header=TRUE)
model2=read.table(text=' id time meanf1 meanf2 meanf3 meanf4 meanf5
1 1995 4.56 5.14 6.05 NA NA
1 1996 4.38 4.94 NA NA NA
1 1997 4.05 4.51 NA NA NA
1 1998 4.07 5.04 6.52 NA NA
1 1999 3.61 4.96 NA NA NA
1 2000 4.35 4.83 6.46 NA NA',header=TRUE)
#name indices of f1..f5 = 3..7
#merge data for each f1..f5 and keep only non-missing values using, complete.cases()
DF_list = lapply(3:7,function(x) {
DF=merge(model1[,c(1,2,x)],model2[,c(1,2,x)],by=c("id","time"));
DF=DF[complete.cases(DF),];
return(DF);
})
DF_list
#[[1]]
# id time f1 meanf1
#1 1 1995 16.351261 4.56
#2 1 1996 15.942914 4.38
#3 1 1997 24.187390 4.05
#4 1 1998 3.101094 4.07
#5 1 1999 33.562234 3.61
#6 1 2000 59.979666 4.35
#
#[[2]]
# id time f2 meanf2
#1 1 1995 -1.856662 5.14
#2 1 1996 -1.749530 4.94
#3 1 1997 15.099166 4.51
#4 1 1998 -10.455754 5.04
#5 1 1999 2.610512 4.96
#6 1 2000 -45.106093 4.83
#
#[[3]]
# id time f3 meanf3
#1 1 1995 6.577671 6.05
#4 1 1998 -9.674086 6.52
#6 1 2000 -100.352866 6.46
#
#[[4]]
#[1] id time f4 meanf4
#<0 rows> (or 0-length row.names)
#
#[[5]]
#[1] id time f5 meanf5
#<0 rows> (or 0-length row.names)

Resources