Year to Year variation - r

I have a question about how to calculate year to year correlations for certain statistics. Like I want to test whether two statistics (let's use a baseball example) whether for specific players is Average or On-Base Percentage more constant over time? Like which fluctuates more compared to the other.
My data currently is in the following format:
name Season ARuns Lag 1 BRuns Lag BRuns
321 Abad Andy 2003 -1.05 NA -1.19 NA
3158 Abercrombie Reggie 2006 27.42 NA -.42 NA
1312 Abercrombie Reggie 2007 7.65 27.42 .15 -.42
1069 Abercrombie Reggie 2008 5.34 7.65 -1.81 .15
4614 Abernathy Brent 2002 46.71 NA -.86 NA
707 Abernathy Brent 2003 -2.29 46.71 -.33 -.86
1297 Abernathy Brent 2005 5.59 -2.29 3.53 -.33
6024 Abreu Bobby 2002 102.89 NA 2.70 NA
6087 Abreu Bobby 2003 113.23 102.89 4.39 2.70
6177 Abreu Bobby 2004 128.60 113.23 2.29 4.39
Any ideas would be appreciated!

Related

Creating averages across time periods

I'm a beginner to R, but I have the below dataframe with more observations in which I have at max each 'id' observation for three years 91, 99, 07.
I want to create a variable avg_ln_rd by 'id' that takes the average of 'ln_rd' and 'ln_rd' from year 91 if the first ln_rd observation is from 99 - and from year 99 if the first ln_rd observation is from 07.
id year ln_rd
<dbl> <dbl> <dbl>
1 1013 1991 3.51
2 1013 1999 5.64
3 1013 2007 4.26
4 1021 1991 0.899
5 1021 1999 0.791
6 1021 2007 0.704
7 1034 1991 2.58
8 1034 1999 3.72
9 1034 2007 4.95
10 1037 1991 0.262
I also already dropped any observations of 'id' that only exist for one of the three years.
My first thought was to create for each year a standalone variable for ln_rd but then i still would need to filter by id which i do not know how to do.
Then I tried using these standalone variables to form an if clause.
df$lagln_rd_99 <- ifelse(df$year == 1999, df$ln_rd_91, NA)
But again I do not know how to keep 'id' fixed.
Any help would be greatly appreciated.
EDIT:
I grouped by id using dplyr. Can I then just sort my df by id and create a new variable that is ln_rd but shifted by one row?
Still a bit unclear what to do if all years are present in a group but this might help.
-- edited -- to show the desired output.
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(avg91 = mean(c(ln_rd[year == 1991], ln_rd[year == 1999])),
avg99 = mean(c(ln_rd[year == 1999], ln_rd[year == 2007])),
avg91 = ifelse(year == 1991, avg91, NA),
avg99 = ifelse(year == 2007, avg99, NA)) %>%
ungroup()
# A tibble: 15 × 5
year id ln_rd avg91 avg99
<int> <int> <dbl> <dbl> <dbl>
1 1991 3505 3.38 3.09 NA
2 1999 3505 2.80 NA NA
3 1991 4584 1.45 1.34 NA
4 1999 4584 1.22 NA NA
5 1991 5709 1.90 2.13 NA
6 1999 5709 2.36 NA NA
7 2007 5709 3.11 NA 2.74
8 2007 9777 2.36 NA 2.36
9 1991 18729 4.82 5.07 NA
10 1999 18729 5.32 NA NA
11 2007 18729 5.53 NA 5.42
12 1991 20054 0.588 0.307 NA
13 1999 20054 0.0266 NA NA
14 1999 62169 1.91 NA NA
15 2007 62169 1.45 NA 1.68

Finding the grouping variable for which the unique values of a variable is more than one

In DATA below, I was wondering how to find the unique study_id for which variable scale takes on more than one unique value?
The expected answer should be Li (scale for Li has other & MBTI). But I wonder how to find it via BASE or dplyr code?
m="
study_id year es_id r se n pub_type context ed_setting age_grp L1 L2 prof scale outcome
Dreyer 1992 130 0 0.0574 305 DocDisse~ Foreign~ CollegeUni~ Adult Afri~ Engl~ NA Other Listen~
Dreyer 1992 131 0.04 0.0574 305 DocDisse~ Foreign~ CollegeUni~ Adult Afri~ Engl~ NA Other Writing
Dreyer 1992 132 -0.03 0.0574 305 DocDisse~ Foreign~ CollegeUni~ Adult Afri~ Engl~ NA Other Reading
Dreyer 1992 133 0 0.0574 305 DocDisse~ Foreign~ CollegeUni~ Adult Afri~ Engl~ NA Other Overall
Ghapanchi 2011 89 0.31 0.0806 141 JournalA~ Foreign~ CollegeUni~ Adult Pers~ Engl~ NA Other Overall
Hassan 2001 177 0.25 0.117 71 NA Foreign~ CollegeUni~ NA Arab~ Engl~ NA Other Speaki~
Kralova 2012 137 0.0252 0.117 75 JournalA~ Foreign~ CollegeUni~ Adult Slov~ Engl~ Inte~ Other Speaki~
Li 2009 55 -0.04 0.132 59 JournalA~ Foreign~ CollegeUni~ Adult Chin~ Engl~ NA Other Grammar
Li 2009 56 0.355 0.124 59 JournalA~ Foreign~ CollegeUni~ Adult Chin~ Engl~ NA Other Pragma~
Li 2003 57 0.039 0.0735 187 JournalA~ Foreign~ CollegeUni~ Multip~ Chin~ Engl~ NA MBTI Overall
"
DATA <- read.table(text = m, h=T)
Here's a way in dplyr as well as base R -
The idea is to select rows with unique study_id where there is more than one unique scale values.
library(dplyr)
DATA %>%
group_by(study_id) %>%
dplyr::filter(n_distinct(scale) > 1) %>%
ungroup %>%
distinct(study_id)
# study_id
# <chr>
#1 Li
Base R -
unique(subset(DATA, ave(scale, study_id,
FUN = function(x) length(unique(x))) > 1, select = study_id))
# study_id
#8 Li

Processing data.frame that needs order and cumulative days

With the small reproducible example below, I'd like to identify the dplyr approach to arrive at the data.frame shown at the end of this note. The features of the dplyr output is that it will ensure that the data.frame is sorted by date (note that the dates 1999-04-13 and 1999-03-12 are out of order) and that it then "accumulate" the number of days within each wy grouping (wy = "water year"; Oct 1-Sep 30) that Q is above a threshold of 3.0.
dat <- read.table(text="
Date wy Q
1997-01-01 1997 9.82
1997-02-01 1997 3.51
1997-02-02 1997 9.35
1997-10-04 1998 0.93
1997-11-01 1998 1.66
1997-12-02 1998 0.81
1998-04-03 1998 5.65
1998-05-05 1998 7.82
1998-07-05 1998 6.33
1998-09-06 1998 0.55
1998-09-07 1998 4.54
1998-10-09 1999 6.50
1998-12-31 1999 2.17
1999-01-01 1999 5.67
1999-04-13 1999 5.66
1999-03-12 1999 4.67
1999-06-05 1999 3.34
1999-09-30 1999 1.99
1999-11-06 2000 5.75
2000-03-04 2000 6.28
2000-06-07 2000 0.81
2000-07-06 2000 9.66
2000-09-09 2000 9.08
2000-09-21 2000 6.72", header=TRUE)
dat$Date <- as.Date(dat$Date)
mdat <- dat %>%
group_by(wy) %>%
filter(Q > 3) %>%
?
Desired results:
Date wy Q abvThreshCum
1997-01-01 1997 9.82 1
1997-02-01 1997 3.51 2
1997-02-02 1997 9.35 3
1997-10-04 1998 0.93 0
1997-11-01 1998 1.66 0
1997-12-02 1998 0.81 0
1998-04-03 1998 5.65 1
1998-05-05 1998 7.82 2
1998-07-05 1998 6.33 3
1998-09-06 1998 0.55 3
1998-09-07 1998 4.54 4
1998-10-09 1999 6.50 1
1998-12-31 1999 2.17 1
1999-01-01 1999 5.67 2
1999-03-12 1999 4.67 3
1999-04-13 1999 5.66 4
1999-06-05 1999 3.34 5
1999-09-30 1999 1.99 5
1999-11-06 2000 5.75 1
2000-03-04 2000 6.28 2
2000-06-07 2000 0.81 2
2000-07-06 2000 9.66 3
2000-09-09 2000 9.08 4
2000-09-21 2000 6.72 5
library(dplyr)
dat %>%
arrange(Date) %>%
group_by(wy) %>%
mutate(abv = cumsum(Q > 3)) %>%
ungroup()
# # A tibble: 24 x 4
# Date wy Q abv
# <date> <int> <dbl> <int>
# 1 1997-01-01 1997 9.82 1
# 2 1997-02-01 1997 3.51 2
# 3 1997-02-02 1997 9.35 3
# 4 1997-10-04 1998 0.93 0
# 5 1997-11-01 1998 1.66 0
# 6 1997-12-02 1998 0.81 0
# 7 1998-04-03 1998 5.65 1
# 8 1998-05-05 1998 7.82 2
# 9 1998-07-05 1998 6.33 3
# 10 1998-09-06 1998 0.55 3
# # ... with 14 more rows
data.table approach
library(data.table)
setDT(dat, key = "Date")[, abvThreshCum := cumsum(Q > 3), by = .(wy)]

Impute missing data of a single variable using the slope of a linear regression of another variable in R

Here is an extract of my dataset (df8) which contains time series from 2000 to 2018 for 194 countries.
iso3 year anc4 median
<chr> <dbl> <dbl> <dbl>
1 BIH 2000 NA 0.739
2 BIH 2001 NA 0.746
3 BIH 2002 NA 0.763
4 BIH 2003 NA 0.778
5 BIH 2004 NA 0.842
6 BIH 2005 NA 0.801
7 BIH 2006 NA 0.819
8 BIH 2007 NA 0.841
9 BIH 2008 NA 0.845
10 BIH 2009 NA 0.840
11 BIH 2010 0.842 0.856
12 BIH 2011 NA 0.873
13 BIH 2012 NA 0.867
14 BIH 2013 NA 0.889
15 BIH 2014 NA 0.879
16 BIH 2015 NA 0.883
17 BIH 2016 NA 0.854
18 BIH 2017 NA 0.891
19 BIH 2018 NA 0.920
20 BWA 2000 NA 0.739
21 BWA 2001 NA 0.746
22 BWA 2002 NA 0.763
23 BWA 2003 NA 0.778
24 BWA 2004 NA 0.842
25 BWA 2005 NA 0.801
26 BWA 2006 0.733 0.819
27 BWA 2007 NA 0.841
28 BWA 2008 NA 0.845
29 BWA 2009 NA 0.840
30 BWA 2010 NA 0.856
31 BWA 2011 NA 0.873
32 BWA 2012 NA 0.867
33 BWA 2013 NA 0.889
34 BWA 2014 NA 0.879
35 BWA 2015 NA 0.883
36 BWA 2016 NA 0.854
37 BWA 2017 NA 0.891
38 BWA 2018 NA 0.920
What I would like to do is to impute missing data for variable anc4, using the slope of a linear regression based on regional medians (median). I would like to do that at the country level as each country do not pertain to the same region.
This is what I have tried..
df_model <- df8
predictions <- vector()
for(i in unique(df_model$iso3)) {
temp <- df_model[df_model[,2]==i,]
predictions <- c(predictions,predict(lm(median~year,temp),df8[is.na(df8$anc4) & df8$iso3==i,]))
}
df8[is.na(df8$anc4),]$anc4 <- predictions
I used the code I have been using when imputing missing anc4 data using a linear regression of observed anc4 data points and tried to adapt that using medians..but did not quite work!
Thank you so much!
Your last comment made your question clear: you get the slope from the linear regression on the medians and you get the intercept from the only non-missing value.
However, there was a rather serious flaw in your code: you should never grow a vector inside a for loop. Use *apply functions, or even better use *map functions from the purrr package. If you have a very good reason to use a for loop, at least preallocate its size.
Since you get the intercept from outside the model, you cannot use predict here. Fortunately, it is rather simple to predict manually when using a linear model.
Here is my solution using the dplyr syntax. If you are not familiar with it, I urge you to read about it (for instance there)
x=df_model %>%
group_by(iso3) %>%
mutate(
slope=lm(median~year)$coefficients["year"],
intercept=anc4[!is.na(anc4)]-slope*year[!is.na(anc4)],
anc4_imput = intercept+year*slope,
anc4_error = anc4-anc4_imput,
)
x
#> # A tibble: 38 x 8
#> # Groups: iso3 [2]
#> iso3 year anc4 median slope intercept anc4_imput anc4_error
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 BIH 2000 NA 0.739 0.00844 -16.1 0.758 NA
#> 2 BIH 2001 NA 0.746 0.00844 -16.1 0.766 NA
#> 3 BIH 2002 NA 0.763 0.00844 -16.1 0.774 NA
#> 4 BIH 2003 NA 0.778 0.00844 -16.1 0.783 NA
#> 5 BIH 2004 NA 0.842 0.00844 -16.1 0.791 NA
#> 6 BIH 2005 NA 0.801 0.00844 -16.1 0.800 NA
#> 7 BIH 2006 NA 0.819 0.00844 -16.1 0.808 NA
#> 8 BIH 2007 NA 0.841 0.00844 -16.1 0.817 NA
#> 9 BIH 2008 NA 0.845 0.00844 -16.1 0.825 NA
#> 10 BIH 2009 NA 0.84 0.00844 -16.1 0.834 NA
#> # ... with 28 more rows
#error is negligible
x %>% filter(!is.na(anc4))
#> # A tibble: 2 x 8
#> # Groups: iso3 [2]
#> iso3 year anc4 median slope intercept anc4_imput anc4_error
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 BIH 2010 0.842 0.856 0.00844 -16.1 0.842 1.22e-15
#> 2 BWA 2006 10.7 10.8 0.00844 -6.20 10.7 0.
#Created on 2020-06-12 by the reprex package (v0.3.0)

Count highest value in data frame in R

I have the below data frame (DF). I need to count how many times/years for each station has recorded maximum avg. temp, minumum avg. temp and maximum total precipitation.
In each row of DF above, year is followed by max avg. temp, min avg. temp and total avg. precipitation. For example, if in year 1985 highest max avg. temperature is recorded in station 1, it should count as one and so on.
Any suggestion or help is appreciated.
Thanks.
DF:
St_name Met_data
station1 1985 15.33 4.33 780.1, 1986 12.7 2.18 505.3, 1987 17.76 6.33 793.6, 1988 17.35 4.53 541, 1989 15.65 3.98 793.7, 1990 16.9 5.96 1169.4, 1991 16.42 5.26 790.6, 1992 14.99 5.04 932.6, 1993 13.96 4.75 1420.7, 1994 14.96 3.79 668.8, 1995 15 3.67 952.9, 1996 13.77 2.4 808.5, 1997 14.69 3.26 773.5, 1998 17.22 6.25 1126.4, 1999 16.35 4.32 921.9, 2000 14.55 2.83 893.9, 2001 15.71 4.33 1118.8, 2002 15.61 3.96 1000.4, 2003 14.83 2.84 911.7, 2004 14.9 4 965.1, 2005 16.16 4.7 647.7, 2006 16.18 5.14 800.8, 2007 15.52 4.15 890.3, 2008 14.35 2.91 1271.9, 2009 14.4 3.77 1343.8, 2010 15.32 4.57 1145.4, 2011 15.41 4.54 857.3, 2012 17.39 5.4 745, 2013 15.26 3.51 811.4, 2014 13.8 2.37 986.3
station2 1985 19.27 7.81 1465.5, 1986 20.37 8.81 1201.3, 1987 20.95 8.72 949.2, 1988 20.03 7.53 1104.6, 1989 19.11 7.42 1050.1, 1990 20.53 8.76 1486.2, 1991 20.21 9.53 1164.4, 1992 19.55 8.51 913.6, 1993 18.7 8.24 1485.1, 1994 19.43 8.42 1171.7, 1995 19.62 7.41 1084.9, 1996 19.01 6.29 1212.4, 1997 18.85 6.76 1243.2, 1998 21 8.27 1261.1, 1999 21.28 7.99 1122.4, 2000 19.99 7.74 1242.7, 2001 20.13 7.59 1305.8, 2002 20.13 7.69 1563, 2003 19.48 6.52 1237.1, 2004 19.94 7.42 1174.8, 2005 20.53 8.05 1140.5, 2006 20.16 7.18 1542, 2007 21.44 7.91 1167.8, 2008 17.6 5.51 1653.8, 2009 20.63 9.06 1326, 2010 21.31 8.7 1024.8, 2011 21.21 9.96 1847.6, 2012 22.22 9.39 782.5, 2013 20.46 9.29 770.7
station3 1985 14.43 2.97 951.6, 1986 15.41 3.37 415.6, 1987 15.08 4.34 1110, 1988 16.19 3.33 787.6, 1989 14.77 2.19 796.8, 1990 16.28 4.59 1213.6, 1991 16.72 4.67 907.4, 1992 14.74 4.18 935.6, 1993 15.22 5.06 903.1, 1994 15.46 2.79 907.5, 1995 15.34 4.21 1001.1, 1996 14.46 2.49 1204.5, 1997 14.95 2.95 819, 1998 17.5 5.3 1078.6, 1999 16.73 3.24 901.9, 2000 15.81 2.7 931.4, 2001 16.68 4.09 968.7, 2002 16.48 6.41 762.2, 2003 15.47 4.99 999.6, 2004 15.32 5.31 875.7, 2005 16.16 5.91 593.2, 2006 16.06 6.3 997.2, 2007 15.87 5.71 946, 2008 14.46 4.1 1128.1, 2009 14.26 4.38 1146.1, 2010 15.92 4.79 1037.6, 2011 15.25 5.47 1045.8, 2012 17.47 6.43 659.2, 2013 14.25 4 1092.9, 2014 13.26 2.98 1039.4
.
.
Output:
St_name max_T_count min_T_count precip_count
station1 1 0 0
station2 0 2 0
station3 1 1 1
.
.
You should at least make an effort to organize your data in a spreadsheet before posting. The first four lines in the code below are just for tidying your data. I am also not sure what you want for precip_count, but at least you can work that out based on this solution.
library(tidyverse)
df %>% separate_rows(Met_data, sep = ",") %>%
mutate(Met_data = trimws(Met_data)) %>%
separate(Met_data, sep = " ", into = c("year", "max_avg", "min_avg", "total_avg")) %>%
group_by(year) %>%
mutate(max_T_count = as.integer(max_avg == max(max_avg)),
min_T_count = as.integer(min_avg == min(min_avg)),
precip_count = as.integer(total_avg == max(total_avg))) %>%
ungroup() %>%
group_by(St_name) %>%
summarise_at(vars(ends_with("count")), sum)
%>% is the magrittr package pipe operator.
separate_rows separates the entries of the column at commas Met_data into new rows.
trimws trims the extra whitespaces around characters. This is necessary in order for separate the characters exactly at blanks.
separate separates Met_data at blanks and assigns the separated variables with column names.
group_by specifies by which grouping the aggregation is going to be done.
mutate creates new columns.
summarise_at makes summaries on specified columns with specified functions.
These are a handful. I advise you to read the documentations for each of these by typing ?function where you replace function by each of those mentioned above. Or you can use help like `help("%>%", package = "magrittr").
Here is the output.
# A tibble: 3 x 4
# St_name max_T_count min_T_count precip_count
# <fct> <int> <int> <int>
# 1 station1 1 17 11
# 2 station2 29 0 5
# 3 station3 0 13 14
Here is the data.
df <- structure(list(St_name = structure(1:3, .Label = c("station1",
"station2", "station3"), class = "factor"), Met_data = structure(c(2L,
3L, 1L), .Label = c(" 1985 14.43 2.97 951.6, 1986 15.41 3.37 415.6, 1987 15.08 4.34 1110, 1988 16.19 3.33 787.6, 1989 14.77 2.19 796.8, 1990 16.28 4.59 1213.6, 1991 16.72 4.67 907.4, 1992 14.74 4.18 935.6, 1993 15.22 5.06 903.1, 1994 15.46 2.79 907.5, 1995 15.34 4.21 1001.1, 1996 14.46 2.49 1204.5, 1997 14.95 2.95 819, 1998 17.5 5.3 1078.6, 1999 16.73 3.24 901.9, 2000 15.81 2.7 931.4, 2001 16.68 4.09 968.7, 2002 16.48 6.41 762.2, 2003 15.47 4.99 999.6, 2004 15.32 5.31 875.7, 2005 16.16 5.91 593.2, 2006 16.06 6.3 997.2, 2007 15.87 5.71 946, 2008 14.46 4.1 1128.1, 2009 14.26 4.38 1146.1, 2010 15.92 4.79 1037.6, 2011 15.25 5.47 1045.8, 2012 17.47 6.43 659.2, 2013 14.25 4 1092.9, 2014 13.26 2.98 1039.4",
" 1985 15.33 4.33 780.1, 1986 12.7 2.18 505.3, 1987 17.76 6.33 793.6, 1988 17.35 4.53 541, 1989 15.65 3.98 793.7, 1990 16.9 5.96 1169.4, 1991 16.42 5.26 790.6, 1992 14.99 5.04 932.6, 1993 13.96 4.75 1420.7, 1994 14.96 3.79 668.8, 1995 15 3.67 952.9, 1996 13.77 2.4 808.5, 1997 14.69 3.26 773.5, 1998 17.22 6.25 1126.4, 1999 16.35 4.32 921.9, 2000 14.55 2.83 893.9, 2001 15.71 4.33 1118.8, 2002 15.61 3.96 1000.4, 2003 14.83 2.84 911.7, 2004 14.9 4 965.1, 2005 16.16 4.7 647.7, 2006 16.18 5.14 800.8, 2007 15.52 4.15 890.3, 2008 14.35 2.91 1271.9, 2009 14.4 3.77 1343.8, 2010 15.32 4.57 1145.4, 2011 15.41 4.54 857.3, 2012 17.39 5.4 745, 2013 15.26 3.51 811.4, 2014 13.8 2.37 986.3",
" 1985 19.27 7.81 1465.5, 1986 20.37 8.81 1201.3, 1987 20.95 8.72 949.2, 1988 20.03 7.53 1104.6, 1989 19.11 7.42 1050.1, 1990 20.53 8.76 1486.2, 1991 20.21 9.53 1164.4, 1992 19.55 8.51 913.6, 1993 18.7 8.24 1485.1, 1994 19.43 8.42 1171.7, 1995 19.62 7.41 1084.9, 1996 19.01 6.29 1212.4, 1997 18.85 6.76 1243.2, 1998 21 8.27 1261.1, 1999 21.28 7.99 1122.4, 2000 19.99 7.74 1242.7, 2001 20.13 7.59 1305.8, 2002 20.13 7.69 1563, 2003 19.48 6.52 1237.1, 2004 19.94 7.42 1174.8, 2005 20.53 8.05 1140.5, 2006 20.16 7.18 1542, 2007 21.44 7.91 1167.8, 2008 17.6 5.51 1653.8, 2009 20.63 9.06 1326, 2010 21.31 8.7 1024.8, 2011 21.21 9.96 1847.6, 2012 22.22 9.39 782.5, 2013 20.46 9.29 770.7"
), class = "factor")), .Names = c("St_name", "Met_data"), class = "data.frame", row.names = c(NA,
-3L))

Resources