I have an annual mean timeseries dataset for 15 years, and I am trying to find the average change/increase/decrease in this timeseries.
The timeseries I have is spatial (average values for each grid-cell/pixel, years repeat).
How can I do this in R via dplyr?
Sample data
year = c(2005, 2005, 2005, 2005, 2006, 2006, 2006, 2006, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008)
Tmean = c(24, 24.5, 25.8,25, 24.8, 25, 23.5, 23.8, 24.8, 25, 25.2, 25.8, 25.3, 25.6, 25.2, 25)
Code
library(tidyverse)
df = data.frame(year, Tmean)
change = df$year %>%
# Sort by year
arrange(year) %>%
mutate(Diff_change = Tmean - lag(Tmean), # Difference in Tmean between years
Rate_percent = (Diff_change / year)/Tmean * 100) # Percent change # **returns inf values**
Average_change = mean(change$Rate_percent, na.rm = TRUE)
To find the average: mean(). To find the differences or changes: diff()
So, to find the average change:
> avg_change <- mean(diff(Tmean))
> print(avg_change)
[1] 0.06666667
If you need that in percentage, then you want to find out how much the difference between an element and its previous one (this year - last year) is in percentage with respect to last year, like so:
> pct_change <- Tmean[2:length(Tmean)] / Tmean[1:(length(Tmean)-1)] - 1
> avg_pct_change <- mean(pct_change) * 100
> print(avg_pct_change)
[1] 0.3101632
We can put those vectors into a data frame to use with dplyr (...if that's how you want to do it; this is straightforward with base R as well).
library(dplyr)
df <- data.frame(year, Tmean)
change <- df %>%
arrange(year) %>%
mutate(Diff_change = Tmean - lag(Tmean), # Difference in Tmean between years
Diff_time = year - lag(year),
Rate_percent = (Diff_change/Diff_time)/lag(Tmean) * 100) # Percent change
Average_change = mean(change$Rate_percent, na.rm = TRUE)
Results (with updated question data)
> change
year Tmean Diff_change Rate_percent
1 2005 24.0 NA NA
2 2005 24.5 0.5 2.0833333
3 2005 25.8 1.3 5.3061224
4 2005 25.0 -0.8 -3.1007752
5 2006 24.8 -0.2 -0.8000000
6 2006 25.0 0.2 0.8064516
7 2006 23.5 -1.5 -6.0000000
8 2006 23.8 0.3 1.2765957
9 2007 24.8 1.0 4.2016807
10 2007 25.0 0.2 0.8064516
11 2007 25.2 0.2 0.8000000
12 2007 25.8 0.6 2.3809524
13 2008 25.3 -0.5 -1.9379845
14 2008 25.6 0.3 1.1857708
15 2008 25.2 -0.4 -1.5625000
16 2008 25.0 -0.2 -0.7936508
> Average_change
[1] 0.3101632
Related
I am trying to calculate the % change by year in the following dataset, does anyone know if this is possible?
I have the difference but am unsure how we can change this into a percentage
C diff(economy_df_by_year$gdp_per_capita)
df
year gdp
1998 8142.
1999 8248.
2000 8211.
2001 7926.
2002 8366.
2003 10122.
2004 11493.
2005 12443.
2006 13275.
2007 15284.
Assuming that gdp is the total value, you could do something like this:
library(tidyverse)
tribble(
~year, ~gdp,
1998, 8142,
1999, 8248,
2000, 8211,
2001, 7926,
2002, 8366,
2003, 10122,
2004, 11493,
2005, 12443,
2006, 13275,
2007, 15284
) -> df
df |>
mutate(pdiff = 100*(gdp - lag(gdp))/gdp)
#> # A tibble: 10 × 3
#> year gdp pdiff
#> <dbl> <dbl> <dbl>
#> 1 1998 8142 NA
#> 2 1999 8248 1.29
#> 3 2000 8211 -0.451
#> 4 2001 7926 -3.60
#> 5 2002 8366 5.26
#> 6 2003 10122 17.3
#> 7 2004 11493 11.9
#> 8 2005 12443 7.63
#> 9 2006 13275 6.27
#> 10 2007 15284 13.1
Which relies on the tidyverse framework.
If gdp is the difference, you will need the total to get a percentage, if that is what you mean by change in percentage by year.
df$change <- NA
df$change[2:10] <- (df[2:10, "gdp"] - df[1:9, "gdp"]) / df[1:9, "gdp"]
This assigns the yearly GDP growth to each row except the first one where it remains as NA
df$diff <- c(0,diff(df$gdp))
df$percentDiff <- 100*(c(0,(diff(df$gdp)))/(df$gdp - df$diff))
This is another possibility.
I have data for every census tract in the country at three time points (2000, 2013, 2019) that is grouped by CBSA. I'm trying to create a new variable called cont_chg_pedu_colplus that is the difference between the 2013 and 2000 value for pedu_colplus. So, in my example below, I want to create a new column called cont_chg_pedu_colplus that returns the value of 3.0 (14.6 - 11.6). Ideally, each group of tracts would have the same value, since I'm only interested in the difference between time 1 and time 2.
tractid year CBSA_name pedu_colplus
<chr> <dbl> <chr> <dbl>
1 48059030101 2000 Abilene, TX 11.6
2 48059030101 2013 Abilene, TX 14.6
3 48059030101 2019 Abilene, TX 20.6
4 48059030102 2000 Abilene, TX 11.6
5 48059030102 2013 Abilene, TX 14.2
6 48059030102 2019 Abilene, TX 20.2
Below is the code I have so far. It throws the following error, I think because I'm subsetting on just one year (37 rows instead of the 111 in the dataset). I'd prefer not to make my data wide, because I've got a bunch of other data manipulations I have to. I couldn't get lag to work.
gent_vars_prelim <- outcome_data %>%
mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000], na.rm = TRUE) %>%
glimpse()
Problem with mutate() input cont_chg_pedu_colplus. x Input cont_chg_pedu_colplus can't be recycled to size 37. ℹ Input cont_chg_pedu_colplus is pedu_colplus[year == 2013] - pedu_colplus[year == 2000]. ℹ Input cont_chg_pedu_colplus must be size 37 or 1, not 0. ℹ The error occurred in group 1: CBSA_name = "Abilene, TX", year = 2000
Any thoughts? Thanks.
I'll assume that for each unique pair of tractid and CBSA_name, there are up to 3 entries for year (possible values 2000, 2013, or 2019) and no two years are the same for a given pair of tractid and CBSA_name.
First, we'll group the values in the data frame by tractid and CBSA_name. Each group will have up to 3 rows, one for each year. We do this with dplyr::group_by(tractid, CBSA_name).
Next, we'll force the group to have all 3 years. We do this with tidyr::complete(year = c(2000, 2013, 2019)) as you suggested in the comments. (This is better than my comment using filter(n() == 3), because we actually wouldn't care if only 2019 were missing, and we are able to preserve incomplete groups.)
Then, we can compute the result you're interested in: dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000]). We just need to dplyr::ungroup() after this and we're done.
Final code:
gent_vars_prelim <- outcome_data %>%
dplyr::group_by(tractid, CBSA_name) %>%
tidyr::complete(year = c(2000, 2013, 2019)) %>%
dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000]) %>%
dplyr::ungroup() %>%
glimpse()
Test case:
outcome_data <- data.frame(tractid = c(48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030102, 48059030102, 48059030102, 48059030103),
year = c(2000, 2013, 2019, 2000, 2013, 2019, 2000, 2013, 2019, 2000),
CBSA_name = c("Abilene, TX", "Abilene, TX", "Abilene, TX", "Austin, TX", "Austin, TX", "Austin, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX"),
pedu_colplus = c(11.6, 14.6, 20.6, 8.4, 9.0, 9.6, 11.6, 14.2, 20.2, 4.0))
Result:
> tibble(gent_vars_prelim)
# A tibble: 12 x 1
gent_vars_prelim$tractid $CBSA_name $year $pedu_colplus $cont_chg_pedu_colplus
<dbl> <fct> <dbl> <dbl> <dbl>
1 48059030101 Abilene, TX 2000 11.6 3
2 48059030101 Abilene, TX 2013 14.6 3
3 48059030101 Abilene, TX 2019 20.6 3
4 48059030101 Austin, TX 2000 8.4 0.600
5 48059030101 Austin, TX 2013 9 0.600
6 48059030101 Austin, TX 2019 9.6 0.600
7 48059030102 Abilene, TX 2000 11.6 2.60
8 48059030102 Abilene, TX 2013 14.2 2.60
9 48059030102 Abilene, TX 2019 20.2 2.60
10 48059030103 Abilene, TX 2000 4 NA
11 48059030103 Abilene, TX 2013 NA NA
12 48059030103 Abilene, TX 2019 NA NA
I have the following dataset (32000 entries) of water chemical compounds annual means organized by monitoring sites and sampling year:
data= data.frame(Site_ID=c(1, 1, 1, 2, 2, 2, 3, 3, 3), Year=c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005), AnnualMean=c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9))
Site_ID Year AnnualMean
1 1976 1.1
1 1977 1.2
1 1978 1.1
2 2004 2.1
2 2005 2.6
2 2006 3.1
3 2003 2.7
3 2004 2.6
3 2005 1.9
I would like to select the data only from all monitoring sites showing at least a measurement in 2005 in their time range. With the above dataset, the expect output dataset would be:
Site_ID Year AnnualMean
2 2004 2.1
2 2005 2.6
2 2006 3.1
3 2003 2.7
3 2004 2.6
3 2005 1.9
I am completely new in R and have been spinning my head around with data manipulation, so thank you in advance!
With dplyr:
library(dplyr)
data %>%
group_by(Site_ID) %>%
filter(2005 %in% Year)
Here is a base R solution, using subset + ave
dfout <- subset(df,!!ave(Year,Site_ID,FUN = function(x) "2005" %in% x))
such that
> dfout
Site_ID Year AnnualMean
4 2 2004 2.1
5 2 2005 2.6
6 2 2006 3.1
7 3 2003 2.7
8 3 2004 2.6
9 3 2005 1.9
An option with data.table
library(data.table)
setDT(data)[, .SD[2005 %in% Year], Site_ID]
I have a question about estimating a regression model in R. I have the following data (example):
Year XY
2002 5
2003 2
2004 4
2005 8
2006 3
2007 5
2008 10
the regression model I want to estimate is:
XY = B0 + Y2005 + Y2006 + Y2007 + Y2008 + e
Where Y2005,Y2006,Y2007,and Y2008 are yearly indicator variables that take the value of 1 for the year 2005, 2006, 2007, 2008 and 0 otherwise.
What I need to do is to compare the value of (XY) in 2005, 2006, 2007, and 2008 to the mean value of (XY) in the period of (2002-2004).
I hope you can help me to figure out this issue and thank you in advance for your help.
DF <- read.table(text = "Year XY
2002 5
2003 2
2004 4
2005 8
2006 3
2007 5
2008 10", header = TRUE)
DF$facYear <- DF$Year
DF$facYear[DF$facYear < 2005] <- "baseline"
DF$facYear <- factor(DF$facYear)
#make sure that baseline is used as intercept:
DF$facYear <- relevel(DF$facYear, "baseline")
fit <- lm(XY ~ facYear, data = DF)
summary(fit)
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 3.6667 0.8819 4.158 0.0533 .
#facYear2005 4.3333 1.7638 2.457 0.1333
#facYear2006 -0.6667 1.7638 -0.378 0.7418
#facYear2007 1.3333 1.7638 0.756 0.5286
#facYear2008 6.3333 1.7638 3.591 0.0696 .
I am very new on R. I have daily observations of temperature and PP for 12-year period (6574 row, 6col, some NA ). I want to calculate, for example, the average from 1st to 10thday of January-2001, then 11-20 and finally 21 to 31 and so on for every month until december for each year in the period I mentioned before.
But also I have problems because February sometimes has 28 or 29 days (leap years).
This is how i open my file is a CSV, with read.table
# READ CSV
setwd ("C:\\Users\\GVASQUEZ\\Documents\\ESTUDIO_PAMPAS\\R_sheet")
huancavelica<-read.table("huancavelica.csv",header = TRUE, sep = ",",
dec = ".", fileEncoding = "latin1", nrows = 6574 )
This is the output of my CSV file
Año Mes Dia PT101 TM102 TM103
1 1998 1 1 6.0 15.6 3.4
2 1998 1 2 8.0 14.4 3.2
3 1998 1 3 8.6 13.8 4.4
4 1998 1 4 5.6 14.6 4.6
5 1998 1 5 0.4 17.4 3.6
6 1998 1 6 3.4 17.4 4.4
7 1998 1 7 9.2 14.6 3.2
8 1998 1 8 2.2 16.8 2.8
9 1998 1 9 8.6 18.4 4.4
10 1998 1 10 6.2 15.0 3.6
. . . . . . .
With the data setup that you have a fairly tried and true method should work:
# add 0 in front of single digit month variable to account for 1 and 10 sorting
huancavelica$MesChar <- ifelse(nchar(huancavelica$Mes)==1,
paste0("0",huancavelica$Mes), as.character(huancavelica$Mes))
# get time of month ID
huancavelica$timeMonth <- ifelse(huancavelica$Dia < 11, 1,
ifelse(huancavelica$Dia > 20, 3, 2)
# get final ID
huancavelica$ID <- paste(huancavelica$Año, huancavelica$MesChar, huancavelica$timeMonth, sep=".")
# average stat
huancavelica$myStat <- ave(huancavelica$PT101, huancavelica$ID, FUN=mean, na.rm=T)
We can try
library(data.table)
setDT(df1)[, Grp := (Dia - 1)%/%10+1, by = .(Ano, Mes)
][Grp>3, Grp := 3][,lapply(.SD, mean, na.rm=TRUE), by = .(Ano, Mes, Grp)]
It adds a bit more complexity, but you could cut each month into thirds and get the average for each third. For example:
library(dplyr)
library(lubridate)
# Fake data
set.seed(10)
df = data.frame(date=seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="1 day"),
value=rnorm(365))
# Cut months into thirds
df = df %>%
mutate(mon_yr = paste0(month(date, label=TRUE, abbr=TRUE) , " ", year(date))) %>%
group_by(mon_yr) %>%
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
cutMonth = paste0(mon_yr, ", ", cutMonth)) %>%
ungroup %>%
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
date value cutMonth
1 2015-01-01 0.01874617 Jan 2015, 1st third
2 2015-01-02 -0.18425254 Jan 2015, 1st third
3 2015-01-03 -1.37133055 Jan 2015, 1st third
...
363 2015-12-29 -1.3996571 Dec 2015, 3rd third
364 2015-12-30 -1.2877952 Dec 2015, 3rd third
365 2015-12-31 -0.9684155 Dec 2015, 3rd third
# Summarise to get average value for each 1/3 of a month
df.summary = df %>%
group_by(cutMonth) %>%
summarise(average.value = mean(value))
cutMonth average.value
1 Jan 2015, 1st third -0.49065685
2 Jan 2015, 2nd third 0.28178222
3 Jan 2015, 3rd third -1.03870698
4 Feb 2015, 1st third -0.45700203
5 Feb 2015, 2nd third -0.07577199
6 Feb 2015, 3rd third 0.33860882
7 Mar 2015, 1st third 0.12067388
...