Aggregating by fixed date range R - r

Given a simplification of my dataset like:
df <- data.frame("ID"= c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
"ForestType" = c("oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak",
"pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine"),
"Date"= c("1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01",
"1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01"),
"NDVI"= c(0.1,0.2,0.3,0.55,0.31,0.26,0.34,0.52,0.41,0.45,0.50,0.7,
0.2,0.3,0.4,0.53,0.52,0.54,0.78,0.73,0.72,0.71,0.76,0.9),
check.names = FALSE, stringsAsFactors = FALSE)
I would like to obtain the means of NDVI values by a certain period of time, in this case by year. Take into account that in my real dataset I would need it for seasons, so it should be adaptable.
These means should consider:
Trimming outliers: for example 25% of the highest values and 25% of the lowest values.
They should be by class, in this case by the ID field.
So the output should look something like:
> desired_df
ID ForestType Date meanNDVI
1 1 oak 1987 0.250
2 1 oak 1988 0.325
3 1 oak 1989 0.430
4 2 pine 1987 0.350
5 2 pine 1988 0.635
6 2 pine 1989 0.740
In this case, for example, 0.250 corresponds to mean NDVI on 1987 of ID=1 and it is the mean of the 4 values of that year taking the lowest and the highest out.
Thanks a lot!

library(tidyverse)
library(lubridate)
df %>%
mutate(Date = as.Date(Date, format = "%Y.%m.%d")) %>%
group_by(ID, ForestType, Year = year(Date)) %>%
filter(NDVI > quantile(NDVI, .25) & NDVI < quantile(NDVI, .75)) %>%
summarise(meanNDVI = mean(NDVI))
Output
# A tibble: 6 x 4
# Groups: ID, ForestType [2]
ID ForestType Year meanNDVI
<dbl> <chr> <dbl> <dbl>
1 1 oak 1987 0.25
2 1 oak 1988 0.325
3 1 oak 1989 0.475
4 2 pine 1987 0.35
5 2 pine 1988 0.635
6 2 pine 1989 0.74

The classical base R approach using aggregate. The year can be obtained using substr.
res <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean))
res[order(res$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.2875
# 3 1 oak 1988 0.3575
# 5 1 oak 1989 0.5150
# 2 2 pine 1987 0.3575
# 4 2 pine 1988 0.6425
# 6 2 pine 1989 0.7725
Trimmed version
Trimmed for 25% outlyers.
res2 <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean, trim=.25))
res2[order(res2$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.250
# 3 1 oak 1988 0.325
# 5 1 oak 1989 0.475
# 2 2 pine 1987 0.350
# 4 2 pine 1988 0.635
# 6 2 pine 1989 0.740

Using data.table package, you could proceed as follows:
library(data.table)
setDT(df)[, Date := as.Date(Date, format = "%Y.%m.%d")][]
df[, .(meanNDVI = base::mean(NDVI, trim = 0.25)), by = .(ID, ForestType, year = year(Date))]
# ID ForestType year meanNDVI
# 1: 1 oak 1987 0.250
# 2: 1 oak 1988 0.325
# 3: 1 oak 1989 0.475
# 4: 2 pine 1987 0.350
# 5: 2 pine 1988 0.635
# 6: 2 pine 1989 0.740

Another option. You can set trim in mean
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = ymd(Date) %>% year()) %>%
group_by(ID, ForestType, Date) %>%
summarise(mean = mean(NDVI, trim = 0.25, na.rm = T))

Related

How to calculate the MSE in multiple rows of a data frame in R?

I have a data frame (datos) that has eight columns and 2006 observations:
from these columns I want to calculate MSE for Pcp_Estacion and Pcp_Chirps, using the MSE function of the MLmetrics library, but I want to calculate them per station and per month to obtain a data frame calculated for each month and each weather station, in the example I calculate the MSE for the five weather stations I have for the month of July
# load libraries
library(tidyverse);
library(dplyr);
library(Metrics);
library(MLmetrics);
# See the first 10 data
dput(head(datos, 10))
X Mes Year Estacion variable n Pcp_Chirps Pcp_Estacion
1 1 1 1982 11024 Pcp 30 0.262 0.000
2 2 1 1982 11033 Pcp 31 0.190 0.045
3 3 1 1982 11141 Pcp 31 0.265 0.000
4 4 2 1982 11024 Pcp 28 0.317 0.286
5 5 2 1982 11033 Pcp 28 0.242 0.629
6 6 2 1982 11141 Pcp 28 0.351 0.500
7 7 3 1982 11024 Pcp 31 0.000 2.903
8 8 3 1982 11033 Pcp 31 0.148 0.000
9 9 3 1982 11141 Pcp 31 0.000 0.000
10 10 4 1982 11024 Pcp 30 0.543 0.800
# Calculate the July mse() for each weather station
mse_11024_7 <- filter(datos, Mes == 7, Estacion %in% c("11024"))
mse_11033_7 <- filter(datos, Mes == 7, Estacion %in% c("11033"))
mse_11060_7 <- filter(datos, Mes == 7, Estacion %in% c("11060"))
mse_11096_7 <- filter(datos, Mes == 7, Estacion %in% c("11096"))
mse_11141_7 <- filter(datos, Mes == 7, Estacion %in% c("11141"))
# check the result
mse(mse_11024_7$Pcp_Estacion, mse_11024_7$Pcp_Chirps)
mse(mse_11033_7$Pcp_Estacion, mse_11033_7$Pcp_Chirps)
mse(mse_11060_7$Pcp_Estacion, mse_11060_7$Pcp_Chirps)
mse(mse_11096_7$Pcp_Estacion, mse_11096_7$Pcp_Chirps)
mse(mse_11141_7$Pcp_Estacion, mse_11141_7$Pcp_Chirps)
is there a faster way to do all this at once, for all months and weather stations ?
Here the example data
https://drive.google.com/drive/folders/19h7u0GzGO1okjhO3RLREY0QKY8DOoTy-?usp=sharing
You’ll have a better chance of getting a helpful answer if you provide a minimal reproducible example. However, the following should give you what you need:
library(dplyr)
datos %>%
filter(Mes == 7) %>%
group_by(Estacion) %>%
summarise(Mse = mse(Pcp_Chirps))
The approach is to partition (i.e. ‘group’) the data by values of Estacion before using summarise() to compute the mse for each group.

Inflation rate with the CPI multiples country, with R

I have to calculate the inflation rate from 2015 to 2019. I have to do this with the CPI, which I have for each month during the 4 years. This means that I have to calculate the percentage growth rate for the same month last year.
They ask me for the calculation of several countries and then calculate or show the average for the period 2015-2019.
This is my database:
data <- read.table("https://pastebin.com/raw/6cetukKb")
I have tried the quantmod, dplyr, lubridate packages, but I can't do the CPI conversion.
I tried this but I know it is not correct:
data$year <- year(data$date)
anual_cpi <- data %>% group_by(year) %>% summarize(cpi = mean(Argentina))
anual_cpi$adj_factor <- anual_cpi$cpi/anual_cpi$cpi[anual_cpi$year == 2014]
**
UPDATE
**
my teacher gave us a hint on how to get the result, but when I try to add it to the code, I get an error.
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate((change=(x-lag(x,1))/lag(x,1)*100))
| Antigua_Barbuda | -1.55 |
|----------------- |------- |
| Argentina | 1.03 |
| Aruba | -1.52 |
| Bahamas | -1.56 |
| Barbados | -1.38 |
where "value" corresponds to the average inflation for each country during the entire period 2015-2019
We can use data.table methods
library(data.table)
melt(fread("https://pastebin.com/raw/6cetukKb"),
id.var = c('date', 'year', 'period', 'periodName'))[,
.(value = mean(value)), .(variable, year)][,
adj_factor := value/value[year == 2014]][]
# variable year value adj_factor
# 1: Antigua_Barbuda 2014 96.40000 1.0000000
# 2: Antigua_Barbuda 2015 96.55833 1.7059776
# 3: Antigua_Barbuda 2016 96.08333 1.0146075
# 4: Antigua_Barbuda 2017 98.40833 0.9900235
# 5: Antigua_Barbuda 2018 99.62500 0.5822618
# 6: Antigua_Barbuda 2019 101.07500 1.0484959
# 7: Argentina 2014 56.60000 1.0000000
# ..
You should read your data with header = TRUE since the first row are the names of the columns. Then get your data in long format which makes it easy to do the calculation.
After this you can perform whichever calculation you want. For example, to perform the same steps as your attempt i.e divide all the values with the value in the year 2014 for each country you can do.
library(dplyr)
data <- read.table("https://pastebin.com/raw/6cetukKb", header = TRUE)
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate(adj_factor = value/value[year == 2014])
# name year value adj_factor
# <chr> <int> <dbl> <dbl>
# 1 Antigua_Barbuda 2014 96.4 1
# 2 Antigua_Barbuda 2015 96.6 1.00
# 3 Antigua_Barbuda 2016 96.1 0.997
# 4 Antigua_Barbuda 2017 98.4 1.02
# 5 Antigua_Barbuda 2018 99.6 1.03
# 6 Antigua_Barbuda 2019 101. 1.05
# 7 Argentina 2014 56.6 1
# 8 Argentina 2015 64.0 1.13
# 9 Argentina 2016 89.9 1.59
#10 Argentina 2017 113. 2.00
# … with 20 more rows

Multiple gathering in R to create tidy dataset

I have a complicated untidy dataset which a dummy version of can be replicated below.
studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)
untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)
I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance. I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.
Thanks in advance!
With tidyr you can use pivot_longer:
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")
Output
# A tibble: 750 x 5
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.432 10762. 0.786
2 1 2019 0.948 11340. 0.909
3 1 2020 0.122 12837. 0.944
4 2 2018 0.422 11515. 0.950
5 2 2019 0.0639 12968. 0.828
6 2 2020 0.611 13645. 0.901
7 3 2018 0.489 11281. 0.784
8 3 2019 0.00337 12250. 0.753
9 3 2020 0.711 12898. 0.803
10 4 2018 0.0596 10526. 0.842
Using pure R:
tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)
studentID time score payment attendance
1.2018 1 2018 0.86743970 10995.45 0.9473540
2.2018 2 2018 0.53204701 11152.74 0.8167776
3.2018 3 2018 0.90072918 10631.06 0.9335316
4.2018 4 2018 0.89154492 11889.23 0.9098399
5.2018 5 2018 0.06320442 10973.20 0.8118909
6.2018 6 2018 0.67519166 11751.67 0.8328860
If you want "year" instead of the default "time", add timevar="year"
We could try:
library(dplyr)
library(tidyr)
untidy_df %>%
pivot_longer(cols = -studentID) %>%
separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>%
pivot_wider(names_from = measure, values_from = value )
Which returns:
studentID year score payment attendance
<int> <chr> <dbl> <dbl> <dbl>
1 1 2018 0.807 10179. 0.974
2 1 2019 0.599 11601. 0.785
3 1 2020 0.515 12347. 0.760
4 2 2018 0.474 11154. 0.983
5 2 2019 0.409 11682. 0.864
6 2 2020 0.688 13756. 0.812
7 3 2018 0.509 11746. 0.870
8 3 2019 0.867 12851. 0.801
9 3 2020 0.878 12710. 0.955
10 4 2018 0.621 11165. 0.975

How to subtract each Country's value by year

I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.

Finding weekly returns from daily returns company-wise

I have data which look something like this
co_code company_name co_stkdate dailylogreturn
1 A 01-01-2000 0.76
1 A 02-01-2000 0.75
.
.
.
1 A 31-12-2019 0.54
2 B 01-01-2000 0.98
2 B 02-01-2000 0.45
.
.
And so on
I want to find weekly returns which is equal to sum of daily log return for one week.
output should look something like this
co_code company_name co_stkdate weeklyreturns
1 A 07-01-2000 1.34
1 A 14-01-2000 0.95
.
.
.
1 A 31-12-2019 0.54
2 B 07-01-2000 0.98
2 B 14-01-2000 0.45
I tried to apply functions in quantmod package but those functions are applicable to only xts objects. Another issue in xts objects is that function "group_by()" can't be used. Thus, I want to work in usual dataframe only.
Code look something like this
library(dplyr)
### Reading txt file
df <- read.csv("33339_1_120_20190405_165913_dat.csv")
Calculating daily log returns
df <- mutate(df, "dailylogrtn"=log(nse_returns)) %>% as.data.frame()
Formatting date
df$co_stkdate<- as.Date(as.character(df$co_stkdate), format="%d-%m-%Y")
Since we don't know how many days of every week you got a dailylogreturn, there might be NAs, I recommend grouping by week and year:
#sample data
df <- data.frame(co_stkdate = rep(seq.Date(from = as.Date("2000-01-07"), to = as.Date("2000-02-07"), by = 1), 2),
dailylogreturn = abs(round(rnorm(64, 1, 1), 2)),
company_name = rep(c("A", "B"), each = 32))
df %>%
mutate(co_stkdate = as.POSIXct(co_stkdate),
year = strftime(co_stkdate, "%W"),
week = strftime(co_stkdate, "%Y")) %>%
group_by(company_name, year, week) %>%
summarise(weeklyreturns = sum(dailylogreturn, na.rm = TRUE))
# A tibble: 12 x 4
# Groups: company_name, year [12]
company_name year week weeklyreturns
<fct> <chr> <chr> <dbl>
1 A 01 2000 6.31
2 A 02 2000 6.11
3 A 03 2000 6.02
4 A 04 2000 8.27
5 A 05 2000 4.92
6 A 06 2000 0.5
7 B 01 2000 1.82
8 B 02 2000 6.6
9 B 03 2000 7.55
10 B 04 2000 7.63
11 B 05 2000 7.54
12 B 06 2000 1.03
Since I don't have sample data, I assume this should work:
df %>%
group_by(group = ceiling((1:nrow(df)/ 7))) %>%
summarise(mean = mean(weeklyreturns))

Resources