Shift column of dataframe by one per ID - r

I have a dataframe containing three columns, where the first is an ID, the second denotes a year and the third column is the value associated with the ID in that year:
df_in <- data.frame("ID"=c(1,1,1,1,1,1,
2,2,2,2,
3,3,3),
"yr"=c(2001,2002,2003,2004,2005,2006,
2002,2003,2004,2005,
2003,2004,2005),
"val"=c(1,2,3,4,5,6,
4,5,6,7,
7,8,9))
I would like to introduce a lag in my val-column per ID, so looking at (e.g.) ID==1 then the value at yr==2002 should be shifted to yr==2001, yr==2003 to yr==2002 and so on. This should be the case for all unique ID's.
The row corresponding to the last year (that now doesn't have a value due to the shift) should be deleted. We ultimately end up with
df_out <- data.frame("ID"=c(1,1,1,1,1,
2,2,2,
3,3),
"yr"=c(2001,2002,2003,2004,2005,
2002,2003,2004,
2003,2004),
"val"=c(2,3,4,5,6,
5,6,7,
8,9))
Is there an easy way to do this in dplyr?

df_out <-
df_in %>%
group_by(ID) %>%
mutate(yr = lag(yr)) %>%
filter(!is.na(yr)) %>%
ungroup

To get the requested result, you can use do:
df_in %>%
group_by(ID) %>%
do(data.frame(yr = head(.$yr, -1L), val = tail(.$val, -1L)))
The result:
# A tibble: 10 x 3
# Groups: ID [3]
ID yr val
<dbl> <dbl> <dbl>
1 1.00 2001 2.00
2 1.00 2002 3.00
3 1.00 2003 4.00
4 1.00 2004 5.00
5 1.00 2005 6.00
6 2.00 2002 5.00
7 2.00 2003 6.00
8 2.00 2004 7.00
9 3.00 2003 8.00
10 3.00 2004 9.00

Related

Inflation rate with the CPI multiples country, with R

I have to calculate the inflation rate from 2015 to 2019. I have to do this with the CPI, which I have for each month during the 4 years. This means that I have to calculate the percentage growth rate for the same month last year.
They ask me for the calculation of several countries and then calculate or show the average for the period 2015-2019.
This is my database:
data <- read.table("https://pastebin.com/raw/6cetukKb")
I have tried the quantmod, dplyr, lubridate packages, but I can't do the CPI conversion.
I tried this but I know it is not correct:
data$year <- year(data$date)
anual_cpi <- data %>% group_by(year) %>% summarize(cpi = mean(Argentina))
anual_cpi$adj_factor <- anual_cpi$cpi/anual_cpi$cpi[anual_cpi$year == 2014]
**
UPDATE
**
my teacher gave us a hint on how to get the result, but when I try to add it to the code, I get an error.
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate((change=(x-lag(x,1))/lag(x,1)*100))
| Antigua_Barbuda | -1.55 |
|----------------- |------- |
| Argentina | 1.03 |
| Aruba | -1.52 |
| Bahamas | -1.56 |
| Barbados | -1.38 |
where "value" corresponds to the average inflation for each country during the entire period 2015-2019
We can use data.table methods
library(data.table)
melt(fread("https://pastebin.com/raw/6cetukKb"),
id.var = c('date', 'year', 'period', 'periodName'))[,
.(value = mean(value)), .(variable, year)][,
adj_factor := value/value[year == 2014]][]
# variable year value adj_factor
# 1: Antigua_Barbuda 2014 96.40000 1.0000000
# 2: Antigua_Barbuda 2015 96.55833 1.7059776
# 3: Antigua_Barbuda 2016 96.08333 1.0146075
# 4: Antigua_Barbuda 2017 98.40833 0.9900235
# 5: Antigua_Barbuda 2018 99.62500 0.5822618
# 6: Antigua_Barbuda 2019 101.07500 1.0484959
# 7: Argentina 2014 56.60000 1.0000000
# ..
You should read your data with header = TRUE since the first row are the names of the columns. Then get your data in long format which makes it easy to do the calculation.
After this you can perform whichever calculation you want. For example, to perform the same steps as your attempt i.e divide all the values with the value in the year 2014 for each country you can do.
library(dplyr)
data <- read.table("https://pastebin.com/raw/6cetukKb", header = TRUE)
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate(adj_factor = value/value[year == 2014])
# name year value adj_factor
# <chr> <int> <dbl> <dbl>
# 1 Antigua_Barbuda 2014 96.4 1
# 2 Antigua_Barbuda 2015 96.6 1.00
# 3 Antigua_Barbuda 2016 96.1 0.997
# 4 Antigua_Barbuda 2017 98.4 1.02
# 5 Antigua_Barbuda 2018 99.6 1.03
# 6 Antigua_Barbuda 2019 101. 1.05
# 7 Argentina 2014 56.6 1
# 8 Argentina 2015 64.0 1.13
# 9 Argentina 2016 89.9 1.59
#10 Argentina 2017 113. 2.00
# … with 20 more rows

Mean Temperature by group month in R

I am trying to calculate the mean temperature per month of daily records between 1988 to 2020 using the following code:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
and I got the following results, that I checked in excel and it seems correct:
# A tibble: 12 x 2
month mean_temp_monthYear
<dbl> <dbl>
1 1 11.4
2 2 13.5
3 3 17.2
4 4 21.2
5 5 26.0
6 6 31.0
7 7 33.3
8 8 32.5
9 9 29.1
10 10 22.4
11 11 15.4
12 12 10.7
However when I do this only for the month of July (month =7). I got a different result:
(Temperature_year_month <- (na.omit(database_PE_na) %>% group_by(month=7) %>% summarise(mean_temp_monthYear = mean(Air.Temp.Mean))))
month mean_temp_monthYear
<dbl> <dbl>
1 7 22.0
Someone could explain to me why this happens¿
We can use data.table methods
library(data.table)
setDT(database_PE_na)[month == 7,
.(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))]
For comparison use == and not =.
If you want to get mean of one month use it in filter instead of group_by.
mean has na.rm argument which can be set to TRUE to ignore NA values instead of using na.omit and removing the complete row.
Use :
library(dplyr)
Temperature_year_month <- database_PE_na %>%
filter(month==7) %>%
summarise(mean_temp_monthYear = mean(Air.Temp.Mean, na.rm = TRUE))

Aggregating by fixed date range R

Given a simplification of my dataset like:
df <- data.frame("ID"= c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
"ForestType" = c("oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak",
"pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine"),
"Date"= c("1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01",
"1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01"),
"NDVI"= c(0.1,0.2,0.3,0.55,0.31,0.26,0.34,0.52,0.41,0.45,0.50,0.7,
0.2,0.3,0.4,0.53,0.52,0.54,0.78,0.73,0.72,0.71,0.76,0.9),
check.names = FALSE, stringsAsFactors = FALSE)
I would like to obtain the means of NDVI values by a certain period of time, in this case by year. Take into account that in my real dataset I would need it for seasons, so it should be adaptable.
These means should consider:
Trimming outliers: for example 25% of the highest values and 25% of the lowest values.
They should be by class, in this case by the ID field.
So the output should look something like:
> desired_df
ID ForestType Date meanNDVI
1 1 oak 1987 0.250
2 1 oak 1988 0.325
3 1 oak 1989 0.430
4 2 pine 1987 0.350
5 2 pine 1988 0.635
6 2 pine 1989 0.740
In this case, for example, 0.250 corresponds to mean NDVI on 1987 of ID=1 and it is the mean of the 4 values of that year taking the lowest and the highest out.
Thanks a lot!
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = as.Date(Date, format = "%Y.%m.%d")) %>%
group_by(ID, ForestType, Year = year(Date)) %>%
filter(NDVI > quantile(NDVI, .25) & NDVI < quantile(NDVI, .75)) %>%
summarise(meanNDVI = mean(NDVI))
Output
# A tibble: 6 x 4
# Groups: ID, ForestType [2]
ID ForestType Year meanNDVI
<dbl> <chr> <dbl> <dbl>
1 1 oak 1987 0.25
2 1 oak 1988 0.325
3 1 oak 1989 0.475
4 2 pine 1987 0.35
5 2 pine 1988 0.635
6 2 pine 1989 0.74
The classical base R approach using aggregate. The year can be obtained using substr.
res <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean))
res[order(res$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.2875
# 3 1 oak 1988 0.3575
# 5 1 oak 1989 0.5150
# 2 2 pine 1987 0.3575
# 4 2 pine 1988 0.6425
# 6 2 pine 1989 0.7725
Trimmed version
Trimmed for 25% outlyers.
res2 <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean, trim=.25))
res2[order(res2$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.250
# 3 1 oak 1988 0.325
# 5 1 oak 1989 0.475
# 2 2 pine 1987 0.350
# 4 2 pine 1988 0.635
# 6 2 pine 1989 0.740
Using data.table package, you could proceed as follows:
library(data.table)
setDT(df)[, Date := as.Date(Date, format = "%Y.%m.%d")][]
df[, .(meanNDVI = base::mean(NDVI, trim = 0.25)), by = .(ID, ForestType, year = year(Date))]
# ID ForestType year meanNDVI
# 1: 1 oak 1987 0.250
# 2: 1 oak 1988 0.325
# 3: 1 oak 1989 0.475
# 4: 2 pine 1987 0.350
# 5: 2 pine 1988 0.635
# 6: 2 pine 1989 0.740
Another option. You can set trim in mean
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = ymd(Date) %>% year()) %>%
group_by(ID, ForestType, Date) %>%
summarise(mean = mean(NDVI, trim = 0.25, na.rm = T))

Finding weekly returns from daily returns company-wise

I have data which look something like this
co_code company_name co_stkdate dailylogreturn
1 A 01-01-2000 0.76
1 A 02-01-2000 0.75
.
.
.
1 A 31-12-2019 0.54
2 B 01-01-2000 0.98
2 B 02-01-2000 0.45
.
.
And so on
I want to find weekly returns which is equal to sum of daily log return for one week.
output should look something like this
co_code company_name co_stkdate weeklyreturns
1 A 07-01-2000 1.34
1 A 14-01-2000 0.95
.
.
.
1 A 31-12-2019 0.54
2 B 07-01-2000 0.98
2 B 14-01-2000 0.45
I tried to apply functions in quantmod package but those functions are applicable to only xts objects. Another issue in xts objects is that function "group_by()" can't be used. Thus, I want to work in usual dataframe only.
Code look something like this
library(dplyr)
### Reading txt file
df <- read.csv("33339_1_120_20190405_165913_dat.csv")
Calculating daily log returns
df <- mutate(df, "dailylogrtn"=log(nse_returns)) %>% as.data.frame()
Formatting date
df$co_stkdate<- as.Date(as.character(df$co_stkdate), format="%d-%m-%Y")
Since we don't know how many days of every week you got a dailylogreturn, there might be NAs, I recommend grouping by week and year:
#sample data
df <- data.frame(co_stkdate = rep(seq.Date(from = as.Date("2000-01-07"), to = as.Date("2000-02-07"), by = 1), 2),
dailylogreturn = abs(round(rnorm(64, 1, 1), 2)),
company_name = rep(c("A", "B"), each = 32))
df %>%
mutate(co_stkdate = as.POSIXct(co_stkdate),
year = strftime(co_stkdate, "%W"),
week = strftime(co_stkdate, "%Y")) %>%
group_by(company_name, year, week) %>%
summarise(weeklyreturns = sum(dailylogreturn, na.rm = TRUE))
# A tibble: 12 x 4
# Groups: company_name, year [12]
company_name year week weeklyreturns
<fct> <chr> <chr> <dbl>
1 A 01 2000 6.31
2 A 02 2000 6.11
3 A 03 2000 6.02
4 A 04 2000 8.27
5 A 05 2000 4.92
6 A 06 2000 0.5
7 B 01 2000 1.82
8 B 02 2000 6.6
9 B 03 2000 7.55
10 B 04 2000 7.63
11 B 05 2000 7.54
12 B 06 2000 1.03
Since I don't have sample data, I assume this should work:
df %>%
group_by(group = ceiling((1:nrow(df)/ 7))) %>%
summarise(mean = mean(weeklyreturns))

calculation average from a few month at the turn of the year for 3 diffrent indexs and for 30 years

I do not have a really date values. I have one column with Year and another with Month. And 3 more columns for 3 diffrent indexes.There is one index value for one month. (so 12 months per year for 30 years,. It is lots numbers) So I´d like to see the average value from a few month.
I need the information about this index to predict pollen season in summer time. So I would like to have a average for winter months (like Dec-Jan_Feb_Mars) for NAO and also average for winter months for AO and SO. (so 3 average for 3 index). But also I ´d like to receive this value not only for one year but for all years. I think the complicate story is because Dec 1988 - Jan 1989- Feb 1989 (so it is a average for a few month at the turn of the years). If I succsse with this I will do diffrent combination of months.
Year Month NAO AO SO
1 1988 1 1.02 0.26 -0.1
2 1988 2 0.76 -1.07 -0.4
3 1988 3 -0.17 -0.20 0.6
4 1988 4 -1.17 -0.56 0.1
5 1988 5 0.63 -0.85 0.9
6 1988 6 0.88 0.06 0.1
7 1988 7 -0.35 -0.14 1.0
8 1988 8 0.04 0.25 1.5
9 1988 9 -0.99 1.04 1.8
10 1988 10 -1.08 0.03 1.4
11 1988 11 -0.34 -0.03 1.7
12 1988 12 0.61 1.68 1.2
13 1989 1 1.17 3.11 1.5
14 1989 2 2.00 3.28 1.2
...
366 2018 6 1.09 0.38 -0.1
367 2018 7 1.39 0.61 0.2
368 2018 8 1.97 0.84 -0.3
index$Month<-as.character(index$Month)
#define function to compute average by consecutive season of interest/month_combination
compute_avg_season <- function(index, month_combination){
index<-index%>%
mutate(date=paste(Year,Month, "01",sep="-")) %>%
mutate(date=as.Date(date,"%Y-%b-%d")) %>%
arrange(date)%>%
mutate(winter_mths=ifelse(Month %in% month_combination, 1, NA))
index<-setDT(index)[,id :=rleid(winter_mths)]%>%
filter(!is.na(winter_mths))%>%
group_by(id)%>%
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE)),
Error: unexpected ',' in:
"group_by(id)%>%
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE)),"
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE),
+ mean_winter_AO=mean(AO, na.rm = TRUE),
+ mean_winter_SO=mean(SO, na.rm=TRUE))
Error in mean(NAO, na.rm = TRUE) : object 'NAO' not found
View(index)
Why do I have such error?
I updated the answer to the new insights from your comments:
# load libraries
library(dplyr)
library(data.table)
# pre-processing
index$Month <- as.character(index$Month) # Month is factor, make it character
colnames(index)[1] <- "Year" # simplify name of the Year column
# define a function to compute average by consecutive season of interest/month_combination (do not modify this function)
compute_avg_season <- function(df, month_combination) {
# mark combination of months as 1, else NA
df <- df %>%
# correction month MAY
mutate(Month = replace(Month, Month=="MAI", "MAY")) %>%
# create date
mutate(date = paste(Year, Month, "01", sep="-")) %>%
mutate(date = as.Date(date, "%Y-%b-%d")) %>%
# sort by date (you want average by consecutive months: DEC, JAN, FEB, MAR)
arrange(date) %>%
mutate(winter_mths = ifelse(Month %in% month_combination, 1, NA))
# add index for each set of months of interest and compute mean by index value
df <- setDT(df)[, id := rleid(winter_mths)] %>%
filter(!is.na(winter_mths)) %>%
group_by(id) %>%
summarise(mean_winter_NAO = mean(NAO, na.rm = TRUE),
mean_winter_AO = mean(AO, na.rm = TRUE),
mean_winter_SO = mean(SO, na.rm = TRUE))
return(df)
}
# Use the above-defined function to compute mean values by desired month combination:
# set the month combination
month_combination <- c("DEC", "JAN", "FEB", "MAR")
# compute mean values by month combination
compute_avg_season(index, month_combination)

Resources