I have to calculate the inflation rate from 2015 to 2019. I have to do this with the CPI, which I have for each month during the 4 years. This means that I have to calculate the percentage growth rate for the same month last year.
They ask me for the calculation of several countries and then calculate or show the average for the period 2015-2019.
This is my database:
data <- read.table("https://pastebin.com/raw/6cetukKb")
I have tried the quantmod, dplyr, lubridate packages, but I can't do the CPI conversion.
I tried this but I know it is not correct:
data$year <- year(data$date)
anual_cpi <- data %>% group_by(year) %>% summarize(cpi = mean(Argentina))
anual_cpi$adj_factor <- anual_cpi$cpi/anual_cpi$cpi[anual_cpi$year == 2014]
**
UPDATE
**
my teacher gave us a hint on how to get the result, but when I try to add it to the code, I get an error.
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate((change=(x-lag(x,1))/lag(x,1)*100))
| Antigua_Barbuda | -1.55 |
|----------------- |------- |
| Argentina | 1.03 |
| Aruba | -1.52 |
| Bahamas | -1.56 |
| Barbados | -1.38 |
where "value" corresponds to the average inflation for each country during the entire period 2015-2019
We can use data.table methods
library(data.table)
melt(fread("https://pastebin.com/raw/6cetukKb"),
id.var = c('date', 'year', 'period', 'periodName'))[,
.(value = mean(value)), .(variable, year)][,
adj_factor := value/value[year == 2014]][]
# variable year value adj_factor
# 1: Antigua_Barbuda 2014 96.40000 1.0000000
# 2: Antigua_Barbuda 2015 96.55833 1.7059776
# 3: Antigua_Barbuda 2016 96.08333 1.0146075
# 4: Antigua_Barbuda 2017 98.40833 0.9900235
# 5: Antigua_Barbuda 2018 99.62500 0.5822618
# 6: Antigua_Barbuda 2019 101.07500 1.0484959
# 7: Argentina 2014 56.60000 1.0000000
# ..
You should read your data with header = TRUE since the first row are the names of the columns. Then get your data in long format which makes it easy to do the calculation.
After this you can perform whichever calculation you want. For example, to perform the same steps as your attempt i.e divide all the values with the value in the year 2014 for each country you can do.
library(dplyr)
data <- read.table("https://pastebin.com/raw/6cetukKb", header = TRUE)
data %>%
tidyr::pivot_longer(cols = Antigua_Barbuda:Barbados) %>%
group_by(name, year) %>%
summarise(value = mean(value)) %>%
mutate(adj_factor = value/value[year == 2014])
# name year value adj_factor
# <chr> <int> <dbl> <dbl>
# 1 Antigua_Barbuda 2014 96.4 1
# 2 Antigua_Barbuda 2015 96.6 1.00
# 3 Antigua_Barbuda 2016 96.1 0.997
# 4 Antigua_Barbuda 2017 98.4 1.02
# 5 Antigua_Barbuda 2018 99.6 1.03
# 6 Antigua_Barbuda 2019 101. 1.05
# 7 Argentina 2014 56.6 1
# 8 Argentina 2015 64.0 1.13
# 9 Argentina 2016 89.9 1.59
#10 Argentina 2017 113. 2.00
# … with 20 more rows
Related
I have this data frame that is grouped by id_station, id_parameter, "zona", and its date.
id_station id_parameter zona year month day mediaDiaria sdDiaria Count
1 AJM CO SO 2019 1 1 0.281 0.181 21
2 AJM CO SO 2019 1 2 0.367 0.230 24
3 AJM CO SO 2019 1 3 0.371 0.160 24
4 AJM CO SO 2019 1 4 0.312 0.185 24
5 AJM CO SO 2019 1 5 0.296 0.168 24
6 AJM CO SO 2019 1 6 0.225 0.142 24
7 AJM CO SO 2019 1 7 0.281 0.0873 21
8 AJM CO SO 2019 1 8 0.388 0.236 24
9 AJM CO SO 2019 1 9 0.421 0.265 24
10 AJM CO SO 2019 1 10 0.225 0.103 24
What I want to do is to filter March 1st, 2019 to February 29, 2020. I would treat this as "Year 1." Afterwards, I want to count the number of rows in Count, in Year 1 and per id_station, to eliminate all rows from stations that have less than 275 rows (days) with Count > 18.
I have tried the following with filter:
Year1in2019CO <- datosCO %>%
filter(year == 2019, month %in% c(3:12)) %>%
group_by(id_station, id_parameter, zona, year, month, day) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE), sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
Year1in2020CO <- datosCO %>%
filter(year == 2020, month %in% c(1:2)) %>%
group_by(id_station, id_parameter, zona, year, month, day) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE), sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
Year1CO <- bind_rows(Year1in2019CO, Year1in2020CO)
It does the job. But is there a way to do this while only creating one data frame, instead of 3?
And I have tried the following for the counting rows part:
YEAR1dfCO_2 <- Year1CO %>%
group_by(id_station) %>%
summarise(dws = sum(Count > 18))
And while it does give me what I need, I do not know how to eliminate all data from stations with less than 275 rows in Count (being > 18) in Year 1 from the original dataset (Year1CO).
Can you please help me?
This might work. First, filter year 1 rows using a constructed date then remove the stations based on the condition you described.
library(tidyverse)
yr1 <- datosCO %>%
mutate(d = as.Date(paste(year, month, day, sep = "-"))) %>%
filter(between(d, as.Date("2019-03-01"), as.Date("2020-02-29"))) %>%
group_by(id_station, id_parameter, zona, d) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE),
sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
yr1 %>%
group_by(id_station) %>%
filter(sum(Count > 18) < 275) %>%
ungroup()
Given a simplification of my dataset like:
df <- data.frame("ID"= c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2),
"ForestType" = c("oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak","oak",
"pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine","pine"),
"Date"= c("1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01",
"1987.01.01","1987.06.01","1987.10.01","1987.11.01",
"1988.01.01","1988.03.01","1988.04.01","1988.06.01",
"1989.03.01","1989.05.01","1989.07.01","1989.08.01"),
"NDVI"= c(0.1,0.2,0.3,0.55,0.31,0.26,0.34,0.52,0.41,0.45,0.50,0.7,
0.2,0.3,0.4,0.53,0.52,0.54,0.78,0.73,0.72,0.71,0.76,0.9),
check.names = FALSE, stringsAsFactors = FALSE)
I would like to obtain the means of NDVI values by a certain period of time, in this case by year. Take into account that in my real dataset I would need it for seasons, so it should be adaptable.
These means should consider:
Trimming outliers: for example 25% of the highest values and 25% of the lowest values.
They should be by class, in this case by the ID field.
So the output should look something like:
> desired_df
ID ForestType Date meanNDVI
1 1 oak 1987 0.250
2 1 oak 1988 0.325
3 1 oak 1989 0.430
4 2 pine 1987 0.350
5 2 pine 1988 0.635
6 2 pine 1989 0.740
In this case, for example, 0.250 corresponds to mean NDVI on 1987 of ID=1 and it is the mean of the 4 values of that year taking the lowest and the highest out.
Thanks a lot!
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = as.Date(Date, format = "%Y.%m.%d")) %>%
group_by(ID, ForestType, Year = year(Date)) %>%
filter(NDVI > quantile(NDVI, .25) & NDVI < quantile(NDVI, .75)) %>%
summarise(meanNDVI = mean(NDVI))
Output
# A tibble: 6 x 4
# Groups: ID, ForestType [2]
ID ForestType Year meanNDVI
<dbl> <chr> <dbl> <dbl>
1 1 oak 1987 0.25
2 1 oak 1988 0.325
3 1 oak 1989 0.475
4 2 pine 1987 0.35
5 2 pine 1988 0.635
6 2 pine 1989 0.74
The classical base R approach using aggregate. The year can be obtained using substr.
res <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean))
res[order(res$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.2875
# 3 1 oak 1988 0.3575
# 5 1 oak 1989 0.5150
# 2 2 pine 1987 0.3575
# 4 2 pine 1988 0.6425
# 6 2 pine 1989 0.7725
Trimmed version
Trimmed for 25% outlyers.
res2 <- with(df, aggregate(list(meanNDVI=NDVI),
by=list(ID=ID, ForestType=ForestType, date=substr(Date, 1, 4)),
FUN=mean, trim=.25))
res2[order(res2$ID), ]
# ID ForestType date meanNDVI
# 1 1 oak 1987 0.250
# 3 1 oak 1988 0.325
# 5 1 oak 1989 0.475
# 2 2 pine 1987 0.350
# 4 2 pine 1988 0.635
# 6 2 pine 1989 0.740
Using data.table package, you could proceed as follows:
library(data.table)
setDT(df)[, Date := as.Date(Date, format = "%Y.%m.%d")][]
df[, .(meanNDVI = base::mean(NDVI, trim = 0.25)), by = .(ID, ForestType, year = year(Date))]
# ID ForestType year meanNDVI
# 1: 1 oak 1987 0.250
# 2: 1 oak 1988 0.325
# 3: 1 oak 1989 0.475
# 4: 2 pine 1987 0.350
# 5: 2 pine 1988 0.635
# 6: 2 pine 1989 0.740
Another option. You can set trim in mean
library(tidyverse)
library(lubridate)
df %>%
mutate(Date = ymd(Date) %>% year()) %>%
group_by(ID, ForestType, Date) %>%
summarise(mean = mean(NDVI, trim = 0.25, na.rm = T))
I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.
I have data which look something like this
co_code company_name co_stkdate dailylogreturn
1 A 01-01-2000 0.76
1 A 02-01-2000 0.75
.
.
.
1 A 31-12-2019 0.54
2 B 01-01-2000 0.98
2 B 02-01-2000 0.45
.
.
And so on
I want to find weekly returns which is equal to sum of daily log return for one week.
output should look something like this
co_code company_name co_stkdate weeklyreturns
1 A 07-01-2000 1.34
1 A 14-01-2000 0.95
.
.
.
1 A 31-12-2019 0.54
2 B 07-01-2000 0.98
2 B 14-01-2000 0.45
I tried to apply functions in quantmod package but those functions are applicable to only xts objects. Another issue in xts objects is that function "group_by()" can't be used. Thus, I want to work in usual dataframe only.
Code look something like this
library(dplyr)
### Reading txt file
df <- read.csv("33339_1_120_20190405_165913_dat.csv")
Calculating daily log returns
df <- mutate(df, "dailylogrtn"=log(nse_returns)) %>% as.data.frame()
Formatting date
df$co_stkdate<- as.Date(as.character(df$co_stkdate), format="%d-%m-%Y")
Since we don't know how many days of every week you got a dailylogreturn, there might be NAs, I recommend grouping by week and year:
#sample data
df <- data.frame(co_stkdate = rep(seq.Date(from = as.Date("2000-01-07"), to = as.Date("2000-02-07"), by = 1), 2),
dailylogreturn = abs(round(rnorm(64, 1, 1), 2)),
company_name = rep(c("A", "B"), each = 32))
df %>%
mutate(co_stkdate = as.POSIXct(co_stkdate),
year = strftime(co_stkdate, "%W"),
week = strftime(co_stkdate, "%Y")) %>%
group_by(company_name, year, week) %>%
summarise(weeklyreturns = sum(dailylogreturn, na.rm = TRUE))
# A tibble: 12 x 4
# Groups: company_name, year [12]
company_name year week weeklyreturns
<fct> <chr> <chr> <dbl>
1 A 01 2000 6.31
2 A 02 2000 6.11
3 A 03 2000 6.02
4 A 04 2000 8.27
5 A 05 2000 4.92
6 A 06 2000 0.5
7 B 01 2000 1.82
8 B 02 2000 6.6
9 B 03 2000 7.55
10 B 04 2000 7.63
11 B 05 2000 7.54
12 B 06 2000 1.03
Since I don't have sample data, I assume this should work:
df %>%
group_by(group = ceiling((1:nrow(df)/ 7))) %>%
summarise(mean = mean(weeklyreturns))
I have a dataframe containing three columns, where the first is an ID, the second denotes a year and the third column is the value associated with the ID in that year:
df_in <- data.frame("ID"=c(1,1,1,1,1,1,
2,2,2,2,
3,3,3),
"yr"=c(2001,2002,2003,2004,2005,2006,
2002,2003,2004,2005,
2003,2004,2005),
"val"=c(1,2,3,4,5,6,
4,5,6,7,
7,8,9))
I would like to introduce a lag in my val-column per ID, so looking at (e.g.) ID==1 then the value at yr==2002 should be shifted to yr==2001, yr==2003 to yr==2002 and so on. This should be the case for all unique ID's.
The row corresponding to the last year (that now doesn't have a value due to the shift) should be deleted. We ultimately end up with
df_out <- data.frame("ID"=c(1,1,1,1,1,
2,2,2,
3,3),
"yr"=c(2001,2002,2003,2004,2005,
2002,2003,2004,
2003,2004),
"val"=c(2,3,4,5,6,
5,6,7,
8,9))
Is there an easy way to do this in dplyr?
df_out <-
df_in %>%
group_by(ID) %>%
mutate(yr = lag(yr)) %>%
filter(!is.na(yr)) %>%
ungroup
To get the requested result, you can use do:
df_in %>%
group_by(ID) %>%
do(data.frame(yr = head(.$yr, -1L), val = tail(.$val, -1L)))
The result:
# A tibble: 10 x 3
# Groups: ID [3]
ID yr val
<dbl> <dbl> <dbl>
1 1.00 2001 2.00
2 1.00 2002 3.00
3 1.00 2003 4.00
4 1.00 2004 5.00
5 1.00 2005 6.00
6 2.00 2002 5.00
7 2.00 2003 6.00
8 2.00 2004 7.00
9 3.00 2003 8.00
10 3.00 2004 9.00