I have a long dataset in the following format:
Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
: : :
: : :
2019-06-30 Australia 57
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10
: : :
: : :
2019-06-30 Austria 21
I want to calculate a 90-day period rolling standard deviation of the Score for each country. I have tried using the rollapply function (Package:zoo) and roll_sd (Package:RcppRoll) but they are not working for groupwise standard deviation. Can anyone please suggest a possible way to calculate the rolling standard deviation.
Thanks!
In general grouping is done separately from the base operation in R so it is not that those functions can't be used for grouped data. It is just that you need to embed them within a grouping operation. Here we use ave to do the grouping and rollapplyr to perform the rolling sd.
Now, at each point can we assume that the last 90 days are the last 90 rows? Assuming yes and taking rolling standard deviations of 2 so that we can use the selected rows of the posted data shown reproducibly in the Note at the end:
library(zoo)
roll <- function(x) rollapplyr(x, 2, sd, fill = NA)
transform(DF, roll = ave(Score, Country, FUN = roll))
giving:
Date Country Score roll
1 1995-01-01 Australia 100 NA
2 1995-01-02 Australia 99 0.7071068
3 1995-01-03 Australia 85 9.8994949
4 1995-01-01 Austria 67 NA
5 1995-01-02 Austria 12 38.8908730
6 1995-01-03 Austria 10 1.4142136
Wide form approach
Another approach is to convert the data to wide form and then perform the rolling operation:
library(zoo)
z <- read.zoo(DF, split = "Country")
zr <- rollapplyr(z, 2, sd, fill = NA)
zr
giving this zoo series:
Australia Austria
1995-01-01 NA NA
1995-01-02 0.7071068 38.890873
1995-01-03 9.8994949 1.414214
You can then just leave it as a zoo series in order to take advantage of the other time series functions in that package or can convert it back to a data frame using fortify.zoo(zr) or fortify.zoo(zr, melt = TRUE, names = names(DF)) depending on what you need.
Note
The input used in reproducible form.
Lines <- "Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10"
DF <- read.table(text = Lines, header = TRUE)
DF$Date <- as.Date(DF$Date)
Related
I am given a dataset of the following form
year<-rep(c(1990:1999),each=10)
age<-rep(50:59, 10)
cat1<-rep(c("A","B","C","D","E"),each=100)
value<-rnorm(10*10*5)
value[c(3,51,100,340,441)]<-0
df<-data.frame(year,age,cat1,value)
year age cat1 value
1 1990 50 A -0.7941799
2 1990 51 A 0.1592270
3 1990 52 A 0.0000000
4 1990 53 A 1.9222384
5 1990 54 A 0.3922259
6 1990 55 A -1.2671957
I now would like to replace any zeroes in the "value" column by the average over the column "cat1" of the non-zero entries of "value" for the corresponding year and age. For example, for year 1990, age 52 the enty for cat1=A is zero, this should be replaced by average of the non-zero entries of the remaining categories for this specific year and age.
As we have
df[df$year==1990 & df$age==52,]
year age cat1 value
3 1990 52 A 0.0000000
103 1990 52 B -1.1325446
203 1990 52 C -1.6136773
303 1990 52 D 0.5724360
403 1990 52 E 0.2795241
we would replace the entry 0 by
sum(df[df$year==1990 & df$age==52,4])/4
[1] -0.4735654
Is there a nice and clean way to this generally?
library(data.table)
setDT(df)[value==0, value := NA,]
df[, value := replace(value, is.na(value), mean(value, na.rm = TRUE)) , by = .(year, age)]
maybe 99,9% of operations with tables can be decomposed into basic fast and optimized: split, concatenation(in case of numeric: sum, multiplication etc), filter, sort, join.
Here left_join from dplyr is your way to go.
Just create another dataframe filtered from zeroes and aggregated over value with proper grouping. Then substitute zeroes with values from new joined column.
I have a large dataframe (AT_df) with many years for many countries, but no annual totals. The initial dataset has already been slimmed down to Pollutant_name (x1="CO2"), I dropped all subcategories, and to one country.
I am preparing this data to afterwards run ggplot2, but for this I need to add a row for each year with the total of the categories (=1-6).
The data looks like this (excerpt):
x y x1 x2 x4 x6
1553 1993 0.00000 CO2 Austria 6 6 - Other Sector
1554 2006 0.00000 CO2 Austria 6 6 - Other Sector
1555 2015 0.00000 CO2 Austria 6 6 - Other Sector
2243 1998 12.07760 CO2 Austria 5 5 - Waste management
2400 1992 11.12720 CO2 Austria 5 5 - Waste management
2401 1995 11.11040 CO2 Austria 5 5 - Waste management
2402 2006 10.26000 CO2 Austria 5 5 - Waste management
2489 1998 0.00000 CO2 Austria 6 6 - Other Sector
I would like to insert a row which is labelled (x6= aggregate) and sums up the values for y (emissions) under the condition of x= year xyz & x2=country_xyz.
Basically something like this
sum(AT_df, x4 %in% c("1", "2", "3", "4", "5", "6") & x ="yearxyz" &
x2="Austria").
This then should be inserted into the dataframe FOR EACH YEAR (16 years in total)
While I have tried some things I've read on stackoverflow, such as:
rbind(AT_df, data.frame(x1='Aggregate', y = sum(AT_df$y)))
... I was not able to write any correctly working code
Thanks in any case and for any sort of help.
You could first prepare a data frame with summary data in the same shape as your AT_df and afterwards combine the two. There are many ways to do this in R. Here I am using the dplyr package. Since the sample data is not enough to fully show this, I am also creating some artificial data first. After that, one has to do the follwing steps:
Name all the columns that should be retained when summarising (function group_by).
Summarise some column and assigning the output to a column (function summarise).
Add a column for the now missing variable(s) (function mutate).
Combine the resulting data frame with the original one (function union_all)
The final filter is only used to show some representative data.
set.seed(42)
df <- expand.grid(year = 1993:2015,
pollutant = "CO2",
country = LETTERS,
sector = 1L:6L)
df$amount <- runif(nrow(df), 0, 15)
library("dplyr")
df %>%
group_by(year, pollutant, country) %>%
summarise(amount = sum(amount)) %>%
mutate(sector = -1L) %>%
union_all(df) %>%
filter(country == "A" & year == 1996)
#> # A tibble: 7 x 5
#> # Groups: year, pollutant [1]
#> year pollutant country amount sector
#> <int> <fct> <fct> <dbl> <int>
#> 1 1996 CO2 A 41.5 -1
#> 2 1996 CO2 A 12.5 1
#> 3 1996 CO2 A 4.24 2
#> 4 1996 CO2 A 6.70 3
#> 5 1996 CO2 A 1.88 4
#> 6 1996 CO2 A 9.40 5
#> 7 1996 CO2 A 6.82 6
I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).
I did quite some searching on how to simplify the code for the problem below but was not successful. I assume that with some kind of apply-magic one could speed things up a little, but so far I still have my difficulties with these kind of functions ....
I have an data.frame data, structured as follows:
year iso3c gdpppc elec solid liquid heat
2010 USA 1567 1063 1118 835 616
2015 USA 1571 NA NA NA NA
2020 USA 1579 NA NA NA NA
... USA ... NA NA NA NA
2100 USA 3568 NA NA NA NA
2010 ARG 256 145 91 85 37
2015 ARG 261 NA NA NA NA
2020 ARG 270 NA NA NA NA
... ARG ... NA NA NA NA
2100 ARG 632 NA NA NA NA
As you can see, I have a historical starting value for 2010 and a complete scenario for gdppc up to 2100. I want to let values for elec, solid, liquid and heat grow according to some elasticity with respect to the development of gdppc, but separately for each country (coded in iso3c).
I have the elasticities defined in a separate data.frame parameters:
item value
elec 0.5
liquid 0.2
solid -0.1
heat 0.1
So far I am using a nested for loop:
for (e in 1:length(levels(parameters$item)){
for (c in 1:length(levels(data$iso3c)){
tmp <- subset(data, select=c("year", "iso3c", "gdppc", parameters[e, "item"]), subset=("iso3c" == levels(data$iso3c)[c]))
tmp[tmp$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <-
tmp[tmp$year == 2010, parameters[e, "item"]] *
cumprod((1 + (tmp[tmp$year %in% seq(2015, 2100, 5), "gdppc"] /
tmp[tmp$year %in% seq(2010, 2095, 5), "gdppc"] - 1) * parameters[e, "value"]))
data[data$iso3c == levels(data$iso3c)[i] & data$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <- tmp[tmp$year > 2010, parameters[e, "item"]]
}
}
The outer loop loops over the columns and the inner one over the countries. The inner loop runs for every country (I have 180+ countries). First, a subset containing data on one single country and on the variable of interest is selected. Then I let the respective variable grow with a certain elasticity to growth in gdppc and finally put the subset back into place in data.
I have already tried to let the outer loop run in parallel using foreach but was not succesful recombining the results. Since I have to run similar calculations quite often I would be very grateful for any help.
Thanks
Here's one way. Note I renamed your parameters data.frame to p
library(data.table)
library(reshape2)
dt <- data.table(data)
dt.melt = melt(dt,id=1:3)
dt.melt[,value:=as.numeric(value)] # coerce value column to numeric
dt.melt[,value:=head(value,1)+(gdpppc-head(gdpppc,1))*p[p$item==variable,]$value,
by="iso3c,variable"]
result <- dcast(dt.melt,iso3c+year+gdpppc~variable)
result
# iso3c year gdpppc elec solid liquid heat
# 1 ARG 2010 256 145.0 91.0 85.0 37.0
# 2 ARG 2015 261 147.5 90.5 86.0 37.5
# 3 ARG 2020 270 152.0 89.6 87.8 38.4
# 4 ARG 2100 632 333.0 53.4 160.2 74.6
# 5 USA 2010 1567 1063.0 1118.0 835.0 616.0
# 6 USA 2015 1571 1065.0 1117.6 835.8 616.4
# 7 USA 2020 1579 1069.0 1116.8 837.4 617.2
# 8 USA 2100 3568 2063.5 917.9 1235.2 816.1
The basic idea is to use the melt(...) function to reshape your original data into "long" format, where the values in the four columns solid, liquid, elec, and heat are all in one column, value, and the column variable indicates which metric value refers to. Now, using data tables, you can fill in the values easily. Then, reshape the result back into wide format using dcast(...).
Hi I am new to R and have a question. I have a data.frame (df) containing about 30 different types of statistics from years 1960-2012 for about 100 different countries. Here is an example of what it looks like:
Country Statistic.Type 1960 1961 1962 1963 ... 2012
__________________________________________________________________________________
1 Albania Death Rate 10 21 13 24 25
2 Albania Birth Rate 7 15 6 10 9
3 Albania Life Expectancy 8 12 10 7 20
4 Albania Population 10 30 27 18 13
5 Brazil Death Rate 14 20 22 13 18
6 Brazil Birth Rate ...
7 Brazil Life Expectancy ...
8 Brazil Population ...
9 Cambodia Death Rate ...
10 Cambodia Birth Rate ... etc...
Note that there are 55 columns in total and the values in each of the 53 year columns are made up for the purposes of this question.
I need help writing a function which takes as inputs the country and statistic type and returns a new data.frame with 2 columns which shows the year and value in each year for a given country and statistic type. For example, if I input country=Brazil and statistic.type=Death Rate into the function, the new data.frame should look like:
Year Value
_____________________
1 1960 14
2 1961 20
3 1962 22
...
51 2012 18
I have no idea on how to do this, if anyone can give me any ideas/code/packages to install then that would be very helpful.
Thank you so much!
If df is your data.frame, all you need is this:
f <- function(country, statistic.type, data=df)
{
values <- data[data$Country==country & data$Statistic.Type==statistic.type,-(1:2)]
cbind(Year=names(df)[-(1:2)], Value=values)
}
Use it as
f(country="Brazil", statistic.type="Death Rate")
You will probably have to do some split operation on the total data set to have country individual datasets.
https://stat.ethz.ch/pipermail/r-help/2008-February/155328.html
Then use the melt function for each subset of data. In your case, adapted from
http://www.statmethods.net/management/reshape.html, where mydata is the already splitted data:
% example of melt function
library(reshape)
mdata <- melt(mydata, id=c("Year"))
That is it.
You could just combine subset with stack, with maybe a gsub in there to leave only the numbers in your column of years:
df <- expand.grid(
"country" = c("A", "B"),
"statistic" = c("c", "d", "e", "f"),
stringsAsFactors = FALSE)
df$year1980 <- rnorm(8)
df$year1990 <- rnorm(8)
df$year2000 <- rnorm(8)
getYears <- function(input, cntry, stat) {
x <- subset(input, country == cntry & stat == statistic,
select = -c(country, statistic))
x <- stack(x)[,c("ind", "values")]
x$ind <- gsub("\\D", "", x$ind)
x
}
getYears(df, "A", "c")
ind values
1 1980 1.1421309
2 1990 1.0777974
3 2000 -0.2010913