Calculate 7 day average in r

Calculate 7 day average in r - r

Date
rate
7_Day_rate_avg
1967-07-01
12.5
N/a
1967-07-02
12.5
N/a
1967-07-03
6
N/a
1967-07-04
8
N/a
1967-07-05
4
N/a
1967-07-06
2
N/a
1967-07-07
11.5
avg
1967-07-08
12.1
avg
1967-07-09
10
avg
1967-07-10
12.0
avg
1967-07-11
11.1
avg
1967-07-12
10
avg
I'm trying to calculate 7 day rate average using the "rate" column in r using rolling mean function, but I am getting a lot of errors. Not sure where to start. I want the final output to look like 7_Day_rate_avg column
library(zoo)
rollmean(rate, date, 7)

Assuming data frame dat shown reproducibly in the Note at the end, there are several problems:
the library statement must be separated from the next statement by a newline or semicolon
rate is a column of a data frame, not an R variable
the arguments to rollmean are wrong
the alignment was not specified. Since you need right alignment use rollmeanr with an r on the end.
fill=NA is needed to specify that you want to fill the first 6 values with NA values.
1) Putting all this together we have:
library(zoo)
transform(dat, avg7 = rollmeanr(rate, 7, fill = NA))
giving this data frame:
date rate avg7
1 1967-07-01 12.5 NA
2 1967-07-02 12.5 NA
3 1967-07-03 6.0 NA
4 1967-07-04 8.0 NA
5 1967-07-05 4.0 NA
6 1967-07-06 2.0 NA
7 1967-07-07 11.5 8.071429
8 1967-07-08 12.1 8.014286
9 1967-07-09 10.0 7.657143
10 1967-07-10 12.0 8.514286
11 1967-07-11 11.1 8.957143
12 1967-07-12 10.0 9.814286
2) Alternately convert dat to a zoo object and then cbind it to the rolling mean. In this case we don't need fill= since zoo objects are automatically aligned.
library(zoo)
rate <- read.zoo(dat)
cbind(rate, avg7 = rollmeanr(rate, 7))
giving this zoo object:
rate avg7
1967-07-01 12.5 NA
1967-07-02 12.5 NA
1967-07-03 6.0 NA
1967-07-04 8.0 NA
1967-07-05 4.0 NA
1967-07-06 2.0 NA
1967-07-07 11.5 8.071429
1967-07-08 12.1 8.014286
1967-07-09 10.0 7.657143
1967-07-10 12.0 8.514286
1967-07-11 11.1 8.957143
1967-07-12 10.0 9.814286
Note
dat in reproducible form is:
dat <- structure(list(date = c("1967-07-01", "1967-07-02", "1967-07-03",
"1967-07-04", "1967-07-05", "1967-07-06", "1967-07-07", "1967-07-08",
"1967-07-09", "1967-07-10", "1967-07-11", "1967-07-12"), rate = c(12.5,
12.5, 6, 8, 4, 2, 11.5, 12.1, 10, 12, 11.1, 10)), row.names = c(NA,
-12L), class = "data.frame")

Related

Getting weather data for multiple stations conditional for specific dates in R

I have the following problem:
For an analysis of weather effects on volunteers observing nature (animals, plants etc.) for a citizen science web page, I need to match the daily observations with the weather information of the nearest weather station. I'm using rdwd (for data of German weather service) and already managed to combine each observation location with the nearest weather station. So I now have a data frame (my_df_example) like this with 100 rows:
ID Date lat long Station_id Stationname
1317186439 2019-05-03 47.77411 9.540569 4094 Weingarten, Kr. Ravensburg
-2117439060 2019-05-19 48.87217 9.396229 10510 Winterbach/Remstal
-630183789 2019-04-30 48.86810 9.285427 4928 Stuttgart (Schnarrenberg)
-390672435 2019-05-10 50.71187 8.706279 1639 Giessen/Wettenberg
262182713 2019-05-01 50.82548 8.892961 3164 Coelbe, Kr. Marburg-Biedenkopf
-373270631 2019-05-24 51.61666 7.950153 5480 Werl
with dput(my_df_example):
structure(list(ID = c(1317186439L, -2117439060L, -630183789L, -390672435L, 262182713L, -373270631L,...
Datum = structure(c(1556841600, 1558224000, 1556582400, 1557446400, 1556668800, 1558656000, 1558224000, 1557532800,..., class = c("POSIXct", "POSIXt"), tzone = "UTC"),
lat = c(47.7741093721703, 48.8721672952686, 48.8681024146134, 50.7118683229165, 50.8254843786222, 51.6166575725419, 48.7357007677785,...
long = c(9.54056899481679, 9.3962287902832, 9.28542673587799, 8.70627880096436, 8.89296054840088, 7.95015335083008, 11.3105964660645,...
Stations_id = c(4094L, 10510L, 4928L, 1639L, 3164L, 5480L, 3484L,...
Stationsname = c("Weingarten, Kr. Ravensburg", "Winterbach/Remstal", "Stuttgart (Schnarrenberg)", "Giessen/Wettenberg", "Coelbe, Kr. Marburg-Biedenkopf", "Werl",...
row.names = c("58501", "89910", "69539", "24379", "45331", "77191", "50028",
class = "data.frame")
What I need to do now is get the weather information for each station on that specific date. I'm trying to use the rdwd package in R to do so.
I tried two options so far, that both didn't work out.
Option 1:
urls <- selectDWD(name=my_df_final$Stationsname, res="daily", var="kl", per="historical", outvec=TRUE)
kl <- dataDWD(urls[1:100])
That gives me a list of 100 lists. Each list of the 100 includes the weather data for every recorded day of a certain station. So I would need to filter the data from those lists so that the date matches the dates in my_df_example. I don't know how to extract info from a list inside a list though.
Option 2:
stat <- my_df_example$Stationname
link <- selectDWD(c(stat), res="daily", var="kl", per="hist")
file <- dataDWD(link, read=FALSE)
clim <- readDWD(file, varnames=TRUE)
The problem here is, that dataDWD doesn't work for lists. And since "link" includes multiple Station names it is not just a vector.
I don't really know if one of these options is the right way at all or if an alternative would make more sense.
Thank you for any advice you can give.

I would suggest a data.table solution:
library(data.table)
full = rbindlist(kl) # Convert list to one huge DF
setDT(my_df_final) # Convert your df to DT
new_df <- merge(my_df_final, full, by.x = c("ID", "Datum"), by.y = c("STATIONS_ID", "MESS_DATUM"), all.x = T) # Merge full and your df
new_df
ID Datum lat long Stations_id Stationsname QN_3 FX FM QN_4 RSK RSKF SDK
1: 1639 2019-05-10 50.71187 8.706279 1639 Giessen/Wettenberg 10 9.1 3.3 3 9.3 6 4.000
2: 3164 2019-05-01 50.82548 8.892961 3164 Coelbe, Kr. Marburg-Biedenkopf NA NA NA 3 0.0 0 NA
3: 4094 2019-05-03 47.77411 9.540569 4094 Weingarten, Kr. Ravensburg 10 6.4 2.2 3 5.2 4 0.000
4: 4928 2019-04-30 48.86810 9.285427 4928 Stuttgart (Schnarrenberg) 10 7.9 2.7 3 0.0 6 3.583
5: 10510 2019-05-19 48.87217 9.396229 10510 Winterbach/Remstal 10 11.3 1.8 NA NA NA NA
SHK_TAG NM VPM PM TMK UPM TXK TNK TGK eor
1: NA 6.6 10.2 985.16 11.1 78.21 15.9 7.7 5.9 eor
2: NA NA 9.7 NA 12.3 71.00 20.0 3.2 1.4 eor
3: NA NA 10.0 NA 8.7 88.92 11.6 5.3 3.0 eor
4: 0 4.9 9.3 981.55 10.5 75.58 15.3 7.3 3.7 eor
5: NA NA NA NA NA NA NA NA NA eor
(should also work in base R, but is certainly faster this way)

According to your problem:
What I need to do now is get the weather information for each station on that specific date.
Then, once you have your list of lists (kl) then you can subset from this "meta"-list the information that you are looking for this way:
query <- lapply(kl, function(x) {
x[which((as.Date(x$MESS_DATUM) %in% as.Date(my_final_df$Date)) &
(x$STATIONS_ID %in% my_final_df$Station_id)), ]
})
x represents the object kl passed to the function definition. The %in% operator, as its letters indicate, will look for the elements in common between $MESS_DATUM and $Date variables and (&) also for the matches between STATIONS_ID and Station_id . which() ensures that no logical surprises occur while subsetting the data and as.Date() returns a common date format for both data frames.
After performing the extraction, you have to collapse the information into a single data frame. Since all the columns in all the lists inside the meta-list are the same, you can use do.call() + rbind() directly. Like:
query <- do.call(rbind,query)
To avoid messy rownames, call:
rownames(query) <- NULL
Then, to see the station names in the query data set, merge the query with my_final_df:
colnames(query)[1] <- "Station_id" # the key needs to have the samen name in both data frames
query <- merge(query,my_final_df, by = "Station_id", all = TRUE)
The final result looks like this:
Station_id MESS_DATUM QN_3 FX FM QN_4 RSK RSKF SDK SHK_TAG NM VPM PM TMK UPM TXK TNK TGK eor ID Date
2 1639 2019-05-01 10 7.1 2.0 3 0.0 0 11.383 NA 0.3 9.0 991.15 12.6 65.67 20.6 3.3 -0.4 eor -390672435 2019-05-10
7 3164 2019-04-30 NA NA NA 3 0.0 0 NA 0 NA 8.9 NA 12.3 64.92 18.7 5.4 3.4 eor 262182713 2019-05-01
16 4094 2019-05-10 10 10.3 3.4 3 5.7 4 5.933 NA NA 10.4 NA 11.9 76.04 16.8 8.5 6.8 eor 1317186439 2019-05-03
21 4928 2019-05-03 10 10.0 3.2 3 0.4 6 3.183 NA 7.5 9.0 973.66 10.4 72.38 14.2 7.8 7.3 eor -630183789 2019-04-30
29 5480 2019-05-19 10 11.0 1.8 3 1.0 6 5.000 NA 7.2 13.0 995.10 14.0 82.38 21.8 6.8 5.2 eor -373270631 2019-05-24
36 10510 2019-05-24 10 5.9 1.4 NA NA NA NA NA NA NA NA NA NA NA NA NA eor -2117439060 2019-05-19
lat long Stationname
2 50.71187 8.706279 Giessen/Wettenberg
7 50.82548 8.892961 Coelbe, Kr. Marburg-Biedenkopf
16 47.77411 9.540569 Weingarten, Kr. Ravensburg
21 48.86810 9.285427 Stuttgart (Schnarrenberg)
29 51.61666 7.950153 Werl
36 48.87217 9.396229 Winterbach/Remstal
This data set matches the dates and station's ids and names you first provided in the my_df_example.
Provided more time, maybe someone will tell us how to solve this with tidyverse notation, because I suspect it is even more straightforward to do the subsetting-extraction algorithm with this package.

Divide values in data frame according to date range in another data frame

I have two dataframes. Dataframe data one has two columns: one contains ymd dates, and the other values:
date value
1 2009-10-23 1100
2 2009-05-01 5000
3 2010-01-13 3050
4 2010-07-24 2700
5 2009-06-16 2600
My second dataframe (named factors) also has two columns: another ymd date, and a coefficient. Here, for each month of each year, I always have two specific dates: the 1st and 15th of each month. This is how the data frame looks (I only added some dates on this minimal example, but there shouldn't be any 'jumps': I have continued data in a 10-year period):
date coeff
1 2009-05-01 2.00
2 2009-05-15 3.00
3 2009-06-01 2.50
4 2009-06-15 4.00
5 2009-10-01 3.65
6 2009-10-15 4.80
7 2010-01-01 2.40
8 2010-01-15 1.90
9 2010-07-01 5.20
10 2010-07-15 4.30
The dataframes are ready to use on this fiddle: http://rextester.com/MOIY96065
My problem
I need to create a new column in dataframe 1 (named data) where this column is data$value / factors$coeff following a condition: it must use the coeff with the previous closest date value.
For example: date$value[1] should be divided by factors$coeff[6] (the value on October 15th), but date$value[2] should be divided by factors$coeff[1] (the value on May 1st).
My factors dataframe is ordered by date. I've been using lubridate to parse the dates from string type, but I don't know how can I make this work.

You can use findInterval() to get the indices for selecting the correct rows
from factors:
(i <- findInterval(date$date, factors$date))
#> [1] 6 1 7 10 4
date$value / factors$coeff[i]
#> [1] 229.1667 2500.0000 1270.8333 627.9070 650.0000
Created on 2018-08-09 by the reprex package (v0.2.0.9000).
Data:
date <- structure(list(date = structure(c(14540, 14365, 14622, 14814,
14411), class = "Date"), value = c(1100, 5000, 3050, 2700, 2600
)), row.names = c(NA, -5L), class = "data.frame")
factors <- structure(list(date = structure(c(14365, 14379, 14396, 14410,
14518, 14532, 14610, 14624, 14791, 14805), class = "Date"), coeff = c(2,
3, 2.5, 4, 3.65, 4.8, 2.4, 1.9, 5.2, 4.3)), row.names = c(NA,
-10L), class = "data.frame")

Adapted form #Frank answer's here
d <- function(x,y) {
diff <- as.numeric(x-y)
diff <- which.min(diff[diff>=0])
}
indx <- sapply(df$date, function(x) d(x,df1$date))
df_final <- cbind(df,df1[indx,,drop=FALSE])
df_final$result <- df_final$value/df_final$coeff
date value date coeff result
1 2009-10-23 1100 2009-10-15 4.8 229.1667
2 2009-05-01 5000 2009-05-01 2.0 2500.0000
3 2010-01-13 3050 2010-01-01 2.4 1270.8333
4 2010-07-24 2700 2010-07-15 4.3 627.9070
5 2009-06-16 2600 2009-06-15 4.0 650.0000
data
df<-read.table(text=" date value
1 2009-10-23 1100
2 2009-05-01 5000
3 2010-01-13 3050
4 2010-07-24 2700
5 2009-06-16 2600
",header=TRUE)
df1<-read.table(text=" date coeff
1 2009-05-01 2.00
2 2009-05-15 3.00
3 2009-06-01 2.50
4 2009-06-15 4.00
5 2009-10-01 3.65
6 2009-10-15 4.80
7 2010-01-01 2.40
8 2010-01-15 1.90
9 2010-07-01 5.20
10 2010-07-15 4.30
",header=TRUE)

Converting Monthly Data to Daily in R

I have a data.frame df that has monthly data:
Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1
I want there to be data on every day in the month (and I will assume Value does not change during each month) since I will be merging this into a different table that has monthly data.
I want the output to look like this:
Date Value
2008-01-02 3.5
2008-01-03 3.5
2008-01-04 3.5
2008-01-05 3.5
2008-01-06 3.5
2008-01-07 3.5
2008-01-08 3.5
2008-01-09 3.5
2008-01-10 3.5
2008-01-11 3.5
2008-01-12 3.5
2008-01-13 3.5
2008-01-14 3.5
2008-01-15 3.5
2008-01-16 3.5
2008-01-17 3.5
2008-01-18 3.5
2008-01-19 3.5
2008-01-20 3.5
2008-01-21 3.5
2008-01-22 3.5
2008-01-23 3.5
2008-01-24 3.5
2008-01-25 3.5
2008-01-26 3.5
2008-01-27 3.5
2008-01-28 3.5
2008-01-29 3.5
2008-01-30 3.5
2008-01-31 3.5
2008-02-01 9.5
I have tried to.daily but my call:
df <- to.daily(df$Date)
returns
Error in to.period(x, "days", name = name, ...) : ‘x’ contains no data

Not sure if i understood perfectly but i think something like this may work.
First, i define the monthly data table
library(data.table)
DT_month=data.table(Date=as.Date(c("2008-01-01","2008-02-01","2008-03-01","2008-05-01","2008-07-01"))
,Value=c(3.5,9.5,0.1,5,8))
Then, you have to do the following
DT_month[,Month:=month(Date)]
DT_month[,Year:=year(Date)]
start_date=min(DT_month$Date)
end_date=max(DT_month$Date)
DT_daily=data.table(Date=seq.Date(start_date,end_date,by="day"))
DT_daily[,Month:=month(Date)]
DT_daily[,Year:=year(Date)]
DT_daily[,Value:=-100]
for( i in unique(DT_daily$Year)){
for( j in unique(DT_daily$Month)){
if(length(DT_month[Year==i & Month== j,Value])!=0){
DT_daily[Year==i & Month== j,Value:=DT_month[Year==i & Month== j,Value]]
}
}
}
Basically, the code will define the month and year of each monthly value in separate columns.
Then, it will create a vector of daily data using the minimum and maximum dates in your monthly data, and will create two separate columns for year and month for the daily data as well.
Finally, it goes through every combination of year and months of data filling the daily values with the monthly ones. In case there is no data for certain combination of month and year, it will show a -100.
Please let me know if it works.

An option using tidyr::expand expand a row between 1st day of month to last day of month. The lubridate::floor_date can provide 1st day of month and lubridate::ceiling_date() - days(1) will provide last day of month.
library(tidyverse)
library(lubridate)
df %>% mutate(Date = ymd(Date)) %>%
group_by(Date) %>%
expand(Date = seq(floor_date(Date, unit = "month"),
ceiling_date(Date, unit="month")-days(1), by="day"), Value) %>%
as.data.frame()
# Date Value
# 1 2008-01-01 3.5
# 2 2008-01-02 3.5
# 3 2008-01-03 3.5
# 4 2008-01-04 3.5
# 5 2008-01-05 3.5
#.....so on
# 32 2008-02-01 9.5
# 33 2008-02-02 9.5
# 34 2008-02-03 9.5
# 35 2008-02-04 9.5
# 36 2008-02-05 9.5
#.....so on
# 85 2008-03-25 0.1
# 86 2008-03-26 0.1
# 87 2008-03-27 0.1
# 88 2008-03-28 0.1
# 89 2008-03-29 0.1
# 90 2008-03-30 0.1
# 91 2008-03-31 0.1
Data:
df <- read.table(text =
"Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1",
header = TRUE, stringsAsFactors = FALSE)

to.daily can only be applied to xts/zooobjects and can only convert to a LOWER frequency. i.e. from daily to monthly, but not the other way round.
One easy way to accomplish what you want is converting df to an xts object:
df.xts <- xts(df$Value,order.by = df$Date)
And merge, like so:
na.locf(merge(df.xts, foo=zoo(NA, order.by=seq(start(df.xts), end(df.xts),
"day",drop=F)))[, 1])
df.xts
2018-01-01 3.5
2018-01-02 3.5
2018-01-03 3.5
2018-01-04 3.5
2018-01-05 3.5
2018-01-06 3.5
2018-01-07 3.5
….
2018-01-27 3.5
2018-01-28 3.5
2018-01-29 3.5
2018-01-30 3.5
2018-01-31 3.5
2018-02-01 9.5
2018-02-02 9.5
2018-02-03 9.5
2018-02-04 9.5
2018-02-05 9.5
2018-02-06 9.5
2018-02-07 9.5
2018-02-08 9.5
….
2018-02-27 9.5
2018-02-28 9.5
2018-03-01 0.1
If you want to adjust the price continuously over the course of a month use na.spline in place of na.locf.

Maybe not an efficient one but with base R we can do
do.call("rbind", lapply(1:nrow(df), function(i)
data.frame(Date = seq(df$Date[i],
(seq(df$Date[i],length=2,by="months") - 1)[2], by = "1 days"),
value = df$Value[i])))
We basically generate a sequence of dates from start_date to the last day of that month which is calculated by
seq(df$Date[i],length=2,by="months") - 1)[2]
and repeat the same value for all the dates and put them in the data frame.
We get a list of dataframe and then we can rbind them using do.call.

Another way:
library(lubridate)
d <- read.table(text = "Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1",
stringsAsFactors = FALSE, header = TRUE)
Dates <- seq(from = min(as.Date(d$Date)),
to = ceiling_date(max(as.Date(d$Date)), "month") - days(1),
by = "1 days")
data.frame(Date = Dates,
Value = setNames(d$Value, d$Date)[format(Dates, format = "%Y-%m-01")])

Calculate average of month and replace values of other column

I have a dataframe as given below:
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015","13-05-2015","14-05-2015"
,"15-05-2015","12-06-2015","13-06-2015","14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
Below is the column which contains value based on some calculation:
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)
Now I want to replace pvar value if its value less than the average value for that particular month.
For example,
for month 4,
Average value of pvar is 9.3 ((8.4+2.4+12+14.4)/4).
Then replace all the value in pvar which is less than avg for month 4 that is (8.4 &2.4).
Pvar value would be 9.3,9.3,12,14.4
I need to do this for all the values in pvar.

A base R solution would be to use ave. Note that we first need to convert the date column to actual date in order to extract the month (strsplit or regex can also do it but I prefer to have it set as a proper date), i.e.
df$vdate <- as.POSIXct(df$vdate, format = '%d-%m-%Y')
with(df, ave(pvar, format(vdate, '%m'), FUN = function(i) replace(i, i < mean(i), mean(i))))
#[1] 9.30 9.30 12.00 14.40 4.65 4.65 7.80 5.00 16.00 14.45 18.00 18.40
As per your edit, I will use dplyr to tackle it as it might be more readable. There are actually two ways I came up with.
First: Create an extra grouping variable that will put all the months you need to alter the values in the same group and replace from there, i.e.
library(dplyr)
cbind(df, pvar) %>%
group_by(grp = cumsum(!month %in% c(4, 5))+1, month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))) %>%
ungroup() %>%
select(-grp)
Second: Filter the months you need, do the calculations. Then filter the months you don't need, create again the pvar but without changing anything (necessary for binding the rows) and bind the rows, i.e.
bind_rows(
cbind(df, pvar) %>%
filter(month %in% c(4, 5)) %>%
group_by(month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))),
cbind(df, pvar) %>%
filter(!month %in% c(4, 5))
)
Both the above give,
vdate month col1 pvar
<fct> <dbl> <dbl> <dbl>
1 12-04-2015 4. 12.0 12.0
2 13-04-2015 4. 12.4 12.4
3 14-04-2015 4. 14.3 14.3
4 15-04-2015 4. 3.00 10.4
5 12-05-2015 5. 5.30 5.30
6 13-05-2015 5. 1.80 4.80
7 14-05-2015 5. 7.60 7.60
8 15-05-2015 5. 4.50 4.80
9 12-06-2015 6. 7.60 7.60
10 13-06-2015 6. 10.7 10.7
11 14-06-2015 6. 12.0 12.0
12 15-06-2015 6. 15.7 15.7

A dplyr based solution could be :
#Additional condition has been added to check if month != 6
cbind(df, pvar) %>%
group_by(month) %>%
mutate(pvar = ifelse(pvar < mean(pvar) & month != 6, mean(pvar), pvar)) %>%
as.data.frame()
# vdate month col1 pvar
# 1 12-04-2015 4 12.0 9.30
# 2 13-04-2015 4 12.4 9.30
# 3 14-04-2015 4 14.3 12.00
# 4 15-04-2015 4 3.0 14.40
# 5 12-05-2015 5 5.3 4.65
# 6 13-05-2015 5 1.8 4.65
# 7 14-05-2015 5 7.6 7.80
# 8 15-05-2015 5 4.5 5.00
# 9 12-06-2015 6 7.6 16.00
# 10 13-06-2015 6 10.7 5.40
# 11 14-06-2015 6 12.0 18.00
# 12 15-06-2015 6 15.7 18.40
Data
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015",
"13-05-2015","14-05-2015","15-05-2015","12-06-2015","13-06-2015",
"14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)

Create several new derived variables from existing variables in data.frame

In R I have a data.frame that has several variables that have been measured monthly over several years. I would like to derive the monthly average (using all years) for each variable. Ideally these new variables would all be together in a new data.frame (carrying over the ID), below I am simply adding the new variable to the data.frame. The only way I know how to do this at the moment (below) seems quite laborious, and I was hoping there might be a smarter way to do this in R, that would not require typing out each month and variable as I did below.
# Example data.frame with only two years, two month, and two variables
# In the real data set there are always 12 months per year
# and there are at least four variables
df<- structure(list(ID = 1:4, ABC.M1Y2001 = c(10, 12.3, 45, 89), ABC.M2Y2001 = c(11.1,
34, 67.7, -15.6), ABC.M1Y2002 = c(-11.1, 9, 34, 56.5), ABC.M2Y2002 = c(12L,
13L, 11L, 21L), DEF.M1Y2001 = c(14L, 14L, 14L, 16L), DEF.M2Y2001 = c(15L,
15L, 15L, 12L), DEF.M1Y2002 = c(5, 12, 23.5, 34), DEF.M2Y2002 = c(6L,
34L, 61L, 56L)), .Names = c("ID", "ABC.M1Y2001", "ABC.M2Y2001","ABC.M1Y2002",
"ABC.M2Y2002", "DEF.M1Y2001", "DEF.M2Y2001", "DEF.M1Y2002",
"DEF.M2Y2002"), class = "data.frame", row.names = c(NA, -4L))
# list variable to average for ABC Month 1 across years
ABC.M1.names <- c("ABC.M1Y2001", "ABC.M1Y2002")
df <- transform(df, ABC.M1 = rowMeans(df[,ABC.M1.names], na.rm = TRUE))
# list variable to average for ABC Month 2 across years
ABC.M2.names <- c("ABC.M2Y2001", "ABC.M2Y2002")
df <- transform(df, ABC.M2 = rowMeans(df[,ABC.M2.names], na.rm = TRUE))
# and so forth for ABC
# ...
# list variables to average for DEF Month 1 across years
DEF.M1.names <- c("DEF.M1Y2001", "DEF.M1Y2002")
df <- transform(df, DEF.M1 = rowMeans(df[,DEF.M1.names], na.rm = TRUE))
# and so forth for DEF
# ...

Here's a solution using data.table development version v1.8.11 (which has melt and cast methods implemented for data.table):
require(data.table)
require(reshape2) # melt/cast builds on S3 generic from reshape2
dt <- data.table(df) # where df is your data.frame
dcast.data.table(melt(dt, id="ID")[, sum(value)/.N, list(ID,
gsub("Y.*$", "", variable))], ID ~ gsub)
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1: 1 -0.55 11.55 9.50 10.5
2: 2 10.65 23.50 13.00 24.5
3: 3 39.50 39.35 18.75 38.0
4: 4 72.75 2.70 25.00 34.0
You can just cbind this to your original data.
Note that sum is a primitive where as mean is S3 generic. Therefore, using sum(.)/length(.) is better (as if there are too many groupings, dispatching the right method with mean for every group could be quite a time-consuming operation). .N is a special variable in data.table that directly gives you the length of the group.

Here is a solution using reshape2 that is more automated when you have lots of data and uses regular expressions to extract the variable name and the month. This solution will give you a nice summary table.
# Load required package
require(reshape2)
# Melt your wide data into long format
mdf <- melt(df , id = "ID" )
# Extract relevant variable names from the variable colum
mdf$Month <- gsub( "^.*\\.(M[0-9]{1,2}).*$" , "\\1" , mdf$variable )
mdf$Var <- gsub( "^(.*)\\..*" , "\\1" , mdf$variable )
# Aggregate by month and variable
dcast( mdf , Var ~ Month , mean )
# Var M1 M2
#1 ABC 30.5875 19.275
#2 DEF 16.5625 26.750
Or to be compatible with the other solutions, and return the table by ID as well...
dcast( mdf , ID ~ Var + Month , mean )
# ID ABC_M1 ABC_M2 DEF_M1 DEF_M2
#1 1 -0.55 11.55 9.50 10.5
#2 2 10.65 23.50 13.00 24.5
#3 3 39.50 39.35 18.75 38.0
#4 4 72.75 2.70 25.00 34.0

This is pretty straight forward in base R.
mean.names <- split(names(df)[-1], gsub('Y[0-9]{4}$', '', names(df)[-1]))
means <- lapply(mean.names, function(x) rowMeans(df[, x], na.rm = TRUE))
data.frame(df, means)
This gives you your original data.frame with the following four columns at the end:
ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 -0.55 11.55 9.50 10.5
2 10.65 23.50 13.00 24.5
3 39.50 39.35 18.75 38.0
4 72.75 2.70 25.00 34.0

You can use Reshape from package {splitstackshape} and then use plyr package or data.table or base R to perform mean.
library(splitstackshape) # Reshape
library(plyr) # ddply
kk<-Reshape(df,id.vars="ID",var.stubs=c("ABC.M1","ABC.M2","DEF.M1","DEF.M2"),sep="")
> kk
ID AE DB time ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 NA NA 1 10.0 11.1 14.0 15
2 2 NA NA 1 12.3 34.0 14.0 15
3 3 NA NA 1 45.0 67.7 14.0 15
4 4 NA NA 1 89.0 -15.6 16.0 12
5 1 NA NA 2 -11.1 12.0 5.0 6
6 2 NA NA 2 9.0 13.0 12.0 34
7 3 NA NA 2 34.0 11.0 23.5 61
8 4 NA NA 2 56.5 21.0 34.0 56
ddply(kk[,c(1,5:8)],.(ID),colwise(mean))
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 -0.55 11.55 9.50 10.5
2 2 10.65 23.50 13.00 24.5
3 3 39.50 39.35 18.75 38.0
4 4 72.75 2.70 25.00 34.0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculate 7 day average in r - r

Related

Getting weather data for multiple stations conditional for specific dates in R

Divide values in data frame according to date range in another data frame

Converting Monthly Data to Daily in R

Calculate average of month and replace values of other column

Create several new derived variables from existing variables in data.frame

Categories

Resources