Here is an example of my data:
Date Prec aggregated by week (output)
1/1/1950 3.11E+00 4.08E+00
1/2/1950 3.25E+00 9.64E+00
1/3/1950 4.81E+00 1.15E+01
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00
I have long time series precipitation data and I want to aggregate it in such a way that (output is in third column; I calculated it from excel) is as follows
If I do aggregation by weekly
output in 1st cell = average prec from day 1 to 7 days.
output in 2nd cell = average prec from 8 to 14 days.
Output in 3rd cell=average prec from 15 to 21 day
If I do aggregation by 3 days
output in 1st cell = average of day 1 to 3 days.
output in 2nd cell = average of day 4 to 6 days.
I will provide the function with "prec" and the "time step" input. I tried loops and lubridate, POSIXct, and some other functions, but I cant figure out the output like in third column.
One code I came up with ran without error but my output is bot correct.
Where dat is my data set.
tt=as.POSIXct(paste(dat$Date),format="%m/%d/%Y") #converting date formate
datZoo <- zoo(dat[,-c(1,3)], tt)
weekly <- apply.weekly(datZoo,mean)
prec_NLCD <-data.frame (weekly)
Also I wanted to write it in form of a function. Your suggestions will be helpful.
Assuming the data shown reproducibly in the Note at the end create the weekly means, zm, and then merge it with z.
(It would seem to make more sense to merge the means at the point that they are calculated, i.e. merge(z, zm) in place of the line marked ##, but for consistency with the output shown in the question they are put at the head of the data below.)
library(zoo)
z <- read.zoo(text = Lines, header = TRUE, format = "%m/%d/%Y")
zm <- rollapplyr(z, 7, by = 7, mean)
merge(z, zm = zoo(coredata(zm), head(time(z), length(zm)))) ##
giving:
z zm
1950-01-01 3.11 4.081429
1950-01-02 3.25 9.642857
1950-01-03 4.81 11.517143
1950-01-04 7.07 NA
1950-01-05 4.25 NA
1950-01-06 3.11 NA
1950-01-07 2.97 NA
1950-01-08 2.83 NA
1950-01-09 2.72 NA
1950-01-10 2.72 NA
1950-01-11 2.60 NA
1950-01-12 2.83 NA
1950-01-13 17.00 NA
1950-01-14 36.80 NA
1950-01-15 42.40 NA
1950-01-16 17.00 NA
1950-01-17 7.07 NA
1950-01-18 3.96 NA
1950-01-19 3.54 NA
1950-01-20 3.40 NA
1950-01-21 3.25 NA
Note:
Lines <- "Date Prec
1/1/1950 3.11E+00
1/2/1950 3.25E+00
1/3/1950 4.81E+00
1/4/1950 7.07E+00
1/5/1950 4.25E+00
1/6/1950 3.11E+00
1/7/1950 2.97E+00
1/8/1950 2.83E+00
1/9/1950 2.72E+00
1/10/1950 2.72E+00
1/11/1950 2.60E+00
1/12/1950 2.83E+00
1/13/1950 1.70E+01
1/14/1950 3.68E+01
1/15/1950 4.24E+01
1/16/1950 1.70E+01
1/17/1950 7.07E+00
1/18/1950 3.96E+00
1/19/1950 3.54E+00
1/20/1950 3.40E+00
1/21/1950 3.25E+00"
Related
myts <- ts(mydata[,2:4], start = c(1981, 1), frequency = 4)
Above line of code is clear to me except the part ts(mydata[,2:4]) I know that ts(mydata[,2] tells us that R should read from 2 column however, what does 2:4 stands for?
data looks like this
usnim
1 2002-01-01 4.08
2 2002-04-01 4.10
3 2002-07-01 4.06
4 2002-10-01 4.04
Could you provide an example when to use 2:4 ?
I have a data.frame with 3 cols: date, rate, price. I want to add columns that come from a matrix, after rate and before price.
df = tibble('date' = c('01/01/2000', '02/01/2000', '03/01/2000'),
'rate' = c(7.50, 6.50, 5.54),
'price' = c(92, 94, 96))
I computed the lags of rate using a function that outputs a matrix:
rate_Lags = matrix(data = c(NA, 7.50, 5.54, NA, NA, 7.50), ncol=2, dimnames=list(c(), c('rate_tMinus1', 'rate_tMinus2'))
I want to insert those lags after rate (and before price) using names indexing rather than column order.
The add_column function from tibble package (Adding a column between two columns in a data.frame) does not work because it only accepts an atomic vector (hence if I have 10 lags I will have to call add_column 10 times). I could use apply in my rate_Lags matrix. Then, however, I lose the dimnames from my rate_Lags matrix.
Using number indexing (subsetting) (https://stat.ethz.ch/pipermail/r-help/2011-August/285534.html) could work if I knew the position of a specific column name (any function that retrieves the position of a column name?).
Is there any simple way of inserting a bunch of columns in a specific position in a data frame/tibble object?
You may be overlooking the following
library(dplyr)
I <- which(names(df) == "rate")
if (I == ncol(df)) {
cbind(df, rate_Lags)
} else {
cbind(select(df, 1:I), rate_Lags, select(df, (I+1):ncol(df)))
}
# date rate rate_tMinus1 rate_tMinus2 price
# 1 0.0005 7.50 NA NA 92
# 2 0.0010 6.50 7.50 NA 94
# 3 0.0015 5.54 5.54 7.5 96
Maybe this is not very elegant, but you only call the function once and I believe it's more or less general purpose.
fun <- function(DF, M){
nms_DF <- colnames(DF)
nms_M <- colnames(M)
inx <- which(sapply(nms_DF, function(x) length(grep(x, nms_M)) > 0))
cbind(DF[seq_len(inx)], M, DF[ seq_along(nms_DF)[-seq_len(inx)] ])
}
fun(df, rate_Lags)
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
We could unclass the dataset to a list and then use append to insert 'rate_Lags' at specific locations, reconvert the list to data.frame
i1 <- match('rate', names(df))
data.frame(append(unclass(df), as.data.frame(rate_Lags), after = i1))
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Or with tidyverse
library(tidyverse)
rate_Lags %>%
as_tibble %>%
append(unclass(df), ., after = i1) %>%
bind_cols
# A tibble: 3 x 5
# date rate rate_tMinus1 rate_tMinus2 price
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 01/01/2000 7.5 NA NA 92
#2 02/01/2000 6.5 7.5 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Let's say I have a vector of prices:
foo <- c(102.25,102.87,102.25,100.87,103.44,103.87,103.00)
I want to get the percent change from x periods ago and, say, store it into another vector that I'll call log_returns. I can't bind vectors foo and log_returns into a data.frame because the vectors are not the same length. So I want to be able to append NA's to log_returns so I can put them in a data.frame. I figured out one way to append an NA at the end of the vector:
log_returns <- append((diff(log(foo), lag = 1)),NA,after=length(foo))
But that only helps if I'm looking at percent change 1 period before. I'm looking for a way to fill in NA's no matter how many lags I throw in so that the percent change vector is equal in length to the foo vector
Any help would be much appreciated!
You could use your own modification of diff:
mydiff <- function(data, diff){
c(diff(data, lag = diff), rep(NA, diff))
}
mydiff(foo, 1)
[1] 0.62 -0.62 -1.38 2.57 0.43 -0.87 NA
data.frame(foo = foo, diff = mydiff(foo, 3))
foo diff
1 102.25 -1.38
2 102.87 0.57
3 102.25 1.62
4 100.87 2.13
5 103.44 NA
6 103.87 NA
7 103.00 NA
Let's say you have an array with number 1 to 10 arranged in the matrix form, in which
The matrix contains Elements from 5 rows 2 columns & 2nd column to be assigned NA , #
then Making one 5*2 matrix of elements 1:10
Array_test=array(c(1:10),dim=c(5,2,1))
Array_test
Array_test[ ,2, ]=c(NA)# Defining 2nd column to get NA
Array_test
# Similarly to make only one element of the entire matrix be NA
# let's say 4nd-row 2nd column to be made NA then
Array_test[4 ,2, ]=c(NA)
I have a dataframe (df) with the values (V) of different stocks at different dates (t). I would like to get a new df with the profitability for each time period.
Profitability is: ln(Vi_t / Vi_t-1)
where:
ln is the natural logarithm
Vi_t is the Value of the stock i at the date t
Vi_t-1 the value of the same stock at the date before
This is the output of df[1:3, 1:10]
date SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
1 01/08/88 1507.5 3.63 4.98 159.20 15.62 14.64 4.01 4.59 11.33
2 01/09/88 1467.4 3.69 4.97 161.55 15.69 14.40 4.06 4.87 11.05
3 01/10/88 1538.0 3.27 5.47 173.72 16.02 14.72 4.14 5.05 11.94
Specifically, instead of 1467.4 at [2, "SMI"] I want the profitability which is ln(1467.4/1507.5) and the same for all the rest of the values in the dataframe.
As I am new to R I am stuck. I was thinking of using something like mapply, and create the transformation function myself.
Any help is highly appreciated.
This will compute the profitabilities (assuming data is in a data.frame call d):
(d2<- log(embed(as.matrix(d[,-1]), 2) / d[-dim(d)[1], -1]))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#1 -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#2 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368
Then, you can add in the dates, if you want:
d2$date <- d$date[-1]
Alternatively, you could use an apply based approach:
(d2 <- apply(d[-1], 2, function(x) diff(log(x))))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#[1,] -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#[2,] 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368
This is a very simple question, but I haven't been able to find a definitive answer, so I thought I would ask it. I use the plm package for dealing with panel data. I am attempting to use the lag function to lag a variable FORWARD in time (the default is to retrieve the value from the previous period, and I want the value from the NEXT). I found a number of old articles/questions (circa 2009) suggesting that this is possible by using k=-1 as an argument. However, when I attempt this, I get an error.
Sample code:
library(plm)
df<-as.data.frame(matrix(c(1,1,1,2,2,3,20101231,20111231,20121231,20111231,20121231,20121231,50,60,70,120,130,210),nrow=6,ncol=3))
names(df)<-c("individual","date","data")
df$date<-as.Date(as.character(df$date),format="%Y%m%d")
df.plm<-pdata.frame(df,index=c("individual","date"))
Lagging:
lag(df.plm$data,0)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
50 60 70 120 130 210
lag(df.plm$data,1)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
NA 50 60 NA 120 NA
lag(df.plm$data,-1)
##returns
Error in rep(1, ak) : invalid 'times' argument
I've also read that plm.data has replaced pdata.frame for some applications in plm. However, plm.data doesn't seem to work with the lag function at all:
df.plm<-plm.data(df,indexes=c("individual","date"))
lag(df.plm$data,1)
##returns
[1] 50 60 70 120 130 210
attr(,"tsp")
[1] 0 5 1
I would appreciate any help. If anyone has another suggestion for a package to use for lagging, I'm all ears. However, I do love plm because it automagically deals with lagging across multiple individuals and skips gaps in the time series.
EDIT2: lagging forward (=leading values) is implemented in plm CRAN releases >= 1.6-4 .
Functions are either lead() or lag() (latter with a negative integer for leading values).
Take care of any other packages attached that use the same function names. To be sure, you can refer to the function by the full namespace, e.g., plm::lead.
Examples from ?plm::lead:
# First, create a pdata.frame
data("EmplUK", package = "plm")
Em <- pdata.frame(EmplUK)
# Then extract a series, which becomes additionally a pseries
z <- Em$output
class(z)
# compute negative lags (= leading values)
lag(z, -1)
lead(z, 1) # same as line above
identical(lead(z, 1), lag(z, -1)) # TRUE
The collapse package in CRAN has a C++ based function flag and also associated lag/lead operators L and F. It supports continuous sequences of lags/leads (positive and negative n values), and plm pseries and pdata.frame classes. Performance: 100x faster than plm and 10x faster than data.table (the fastest in R at the time of writing). Example:
library(collapse)
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c", "year"))
head(flag(pwlddev$LIFEEX, -1:1)) # A sequence of lags and leads
F1 -- L1
ABW-1960 66.074 65.662 NA
ABW-1961 66.444 66.074 65.662
ABW-1962 66.787 66.444 66.074
ABW-1963 67.113 66.787 66.444
ABW-1964 67.435 67.113 66.787
ABW-1965 67.762 67.435 67.113
head(L(pwlddev$LIFEEX, -1:1)) # Same as above
head(L(pwlddev, -1:1, cols = 9:12)) # Computing on columns 9 through 12
iso3c year F1.PCGDP PCGDP L1.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX F1.GINI GINI L1.GINI
ABW-1960 ABW 1960 NA NA NA 66.074 65.662 NA NA NA NA
ABW-1961 ABW 1961 NA NA NA 66.444 66.074 65.662 NA NA NA
ABW-1962 ABW 1962 NA NA NA 66.787 66.444 66.074 NA NA NA
ABW-1963 ABW 1963 NA NA NA 67.113 66.787 66.444 NA NA NA
ABW-1964 ABW 1964 NA NA NA 67.435 67.113 66.787 NA NA NA
ABW-1965 ABW 1965 NA NA NA 67.762 67.435 67.113 NA NA NA
F1.ODA ODA L1.ODA
ABW-1960 NA NA NA
ABW-1961 NA NA NA
ABW-1962 NA NA NA
ABW-1963 NA NA NA
ABW-1964 NA NA NA
ABW-1965 NA NA NA
library(microbenchmark)
library(data.table)
microbenchmark(plm_class = flag(pwlddev),
ad_hoc = flag(wlddev, g = wlddev$iso3c, t = wlddev$year),
data.table = qDT(wlddev)[, shift(.SD), by = iso3c])
Unit: microseconds
expr min lq mean median uq max neval cld
plm_class 462.313 512.5145 1044.839 551.562 637.6875 15913.17 100 a
ad_hoc 443.124 519.6550 1127.363 559.817 701.0545 34174.05 100 a
data.table 7477.316 8070.3785 10126.471 8682.184 10397.1115 33575.18 100 b
I had this same problem and couldn't find a good solution in plm or any other package. ddply was tempting (e.g. s5 = ddply(df, .(country,year), transform, lag=lag(df[, "value-to-lag"], lag=3))), but I couldn't get the NAs in my lagged column to line up properly for lags other than one.
I wrote a brute force solution that iterates over the dataframe row-by-row and populates the lagged column with the appropriate value. It's horrendously slow (437.33s for my 13000x130 dataframe vs. 0.012s for turning it into a pdata.frame and using lag) but it got the job done for me. I thought I would share it here because I couldn't find much information elsewhere on the internet.
In the function below:
df is your dataframe. The function returns df with a new column containing the forward values.
group is the column name of the grouping variable for your panel data. For example, I had longitudinal data on multiple countries, and I used "Country.Name" here.
x is the column you want to generate lagged values from, e.g. "GDP"
forwardx is the (new) column that will contain the forward lags, e.g. "GDP.next.year".
lag is the number of periods into the future. For example, if your data were taken in annual intervals, using lag=5 would set forwardx to the value of x five years later.
.
add_forward_lag <- function(df, group, x, forwardx, lag) {
for (i in 1:(nrow(df)-lag)) {
if (as.character(df[i, group]) == as.character(df[i+lag, group])) {
# put forward observation in forwardx
df[i, forwardx] <- df[i+lag, x]
}
else {
# end of group, no forward observation
df[i, forwardx] <- NA
}
}
# last elem(s) in forwardx are NA
for (j in ((nrow(df)-lag+1):nrow(df))) {
df[j, forwardx] <- NA
}
return(df)
}
See sample output using built-in DNase dataset. This doesn't make sense in context of the dataset, but it lets you see what the columns do.
require(DNase)
add_forward_lag(DNase, "Run", "density", "lagged_density",3)
Grouped Data: density ~ conc | Run
Run conc density lagged_density
1 1 0.04882812 0.017 0.124
2 1 0.04882812 0.018 0.206
3 1 0.19531250 0.121 0.215
4 1 0.19531250 0.124 0.377
5 1 0.39062500 0.206 0.374
6 1 0.39062500 0.215 0.614
7 1 0.78125000 0.377 0.609
8 1 0.78125000 0.374 1.019
9 1 1.56250000 0.614 1.001
10 1 1.56250000 0.609 1.334
11 1 3.12500000 1.019 1.364
12 1 3.12500000 1.001 1.730
13 1 6.25000000 1.334 1.710
14 1 6.25000000 1.364 NA
15 1 12.50000000 1.730 NA
16 1 12.50000000 1.710 NA
17 2 0.04882812 0.045 0.123
18 2 0.04882812 0.050 0.225
19 2 0.19531250 0.137 0.207
Given how long this takes, you may want to use a different approach: backwards-lag all of your other variables.