Removing NA columns in xts - r

I have an xts in the following format
a b c d e f ......
2011-01-03 11.40 NA 23.12 0.23 123.11 NA ......
2011-01-04 11.49 NA 23.15 1.11 111.11 NA ......
2011-01-05 NA NA 23.11 1.23 142.32 NA ......
2011-01-06 11.64 NA 39.01 NA 124.21 NA ......
2011-01-07 13.84 NA 12.12 1.53 152.12 NA ......
Is there a function I can apply to generate a new xts or data.frame missing the columns containing only NA?
The position of the columns with the NAs isn't static so just removing those columns by name or position isn't possible

Supose DF is your data.frame
DF [, -which(sapply(DF, function(x) sum(is.na(x)))==nrow(DF))]
a c d e
2011-01-03 11.40 23.12 0.23 123.11
2011-01-04 11.49 23.15 1.11 111.11
2011-01-05 NA 23.11 1.23 142.32
2011-01-06 11.64 39.01 NA 124.21
2011-01-07 13.84 12.12 1.53 152.12

#Jiber's solution works, but might give you unexpected results if there are no columns with all NA. For example:
# sample data
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
# Jiber's solution, when no columns have all missing values
DF <- as.data.frame(x)
DF[, -which(sapply(DF, function(x) sum(is.na(x)))==nrow(DF))]
# data frame with 0 columns and 180 rows
Here's a solution that works whether or not there are columns that have all missing values:
y <- x[,apply(!is.na(x), 2, all)]
x$High <- NA
x$Close <- NA
z <- x[,apply(!is.na(x), 2, all)]

Try this:
dataframe[,-which(apply(is.na(dataframe), 2, all))]

This seems simpler:
DF[, colSums(is.na(DF)) < nrow(DF)]

Related

Update dt columns based on named list

Let's say, I have the following my_dt datatable:
neutrons
spectrum
geography
2.30
-1.2
KIEL
2.54
-1.6
KIEL
2.56
-0.9
JUNG
2.31
-0.3
ANT
Also I have the following named list (my_list):
> my_list
$particles
[1] "neutrons"
$station
[1] NA
$energy
[1] "spectrum"
$area
[1] "geography"
$gamma
[1] NA
The values of this list correspond to the columns names from my dataset (if they exist, if they are absent - NA).
Based on my dataset and this list, I need to check which columns exist in my_dt and rename them (based on my_list names), and for NA values - I need to create columns filled with NAs.
So, I want to obtain the following dataset:
>final_dt
particles
station
energy
area
gamma
2.30
NA
-1.2
KIEL
NA
2.54
NA
-1.6
KIEL
NA
2.56
NA
-0.9
JUNG
NA
2.31
NA
-0.3
ANT
NA
I try to implement this using apply family functions, but at the moment I can't obtain exactly what I want.
So, I would be grateful for any help!
data.table using lapply
library(data.table)
setDT(my_dt)
setDT(my_list)
final_dt <- setnames( my_list[, lapply( .SD, function(x){
if( x %in% colnames(my_dt)){ my_dt[,x,with=F] }else{ NA } } ) ],
names(my_list) )
final_dt
particles station energy area gamma
1: 2.30 NA -1.2 KIEL NA
2: 2.54 NA -1.6 KIEL NA
3: 2.56 NA -0.9 JUNG NA
4: 2.31 NA -0.3 ANT NA
base R using sapply
setDF(my_dt)
setDF(my_list)
data.frame( sapply( my_list, function(x) if(!is.na(x)){ my_dt[,x] }else{ NA } ) )
particles station energy area gamma
1 2.30 NA -1.2 KIEL NA
2 2.54 NA -1.6 KIEL NA
3 2.56 NA -0.9 JUNG NA
4 2.31 NA -0.3 ANT NA
Data
my_dt <- structure(list(neutrons = c(2.3, 2.54, 2.56, 2.31), spectrum = c(-1.2,
-1.6, -0.9, -0.3), geography = c("KIEL", "KIEL", "JUNG", "ANT"
)), class = "data.frame", row.names = c(NA, -4L))
my_list <- list(particles = "neutrons", station = NA, energy = "spectrum",
area = "geography", gamma = NA)
This may not meet your needs, but since I had come up with this separately thought I would share just in case. You can use setnames to rename the columns based on my_list. After that, add in the missing column names with values of NA. Finally, you can use setcolorder to reorder based on your list if desired.
library(data.table)
my_vec <- unlist(my_list)
setnames(my_dt, names(my_vec[match(names(my_dt), my_vec)]))
my_dt[, (setdiff(names(my_vec), names(my_dt))) := NA]
setcolorder(my_dt, names(my_vec))
my_dt
Output
particles station energy area gamma
1: 2.30 NA -1.2 KIEL NA
2: 2.54 NA -1.6 KIEL NA
3: 2.56 NA -0.9 JUNG NA
4: 2.31 NA -0.3 ANT NA
I wrote a simple code that should do the job for you:
l = list(c = 'cc', a = 'aa', b = NA) # replace this with your my_list
dt = data.frame(aa = 1:3, cc = 2:4) # replace this with my_dt
dtl = data.frame(l)
names(dt) = names(l)[na.omit(match(l, names(dt)))]
m = merge(dt, dtl[!is.element(names(dtl), names(dt))])

populate values from another data frame based on predefined set of columns

I have two data frames. The first one look like that:
df1 <- data.frame(Hugo_Symbol=c("CDKN2A", "JUN", "IRS2","MTOR",
"NRAS"),
A183=c(-0.19,NA,2.01,0.4,1.23),
A185=c(0.11,2.45,NA,NA,1.67),
A186=c(1.19,NA,2.41,0.78,1.93),
A187=c(2.78,NA,NA,0.7,2.23),
A188=c(NA,NA,NA,2.4,1.23))
head(df1)
Hugo_Symbol A183 A185 A186 A187 A188
1 CDKN2A -0.19 0.11 1.19 2.78 NA
2 JUN NA 2.45 NA NA NA
3 IRS2 2.01 NA 2.41 NA NA
4 MTOR 0.40 NA 0.78 0.70 2.40
5 NRAS 1.23 1.67 1.93 2.23 1.23
The second data frame is smaller and have an empty values:
df2 <- data.frame(Hugo_Symbol=c("CDKN2A", "IRS2", "NRAS"),
A183=c(0, 0, 0),
A187=c(0, 0, 0),
A188=c(0, 0, 0))
head(df2)
Hugo_Symbol A183 A187 A188
1 CDKN2A 0 0 0
2 IRS2 0 0 0
3 NRAS 0 0 0
I would like to populate the second data frame with values from the first data frame. The final result will look like that:
Hugo_Symbol A183 A187 A188
1 CDKN2A -0.19 2.78 NA
2 IRS2 2.01 NA NA
3 NRAS 1.23 2.23 1.23
I tried cbind() and merge() functions, but they do not work on data with different number of raws and columns.
I would appreciate any help!
Thank you!
Olha
I don't get the logic of your output, I hope you wrote it wrong, but I think you want the following:
matchedRowInds <- match(df2$Hugo_Symbol,df1$Hugo_Symbol)
matchedColInds <- match(colnames(df2),colnames(df1))
newdf <- df1[matchedRowInds,matchedColInds]
# > newdf
# Hugo_Symbol A183 A187 A188
# 1 CDKN2A -0.19 2.78 NA
# 3 IRS2 2.01 NA NA
# 5 NRAS 1.23 2.23 1.23
Idea: Get the matching rows in the bigged dataframe which are present in the smaller. Same with columns.
You can use semi_join from dplyr:
your final table has unexpected values.
my version:
library(dplyr)
df3 <- df1 %>% semi_join(df2, by="Hugo_Symbol") %>%
select(Hugo_Symbol, A183, A187, A188)
here is a data.table approach... are you sure your desires output in the question is correct? seems to me like IRS2 - A188 should be NA and not 2.23 ?
library( data.table )
#make them both data.tables
setDT(df1); setDT(df2)
#find the common columns
comcols <- intersect( names(df1[,-1]), names(df2[,-1]) )
#create a data.table syntax for an update join on the common columns
expr <- paste0( "df2[ df1, `:=` (",
paste0( comcols, " = i.", comcols, collapse = " ," ),
" ), on = .(Hugo_Symbol) ]" )
eval(parse(text=expr))
df2
# Hugo_Symbol A183 A187 A188
# 1: CDKN2A -0.19 2.78 NA
# 2: IRS2 2.01 NA NA
# 3: NRAS 1.23 2.23 1.23

How to group by and fill NA with closest not NA in R dataframe column with condition on another column

I have a data frame of blood test markers results and I want to fill in the NA's by the following criteria:
For each group of ID (TIME is in ascending order) if the marker value is NA then fill it with the closest not NA value in this group (may be past or future) but only if the time difference is less than 14.
this example of my data:
df<-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA NA
2 43 2.33 22.34 NA NA NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
ID is the patient.
The TIME is the time of the blood test.
The others are the markers.
The only way I could do it is with loops which I try to avoid as much as possible.
I expect the output to be:
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA 13.21
2 43 2.33 22.34 30.31 5.72 NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
CA.19.9 and CA.124 are filled with the previous (10 days before)
NSE filled with the previous (11 days)
CA.72.4 not filled since the time difference of 1.32 which is -48 is 49 days from the next measure.
I bet there is a much simpler, vectorized solution but the following works.
fill_NA <- function(DF){
sp <- split(df, df$ID)
sp <- lapply(sp, function(DF){
d <- diff(DF$TIME)
i_diff <- c(FALSE, d < 14)
res <- sapply(DF[-(1:2)], function(X){
inx <- i_diff & is.na(X)
if(any(inx)){
inx <- which(inx)
last_change <- -1
for(i in inx){
if(i > last_change + 1){
if(i == 1){
X[i] <- X[i + 1]
}else{
X[i] <- X[i - 1]
}
last_change <- i
}
}
}
X
})
cbind(DF[1:2], res)
})
res <- do.call(rbind, sp)
row.names(res) <- NULL
res
}
fill_NA(df)
# ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
#1 2 1 1.32 14.62 33.98 6.18 NA NA
#2 2 22 1.42 14.59 27.56 7.11 NA 13.21
#3 2 33 1.81 16.80 30.31 5.72 NA 13.21
#4 2 43 2.33 22.34 30.31 5.72 NA NA
#5 2 85 2.23 36.33 39.57 7.38 NA NA
#6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA
#7 4 1 23.34 94.09 956.50 118.20 NA NA
#8 4 30 18.23 121.50 825.30 98.26 NA NA
Yes, you can have a vectorized solution. first let us consider the case in which you only impute using the future value. You need to create few auxiliary variables:
a variable that tells you whether the next observation belong to the same id (so it can be used to impute),
a variable that tells you whether the next observation is less than 14 days apart from the current one.
These do not depend on the specific variable you want to impute. for each variable to be imputed you will also need a variable that tells you whether the next variable is missing.
Then you can vectorize the following logic: when the next observation has the same id, and when it is less than 14 days from the current one and it is not missing copy its value in the current one.
Things get more complicated when you need to decide whether to use the past or future value, but the logic is the same. the code is below, it is a bit long but you can simplify it, I just wanted to be clear about what it does.
Hope this helps
x <-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
### these are the columns we want to input
cols.to.impute <- colnames(x)[! colnames(x) %in% c("ID","TIME")]
### is the next id the same?
x$diffidf <- NA
x$diffidf[1:(nrow(x)-1)] <- diff(x$ID)
x$diffidf[x$diffidf > 0] <- NA
### is the previous id the same?
x$diffidb <- NA
x$diffidb[2:nrow(x)] <- diff(x$ID)
x$diffidb[x$diffidb > 0] <- NA
### diff in time with next observation
x$difftimef <- NA
x$difftimef[1:(nrow(x)-1)] <- diff(x$TIME)
### diff in time with previous observation
x$difftimeb <- NA
x$difftimeb[2:nrow(x)] <- diff(x$TIME)
### if next (previous) id is not the same time difference is not meaningful
x$difftimef[is.na(x$diffidf)] <- NA
x$difftimeb[is.na(x$diffidb)] <- NA
### we do not need diffid anymore (due to previous statement)
x$diffidf <- x$diffidb <- NULL
### if next (previous) point in time is more than 14 days it is not useful for imputation
x$difftimef[abs(x$difftimef) > 14] <- NA
x$difftimeb[abs(x$difftimeb) > 14] <- NA
### create variable usef that tells us whether we should attempt to use the forward observation for imputation
### it is 1 only if difftime forward is less than difftime backward
x$usef <- NA
x$usef[!is.na(x$difftimef) & x$difftimef < x$difftimeb] <- 1
x$usef[!is.na(x$difftimef) & is.na(x$difftimeb)] <- 1
x$usef[is.na(x$difftimef) & !is.na(x$difftimeb)] <- 0
if (!is.na(x$usef[nrow(x)]))
stop("\nlast observation usef is not missing\n")
### now we get into column specific operations.
for (col in cols.to.impute){
### we will store the results in x$imputed, and copy into c[,col] at the end
x$imputed <- x[,col]
### x$usef needs to be modified depending on the specific column, so we define a local version of it
x$usef.local <- x$usef
### if a variable is not missing no point in looking at usef.local, so we make it missing
x$usef.local[!is.na(x[,col])] <- NA
### when usef.local is 1 but the next observation is missing it cannot be used for imputation, so we
### make it 0. but a value of 0 does not mean we can use the previous observation because that may
### be missing too. so first we make usef 0 and next we check the previous observation and if that
### is missing too we make usef missing
x$previous.value <- c(NA,x[1:(nrow(x)-1),col])
x$next.value <- c(x[2:nrow(x),col],NA)
x$next.missing <- is.na(x$next.value)
x$previous.missing <- is.na(x$previous.value)
x$usef.local[x$next.missing & x$usef.local == 1] <- 0
x$usef.local[x$previous.missing & x$usef.local == 0] <- NA
### now we can impute properly: use next value when usef.local is 1 and previous value when usef.local is 0
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 1] <- TRUE
x$imputed[tmp] <- x$next.value[tmp]
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 0] <- TRUE
x$imputed[tmp] <- x$previous.value[tmp]
### copy to column
x[,col] <- x$imputed
}
### get rid of useless temporary stuff
x$previous.value <- x$previous.missing <- x$next.value <- x$next.missing <- x$imputed <- x$usef.local <- NULL
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE difftimef difftimeb usef
1 2 1 1.32 14.62 33.98 6.18 NA NA NA NA NA
2 2 22 1.42 14.59 27.56 7.11 NA 13.21 11 NA 1
3 2 33 1.81 16.80 30.31 5.72 NA 13.21 10 11 1
4 2 43 2.33 22.34 30.31 5.72 NA NA NA 10 0
5 2 85 2.23 36.33 39.57 7.38 NA NA NA NA NA
6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA NA NA NA
7 4 1 23.34 94.09 956.50 118.20 NA NA NA NA NA
8 4 30 18.23 121.50 825.30 98.26 NA NA NA NA NA
>

Rolling correlation with id and date

I have some data that has a name, date, and two factors (x,y). I would like to calculate
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
I would like to add a column named "Correl" that for each date, takes the correlation of the previous 5 periods. However, when the name changes, I would like it to give NA's instead.
As you can see below, when the data becomes XYZ instead of ABC, the first 4 periods, the correlation is NA. When there's 5 data points is when the correlation begins again.
name date x y Correl
ABC 1/1/2013 -3.59 -5.13 NA
ABC 1/2/2013 -8.69 4.22 NA
ABC 1/3/2013 2.80 -0.59 NA
ABC 1/4/2013 0.54 5.06 NA
ABC 1/5/2013 1.13 3.49 -0.03
ABC 1/6/2013 0.52 5.16 -0.38
ABC 1/7/2013 -0.24 -5.40 0.08
ABC 1/8/2013 3.26 -2.75 -0.16
ABC 1/9/2013 1.33 5.94 -0.04
ABC 1/10/2013 2.24 1.14 -0.01
ABC 1/11/2013 0.01 9.87 -0.24
ABC 1/12/2013 2.29 1.28 -0.99
ABC 1/13/2013 1.03 -6.30 -0.41
ABC 1/14/2013 0.62 4.82 -0.47
ABC 1/15/2013 1.08 -1.17 -0.50
ABC 1/16/2013 2.43 8.86 0.45
ABC 1/17/2013 -3.43 9.38 -0.35
ABC 1/18/2013 -5.73 7.59 -0.38
ABC 1/19/2013 1.77 3.13 -0.44
ABC 1/20/2013 -0.97 -0.77 -0.24
XYZ 1/1/2013 2.12 10.22 NA
XYZ 1/2/2013 -0.81 0.22 NA
XYZ 1/3/2013 -1.55 -2.25 NA
XYZ 1/4/2013 -4.53 3.63 NA
XYZ 1/5/2013 2.95 -1.51 0.13
XYZ 1/6/2013 6.76 24.16 0.69
XYZ 1/7/2013 3.33 7.31 0.66
XYZ 1/8/2013 -1.47 -4.23 0.67
XYZ 1/9/2013 3.89 -0.43 0.81
XYZ 1/10/2013 5.63 17.95 0.86
XYZ 1/11/2013 3.29 -7.09 0.63
XYZ 1/12/2013 6.03 -9.03 0.29
XYZ 1/13/2013 -5.63 6.96 -0.19
XYZ 1/14/2013 1.70 13.59 -0.18
XYZ 1/15/2013 -1.19 -16.79 -0.29
XYZ 1/16/2013 4.76 4.91 -0.11
XYZ 1/17/2013 9.02 25.16 0.57
XYZ 1/18/2013 4.56 6.48 0.84
XYZ 1/19/2013 5.30 11.81 0.99
XYZ 1/20/2013 -0.60 3.38 0.84
UPDATE: I have tried all of your suggestions and have run into problems using the actual data. I have attached a subset of the data below:
https://www.dropbox.com/s/6k4xhwuinlu0p1f/TEST_SUBSET.csv?dl=0
I cannot get this to work. I've tried removing the NA's, renaming the rows, reading the data in differently, formatting the date differently. Nothing is working for me. Can you see if what you are running is working for this dataset? Thank you very much folks!
Apply ave to the row indexes of df to process by name and use rollapplyr to perform the rolling computations. Note that i is a vector of indexes:
library(zoo)
corx <- function(x) cor(x[, 1], x[, 2])
df$Correl <- ave(1:nrow(df), df$name, FUN = function(i)
rollapplyr(df[i, c("x", "y")], 5, corx, by.column = FALSE, fill = NA))
Update Changed rollapply to rollapplyr to be consistent with the output shown in the question. If you want centred correlations change it back to rollapply.
This is a little late to the party, but the below is a pretty compact solution with dplyr and rollapply from (zoo package).
library(dplyr)
library(zoo)
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
df<-df %>%
group_by(name)%>%
arrange(date) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
which returns...
> df
Source: local data frame [40 x 5]
Groups: name
name date x y correl
1 ABC 2013-01-01 -0.61707785 -0.7299461 NA
2 ABC 2013-01-02 1.35353618 9.1314743 NA
3 ABC 2013-01-03 2.60815932 0.2511828 NA
4 ABC 2013-01-04 -2.89619789 -1.2586655 NA
5 ABC 2013-01-05 2.23750886 4.6616034 0.52013407
6 ABC 2013-01-06 -1.97573999 3.6800832 0.37575664
7 ABC 2013-01-07 1.70360813 2.2621718 0.32390612
8 ABC 2013-01-08 0.02017797 2.5088032 0.64020507
9 ABC 2013-01-09 0.96263256 1.6711756 -0.00557611
10 ABC 2013-01-10 -0.62400803 5.2011656 -0.66040650
.. ... ... ... ... ...
checking that the other group responds correctly...
> df %>%
+ filter(name=="XYZ")
Source: local data frame [20 x 5]
Groups: name
name date x y correl
1 XYZ 2013-01-01 3.4199729 5.0866361 NA
2 XYZ 2013-01-02 4.7326297 -5.4613465 NA
3 XYZ 2013-01-03 3.8983329 11.1635903 NA
4 XYZ 2013-01-04 1.5235936 3.9077184 NA
5 XYZ 2013-01-05 -5.4885373 7.8961020 -0.3755766
6 XYZ 2013-01-06 0.2311371 2.0157046 -0.3754510
7 XYZ 2013-01-07 2.6903306 -3.2940181 -0.1808097
8 XYZ 2013-01-08 -0.2584268 3.6047800 -0.8457930
9 XYZ 2013-01-09 -0.2897795 2.1029431 -0.9526992
10 XYZ 2013-01-10 5.9571558 18.5810947 0.7025559
11 XYZ 2013-01-11 -7.5250647 -8.0858699 0.7949917
12 XYZ 2013-01-12 2.8438336 -8.4072829 0.6563161
13 XYZ 2013-01-13 7.2295030 -0.1236801 0.5383666
14 XYZ 2013-01-14 -0.7579570 -0.2830291 0.5542751
15 XYZ 2013-01-15 4.3116507 -6.5291051 0.3894343
16 XYZ 2013-01-16 1.4334510 0.5957465 -0.1480032
17 XYZ 2013-01-17 -2.6444881 6.1261976 -0.6183805
18 XYZ 2013-01-18 0.8517223 0.5587499 -0.9243050
19 XYZ 2013-01-19 6.2140131 -3.0944259 -0.8939475
20 XYZ 2013-01-20 11.2871086 -0.1187153 -0.6845300
Hope this helps!
FOLLOW UP
I just ran the following on your actual data set:
library(dplyr)
library(zoo)
import <- read.csv("TEST_SUBSET.CSV", header=TRUE, stringsAsFactors = FALSE)
str(head(import))
import_df<-import %>%
group_by(id)%>%
arrange(asof_dt) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
import_df
and received the following:
> import_df
Source: local data frame [15,365 x 5]
Groups: id
id asof_dt x y correl
1 DC1123 1/10/1990 -0.003773632 NA NA
2 DC1123 1/10/1991 0.014034992 NA NA
3 DC1123 1/10/1992 -0.004109765 NA NA
4 DC1123 1/10/1994 0.006369326 0.012176085 NA
5 DC1123 1/10/1995 0.014900600 0.001241080 NA
6 DC1123 1/10/1996 0.005763689 -0.013112491 NA
7 DC1123 1/10/1997 0.006949765 0.010737034 NA
8 DC1123 1/10/2000 0.044052805 0.003346296 0.02724175
9 DC1123 1/10/2001 0.009452785 0.017582638 0.01362101
10 DC1123 1/10/2002 -0.018876970 0.004346372 0.01343657
.. ... ... ... ... ...
so it feels like its working.
The (cor) function is only going to return data when it has 5 input points, which doesn't happen until row 8.
Here is a a solution using base R, note that it requires that the data set be sorted by name and date, in that order.
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
rollcorr = function(df, lag = 4) {
out = numeric(nrow(df) - lag)
for( i in seq_along(out) ) {
window = i:(i+lag)
out[i] = cor(df$x[window], df$y[window])
}
out <- c(rep(NA, lag), out)
return(out)
}
df$Correl <- do.call(c, by(df[, -1], df[, 1], rollcorr))

Dividing specific values between two arrays

I am a novice R user trying to work with a data set of 40,000 rows and 300 columns. I have found a solution for what I would like to do, however my machine takes over an hour to run my code and I feel like an expert could help me with a quicker solution (as I can do this in excel in half the time). I will post my solution at the end.
What I would like to do is the following:
Compute the average value for each column NY1 to NYn based on the value of the YYYYMMbucket column.
Divide original value by the its average YYYYMMbucket value.
Here is sample of my original data set:
YYYYMMbucket NY1 NY2 NY3 NY4
1 200701.3 0.309 NA 20.719 16260
2 200701.3 0.265 NA 19.482 15138
3 200701.3 0.239 NA 19.168 14418
4 200701.3 0.225 NA 19.106 14046
5 200701.3 0.223 NA 19.211 14040
6 200701.3 0.234 NA 19.621 14718
7 200701.3 0.270 NA 20.522 15780
8 200701.3 0.298 NA 22.284 16662
9 200701.2 0.330 NA 23.420 16914
10 200701.2 0.354 NA 23.805 17310
11 200701.2 0.388 NA 24.095 17448
12 200701.2 0.367 NA 23.954 17640
13 200701.2 0.355 NA 23.255 17748
14 200701.2 0.346 NA 22.731 17544
15 200701.2 0.347 NA 22.445 17472
16 200701.2 0.366 NA 21.945 17634
17 200701.2 0.408 NA 22.683 18876
18 200701.2 0.478 NA 23.189 21498
19 200701.2 0.550 NA 23.785 22284
20 200701.2 0.601 NA 24.515 22368
This is what my averages look like:
YYYYMMbucket NY1M NY2M
1 200701.1 0.4424574 NA
2 200701.2 0.4530000 NA
3 200701.3 0.2936935 NA
4 200702.1 0.4624063 NA
5 200702.2 0.4785937 NA
6 200702.3 0.3091161 NA
7 200703.1 0.4159687 NA
8 200703.2 0.4491875 NA
9 200703.3 0.2840081 NA
10 200704.1 0.4279137 NA
How I would like my final output to look:
NY1avgs NY2avgs NY3avgs
1 1.052117 NA 0.7560868
2 0.9023011 NA 0.7109456
3 0.8137734 NA 0.699487
4 0.7661047 NA 0.6972245
5 0.7592949 NA 0.7010562
6 0.7967489 NA 0.7160181
7 0.9193256 NA 0.7488978
8 1.014663 NA 0.8131974
9 0.7284768 NA 0.857904
Here's how I did it:
First I used "plyr" to compute my averages, simple enough:
test <- ddply(prf.delete2b,. (YYYYMMbucket), summarise,
NY1M = mean(NY1), NY2M = mean(NY2) ... ...))
Then used a series of the following:
x <- c(1:40893)
lookv <- function(x,ltab,rcol=2) ltab[max(which(ltab[,1]<=x)),rcol]
NY1Fun <- function(x) (prf.delete2b$NY1[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,2))
NY2Fun <- function(x) (prf.delete2b$NY2[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,3))
NY1Avgs <- lapply(x, NY1Fun)
NY2Avgs <- lapply(x, NY2Fun)
I also tried a variant of the above by saying:
NY1Fun <- function(x) (prf.delete2b$NY1[x] / subset(test, YYYYMMbucket == prf.delete2b$YYYYMMbucket[x], select =c(NY1M)))
lapply(x, NY1Fun)
Each variant of NYnFun takes a good 20 seconds to run so doing this 300 times takes much too long. Can anyone recommend any alternative to what I posted or point out any novice mistakes I've made?
Here is the customary data.table approach, which works pretty fast.
# CREATE DUMMY DATA
N = 1000
mydf = data.frame(
bucket = sample(letters, N, replace = T),
NY1 = runif(N),
NY2 = runif(N),
NY3 = runif(N),
NY4 = runif(N)
)
# SCALE COLUMNS BY AVG
library(data.table)
scale_x = function(x) x/ave(x)
mydt = data.table(mydf)
ans = mydt[,lapply(.SD, scale_x), by = 'bucket']
How about:
test2 <- merge(prfdelete2b,test,all.x=TRUE)
test2[2:ncol(prefdelete2b)]/test2[(ncol(prefdelete2b)+1):ncol(test2)]
In this case, I would use ave instead of ddply because ave returns a vector the same length as its input. ave only accepts a vector, so you need to use lapply to loop over the columns of your data.frame.
myFun <- function(x, groupVar) {
x / ave(x, groupVar, FUN=function(y) mean(y, na.rm=TRUE))
}
relToMeans <- data.frame(prf.delete2b[1],
lapply(prf.delete2b[-1], myFun, groupVar=prf.delete2b[1]))

Resources