R time series interpolation, and extrapolation of a specific value - r

I have daily values for 11 different yield curves, that is time series for 11 yield maturities (1yr, 2yr, 3yr, 4yr, 5yr, 7yr, 10yr, 15yr, 20yr, 25yr, 30yr) in the same period of time. Some of the yields in some days are missing (NAs) and I'd like to extrapolate their values knowing the value of the other yields at the same day. This should be done by a first linear interpolation of the available yields in a given day, and a successive extrapolation of the missing yields in the same day, using the maturity duration (1yr, 2yr, etc) as weight.
For example, I have the following data set and I'd like to extrapolate the daily value for 5yr yield based on an interpolation of all available yilds at the same day:
Date 1 2 3 4 5 7 10 15 20 25 30
7/4/2007 9.642 9.278 8.899 NA NA 8.399 8.241 8.183 8.117 NA NA
7/5/2007 9.669 9.302 8.931 NA NA 8.44 8.287 8.231 8.118 NA NA
7/6/2007 9.698 9.331 8.961 NA NA 8.437 8.295 8.243 8.13 NA NA
7/9/2007 9.678 9.306 8.941 NA NA 8.409 8.269 8.214 8.092 NA NA
7/10/2007 9.65 9.283 8.915 NA NA 8.385 8.243 8.185 8.065 NA NA
7/11/2007 9.7 9.342 8.976 NA NA 8.445 8.306 8.249 8.138 NA NA
7/12/2007 9.703 9.348 8.975 NA NA 8.448 8.303 8.245 8.152 NA NA
7/13/2007 9.69 9.334 8.965 NA NA 8.439 8.294 8.24 8.145 NA NA
7/16/2007 9.683 9.325 8.964 NA NA 8.442 8.299 8.244 8.158 NA NA
7/17/2007 9.712 9.359 8.987 NA NA 8.481 8.33 8.277 8.192 NA NA
7/18/2007 9.746 9.394 9.018 NA NA 8.509 8.363 8.311 8.22 NA NA
...
Does anyone have suggestions on how to do it?
Thanks.

This is one of the ways to build a linear model for each Date based on the available info you have and use it to predict/estimate the value at year 5.
Run the process step by step to see how it works. Check the estimations to make sure they make sense.
dt = read.table(text=
"Date 1 2 3 4 5 7 10 15 20 25 30
7/4/2007 9.642 9.278 8.899 NA NA 8.399 8.241 8.183 8.117 NA NA
7/5/2007 9.669 9.302 8.931 NA NA 8.44 8.287 8.231 8.118 NA NA
7/6/2007 9.698 9.331 8.961 NA NA 8.437 8.295 8.243 8.13 NA NA
7/9/2007 9.678 9.306 8.941 NA NA 8.409 8.269 8.214 8.092 NA NA
7/10/2007 9.65 9.283 8.915 NA NA 8.385 8.243 8.185 8.065 NA NA
7/11/2007 9.7 9.342 8.976 NA NA 8.445 8.306 8.249 8.138 NA NA
7/12/2007 9.703 9.348 8.975 NA NA 8.448 8.303 8.245 8.152 NA NA
7/13/2007 9.69 9.334 8.965 NA NA 8.439 8.294 8.24 8.145 NA NA
7/16/2007 9.683 9.325 8.964 NA NA 8.442 8.299 8.244 8.158 NA NA
7/17/2007 9.712 9.359 8.987 NA NA 8.481 8.33 8.277 8.192 NA NA
7/18/2007 9.746 9.394 9.018 NA NA 8.509 8.363 8.311 8.22 NA NA", header=T)
library(dplyr)
library(tidyr)
dt %>%
gather(time, value, -Date) %>% # reshape dataset
filter(!is.na(value)) %>% # ignore NA values
mutate(time = as.numeric(gsub("X","",time))) %>% # get rid of the X created by importing data
group_by(Date) %>% # for each date
do({model = lm(value~time, data=.) # build a linear model
data.frame(pred = predict(model, data.frame(time=5)))}) # use model to predict at time = 5
# Source: local data frame [11 x 2]
# Groups: Date [11]
#
# Date pred
# (fctr) (dbl)
# 1 7/10/2007 8.920932
# 2 7/11/2007 8.979601
# 3 7/12/2007 8.981383
# 4 7/13/2007 8.970571
# 5 7/16/2007 8.968542
# 6 7/17/2007 8.999584
# 7 7/18/2007 9.032026
# 8 7/4/2007 8.917645
# 9 7/5/2007 8.950605
# 10 7/6/2007 8.970669
# 11 7/9/2007 8.946661
I'm not suggesting that the linear model is the best fit, as I didn't spend time on checking that. But, you can use a quadratic model instead of a linear, which might give you a better estimation.
In case you want to check the model output and get info about the models you built and used for each Date you can do this:
library(dplyr)
library(tidyr)
library(broom)
dt %>%
gather(time, value, -Date) %>% # reshape dataset
filter(!is.na(value)) %>% # ignore NA values
mutate(time = as.numeric(gsub("X","",time))) %>% # get rid of the X created by importing data
group_by(Date) %>% # for each date
do({model = lm(value~time, data=.) # build a linear model
tidy(model)}) # check model output
# Source: local data frame [22 x 6]
# Groups: Date [11]
#
# Date term estimate std.error statistic p.value
# (fctr) (chr) (dbl) (dbl) (dbl) (dbl)
# 1 7/10/2007 (Intercept) 9.29495818 0.19895389 46.719158 8.485928e-08
# 2 7/10/2007 time -0.07480530 0.01875160 -3.989275 1.043399e-02
# 3 7/11/2007 (Intercept) 9.34942937 0.19823019 47.164509 8.093526e-08
# 4 7/11/2007 time -0.07396561 0.01868339 -3.958897 1.075469e-02
# 5 7/12/2007 (Intercept) 9.35001022 0.20037595 46.662337 8.537618e-08
# 6 7/12/2007 time -0.07372537 0.01888563 -3.903781 1.136592e-02
# 7 7/13/2007 (Intercept) 9.33730855 0.19974786 46.745476 8.462114e-08
# 8 7/13/2007 time -0.07334758 0.01882643 -3.895989 1.145551e-02
# 9 7/16/2007 (Intercept) 9.33045446 0.19856561 46.989276 8.245272e-08
# 10 7/16/2007 time -0.07238243 0.01871501 -3.867615 1.178869e-02
# .. ... ... ... ... ... ...

Related

unable to generate matrix in desired format

I need to make a function such that it create a matrix across two groups and provide total of the groups at row and column level, along with matrix results
Inputs for functions -> df
credit breaks
Rate_Cutpoints from 1..5
O/P needs to be final table provide below
MY Data Frame
credit <- c(10,20,30,40,10,30,50,70,90,100,25,45,67,87,98,54,34,56,78,23,45,56,12)
rate <- c(1,2,3,4,1,3,5,7,9,10,2,4,6,8,9,5,3,5,7,2,4,5,1)
Marks <- c(9,3,5,6,7,8,9,1,3,10,4,5,6,7,5,4,8,3,5,6,7,8,9)
Points <- c(1,2,3,4,5,6,7,8,9,10,2,3,4,4,5,7,8,3,4,5,6,7,8)
Scale <- c(1,2,3,4,5,6,7,8,9,10,2,3,4,4,5,7,8,3,4,5,6,7,8)
Category <- c('book', 'pen', 'textbook', 'pencil_case','book', 'pen', 'textbook', 'pencil_case','book', 'pen', 'textbook', 'pencil_case','book','pen' ,'pen', 'textbook', 'pencil_case','book', 'pen', 'textbook', 'pencil_case','book', 'pencil_case')
# Join the variables to create a data frame
df <- data.frame(credit,rate,Marks,Points,Scale,Category)
MY Inputs
credit_breaks<-c(0,15,30,45,65,75,1000)
Rate_Cutpoints1<-c(0,1,2,5,7,9,10)
Rate_Cutpoints2<-c(0,3,4,7,8,9,10)
Rate_Cutpoints3<-c(0,1,5,6,8,9,10)
Rate_Cutpoints4<-c(0,1,3,6,7,9,10)
Rate_Cutpoints5<-c(0,2,3,4,8,9,10)
Rate_Cutpoints6<-c(0,3,4,5,7,9,10)
MY code it basically first make credit band column from credit breaks which is provided as input and then using that make another column as rate bands based on rate breaks
calculate few metric and then provide metrics
df1<-df %>% mutate(Credit_Band = cut(credit,include.lowest = TRUE,right=TRUE,
breaks =credit_breaks ,labels = FALSE))
df2<-df1 %>%
group_by(credit) %>%
mutate(New_Band =
(ifelse(Credit_Band==1, (cut(rate, Rate_Cutpoints1 ,labels = FALSE)),
ifelse(Credit_Band==2, (cut(rate, Rate_Cutpoints2 ,labels = FALSE)),
ifelse(Credit_Band==3, (cut(rate, Rate_Cutpoints3 ,labels = FALSE)),
ifelse(Credit_Band==4, (cut(rate, Rate_Cutpoints4 ,labels = FALSE)),
ifelse(Credit_Band==5, (cut(rate, Rate_Cutpoints5,labels = FALSE)),
ifelse(Credit_Band==6, (cut(rate, Rate_Cutpoints6,labels = FALSE)),
NA))))))))
df2<-as.data.frame(df2)
summary_results<-df2%>%
group_by(Credit_Band,New_Band)%>%
dplyr::summarize(dist = n()/nrow(df2),
count =n(),
avg_marks= sum(Marks, na.rm=TRUE),
sum_points = sum(Points,na.rm = TRUE),
sum_scale = sum(Scale,na.rm = TRUE))
summary_results$final<-summary_results$avg_marks/summary_results$sum_points
results<-reshape2::dcast(data = summary_results,formula = Credit_Band~New_Band,
value.var = "final")
my result o/p is cross tab across credit and rate bands
Then below code is to calculate total across credit and rate bands
total_rows_value=df2%>% group_by(New_Band)%>%
dplyr::summarize(sum_points = sum(Points ,na.rm = TRUE),
avg_marks= sum(Marks, na.rm=TRUE),
)
total_rows_value$final<-total_rows_value$avg_marks/total_rows_value$sum_points
total_cols_vals=df2%>% group_by(Credit_Band)%>%
dplyr::summarize(sum_points = sum(Points ,na.rm = TRUE),
avg_marks= sum(Marks, na.rm=TRUE),
)
total_cols_vals$final<-total_cols_vals$avg_marks/total_cols_vals$sum_points
Now MY above O/P needs to be clubbed in a fashion to generate below matrix as Final O/P desired
Credit_Band 1 2 3 4 5 6 TotalCols
1 1.78 NA NA NA NA NA. 1.79
2 1.44 NA NA NA NA NA. 1.44
3 NA 1.23 NA NA NA NA. 1.24
4 NA NA. 1 NA NA NA 1
5 NA NA NA 0.58 NA NA 0.58
6 NA NA NA 1.25 0.83 1 0.93
Total_R 1.59. 1.24 1. 0.75. 0.83. 1
(results_body <- results[,-1])
(results_rownames <- results[,1])
(fin <- cbind(
rbind(results_body,total_rows_value$final),
totcol = c(total_cols_vals$final,NA)))
rownames(fin) <-c(results_rownames,"Total_R")
> round(fin,2)
1 2 3 4 5 6 totcol
1 1.79 NA NA NA NA NA 1.79
2 1.44 NA NA NA NA NA 1.44
3 NA 1.24 NA NA NA NA 1.24
4 NA NA 1 NA NA NA 1.00
5 NA NA NA 0.58 NA NA 0.58
6 NA NA NA 1.25 0.83 1 0.94
Total_R 1.59 1.24 1 0.75 0.83 1 NA

PCA analysis with triangular matrix

I was trying to using PCA to analysis my data. But it ends like this:
> head(MEGA)
# A tibble: 6 × 86
...1 A2S10A16T18 K3N10E14 Q3H6G8K14 G4L8D14 W2G16Q17C18 H15K16 E3V9D10W14
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A2S10A… NA NA NA NA NA NA NA
2 K3N10E… 0.462 NA NA NA NA NA NA
3 Q3H6G8… 0.727 0.357 NA NA NA NA NA
4 G4L8D14 0.583 0.357 0.357 NA NA NA NA
5 W2G16Q… 0.357 0.583 0.727 0.583 NA NA NA
6 H15K16 0.357 0.357 0.462 0.357 0.357 NA NA
> prcomp(MEGA)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Can you help me with this?
I am freshman in bioinformation, thank you so much.
Here are the issues you need to solve to be able to compute your PCA (MDS):
There can't be any missing values, so you need to "mirror" your lower triangular matrix into the upper triangular
The table must be a matrix, with rownames and colnames, but only numerical variables.
Here's how I'd solve these problems in a reproducible example:
library(tidyverse)
#-- Making a fake reproducible example
dist_mat <- dist(matrix(rnorm(1000), ncol = 100), method = "euclidean") %>%
as.matrix()
dist_mat[upper.tri(dist_mat)] <- NA
mega <- dist_mat %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column(var = "...1") %>%
tibble()
#-- Creating the matrix
mega_mat <- mega %>%
as.data.frame() %>%
column_to_rownames("...1")
#-- Mirroring lower on upper triangular
mega_mat[upper.tri(mega_mat)] <- mega_mat[lower.tri(mega_mat)]
#-- Computing PCA
prcomp(mega_mat)
#> Standard deviations (1, .., p=10):
#> [1] 5.758257e+00 5.621893e+00 5.289312e+00 5.089903e+00 4.739766e+00
#> [6] 4.494360e+00 4.458136e+00 4.317035e+00 3.989503e+00 8.301307e-16
#>
#> Rotation (n x k) = (10 x 10):
#> PC1 PC2 PC3 PC4 PC5 PC6
#> 1 0.35741259 -0.005146882 0.39400567 -0.47952260 0.02795010 0.14613179
#> 2 -0.12992630 -0.199989467 -0.02870089 -0.63737579 -0.08742113 -0.29440106
#> 3 0.02392526 -0.478265115 -0.10158414 0.44137508 0.16895264 -0.13301094
#> 4 0.53242350 -0.208730464 -0.34645287 0.14446233 -0.41603543 0.05018439
#> 5 -0.45456984 0.282095797 0.16492701 0.17112439 -0.67212158 0.17606230
#> 6 -0.30035787 -0.300319956 -0.20283830 -0.06251480 0.09505941 -0.23999350
#> 7 0.22910453 0.474792604 0.30265124 0.26661252 0.15541762 -0.65105608
#> 8 -0.46965024 -0.101554817 0.08307490 0.01429205 0.22627747 -0.10163731
#> 9 -0.03485274 0.491621247 -0.54080984 -0.08671785 0.41166874 0.33337416
#> 10 0.01430624 -0.201168669 0.50425001 0.19008080 0.29040466 0.48767389
#> PC7 PC8 PC9 PC10
#> 1 0.21547630 -0.11053044 0.54974532 0.3225596
#> 2 -0.38982043 -0.24769290 -0.41060997 0.2445915
#> 3 -0.24227704 -0.55651526 0.31106933 0.2327022
#> 4 0.25267026 0.15142687 -0.28808603 0.4320741
#> 5 -0.15625939 -0.10455639 0.15870825 0.3376700
#> 6 -0.12305942 0.69359210 0.36761203 0.2766879
#> 7 -0.09611608 0.07981677 -0.08463128 0.2976625
#> 8 0.73091422 -0.16044081 -0.22278636 0.3014421
#> 9 -0.13964194 -0.10677135 0.04031700 0.3794254
#> 10 -0.27664992 0.24140792 -0.36148755 0.2850979
Created on 2022-04-28 by the reprex package (v2.0.1)

Issue with a multiple regression model in R

First let me apologize but I'm a biologist starting in the world of bioinformatics and therefore in R programming and statistics.
I have to do an analysis of a multilinear regression model with the data (Penta) from Library(mvdalav).
I have to try different models including the PLS model that is the model that is normally used for this data set (https://rdrr.io/cran/mvdalab/f/README.md)
However, they ask us to play with the data more models and I'm very lost as the data seems to always give me errors:
1) Normal multiple regression model:
> mod2<-mod1<-lm(Penta1$log.RAI~.,Penta1)
> summary(mod2)
Call:
lm(formula = Penta1$log.RAI ~ ., data = Penta1)
Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.000e-01 NA NA NA
Obs.NameAAWAA 8.500e-01 NA NA NA
Obs.NameAAYAA 5.600e-01 NA NA NA
Obs.NameEKWAP 1.400e+00 NA NA NA
Obs.NameFEAAK 4.000e-01 NA NA NA
Obs.NameFSPFR 7.400e-01 NA NA NA
Obs.NameGEAAK -4.200e-01 NA NA NA
Obs.NameLEAAK 5.000e-01 NA NA NA
Obs.NamePGFSP 1.000e+00 NA NA NA
Obs.NameRKWAP 2.080e+00 NA NA NA
Obs.NameRYLPT 5.000e-01 NA NA NA
Obs.NameVAAAK 1.114e-15 NA NA NA
Obs.NameVAAWK 3.300e-01 NA NA NA
Obs.NameVAWAA 1.530e+00 NA NA NA
Obs.NameVAWAK 1.550e+00 NA NA NA
Obs.NameVEAAK 6.100e-01 NA NA NA
Obs.NameVEAAP 2.800e-01 NA NA NA
Obs.NameVEASK 3.000e-01 NA NA NA
Obs.NameVEFAK 1.670e+00 NA NA NA
Obs.NameVEGGK -9.000e-01 NA NA NA
Obs.NameVEHAK 1.630e+00 NA NA NA
Obs.NameVELAK 6.900e-01 NA NA NA
Obs.NameVESAK 3.800e-01 NA NA NA
Obs.NameVESSK 1.000e-01 NA NA NA
Obs.NameVEWAK 2.830e+00 NA NA NA
Obs.NameVEWVK 1.810e+00 NA NA NA
Obs.NameVKAAK 2.100e-01 NA NA NA
Obs.NameVKWAA 1.810e+00 NA NA NA
Obs.NameVKWAP 2.450e+00 NA NA NA
Obs.NameVWAAK 1.400e-01 NA NA NA
S1 NA NA NA NA
L1 NA NA NA NA
P1 NA NA NA NA
S2 NA NA NA NA
L2 NA NA NA NA
P2 NA NA NA NA
S3 NA NA NA NA
L3 NA NA NA NA
P3 NA NA NA NA
S4 NA NA NA NA
L4 NA NA NA NA
P4 NA NA NA NA
S5 NA NA NA NA
L5 NA NA NA NA
P5 NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 29 and 0 DF, p-value: NA
2) Study the reduced model provided by the stepwise method. The aim is to compare the RMSE of the reduced model and the complete model for the training group and for the test group.
step(lm(log.RAI~.,data = penta),direction = "backward")
Error in step(lm(log.RAI ~ ., data = penta), direction = "backward") :
AIC is -infinity for this model, so 'step' cannot proceed
3)Find the best model by the criteria of the AIC and by the adjusted R2
4) PLS model --> what fits the data following:https://rdrr.io/cran/mvdalab/f/README.md
5)Also study it with the Ridge Regression method with the lm.ridge () function or similar
6) Finally we will study the LASSO method with the lars () function of Lasso project.
I'm super lost with why the data.frame gave those errors and also how to develop the analysis. Any help with any of the parts would be much appreciated
Kind regards
Ok after reading the vignette, Penta is some data obtained from drug discovery and the first column is the unique identifier. To do regression or downstream analysis you need to exclude this column. For the steps below, I simply do Penta[,-1] as input data
For the first part, this works:
library(mvdalab)
data(Penta)
summary(lm(log.RAI~.,data = Penta[,-1]))
Call:
lm(formula = log.RAI ~ ., data = Penta[, -1])
Residuals:
Min 1Q Median 3Q Max
-0.39269 -0.12958 -0.05101 0.07261 0.63414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.80263 0.92495 -0.868 0.40016
S1 -0.09783 0.03895 -2.512 0.02489 *
L1 0.03236 0.04973 0.651 0.52576
P1 -0.10795 0.08521 -1.267 0.22587
S2 0.08670 0.04428 1.958 0.07043 .
Second part for AIC is ok as well:
step(lm(log.RAI~.,data = Penta[,-1]),direction="backward")
Start: AIC=-57.16
log.RAI ~ S1 + L1 + P1 + S2 + L2 + P2 + S3 + L3 + P3 + S4 + L4 +
P4 + S5 + L5 + P5
Df Sum of Sq RSS AIC
- P3 1 0.00150 1.5374 -59.132
- L4 1 0.00420 1.5401 -59.080
If you want to select model with AIC, the one above works. For adjusted R^2 i think most likely there are packages out there that does this
For lm.ridge, do the same:
library(MASS)
fit=lm.ridge(log.RAI~.,data = Penta[,-1])
For lars, lasso, you need to have the predictors etc in a matrix, so let's do
library(lars)
data = as.matrix(Penta[,-1])
fit = lars(x=data[,-ncol(data)],y=data[,"log.RAI"],type="lasso")

How to group by and fill NA with closest not NA in R dataframe column with condition on another column

I have a data frame of blood test markers results and I want to fill in the NA's by the following criteria:
For each group of ID (TIME is in ascending order) if the marker value is NA then fill it with the closest not NA value in this group (may be past or future) but only if the time difference is less than 14.
this example of my data:
df<-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA NA
2 43 2.33 22.34 NA NA NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
ID is the patient.
The TIME is the time of the blood test.
The others are the markers.
The only way I could do it is with loops which I try to avoid as much as possible.
I expect the output to be:
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA 13.21
2 43 2.33 22.34 30.31 5.72 NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
CA.19.9 and CA.124 are filled with the previous (10 days before)
NSE filled with the previous (11 days)
CA.72.4 not filled since the time difference of 1.32 which is -48 is 49 days from the next measure.
I bet there is a much simpler, vectorized solution but the following works.
fill_NA <- function(DF){
sp <- split(df, df$ID)
sp <- lapply(sp, function(DF){
d <- diff(DF$TIME)
i_diff <- c(FALSE, d < 14)
res <- sapply(DF[-(1:2)], function(X){
inx <- i_diff & is.na(X)
if(any(inx)){
inx <- which(inx)
last_change <- -1
for(i in inx){
if(i > last_change + 1){
if(i == 1){
X[i] <- X[i + 1]
}else{
X[i] <- X[i - 1]
}
last_change <- i
}
}
}
X
})
cbind(DF[1:2], res)
})
res <- do.call(rbind, sp)
row.names(res) <- NULL
res
}
fill_NA(df)
# ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
#1 2 1 1.32 14.62 33.98 6.18 NA NA
#2 2 22 1.42 14.59 27.56 7.11 NA 13.21
#3 2 33 1.81 16.80 30.31 5.72 NA 13.21
#4 2 43 2.33 22.34 30.31 5.72 NA NA
#5 2 85 2.23 36.33 39.57 7.38 NA NA
#6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA
#7 4 1 23.34 94.09 956.50 118.20 NA NA
#8 4 30 18.23 121.50 825.30 98.26 NA NA
Yes, you can have a vectorized solution. first let us consider the case in which you only impute using the future value. You need to create few auxiliary variables:
a variable that tells you whether the next observation belong to the same id (so it can be used to impute),
a variable that tells you whether the next observation is less than 14 days apart from the current one.
These do not depend on the specific variable you want to impute. for each variable to be imputed you will also need a variable that tells you whether the next variable is missing.
Then you can vectorize the following logic: when the next observation has the same id, and when it is less than 14 days from the current one and it is not missing copy its value in the current one.
Things get more complicated when you need to decide whether to use the past or future value, but the logic is the same. the code is below, it is a bit long but you can simplify it, I just wanted to be clear about what it does.
Hope this helps
x <-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
### these are the columns we want to input
cols.to.impute <- colnames(x)[! colnames(x) %in% c("ID","TIME")]
### is the next id the same?
x$diffidf <- NA
x$diffidf[1:(nrow(x)-1)] <- diff(x$ID)
x$diffidf[x$diffidf > 0] <- NA
### is the previous id the same?
x$diffidb <- NA
x$diffidb[2:nrow(x)] <- diff(x$ID)
x$diffidb[x$diffidb > 0] <- NA
### diff in time with next observation
x$difftimef <- NA
x$difftimef[1:(nrow(x)-1)] <- diff(x$TIME)
### diff in time with previous observation
x$difftimeb <- NA
x$difftimeb[2:nrow(x)] <- diff(x$TIME)
### if next (previous) id is not the same time difference is not meaningful
x$difftimef[is.na(x$diffidf)] <- NA
x$difftimeb[is.na(x$diffidb)] <- NA
### we do not need diffid anymore (due to previous statement)
x$diffidf <- x$diffidb <- NULL
### if next (previous) point in time is more than 14 days it is not useful for imputation
x$difftimef[abs(x$difftimef) > 14] <- NA
x$difftimeb[abs(x$difftimeb) > 14] <- NA
### create variable usef that tells us whether we should attempt to use the forward observation for imputation
### it is 1 only if difftime forward is less than difftime backward
x$usef <- NA
x$usef[!is.na(x$difftimef) & x$difftimef < x$difftimeb] <- 1
x$usef[!is.na(x$difftimef) & is.na(x$difftimeb)] <- 1
x$usef[is.na(x$difftimef) & !is.na(x$difftimeb)] <- 0
if (!is.na(x$usef[nrow(x)]))
stop("\nlast observation usef is not missing\n")
### now we get into column specific operations.
for (col in cols.to.impute){
### we will store the results in x$imputed, and copy into c[,col] at the end
x$imputed <- x[,col]
### x$usef needs to be modified depending on the specific column, so we define a local version of it
x$usef.local <- x$usef
### if a variable is not missing no point in looking at usef.local, so we make it missing
x$usef.local[!is.na(x[,col])] <- NA
### when usef.local is 1 but the next observation is missing it cannot be used for imputation, so we
### make it 0. but a value of 0 does not mean we can use the previous observation because that may
### be missing too. so first we make usef 0 and next we check the previous observation and if that
### is missing too we make usef missing
x$previous.value <- c(NA,x[1:(nrow(x)-1),col])
x$next.value <- c(x[2:nrow(x),col],NA)
x$next.missing <- is.na(x$next.value)
x$previous.missing <- is.na(x$previous.value)
x$usef.local[x$next.missing & x$usef.local == 1] <- 0
x$usef.local[x$previous.missing & x$usef.local == 0] <- NA
### now we can impute properly: use next value when usef.local is 1 and previous value when usef.local is 0
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 1] <- TRUE
x$imputed[tmp] <- x$next.value[tmp]
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 0] <- TRUE
x$imputed[tmp] <- x$previous.value[tmp]
### copy to column
x[,col] <- x$imputed
}
### get rid of useless temporary stuff
x$previous.value <- x$previous.missing <- x$next.value <- x$next.missing <- x$imputed <- x$usef.local <- NULL
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE difftimef difftimeb usef
1 2 1 1.32 14.62 33.98 6.18 NA NA NA NA NA
2 2 22 1.42 14.59 27.56 7.11 NA 13.21 11 NA 1
3 2 33 1.81 16.80 30.31 5.72 NA 13.21 10 11 1
4 2 43 2.33 22.34 30.31 5.72 NA NA NA 10 0
5 2 85 2.23 36.33 39.57 7.38 NA NA NA NA NA
6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA NA NA NA
7 4 1 23.34 94.09 956.50 118.20 NA NA NA NA NA
8 4 30 18.23 121.50 825.30 98.26 NA NA NA NA NA
>

Dividing specific values between two arrays

I am a novice R user trying to work with a data set of 40,000 rows and 300 columns. I have found a solution for what I would like to do, however my machine takes over an hour to run my code and I feel like an expert could help me with a quicker solution (as I can do this in excel in half the time). I will post my solution at the end.
What I would like to do is the following:
Compute the average value for each column NY1 to NYn based on the value of the YYYYMMbucket column.
Divide original value by the its average YYYYMMbucket value.
Here is sample of my original data set:
YYYYMMbucket NY1 NY2 NY3 NY4
1 200701.3 0.309 NA 20.719 16260
2 200701.3 0.265 NA 19.482 15138
3 200701.3 0.239 NA 19.168 14418
4 200701.3 0.225 NA 19.106 14046
5 200701.3 0.223 NA 19.211 14040
6 200701.3 0.234 NA 19.621 14718
7 200701.3 0.270 NA 20.522 15780
8 200701.3 0.298 NA 22.284 16662
9 200701.2 0.330 NA 23.420 16914
10 200701.2 0.354 NA 23.805 17310
11 200701.2 0.388 NA 24.095 17448
12 200701.2 0.367 NA 23.954 17640
13 200701.2 0.355 NA 23.255 17748
14 200701.2 0.346 NA 22.731 17544
15 200701.2 0.347 NA 22.445 17472
16 200701.2 0.366 NA 21.945 17634
17 200701.2 0.408 NA 22.683 18876
18 200701.2 0.478 NA 23.189 21498
19 200701.2 0.550 NA 23.785 22284
20 200701.2 0.601 NA 24.515 22368
This is what my averages look like:
YYYYMMbucket NY1M NY2M
1 200701.1 0.4424574 NA
2 200701.2 0.4530000 NA
3 200701.3 0.2936935 NA
4 200702.1 0.4624063 NA
5 200702.2 0.4785937 NA
6 200702.3 0.3091161 NA
7 200703.1 0.4159687 NA
8 200703.2 0.4491875 NA
9 200703.3 0.2840081 NA
10 200704.1 0.4279137 NA
How I would like my final output to look:
NY1avgs NY2avgs NY3avgs
1 1.052117 NA 0.7560868
2 0.9023011 NA 0.7109456
3 0.8137734 NA 0.699487
4 0.7661047 NA 0.6972245
5 0.7592949 NA 0.7010562
6 0.7967489 NA 0.7160181
7 0.9193256 NA 0.7488978
8 1.014663 NA 0.8131974
9 0.7284768 NA 0.857904
Here's how I did it:
First I used "plyr" to compute my averages, simple enough:
test <- ddply(prf.delete2b,. (YYYYMMbucket), summarise,
NY1M = mean(NY1), NY2M = mean(NY2) ... ...))
Then used a series of the following:
x <- c(1:40893)
lookv <- function(x,ltab,rcol=2) ltab[max(which(ltab[,1]<=x)),rcol]
NY1Fun <- function(x) (prf.delete2b$NY1[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,2))
NY2Fun <- function(x) (prf.delete2b$NY2[x] / lookv((prf.delete2b$YYYYMMbucket[x]),test,3))
NY1Avgs <- lapply(x, NY1Fun)
NY2Avgs <- lapply(x, NY2Fun)
I also tried a variant of the above by saying:
NY1Fun <- function(x) (prf.delete2b$NY1[x] / subset(test, YYYYMMbucket == prf.delete2b$YYYYMMbucket[x], select =c(NY1M)))
lapply(x, NY1Fun)
Each variant of NYnFun takes a good 20 seconds to run so doing this 300 times takes much too long. Can anyone recommend any alternative to what I posted or point out any novice mistakes I've made?
Here is the customary data.table approach, which works pretty fast.
# CREATE DUMMY DATA
N = 1000
mydf = data.frame(
bucket = sample(letters, N, replace = T),
NY1 = runif(N),
NY2 = runif(N),
NY3 = runif(N),
NY4 = runif(N)
)
# SCALE COLUMNS BY AVG
library(data.table)
scale_x = function(x) x/ave(x)
mydt = data.table(mydf)
ans = mydt[,lapply(.SD, scale_x), by = 'bucket']
How about:
test2 <- merge(prfdelete2b,test,all.x=TRUE)
test2[2:ncol(prefdelete2b)]/test2[(ncol(prefdelete2b)+1):ncol(test2)]
In this case, I would use ave instead of ddply because ave returns a vector the same length as its input. ave only accepts a vector, so you need to use lapply to loop over the columns of your data.frame.
myFun <- function(x, groupVar) {
x / ave(x, groupVar, FUN=function(y) mean(y, na.rm=TRUE))
}
relToMeans <- data.frame(prf.delete2b[1],
lapply(prf.delete2b[-1], myFun, groupVar=prf.delete2b[1]))

Resources