PCA analysis with triangular matrix - r

I was trying to using PCA to analysis my data. But it ends like this:
> head(MEGA)
# A tibble: 6 × 86
...1 A2S10A16T18 K3N10E14 Q3H6G8K14 G4L8D14 W2G16Q17C18 H15K16 E3V9D10W14
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A2S10A… NA NA NA NA NA NA NA
2 K3N10E… 0.462 NA NA NA NA NA NA
3 Q3H6G8… 0.727 0.357 NA NA NA NA NA
4 G4L8D14 0.583 0.357 0.357 NA NA NA NA
5 W2G16Q… 0.357 0.583 0.727 0.583 NA NA NA
6 H15K16 0.357 0.357 0.462 0.357 0.357 NA NA
> prcomp(MEGA)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Can you help me with this?
I am freshman in bioinformation, thank you so much.

Here are the issues you need to solve to be able to compute your PCA (MDS):
There can't be any missing values, so you need to "mirror" your lower triangular matrix into the upper triangular
The table must be a matrix, with rownames and colnames, but only numerical variables.
Here's how I'd solve these problems in a reproducible example:
library(tidyverse)
#-- Making a fake reproducible example
dist_mat <- dist(matrix(rnorm(1000), ncol = 100), method = "euclidean") %>%
as.matrix()
dist_mat[upper.tri(dist_mat)] <- NA
mega <- dist_mat %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column(var = "...1") %>%
tibble()
#-- Creating the matrix
mega_mat <- mega %>%
as.data.frame() %>%
column_to_rownames("...1")
#-- Mirroring lower on upper triangular
mega_mat[upper.tri(mega_mat)] <- mega_mat[lower.tri(mega_mat)]
#-- Computing PCA
prcomp(mega_mat)
#> Standard deviations (1, .., p=10):
#> [1] 5.758257e+00 5.621893e+00 5.289312e+00 5.089903e+00 4.739766e+00
#> [6] 4.494360e+00 4.458136e+00 4.317035e+00 3.989503e+00 8.301307e-16
#>
#> Rotation (n x k) = (10 x 10):
#> PC1 PC2 PC3 PC4 PC5 PC6
#> 1 0.35741259 -0.005146882 0.39400567 -0.47952260 0.02795010 0.14613179
#> 2 -0.12992630 -0.199989467 -0.02870089 -0.63737579 -0.08742113 -0.29440106
#> 3 0.02392526 -0.478265115 -0.10158414 0.44137508 0.16895264 -0.13301094
#> 4 0.53242350 -0.208730464 -0.34645287 0.14446233 -0.41603543 0.05018439
#> 5 -0.45456984 0.282095797 0.16492701 0.17112439 -0.67212158 0.17606230
#> 6 -0.30035787 -0.300319956 -0.20283830 -0.06251480 0.09505941 -0.23999350
#> 7 0.22910453 0.474792604 0.30265124 0.26661252 0.15541762 -0.65105608
#> 8 -0.46965024 -0.101554817 0.08307490 0.01429205 0.22627747 -0.10163731
#> 9 -0.03485274 0.491621247 -0.54080984 -0.08671785 0.41166874 0.33337416
#> 10 0.01430624 -0.201168669 0.50425001 0.19008080 0.29040466 0.48767389
#> PC7 PC8 PC9 PC10
#> 1 0.21547630 -0.11053044 0.54974532 0.3225596
#> 2 -0.38982043 -0.24769290 -0.41060997 0.2445915
#> 3 -0.24227704 -0.55651526 0.31106933 0.2327022
#> 4 0.25267026 0.15142687 -0.28808603 0.4320741
#> 5 -0.15625939 -0.10455639 0.15870825 0.3376700
#> 6 -0.12305942 0.69359210 0.36761203 0.2766879
#> 7 -0.09611608 0.07981677 -0.08463128 0.2976625
#> 8 0.73091422 -0.16044081 -0.22278636 0.3014421
#> 9 -0.13964194 -0.10677135 0.04031700 0.3794254
#> 10 -0.27664992 0.24140792 -0.36148755 0.2850979
Created on 2022-04-28 by the reprex package (v2.0.1)

Related

How is have to be a dataset to perform an ANOVA test in R?

I have three columns, one per group, with numeric values. I want to analyze them using an Anova test, but I found applications when you have the different groups in a column and the respective values in the second column. I wonder if it is necessary to reorder the data like that, or if there is a method that I can use for the columns that I currently have. Here I attached a capture:
Thanks!
You can convert a wide table having many columns into another table having only two columns for key (group) and value (response) by pivoting the data:
library(tidyverse)
# create example data
set.seed(1337)
data <- tibble(
VIH = runif(100),
VIH2 = runif(100),
VIH3 = runif(100)
)
data
#> # A tibble: 100 × 3
#> VIH VIH2 VIH3
#> <dbl> <dbl> <dbl>
#> 1 0.576 0.485 0.583
#> 2 0.565 0.495 0.108
#> 3 0.0740 0.868 0.350
#> 4 0.454 0.833 0.324
#> 5 0.373 0.242 0.915
#> 6 0.331 0.0694 0.0790
#> 7 0.948 0.130 0.563
#> 8 0.281 0.122 0.287
#> 9 0.245 0.270 0.419
#> 10 0.146 0.488 0.838
#> # … with 90 more rows
data %>%
pivot_longer(everything()) %>%
aov(value ~ name, data = .)
#> Call:
#> aov(formula = value ~ name, data = .)
#>
#> Terms:
#> name Residuals
#> Sum of Squares 0.124558 25.171730
#> Deg. of Freedom 2 297
#>
#> Residual standard error: 0.2911242
#> Estimated effects may be unbalanced
Created on 2022-05-10 by the reprex package (v2.0.0)

How to index the features() function to iterate over a list of data frames using map() function in R?

Plotting my soil compaction data gives a convex-up curve. I need to determine the maximum y-value and the x-value which produces that maximum.
The 'features' package fits a smooth spline to the data and returns the features of the spline, including the y-maximum and critical x-value. I am having difficulty iterating the features() function over multiple samples, which are contained in a tidy list.
It seems that the features package is having trouble indexing to the data. The code works fine when I use data for only one sample, but when I try to use the dot placeholder and square brackets it loses track of the data.
Below is the code showing how this process works correctly for one sample, but not for an iteration.
#load packages
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 3.6.3
#> Warning: package 'forcats' was built under R version 3.6.3
library(features)
#> Warning: package 'features' was built under R version 3.6.3
#> Loading required package: lokern
#> Warning: package 'lokern' was built under R version 3.6.3
# generate example data
df <- tibble(
sample = (rep(LETTERS[1:3], each=4)),
w = c(seq(0.08, 0.12, by=0.0125),
seq(0.09, 0.13, by=0.0125),
seq(0.10, 0.14, by=0.0125)),
d= c(1.86, 1.88, 1.88, 1.87,
1.90, 1.92, 1.92, 1.91,
1.96, 1.98, 1.98, 1.97) )
df
#> # A tibble: 12 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 A 0.08 1.86
#> 2 A 0.0925 1.88
#> 3 A 0.105 1.88
#> 4 A 0.118 1.87
#> 5 B 0.09 1.9
#> 6 B 0.102 1.92
#> 7 B 0.115 1.92
#> 8 B 0.128 1.91
#> 9 C 0.1 1.96
#> 10 C 0.112 1.98
#> 11 C 0.125 1.98
#> 12 C 0.138 1.97
# use the 'features' package to fit a smooth spline and extract the spline features,
# including local y-maximum and critical point along x-axis.
# This works fine for one sample at a time:
sample1_data <- df %>% filter(sample == 'A')
sample1_features <- features(x= sample1_data$w,
y= sample1_data$d,
smoother = "smooth.spline")
sample1_features
#> $f
#> fmean fmin fmax fsd noise
#> 1.880000e+00 1.860000e+00 1.880000e+00 1.000000e-02 0.000000e+00
#> snr d1min d1max fwiggle ncpts
#> 2.707108e+11 -9.100000e-01 1.970000e+00 9.349000e+01 1.000000e+00
#>
#> $cpts
#> [1] 0.1
#>
#> $curvature
#> [1] -121.03
#>
#> $outliers
#> [1] NA
#>
#> attr(,"fits")
#> attr(,"fits")$x
#> [1] 0.0800 0.0925 0.1050 0.1175
#>
#> attr(,"fits")$y
#> [1] 1.86 1.88 1.88 1.87
#>
#> attr(,"fits")$fn
#> [1] 1.86 1.88 1.88 1.87
#>
#> attr(,"fits")$d1
#> [1] 1.9732965 0.8533784 -0.5868100 -0.9061384
#>
#> attr(,"fits")$d2
#> [1] 4.588832e-03 -1.791915e+02 -5.123866e+01 1.461069e-01
#>
#> attr(,"class")
#> [1] "features"
# But when attempting to use the pipe and the map() function
# to iterate over a list containing data for multiple samples,
# using the typical map() placeholder dot will not index to the
# list element/columns that are being passed to .f
df_split <- split(df, f= df[['sample']])
df_split
#> $A
#> # A tibble: 4 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 A 0.08 1.86
#> 2 A 0.0925 1.88
#> 3 A 0.105 1.88
#> 4 A 0.118 1.87
#>
#> $B
#> # A tibble: 4 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 B 0.09 1.9
#> 2 B 0.102 1.92
#> 3 B 0.115 1.92
#> 4 B 0.128 1.91
#>
#> $C
#> # A tibble: 4 x 3
#> sample w d
#> <chr> <dbl> <dbl>
#> 1 C 0.1 1.96
#> 2 C 0.112 1.98
#> 3 C 0.125 1.98
#> 4 C 0.138 1.97
df_split %>% map(.f = features, x = .[['w']], y= .[['d']], smoother = "smooth.spline")
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> Error in seq.default(min(x), max(x), length = max(npts, length(x))): 'from' must be a finite number
Created on 2020-04-04 by the reprex package (v0.3.0)
You could use group_split to split the data based on sample and use map to apply features functions to each subset of data.
library(features)
library(dplyr)
library(purrr)
list_model <- df %>%
group_split(sample) %>%
map(~features(x = .x$w, y = .x$d, smoother = "smooth.spline"))

using emmeans for lmer

I've been trying to calculate marginal means for my lmer & glmer in R. I found the emmeans function and I've been trying to understand it and apply it to my model. I found that it's hard to get the means for an interaction, so I'm starting with just additive predictors, but the function doesn't work the way it's presented in examples (e.g. here https://cran.r-project.org/web/packages/emmeans/vignettes/sophisticated.html)
emmeans(Oats.lmer, "nitro")
nitro emmean SE df lower.CL upper.CL
0.0 78.89207 7.294379 7.78 61.98930 95.79484
0.2 97.03425 7.136271 7.19 80.25029 113.81822
0.4 114.19816 7.136186 7.19 97.41454 130.98179
0.6 124.06857 7.070235 6.95 107.32795 140.80919
what I'm getting is:
emmeans(model2, "VariableA")
VariableA emmean SE df lower.CL upper.CL
0.4657459 2649.742 120.8955 19.07 2396.768 2902.715
Only one line and the variable is averaged instead of split into 0 and 1 (which are the values in the dataset, and maybe the problem is that it's categorical?)
The model I'm running is :
model2 = lmer (rt ~ variableA + variableB + (1 |participant) + (1 |sequence/item), data=memoryData, REML=FALSE)
EDIT: The data file is quite big and I wasn't sure how to extract useful information from it, but here is the structure:
> str(memoryData)
'data.frame': 3168 obs. of 123 variables:
$ participant : int 10 10 10 10 10 10 10 10 10 10 ...
$ variableA : int 1 1 1 1 1 1 1 1 1 1 ...
$ variableB : int 1 1 1 1 1 1 1 1 1 1 ...
$ sequence: int 1 1 1 1 1 1 1 1 1 1 ...
$ item : int 25 26 27 28 29 30 31 32 33 34 ...
$ accuracy : int 1 1 1 1 1 1 0 1 1 1 ...
$ rt : num 1720 1628 1728 2247 1247 ...
Why is the function not working for me?
And as a further question, is there a way to get these means when I include interaction between variables A and B?
EDIT 2: ok, it did work when I changed it to factor, I guess my method of doing it was incorrect. But I'm still not sure how to calculate it when there is an interaction? Because with this method, R says "NOTE: Results may be misleading due to involvement in interactions"
To see marginal means of interactions, add all variables of the interaction term to emmeans(), and you need to use the at-argument if you want to see the marginal means at different levels of the interaction terms.
Here are some examples, for the average effect of the interaction, and for marginal effects at different levels of the interaction term. The latter has the advantage in terms of visualization.
library(ggeffects)
library(lme4)
library(emmeans)
data("sleepstudy")
sleepstudy$inter <- sample(1:5, size = nrow(sleepstudy), replace = T)
m <- lmer(Reaction ~ Days * inter + (1 + Days | Subject), data = sleepstudy)
# average marginal effect of interaction
emmeans(m, c("Days", "inter"))
#> Days inter emmean SE df lower.CL upper.CL
#> 4.5 2.994444 298.3427 8.84715 16.98 279.6752 317.0101
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
# marginal effects at different levels of interactions -
# useful for plotting
ggpredict(m, c("Days [3,5,7]", "inter"))
#>
#> # Predicted values of Reaction
#> # x = Days
#>
#> # inter = 1
#> x predicted std.error conf.low conf.high
#> 3 279.349 8.108 263.458 295.240
#> 5 304.839 9.818 285.597 324.082
#> 7 330.330 12.358 306.109 354.551
#>
#> # inter = 2
#> x predicted std.error conf.low conf.high
#> 3 280.970 7.624 266.028 295.912
#> 5 304.216 9.492 285.613 322.819
#> 7 327.462 11.899 304.140 350.784
#>
#> # inter = 3
#> x predicted std.error conf.low conf.high
#> 3 282.591 7.446 267.997 297.185
#> 5 303.593 9.384 285.200 321.985
#> 7 324.594 11.751 301.562 347.626
#>
#> # inter = 4
#> x predicted std.error conf.low conf.high
#> 3 284.212 7.596 269.325 299.100
#> 5 302.969 9.502 284.345 321.594
#> 7 321.726 11.925 298.353 345.099
#>
#> # inter = 5
#> x predicted std.error conf.low conf.high
#> 3 285.834 8.055 270.046 301.621
#> 5 302.346 9.839 283.062 321.630
#> 7 318.858 12.408 294.540 343.177
#>
#> Adjusted for:
#> * Subject = 308
emmeans(m, c("Days", "inter"), at = list(Days = c(3, 5, 7), inter = 1:5))
#> Days inter emmean SE df lower.CL upper.CL
#> 3 1 279.3488 8.132335 23.60 262.5493 296.1483
#> 5 1 304.8394 9.824196 20.31 284.3662 325.3125
#> 7 1 330.3300 12.366296 20.69 304.5895 356.0704
#> 3 2 280.9700 7.630745 18.60 264.9754 296.9646
#> 5 2 304.2160 9.493225 17.77 284.2529 324.1791
#> 7 2 327.4621 11.901431 17.84 302.4420 352.4822
#> 3 3 282.5912 7.445982 16.96 266.8786 298.3038
#> 5 3 303.5927 9.383978 16.98 283.7927 323.3927
#> 7 3 324.5942 11.751239 16.98 299.7988 349.3896
#> 3 4 284.2124 7.601185 18.34 268.2639 300.1609
#> 5 4 302.9694 9.504102 17.85 282.9900 322.9487
#> 7 4 321.7263 11.927612 17.99 296.6666 346.7860
#> 3 5 285.8336 8.076779 23.02 269.1264 302.5409
#> 5 5 302.3460 9.845207 20.48 281.8399 322.8521
#> 7 5 318.8584 12.416642 21.02 293.0380 344.6788
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
And a plotting example:
ggpredict(m, c("Days", "inter [1,3,5]")) %>% plot()
You say that "changing the vari[a]ble to factor doesn't help", but I would think this would (as documented in the emmeans FAQ):
md <- transform(memoryData,
variableA=factor(variableA),
variableB=factor(variableB))
model2 = lmer (rt ~ variableA + variableB +
(1 |participant) + (1 |sequence/item), data=md, REML=FALSE)
emmeans(model2, ~variableA)
emmeans(model2, ~variableB)
emmeans(model2, ~variableA + variableB)
If this really doesn't work, then we need a reproducible example ...

Calculate predicted model results by iterating through variables

I have several models fit to predict an outcome y = x1 + x2 + .....+x22. That's a fair number of predictors and a fair number of models. My customers want to know what's the marginal impact of each X on the estimated y. The models may include splines and interaction terms. I can do this, but it's cumbersome and requires loops or a lot of copy paste, which is slow or error prone. Can I do this better by writing my function differently and/or using purrr or an *apply function? Reproducible example is below. Ideally, I could write one function and apply it to longdata.
## create my fake data.
library(tidyverse)
library (rms)
ltrans<- function(l1){
newvar <- exp(l1)/(exp(l1)+1)
return(newvar)
}
set.seed(123)
mystates <- c("AL","AR","TN")
mydf <- data.frame(idno = seq(1:1500),state = rep(mystates,500))
mydf$x1[mydf$state=='AL'] <- rnorm(500,50,7)
mydf$x1[mydf$state=='AR'] <- rnorm(500,55,8)
mydf$x1[mydf$state=='TN'] <- rnorm(500,48,10)
mydf$x2 <- sample(1:5,500, replace = T)
mydf$x3 <- (abs(rnorm(1500,10,20)))^2
mydf$outcome <- as.numeric(cut2(sample(1:100,1500,replace = T),95))-1
dd<- datadist(mydf)
options(datadist = 'dd')
m1 <- lrm(outcome ~ x1 + x2+ rcs(x3,3), data = mydf)
dothemath <- function(x1 = x1ref,x2 = x2ref,x3 = x3ref) {
ltrans(-2.1802256-0.01114239*x1+0.050319692*x2-0.00079289232* x3+
7.6508189e-10*pmax(x3-7.4686271,0)^3-9.0897627e-10*pmax(x3- 217.97865,0)^3+
1.4389439e-10*pmax(x3-1337.2538,0)^3)}
x1ref <- 51.4
x2ref <- 3
x3ref <- 217.9
dothemath() ## 0.0591
mydf$referent <- dothemath()
mydf$thisobs <- dothemath(x1 = mydf$x1, x2 = mydf$x2, x3 = mydf$x3)
mydf$predicted <- predict(m1,mydf,type = "fitted.ind") ## yes, matches.
mydf$x1_marginaleffect <- dothemath(x1= mydf$x1)/mydf$referent
mydf$x2_marginaleffect <- dothemath(x2 = mydf$x2)/mydf$referent
mydf$x3_marginaleffect <- dothemath(x3 = mydf$x3)/mydf$referent
## can I do this with long data?
longdata <- mydf %>%
select(idno,state,referent,thisobs,x1,x2,x3) %>%
gather(varname,value,x1:x3)
##longdata$marginaleffect <- dothemath(longdata$varname = longdata$value) ## no, this does not work.
## I need to communicate to the function which variable it is evaluating.
longdata$marginaleffect[longdata$varname=="x1"] <- dothemath(x1 = longdata$value[longdata$varname=="x1"])/
longdata$referent[longdata$varname=="x1"]
longdata$marginaleffect[longdata$varname=="x2"] <- dothemath(x2 = longdata$value[longdata$varname=="x2"])/
longdata$referent[longdata$varname=="x2"]
longdata$marginaleffect[longdata$varname=="x3"] <- dothemath(x3 = longdata$value[longdata$varname=="x3"])/
longdata$referent[longdata$varname=="x3"]
testing<- inner_join(longdata[longdata$varname=="x1",c(1,7)],mydf[,c(1,10)])
head(testing) ## yes, both methods work.
Mostly you're just talking about a grouped mutate, with the caveat that dothemath is built such that you need to specify the variable name, which can be done by using do.call or purrr::invoke to call it on a named list of parameters:
longdata <- longdata %>%
group_by(varname) %>%
mutate(marginaleffect = invoke(dothemath, setNames(list(value), varname[1])) / referent)
longdata
#> # A tibble: 4,500 x 7
#> # Groups: varname [3]
#> idno state referent thisobs varname value marginaleffect
#> <int> <fct> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 AL 0.0591 0.0688 x1 46.1 1.06
#> 2 2 AR 0.0591 0.0516 x1 50.2 1.01
#> 3 3 TN 0.0591 0.0727 x1 38.0 1.15
#> 4 4 AL 0.0591 0.0667 x1 48.4 1.03
#> 5 5 AR 0.0591 0.0515 x1 47.1 1.05
#> 6 6 TN 0.0591 0.0484 x1 37.6 1.15
#> 7 7 AL 0.0591 0.0519 x1 60.9 0.905
#> 8 8 AR 0.0591 0.0531 x1 63.2 0.883
#> 9 9 TN 0.0591 0.0780 x1 47.8 1.04
#> 10 10 AL 0.0591 0.0575 x1 50.5 1.01
#> # ... with 4,490 more rows
# the first values look similar
inner_join(longdata[longdata$varname == "x1", c(1,7)], mydf[,c(1,10)])
#> Joining, by = "idno"
#> # A tibble: 1,500 x 3
#> idno marginaleffect x1_marginaleffect
#> <int> <dbl> <dbl>
#> 1 1 1.06 1.06
#> 2 2 1.01 1.01
#> 3 3 1.15 1.15
#> 4 4 1.03 1.03
#> 5 5 1.05 1.05
#> 6 6 1.15 1.15
#> 7 7 0.905 0.905
#> 8 8 0.883 0.883
#> 9 9 1.04 1.04
#> 10 10 1.01 1.01
#> # ... with 1,490 more rows
# check everything is the same
mydf %>%
gather(varname, marginaleffect, x1_marginaleffect:x3_marginaleffect) %>%
select(idno, varname, marginaleffect) %>%
mutate(varname = substr(varname, 1, 2)) %>%
all_equal(select(longdata, idno, varname, marginaleffect))
#> [1] TRUE
It may be easier to reconfigure dothemath to take an additional parameter of the variable name so as to avoid the gymnastics.

R time series interpolation, and extrapolation of a specific value

I have daily values for 11 different yield curves, that is time series for 11 yield maturities (1yr, 2yr, 3yr, 4yr, 5yr, 7yr, 10yr, 15yr, 20yr, 25yr, 30yr) in the same period of time. Some of the yields in some days are missing (NAs) and I'd like to extrapolate their values knowing the value of the other yields at the same day. This should be done by a first linear interpolation of the available yields in a given day, and a successive extrapolation of the missing yields in the same day, using the maturity duration (1yr, 2yr, etc) as weight.
For example, I have the following data set and I'd like to extrapolate the daily value for 5yr yield based on an interpolation of all available yilds at the same day:
Date 1 2 3 4 5 7 10 15 20 25 30
7/4/2007 9.642 9.278 8.899 NA NA 8.399 8.241 8.183 8.117 NA NA
7/5/2007 9.669 9.302 8.931 NA NA 8.44 8.287 8.231 8.118 NA NA
7/6/2007 9.698 9.331 8.961 NA NA 8.437 8.295 8.243 8.13 NA NA
7/9/2007 9.678 9.306 8.941 NA NA 8.409 8.269 8.214 8.092 NA NA
7/10/2007 9.65 9.283 8.915 NA NA 8.385 8.243 8.185 8.065 NA NA
7/11/2007 9.7 9.342 8.976 NA NA 8.445 8.306 8.249 8.138 NA NA
7/12/2007 9.703 9.348 8.975 NA NA 8.448 8.303 8.245 8.152 NA NA
7/13/2007 9.69 9.334 8.965 NA NA 8.439 8.294 8.24 8.145 NA NA
7/16/2007 9.683 9.325 8.964 NA NA 8.442 8.299 8.244 8.158 NA NA
7/17/2007 9.712 9.359 8.987 NA NA 8.481 8.33 8.277 8.192 NA NA
7/18/2007 9.746 9.394 9.018 NA NA 8.509 8.363 8.311 8.22 NA NA
...
Does anyone have suggestions on how to do it?
Thanks.
This is one of the ways to build a linear model for each Date based on the available info you have and use it to predict/estimate the value at year 5.
Run the process step by step to see how it works. Check the estimations to make sure they make sense.
dt = read.table(text=
"Date 1 2 3 4 5 7 10 15 20 25 30
7/4/2007 9.642 9.278 8.899 NA NA 8.399 8.241 8.183 8.117 NA NA
7/5/2007 9.669 9.302 8.931 NA NA 8.44 8.287 8.231 8.118 NA NA
7/6/2007 9.698 9.331 8.961 NA NA 8.437 8.295 8.243 8.13 NA NA
7/9/2007 9.678 9.306 8.941 NA NA 8.409 8.269 8.214 8.092 NA NA
7/10/2007 9.65 9.283 8.915 NA NA 8.385 8.243 8.185 8.065 NA NA
7/11/2007 9.7 9.342 8.976 NA NA 8.445 8.306 8.249 8.138 NA NA
7/12/2007 9.703 9.348 8.975 NA NA 8.448 8.303 8.245 8.152 NA NA
7/13/2007 9.69 9.334 8.965 NA NA 8.439 8.294 8.24 8.145 NA NA
7/16/2007 9.683 9.325 8.964 NA NA 8.442 8.299 8.244 8.158 NA NA
7/17/2007 9.712 9.359 8.987 NA NA 8.481 8.33 8.277 8.192 NA NA
7/18/2007 9.746 9.394 9.018 NA NA 8.509 8.363 8.311 8.22 NA NA", header=T)
library(dplyr)
library(tidyr)
dt %>%
gather(time, value, -Date) %>% # reshape dataset
filter(!is.na(value)) %>% # ignore NA values
mutate(time = as.numeric(gsub("X","",time))) %>% # get rid of the X created by importing data
group_by(Date) %>% # for each date
do({model = lm(value~time, data=.) # build a linear model
data.frame(pred = predict(model, data.frame(time=5)))}) # use model to predict at time = 5
# Source: local data frame [11 x 2]
# Groups: Date [11]
#
# Date pred
# (fctr) (dbl)
# 1 7/10/2007 8.920932
# 2 7/11/2007 8.979601
# 3 7/12/2007 8.981383
# 4 7/13/2007 8.970571
# 5 7/16/2007 8.968542
# 6 7/17/2007 8.999584
# 7 7/18/2007 9.032026
# 8 7/4/2007 8.917645
# 9 7/5/2007 8.950605
# 10 7/6/2007 8.970669
# 11 7/9/2007 8.946661
I'm not suggesting that the linear model is the best fit, as I didn't spend time on checking that. But, you can use a quadratic model instead of a linear, which might give you a better estimation.
In case you want to check the model output and get info about the models you built and used for each Date you can do this:
library(dplyr)
library(tidyr)
library(broom)
dt %>%
gather(time, value, -Date) %>% # reshape dataset
filter(!is.na(value)) %>% # ignore NA values
mutate(time = as.numeric(gsub("X","",time))) %>% # get rid of the X created by importing data
group_by(Date) %>% # for each date
do({model = lm(value~time, data=.) # build a linear model
tidy(model)}) # check model output
# Source: local data frame [22 x 6]
# Groups: Date [11]
#
# Date term estimate std.error statistic p.value
# (fctr) (chr) (dbl) (dbl) (dbl) (dbl)
# 1 7/10/2007 (Intercept) 9.29495818 0.19895389 46.719158 8.485928e-08
# 2 7/10/2007 time -0.07480530 0.01875160 -3.989275 1.043399e-02
# 3 7/11/2007 (Intercept) 9.34942937 0.19823019 47.164509 8.093526e-08
# 4 7/11/2007 time -0.07396561 0.01868339 -3.958897 1.075469e-02
# 5 7/12/2007 (Intercept) 9.35001022 0.20037595 46.662337 8.537618e-08
# 6 7/12/2007 time -0.07372537 0.01888563 -3.903781 1.136592e-02
# 7 7/13/2007 (Intercept) 9.33730855 0.19974786 46.745476 8.462114e-08
# 8 7/13/2007 time -0.07334758 0.01882643 -3.895989 1.145551e-02
# 9 7/16/2007 (Intercept) 9.33045446 0.19856561 46.989276 8.245272e-08
# 10 7/16/2007 time -0.07238243 0.01871501 -3.867615 1.178869e-02
# .. ... ... ... ... ... ...

Resources