Compare PCs to data with lsfit() - r

I have a data frame with 2000 observations (rows) and 600 variables (columns). See reproducible example:
list <- list()
for(i in 1:600){
list[[i]] <- sample(seq(0,0.6,l=2000))
}
df <- as.data.frame(do.call(cbind,list))
I want to perform PCA on the variables and then use lsfit to compare the fit between the principal components and the data (as well as some other data, but this is left out here). My first issue is that when I perform PCA on the data set as it is, my principle components have length 20000. I would expect them to have length 600. However, this is resolved by transposing the data frame.
pc_model <- prcomp(df, center=F, rank=3)
pcs <- pc_model$x # wrong length, why?
df_trans <- as.data.frame(t(df))
pc_model2 <- prcomp(df_trans, center=F, rank=3)
pcs2 <- pc_model2$x # correct length, why?
My next issue is that when I try to use lsfit() to compare my 2000 observations to the principal components, I get all sorts of complaints:
fit <- lsfit(df_trans, pcs2) # Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables
fit2 <- lsfit(df, pcs2) # Error in complete.cases(x, y, wt) : not all arguments have the same length
fit3 <- lsfit(df[1,], pcs2[,1]) # Error in complete.cases(x, y, wt) : not all arguments have the same length
With the transposed data frame, lsfit() complains that I have too many variables. With the non-transposed data frame, it argues that the arguments donĀ“t have the same length, even when I only feed it one row from df (length=600) and one column from pcs2 (length=600). How do I get the least squared fits between my PCs and my 20000 observations?

first pc_model$x is just the coordinates of the observations in the new space defined by axises (PC1, PC2, PC3), so you'll have as many rows as there are observations, i.e 2000 rows for 2000 observations.
ls.fit(X, Y) is trying to fit the model Y = Xb + e where Y and e are (N,M) matrices, X is (N,K) matrix and b is (K,M) vector. and K is the number of variables you want to use in the estimation (K=number of columns in the original X matrix + 1 if you want to calculate the coefficient of the intercept which is the default) also N>=K for this regression to be computable.
Running fit2 <- lsfit(df, pcs) will give correct output, as the conditions are verified, i.e same number of lines and N=2000>=K=601.
the error Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables is caused by df_trans having 2000 columns (variables + 1 for the intercept) while pcs2 having only 600 rows. selecting the first 599 columns circumvents the error lsfit(df_trans[,1:599] ,pcs2)
the error not all arguments have the same length is caused by the arguments complete.cases call inside of ls.fit because df and pcs2 have different row numbers this error is thrown before reaching the conditional on different row numbers inside of lsfit.

Related

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

error : for loop - replacement has length zero

I am new to R and trying to do a coursework about factor analysis with it.
I have two data sets FundReturn(120 rows, 14 columns) and Factors(120 rows, 30 columns), I want to do a one-factor regression for all the possible pairs of factors and funds, starting with the first 60 observations. With the parameters estimated, I want to calculate the predicted value for the 61st fund return with the 61st value of the factor. Then the estimation window is expanded one observation bigger and new parameters are estimated with the updated sample, then the predicted value for 62rd fund return is calculated, so on so forth. Totally 60 predictions will be made, stored in Predictions=array(1,dim=c(60,30,14)), so I can compare them with the realized values.
The following is the code I used and produced this error:
Error in Predictions[p, fa, fu] <- coeff[1, p, fa, fu] + coeff[2, p, fa, :
replacement has length zero
Can anyone spot the problem? Your help is very appreciated.
Predictions=array(1,dim=c(60,30,14))
coeff=array(1,dim=c(3,60,30,14))
v1<- 1:30
v2<- 1:60
v3<- 1:14
for(fu in v3){
for (fa in v1){
for (p in v2){
y1=FundReturn[1:(59+p),fu]
x1=Factors[1:(59+p),fa]
Model<-lm(y1 ~ x1 + lag(y1))
coeff[1:3,p,fa,fu]=Model[["coefficients"]]
Predictions[p,fa,fu]= coeff[1,p,fa,fu]+coeff[2,p,fa,fu]*Factors[60+p,fa]+coeff[3,p,fa,fu]*FundReturn[59+p,fu]
}
}
}

predict function in R for a matrix

So, I have 2 datasets, training and test. The training dataset is a 926x9 matrix. The first 8 columns represent the feature vector x and the last column represents single valued output y. The test data set 103x8 matrix. I am looking to perform linear regression on the same.
trainData <- read.table("./traindata.txt")
X <- as.matrix(trainData[,1:8])
Y <- as.matrix(trainData[,9])
relation <- lm(Y~X)
testData <- read.table("./testinputs.txt")
testX <- as.matrix(testData[,1:8])
testOutputForY <- predict(relation, newdata = data.frame(X = testX))
The warning message I get is 'newdata' had 103 rows but variables found have 926 rows. I am not sure as to what changes need to be made to get it working fineenter code here

predict.lm after regression with missing data in Y

I don't understand how to generate predicted values from a linear regression using the predict.lm command when some value of the dependent variable Y are missing, even though no independent X observation is missing. Algebraically, this isn't a problem, but I don't know an efficient method to do it in R. Take for example this fake dataframe and regression model. I attempt to assign predictions in the source dataframe but am unable to do so because of one missing Y value: I get an error.
# Create a fake dataframe
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(100,200,300,400,NA,600,700,800,900,100)
df <- as.data.frame(cbind(x,y))
# Regress X and Y
model<-lm(y~x+1)
summary(model)
# Attempt to generate predictions in source dataframe but am unable to.
df$y_ip<-predict.lm(testy)
Error in `$<-.data.frame`(`*tmp*`, y_ip, value = c(221.............
replacement has 9 rows, data has 10
I got around this problem by generating the predictions using algebra, df$y<-B0+ B1*df$x, or generating the predictions by calling the coefficients of the model df$y<-((summary(model)$coefficients[1, 1]) + (summary(model)$coefficients[2, 1]*(df$x)) ; however, I am now working with a big data model with hundreds of coefficients, and these methods are no longer practical. I'd like to know how to do it using the predict function.
Thank you in advance for your assistance!
There is built-in functionality for this in R (but not necessarily obvious): it's the na.action argument/?na.exclude function. With this option set, predict() (and similar downstream processing functions) will automatically fill in NA values in the relevant spots.
Set up data:
df <- data.frame(x=1:10,y=100*(1:10))
df$y[5] <- NA
Fit model: default na.action is na.omit, which simply removes non-complete cases.
mod1 <- lm(y~x+1,data=df)
predict(mod1)
## 1 2 3 4 6 7 8 9 10
## 100 200 300 400 600 700 800 900 1000
na.exclude removes non-complete cases before fitting, but then restores them (filled with NA) in predicted vectors:
mod2 <- update(mod1,na.action=na.exclude)
predict(mod2)
## 1 2 3 4 5 6 7 8 9 10
## 100 200 300 400 NA 600 700 800 900 1000
Actually, you are not using correctly the predict.lm function.
Either way you have to input the model itself as its first argument, hereby model, with or without the new data. Without the new data, it will only predict on the training data, thus excluding your NA row and you need this workaround to fit the initial data.frame:
df$y_ip[!is.na(df$y)] <- predict.lm(model)
Or explicitly specifying some new data. Since the new x has one more row than the training x it will fill the missing row with a new prediction:
df$y_ip <- predict.lm(model, newdata = df)

How do I produce a set of predictions based on a new set of data using predict in R? [duplicate]

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
I'm struggling to understand how the predict function works and can be used with different sample data. For instance the following code...
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(my$y ~ my$x)
mySample <- my[sample(nrow(my), 100),]
predict(fit, mySample)
I would understand should return 100 y predictions based on the sample. But it returns 1,000 row with the warning message :
'newdata' had 100 rows but variables found have 1000 rows
How do I produce a set of predictions based on a new set of data using predict? Or am I using the wrong function? I am a noob so apologise in advance if I am asking stupid questions.
It's never a good idea to use the $ symbol when using the formula syntax (and most of the times it's completely unnecessary. This is especially true when you are trying to make predictions because the predict() function works hard to exactly match up column names and data.types. So rather than
fit <- lm(my$y ~ my$x)
use
fit <- lm(y ~ x, my)
So a complete example would be
set.seed(15) # for reproducibility
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(y ~ x, my)
mySample <- my[sample(1:nrow(my), 100),]
head(predict(fit, mySample))
# 694 278 298 825 366 980
# 0.43593108 -0.67936324 -0.42168723 -0.04982095 -0.72499087 0.09627245
couple of things wrong with the code: you are overwriting the sample function with your variable named sample. you want something like mysample<- sample(my\$x,100) ... its nothing to do with predict. From my limited understanding dataframes are 'lists of columns' so sampling my means creating 100 samples of (the 1000 row) column x. by using my\$x you now are referring to the column ( in the dataframe), which is a list of rows.
In other words you are sampling from a list of columns (which only has a single element), but you actually want to sample from a list of the rows in column x
Is this what you want
library(caret)
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
## Divide data into train and test set
Index <- createDataPartition(my$y, p = 0.8, list = FALSE, times = 1)
train <- my[Index, ]
test <- my[-Index,]
lmfit<- train(y~x,method="lm",data=train,trControl = trainControl(method = "cv"))
lmpredict<-predict(lmfit,test)
this for an in-sample prediction for pseudo out of sample prediction (forecasting one step ahead) you just need lag the independent variable by 1
Lag(x)

Resources