predict function in R for a matrix - r

So, I have 2 datasets, training and test. The training dataset is a 926x9 matrix. The first 8 columns represent the feature vector x and the last column represents single valued output y. The test data set 103x8 matrix. I am looking to perform linear regression on the same.
trainData <- read.table("./traindata.txt")
X <- as.matrix(trainData[,1:8])
Y <- as.matrix(trainData[,9])
relation <- lm(Y~X)
testData <- read.table("./testinputs.txt")
testX <- as.matrix(testData[,1:8])
testOutputForY <- predict(relation, newdata = data.frame(X = testX))
The warning message I get is 'newdata' had 103 rows but variables found have 926 rows. I am not sure as to what changes need to be made to get it working fineenter code here

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Compare PCs to data with lsfit()

I have a data frame with 2000 observations (rows) and 600 variables (columns). See reproducible example:
list <- list()
for(i in 1:600){
list[[i]] <- sample(seq(0,0.6,l=2000))
}
df <- as.data.frame(do.call(cbind,list))
I want to perform PCA on the variables and then use lsfit to compare the fit between the principal components and the data (as well as some other data, but this is left out here). My first issue is that when I perform PCA on the data set as it is, my principle components have length 20000. I would expect them to have length 600. However, this is resolved by transposing the data frame.
pc_model <- prcomp(df, center=F, rank=3)
pcs <- pc_model$x # wrong length, why?
df_trans <- as.data.frame(t(df))
pc_model2 <- prcomp(df_trans, center=F, rank=3)
pcs2 <- pc_model2$x # correct length, why?
My next issue is that when I try to use lsfit() to compare my 2000 observations to the principal components, I get all sorts of complaints:
fit <- lsfit(df_trans, pcs2) # Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables
fit2 <- lsfit(df, pcs2) # Error in complete.cases(x, y, wt) : not all arguments have the same length
fit3 <- lsfit(df[1,], pcs2[,1]) # Error in complete.cases(x, y, wt) : not all arguments have the same length
With the transposed data frame, lsfit() complains that I have too many variables. With the non-transposed data frame, it argues that the arguments don´t have the same length, even when I only feed it one row from df (length=600) and one column from pcs2 (length=600). How do I get the least squared fits between my PCs and my 20000 observations?
first pc_model$x is just the coordinates of the observations in the new space defined by axises (PC1, PC2, PC3), so you'll have as many rows as there are observations, i.e 2000 rows for 2000 observations.
ls.fit(X, Y) is trying to fit the model Y = Xb + e where Y and e are (N,M) matrices, X is (N,K) matrix and b is (K,M) vector. and K is the number of variables you want to use in the estimation (K=number of columns in the original X matrix + 1 if you want to calculate the coefficient of the intercept which is the default) also N>=K for this regression to be computable.
Running fit2 <- lsfit(df, pcs) will give correct output, as the conditions are verified, i.e same number of lines and N=2000>=K=601.
the error Error in lsfit(df_trans, pcs2) : only 600 cases, but 2001 variables is caused by df_trans having 2000 columns (variables + 1 for the intercept) while pcs2 having only 600 rows. selecting the first 599 columns circumvents the error lsfit(df_trans[,1:599] ,pcs2)
the error not all arguments have the same length is caused by the arguments complete.cases call inside of ls.fit because df and pcs2 have different row numbers this error is thrown before reaching the conditional on different row numbers inside of lsfit.

Why does SVM work when using the comma delimited form but not the formula form? R

So I have a data set of nrow = 218, and I'm going through [this][https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/] example [git here][https://github.com/nagdevAmruthnath]. I've split my data into train (nrow = 163; ~75%) and test (nrow = 55; ~25%).
When I get to the part where "pred <- predict(model_svm, test)", if I convert pred into a data frame, instead of 55 rows there are 163 (when using the function form of the svm call). Is this normal because it used 163 rows to train? Or should it only have 55 rows since Im using the test set to test?
When I use the 'formula' form of the svm I have issues with the # of rows in the predict function:
model_svm <- svm(trainlabel ~ as.matrix(train) )
But when I use the 'traditional' form, predict on the test data works fine:
model_svm <- svm(as.matrix(train), trainlabel)
Any idea why this is?
Some fake data:
featuredata_all <- matrix(rexp(218, rate=.1), ncol=23)
Some of the code:
library(data.table)
pt1 <- scale(featuredata_all[,1:22],center=T)
pt2 <- as.character(featuredata_all[,23]) #since the label is a string I kept it separate
ft<-cbind.data.frame(pt1,pt2) #to preserve the label in text
colnames(ft)[23]<- "Cluster"
## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)
train <- ft[train_ind,1:22] #163 reads
test <- ft[-train_ind,1:22] #55 reads
trainlabel<- ft[train_ind,23] #163 labels
testlabel <- ft[-train_ind,23] #55 labels
#Support Vector Machine for classification
model_svm <- svm(trainlabel ~ as.matrix(train) )
summary(model_svm)
#Use the predictions on the data
pred <- predict(model_svm, test)
[1]: https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/
[2]: https://github.com/nagdevAmruthnath
You are correct, your formula way is giving you the number of results for training when pred should give you the number of results for testing. I think the problem is because you're writing your formula with as.matrix(). If you look at the results of your pred, you'll see there are actually a bunch of NAs.
Here's the correct way to use the formula
#Create training and testing sets
set.seed(123)
intrain<-createDataPartition(y=beaver2$activ,p=0.8,list=FALSE)
train<-beaver2[intrain,] #80 rows, 4 variables
test<-beaver2[-intrain,] #20 rows, 4 variables
svm_beaver2 <- svm(activ ~ ., data=train)
pred <- predict(svm_beaver2, test) #20 responses, the same as the length of test set
Your outcome just has to be a factor. So even if it is a string, you can convert it to a factor by doing train$outcome <- as.factor(train$outcome) and then you can use the formula above.

naive bayes error in R: subscript out of bounds

I'm trying to classify 94 text of speech.
Since naiveBayes cannot work well if categories of trainset do not exist in categories of testset, I randomized and confirmed.
There were no problem with categories.
But classifier didn't work with testset.
Following is error message:
Df.dtm<-cbind(Df.dtm, category)
dim(Df.dtm)
Df.dtm[1:10, 530:532]
# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]
# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
dip eds ind pols
23 8 3 13
# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]
#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
#Apply the convert_count function to get final training and testing DTMs
train.cc <- apply(modeldata[train, ], 2, convert_count)
test.cc <- apply(modeldata[test, ], 2, convert_count)
#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(train.cc, cl[train], laplace = 1) )
This classifier worked well:
用户 系统 流逝
0.45 0.00 0.46
#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier, newdata=test.cc))
However, prediction failed.
Error in [.default(object$tables[[v]], , nd) : 下标出界
Timing stopped at: 0.2 0 0.2
Consider the following:
# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
# Indicies of whatever is left over from the previous sample, again, also observations are being returned.
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]
After clearing up what my sample returned (row indicies) and how I wanted to slice up my test set (again, rows or columns need to be established at this point), the I would tweak that apply function with the argument necessary here is a link of how the apply function works, but for the sake of time, if you pass it a 2 you apply over each column and if you pass it a 1 it will apply the function given over each row. Again, depending on how you want your sample (rows or columns) we can tweak this either way.

How do I produce a set of predictions based on a new set of data using predict in R? [duplicate]

This question already has answers here:
Predict() - Maybe I'm not understanding it
(4 answers)
Closed 6 years ago.
I'm struggling to understand how the predict function works and can be used with different sample data. For instance the following code...
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(my$y ~ my$x)
mySample <- my[sample(nrow(my), 100),]
predict(fit, mySample)
I would understand should return 100 y predictions based on the sample. But it returns 1,000 row with the warning message :
'newdata' had 100 rows but variables found have 1000 rows
How do I produce a set of predictions based on a new set of data using predict? Or am I using the wrong function? I am a noob so apologise in advance if I am asking stupid questions.
It's never a good idea to use the $ symbol when using the formula syntax (and most of the times it's completely unnecessary. This is especially true when you are trying to make predictions because the predict() function works hard to exactly match up column names and data.types. So rather than
fit <- lm(my$y ~ my$x)
use
fit <- lm(y ~ x, my)
So a complete example would be
set.seed(15) # for reproducibility
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
fit <- lm(y ~ x, my)
mySample <- my[sample(1:nrow(my), 100),]
head(predict(fit, mySample))
# 694 278 298 825 366 980
# 0.43593108 -0.67936324 -0.42168723 -0.04982095 -0.72499087 0.09627245
couple of things wrong with the code: you are overwriting the sample function with your variable named sample. you want something like mysample<- sample(my\$x,100) ... its nothing to do with predict. From my limited understanding dataframes are 'lists of columns' so sampling my means creating 100 samples of (the 1000 row) column x. by using my\$x you now are referring to the column ( in the dataframe), which is a list of rows.
In other words you are sampling from a list of columns (which only has a single element), but you actually want to sample from a list of the rows in column x
Is this what you want
library(caret)
my <- data.frame(x=rnorm(1000))
my$y <- 0.5*my$x+0.5*rnorm(1000)
## Divide data into train and test set
Index <- createDataPartition(my$y, p = 0.8, list = FALSE, times = 1)
train <- my[Index, ]
test <- my[-Index,]
lmfit<- train(y~x,method="lm",data=train,trControl = trainControl(method = "cv"))
lmpredict<-predict(lmfit,test)
this for an in-sample prediction for pseudo out of sample prediction (forecasting one step ahead) you just need lag the independent variable by 1
Lag(x)

Resources