Related
I am trying to use the missForest package to impute missing data into a fairly large dataset. Most of my variables are categorical with many factors. When I run missForest, it imputes decimal values and sometimes even negative values. Obviously, I'm doing something wrong. Here is my process below:
FIRST TRY: Entering predictor data directly. I got decimal values imputed into my dataset. I know that missForest only takes matrices but I'm not sure how to force it into recognizing what columns are factors. Someone on another post recommended dummy coding, so I tried that next, witht eh same results. code is below.
SECOND TRY: Dummy coding each predictor (so time consuming) and then running that.
homt_sub_dummy<-homt_sub[c("Psyprob.yes", "Psyprob.no","SUB2.2.0", "SUB2.2.1", "SUB2.2.2", "SUB2.2.3", "SUB2.2.4", "SUB2.2.5", "SUB2.2.6", "SUB2.2.7","Freq1.1", "Freq1.2", "Freq1.3", "Freq1.4","FRSTUSE1.0", "FRSTUSE1.1", "FRSTUSE1.2", "FRSTUSE1.3", "FRSTUSE1.4", "FRSTUSE1.5", "FRSTUSE1.6","FRSTUSE1.7", "FRSTUSE1.8", "FRSTUSE1.9", "FRSTUSE1.10", "FRSTUSE1.11","Freq2.1", "Freq2.2", "Freq2.3", "Freq2.4","AGEcont","Gender_male", "Gender_female", "Race2.0", "Race2.1", "Race2.2", "Arrests.0", "Arrests.1", "Arrests.2")]
homt_dummy_matrix<-data.matrix(homt_sub_dummy, rownames.force = NA)
homt_dummp.imp <- missForest(homt_dummy_matrix, verbose= TRUE, maxiter = 3, ntree = 20)
homt_dummy.imp.df<-as.data.frame(homt_dummp.imp$ximp)
View(homt_dummy.imp.df)
This is a chunk of the data.frame i saved with the imputed values
Any help would be appreciated. I'm pretty new to imputation. I wanted to compare results of MICE with this but I just can't seem to get missForest to work!!!
you can use as.factor function to transform the class of data that you want. For example
cleveland_t <- transform(cleveland,V2=as.factor(V2),V3 = as.factor(V3),V6 = as.factor(V6),V7=as.factor(V7),V9 = as.factor(V9),V11=as.factor(V11),V12 = as.factor(V12),V13= as.factor(V13),v14=as.factor(V14))
then use the sapply to check the class
Ok, I have a data frame with 250 observations of 9 variables. For simplicity, let's just label them A - I
I've done all the standard stuff (converting things to int or factor, creating the data partition, test and train sets, etc).
What I want to do is use columns A and B, and predict column E. I don't want to use the entire set of nine columns, just these three when I make my prediction.
I tried only using the limited columns in the prediction, like this:
myPred <- predict(rfModel, newdata=myData)
where rfModel is my model, and myData only contains the two fields I want to use, as a dataframe. Unfortunately, I get the following error:
Error in predict.randomForest(rfModel, newdata = myData) :
variables in the training data missing in newdata
Honestly, I'm very new to R, and I'm not even sure this is feasible. I think the data that I'm collecting (the nine fields) are important to use for "training", but I can't figure out how to make a prediction using just the "resultant" field (in this case field E) and the other two fields (A and B), and keeping the other important data.
Any advice is greatly appreciated. I can post some of the code if necessary.
I'm just trying to learn more about things like this.
A assume you used random forest method:
library(randomForest)
model <- randomForest(E ~ A+ B - c(C,D,F,G,H,I), data = train)
pred <- predict(model, newdata = test)
As you can see in this example only A and B column would be taken to build a model, others are removed from model building (however not removed from the dataset). If you want to include all of them use (E~ .). It also means that if you build your model based on all column you need to have those columns in test set too, predict won't work without them. If the test data have only A and B column the model has to be build based on them.
Hope it helped
As I mentioned in my comment above, perhaps you should be building your model using only the A and B columns. If you can't/don't want to do this, then one workaround perhaps would be to simply use the median values for the other columns when calling predict. Something like this:
myData <- cbind(data[, c("A", "B)], median(data$C), median(data$D), median(data$E),
median(data$F), median(data$G), median(data$H), median(data$I))
myPred <- predict(rfModel, newdata=myData)
This would allow you to use your current model, built with 9 predictors. Of course, you would be assuming average behavior for all predictors except for A and B, which might not behave too differently from a model built solely on A and B.
I am trying to do an anova anaysis in R on a data set with one within factor and one between factor. The data is from an experiment to test the similarity of two testing methods. Each subject was tested in Method 1 and Method 2 (the within factor) as well as being in one of 4 different groups (the between factor). I have tried using the aov, the Anova(in car package), and the ezAnova functions. I am getting wrong values for every method I try. I am not sure where my mistake is, if its a lack of understanding of R or the Anova itself. I included the code I used that I feel should be working. I have tried a ton of variations of this hoping to stumble on the answer. This set of data is balanced but I have a lot of similar data sets and many are unblanced. Thanks for any help you can provide.
library(car)
library(ez)
#set up data
sample_data <- data.frame(Subject=rep(1:20,2),Method=rep(c('Method1','Method2'),each=20),Level=rep(rep(c('Level1','Level2','Level3','Level4'),each=5),2))
sample_data$Result <- c(4.76,5.03,4.97,4.70,5.03,6.43,6.44,6.43,6.39,6.40,5.31,4.54,5.07,4.99,4.79,4.93,5.36,4.81,4.71,5.06,4.72,5.10,4.99,4.61,5.10,6.45,6.62,6.37,6.42,6.43,5.22,4.72,5.03,4.98,4.59,5.06,5.29,4.87,4.81,5.07)
sample_data[, 'Subject'] <- as.factor(sample_data[, 'Subject'])
#Set the contrats if needed to run type 3 sums of square for unblanaced data
#options(contrats=c("contr.sum","contr.poly"))
#With aov method as I understand it 'should' work
anova_aov <- aov(Result ~ Method*Level + Error(Subject/Method),data=test_data)
print(summary(anova_aov))
#ezAnova method,
anova_ez = ezANOVA(data=sample_data, wid=Subject, dv = Result, within = Method, between=Level, detailed = TRUE, type=3)
print(anova_ez)
Also, the values I should be getting as output by SAS
SAS Anova
Actually, your R code is correct in both cases. Running these data through SPSS yielded the same result. SAS, like SPSS, seems to require that the levels of the within factor appear in separate columns. You will end up with 20 rows instead of 40. An arrangmement like the one below might give you the desired result in SAS:
Subject Level Method1 Method2
Hi guys i would like to ask is there any way to do rolling window regressions for multiple return at once, with the dependent variable in one data frame and the regressor in the other?. i am trying to combine the rollapply and the sapply funtion to do this. So far i cant seem to make it work.
For finance backgrounds: What i am trying to do is to compute the regressor for Fama-Macbeth regressions. With a rolling window that is rolled forward by 1 month to update the regressor. Different from the original 1973 Fama-macbeth that rolls forward the estimation period by 4 years.
I attached a link to the .csv files needed for the sample script below, it contains daily price data from Yahoo Finance, so that you guys can better see what i am trying to do.
here are some csv files for the script
, just put it in in your R work directory and run this script.
library(xts)
library(quantmod)
library(lmtest)
library(sandwich)
library(MASS)
library(tseries)
data.AMZN<-read.csv("AMZN.csv",header=TRUE)
date<-as.Date(data.AMZN$Date,format="%Y-%m-%d")
data.AMZN<-cbind(date, data.AMZN[,-1])
data.AMZN<-data.AMZN[order(data.AMZN$date),]
data.AMZN<-xts(data.AMZN[,2:7],order.by=data.AMZN[,1])
names(data.AMZN)<-
paste(c("AMZN.Open","AMZN.High","AMZN.Low",
"AMZN.Close","AMZN.Volume","AMZN.Adjusted"))
data.AMZN[c(1:3,nrow(data.AMZN)),]
data.YHOO<-read.csv("YHOO.csv",header=TRUE)
date<-as.Date(data.YHOO$Date,format="%Y-%m-%d")
data.YHOO<-cbind(date, data.YHOO[,-1])
data.YHOO<-data.YHOO[order(data.YHOO$date),]
data.YHOO<-xts(data.YHOO[,2:7],order.by=data.YHOO[,1])
names(data.YHOO)<-
paste(c("YHOO.Open","YHOO.High","YHOO.Low",
"YHOO.Close","YHOO.Volume","YHOO.Adjusted"))
data.YHOO[c(1:3,nrow(data.YHOO)),]
data.mkt<-read.csv("GSPC.csv",header=TRUE)
date<-as.Date(data.mkt$Date,format="%Y-%m-%d")
data.mkt<-cbind(date, data.mkt[,-1])
data.mkt<-data.mkt[order(data.mkt$date),]
data.mkt<-xts(data.mkt[,2:7],order.by=data.mkt[,1])
names(data.mkt)[1:6]<-
paste(c("GSPC.Open","GSPC.High","GSPC.Low",
"GSPC.Close","GSPC.Volume","GSPC.Adjusted"))
data.mkt[c(1:3,nrow(data.mkt))]
rets<-diff(log(data.AMZN$AMZN.Adjusted))
rets$YHOO<-diff(log(data.YHOO$YHOO.Adjusted))
names(rets)[1]<-"AMZN"
mktrets<-diff(log(data.mkt$GSPC.Adjusted))
names(mktrets)[1]<- "GSPC"
rets<-rets[-1,]
rets.df = as.data.frame(rets)
mktrets<-mktrets[-1,]
mktrets.df = as.data.frame(mktrets)
# combining this funtion : do 252 days rolling window linear regression,
#for a single asset as dependent variable and the other as regressor, in the same data frame
coeffs<-rollapply(rets,
width=252,
FUN=function(X)
{
roll.reg=lm(AMZN~YHOO,#YHOO is supposed to be GSPC, just an illustration.
data=as.data.frame(X))
return(summary(roll.reg)$coef)
},
by.column=FALSE)
#With this funtion : it does linear regressions for multiple assets in a different data frame at once
#and put it in a matrix.
Coefficients = sapply(1:ncol(rets),function(x) {
summary(lm(rets[,x]~mktrets[,1]))$coefficients
}
)
#I need to the rolling regressions with different data frames because
#in the real application,i need to assign a unique and specific regressor to
#each dependent variable
Perhaps it's too much to ask, but i realy need to do this. Any suggestions on how to do this or any other way to do this will be very appreciated.
Thanks guys.
If you need all coefficients you can modify the function in your rollapply (Edit: the result needs to be a vector):
coeffs<-rollapply(1:nrow(rets),
width=252,
FUN=function(i) #i=1:252
{
yrets=rets.df[i,]
xmktrets=mktrets.df[i,]
Coefficients =do.call("cbind",lapply(1:ncol(yrets),function(y) { #y=2
t(summary(lm(yrets[,y]~xmktrets))$coefficients)
} ))
rs=c()
for(j in 1:4)rs<-c(rs,Coefficients[j,])
c(rs,Data=as.Date(index(rets)[max(i)],"%y-%m-%d"))
},
by.column=FALSE)
Then you could extract information from coeffs:
#the betas
colnames(rets)
plot.zoo(coeffs[,c(1,2)],col=2:3,main=colnames(rets)[1]) #"AMZN"
plot.zoo(coeffs[,c(3,4)],col=3:4,main=colnames(rets)[2]) #"YHOO"
I am just starting to pick up R to help with a new retailing project, and although I can punch in some basic functions, I am looking for a way to do some sales comparisons more efficiently. The following is a condensed example.
I would like to compare the means for total purchases by six different types of customers (noted using the factor MemberType with 6 levels, one for each type of rewards membership enrollment).
Although I can certainly do something like this:
>m<-t.test(TotalPurchase[MemberType=='2'],TotalPurchase[MemberType=='4'])
for each pair, my objective here is to avoid running the test for each pair of factor levels manually.
At this early stage I do not understand conceptually how to go about this. Is it possible to use the function across a vector of unique factor levels, e.g.
>tp<-data.frame(levels(MemberType))
? If so, are there any insights on how/whether to construct a nested for-loop something like
>for(i in tp) function(i){
>##something like tt<-t.test(TotalPurchase[MemberType==i],##
>##+t.test(TotalPurchase[MemberType==i])##
>+}
with an additional layer? I have also monkeyed around with the 'apply' family of functions but am stumped by 1)the need for two inputs into the two-sample t.test
and by 2)the indexing syntax in the for() and lapply() arguments that tell R what vector of values to use in the t-test.
Any specific help on this problem or polite guidance on my formatting in R (or in Stack Overflow) will be greatly appreciated by this novice. Thanks!
The most straight forward way is to use pairwise.t.test() in the stats package.
But you need to be clear on what type of multiple-test adjustment
you'd like to use to control your family wise error rate. So, it's
really a statistics question and not a programming question. Do you
have a preference between Bonferroni or others?
You're also unclear on the if you're using pooled variance or not.
Finally, your data is unclear.
If you have a data frame with the variables: purchases is the measurement variable, MemberType for customer category, and ItemType as the item category, and you want Bonferroni corrections and unpooled SD, this will work for the example item type == a:
df <- data.frame(purchases= rnorm(100, 50, 20),
MemberType= factor(sample(c("a", "b", "c"), 100, replace= TRUE)),
ItemType= factor(sample(c("d","e","f"), 100, replace=TRUE))
df2 <- df[df$ItemType == "a", ]
pairwise.t.test(df2$purchases, df2$MemberType, p.adj= "bonf", pool.sd= FALSE)
Please provide a complete description of your problem and I can update this solution as needed.