I am trying to fit values in my algorithm so that I could predict a next month's number. I am getting a No data for variable errror when clearly I've defined what the objects are that I am putting into the equation.
I've tried to place them in vectors so that it could use one vector as a training data set to predict the new values. Current script has worked for me for a different dataset but for some reason isn't working here.
The data is small so I was wondering if that has anything to do with it. The data is:
Month io obs Units Sold
12 in 1 114
1 in 2 29
2 in 3 105
3 in 4 30
4 in 5
I'm trying to predict Units Sold with the code below
matt<-TEST1
isdf<-matt[matt$month<=3,]
isdf<-na.omit(isdf)
osdf<-matt[matt$Units.Sold==4,]
lmfit<-lm(Units.Sold~obs+Month,data=isdf,na.action=na.omit)
predict(lmFit,osdf[1,1])
I am expecting to be able to place lmfit in predict and get an output.
Related
I have been writing a code to ease performing a discriminant analysis using the lda function. But actually I have a step which I cannot solve. And it is when I have to introduce the name of the categorical column in the code. Imagine we have the next table (called smoke), in which the column Factor represents the groups (in our cases, smoker and nsmok).
smoke
Factor Lung Heart Blood
1 smoker 7 22 15
2 smoker 8 21 12
3 nsmok 22 9 5
This is the code I have been preparing. Please, look at the XXXX's in the code (it appears twice). I want them to write automatically the name of the categorical column, instead of writing directly it twice.
lda=lda(XXXX~.,data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
Tabla=table(Smoke$XXXX,lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))
I thought that writing...
colnames(Table)[1]
... would solve it. But actually there still exist some errors when running the code.
Otherwise, I though that introducing directly the name in this way:
Column_Factor-> Factor
and writing Column_Factor in the two places in the code would solve it. But it isn't.
Any ideas?
You could do something like this:
library(MASS)
#gets the column name of the factor, maybe check if there is only one factor column first
Column_Factor <- names(Smoke)[sapply(Smoke, class)=="factor"]
#creates the formula by pasting the name and the RHS
lda <- lda(as.formula(paste(Column_Factor,"~.",sep="")),data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
#selects the column using the variable
Tabla=table(Smoke[,Column_Factor],lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))
I have a train data set which has 700 records. I prepared the model using c5.0 function with this data.
library(C50)
abc_model <- C5.0(abc_train[-5], abc_train$resultval)
I have test data, which has 5000 records.
I am using predict function to do the prediction on these 5000 recs.
abc_Test <- read.csv("FullData.csv", quote="")
abc_pred <- predict(abc_model, abc_test)
This is giving me the prediction for ONLY 700 recs, not all 5000.
How to make this predict for all 5000?
When I have the train data size larger than test data size, then the result is fine, I get all data, I am able to combine test data with results and get the output into ".CSV". But when train data size is smaller than test data, all records are not getting predicted.
x <- data.frame(abc_test, abc_pred)
Any inputs how to overcome this problem? I am not an expert in R. Any suggestions will help me a lot.
Thanks Richard.
Below is my train data, few recs.
Id Value1 Value2 Country Result
20835 63 1 United States yes
3911156 60 12 Romania no
39321 10 3 United States no
29425 80 9 Australia no
Below is my test data, few recs again.
Id Value1 Value2 Country
3942587 114 12 United States
3968314 25 13 Sweden
3973205 83 10 Russian Federation
17318 159 9 Russian Federation
I am trying to find the Result value and append this to my test data. But, like i described, I am getting the Result only for 700 records, not all 5000
You should try this:
str(abc_train)
str(abc_test)
lapply(abc_train[ names(abc_train) != "Result"] , table)
lapply(abc_train[] , table)
Then you will probably find that some of the levels for some of the variables in abc_test were not in abc_train, so estimates could not be produced. I'm guessing you thought that the numeric values would be handled as though a regression had been done, but that won't happen if those columns are factors in any prediction function and perhaps never depending on the function's behavior. Looking at C50::C5.0.default, it appears there may be no regression option for variables.
I have data showing when an animal came to a survey station. example csv file here The first few lines of data look like this:
Site_ID DateTime HourOfDay MinTemp LunarPhase Habitat
F1 6/12/2013 14:01:00 14 -1 0 river
F1 6/12/2013 14:23:00 14 -1 0 river
F2 6/13/2013 1:21:00 1 3 1 upland
F2 6/14/2013 1:33:00 1 4 2 upland
F3 6/14/2013 1:48:00 1 4 2 river
F3 6/15/2013 11:08:00 11 0 0 river
I would like to perform a circular-linear regression in R to determine peak activity times. The dependent variable could be DateTime or HourOfDay, whichever is easier. I would like to incorporate the covariates Site_ID (random effect), plus MinTemp, LunarPhase, and Habitat into a mixed-effects model.
I have tried using the lm.circular function of program circular, and have the following code:
data<-read.csv("StackOverflowExampleData.csv")
data$DateTime<-as.POSIXct(as.character(data$DateTime), format = "%m/%d/%Y %H:%M:%S")
data$LunarPhase<-as.factor(data$LunarPhase)
str(data)
library(circular)
y<-data$DateTime
y<-circular(y, units ="hours",template = "clock24",rotation = "clock")
x<-data[,c(1,4,5,6)]
lm.circular(y=y, x=x, init=c(1,1,1,1), type='c-l', verbose=TRUE)
I keep getting the error:
Error in Ops.POSIXt(x, 12) : '/' not defined for "POSIXt" objects
Apparently this is a known bug, but I was confused by this threat about it and could not determine an appropriate work-around. Suggestions?
Also, my ultimate goal with this data was to run a circular-linear version of a glm, and then test several models against one another using AIC or some other information theoretics method. The model I'm seeking would be a circular-linear version of something like this:
glmer(HourOfDay~MinTemp+LunarPhase+Habitat+(1|Site_ID),family=binomial,data=data)
Perhaps this is an inappropriate application of the circular package. If so, I'm open to other suggestions of models and/or graphics that would investigate peak activity using the data and covariates.
Note: I did search for related discussions and found this somewhat relevant thread, but it was never answered, did not request a solution in R, and was of a different scope.
The specific problem is caused by conversion.circular. There, a POSIXlt object is divided by 12. This is an operation that has a non-defined outcome:
> as.POSIXlt('2005-07-16') / 2
Error in Ops.POSIXt(as.POSIXlt("2005-07-16"), 2) :
'/' not defined for "POSIXt" objects
So, it seems that you cannot use data of this class as input for the circular package. I could not find any mention of POSIXlt data in the examples. Maybe you need to specify the timestamps simply as a number, not as a POSIXlt object.
Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!
I have observed nurses during 400 episodes of care and recorded the sequence of surfaces contacts in each.
I categorised the surfaces into 5 groups 1:5 and calculated the probability density functions of touching any one of 1:5 (PDF).
PDF=[ 0.255202629 0.186199343 0.104052574 0.201533406 0.253012048]
I then predicted some 1000 sequences using:
for i=1:1000 % 1000 different nurses
seq(i,1:end)=randsample(1:5,max(observed_seq_length),'true',PDF);
end
eg.
seq = 1 5 2 3 4 2 5 5 2 5
stairs(1:max(observed_seq_length),seq) hold all
I'd like to compare my empirical sequences with my predicted one. What would you suggest to be the best strategy or property to look at?
Regards,
EDIT: I put r as a tag as this may well fall more easily under that category due to the nature of the question rather than the matlab code.