Length mismatch with model from Machine Learning MDA package - r

Can someone help me even phrase what I am trying to do? (I am new to this.)
I am trying out Machine Learning in R now that I have it nailed in Matlab. R is just a passion of mine at the moment.
Data:
> head(newzap1209, n=5)
buoy_douglas avgtopsum avgstdwin1 stddiff2
1 3 -12.097720 410.4747 410.6323
2 2 -10.462240 260.7213 263.2085
3 2 -11.539432 357.1802 362.3258
4 2 -9.524074 234.8285 234.8571
5 3 -11.498597 356.4736 359.4485
Code:
library(mda)
fit<-mda(buoy_douglas~.,data=newzap1209)
summary(fit)
predictions<-predict(fit,newzap1209[,2:4])
table(predictions,newzap1209$buoy_douglas)
Error message:
Error in table(predictions, newzap1209$buoy_douglas) : all arguments must have the same length
Everything works except the table!
Same goes for the confusion matrix.

The error is saying that predictions and newzap1209 have mismatching lengths (nrows). Which should be impossible since you generated fit from newzap1209[,2:4].
Check the length of each and debug why they mismatch.

Related

I am having trouble trying to do the statistical analysis part for my bio class

this is my first time asking a question on here. I am tirelessly working on this lab that was due ages ago but was able to get extended. I am not sure what I am doing anymore. I have to be able to do statistical anyalysis and do one of four tests: Correlation, Linear regression, T-test and ANOVA.
Currently what I am faced with is just getting my dataset to be readable in a wide format on R and currently what it looks like is: dataset I have been able to do the bare minimum which is get it read but my lesson tells me that it needs to be in a wide format and from what it looks like, it is not even in that formatting. I know I would have to run an ANOVA test as there is more than 2 categories that are being tested, but I do not know how to change variable name on the program nor do I know how to get it to run a statistical data as it is not reading the way I want it to. Any suggestions would be helpful! Thanks.
edit: here's my code
# Statistical Data for Lab 2: Measuring Diversity
Lab2 <- read.csv2('Lab2Measure.csv')
Lab2_wide <- Lab2
to which it gives me the following output:
> X.x1.y1.z1.x2.y2.z2
> 1 1,4,80,10,4,100,0
> 2 2,5,90,5,6,90,5
> 3 3,3,100,20,5,90,0
> 4 4,6,60,5,6,57,0
> 5 5,8,70,3,6,95,2
> 6 6,5,95,6,5,25,0
> 7 7,5,80,15,3,90,10
> 8 8,3,75,20,4,80,0
> 9 9,5,70,25,3,85,10
> 10 10,7,95,5,6,97,2
> 11 11,6,90,2,5,90,0.5
> 12 12,5,70,1,5,75,5
> 13 13,3,60,15,3,97,1
> 14 14,4,90,10,2,70,0
> 15 15,3,85,8,3,98,1
> 16 16,2,96,17,8,90,5
> 17 17,5,70,20,5,98,1
> 18 18,3,40,10,4,80,9
> 19 19,3,80,15,4,95,0
> 20 20,1,90,2,2,92,0
> 21 21,2,75,7,2,96,5
but please refer to the photo provided to understand my woes
When you see a column name like:
X.x1.y1.z1.x2.y2.z2
It means you didn't give the correct separator to the read function:
The default for read.table is whitespace. Default for read.csv2 is semicolon. You can change the separator to be used by read.table with the sep parameter. I'm not sure that you can change the separator of read.csv2 or read.csv with that parameter. I think they may throw an error if you try. As user 20650 suggests, you may get success with:
Lab2 <- read.csv("Lab2Measure.csv")
Rather than using images of datasets or the results you should learn to post the copied text from the console .
The default value for the separator in read.csv2() is ; but it seems that the separator in your data is ,. So you should add sep="," to the code to make it work correctly.
Lab2 <- read.csv2('Lab2Measure.csv', sep=",")

How to fix linear model fitting error in S-plus

I am trying to fit values in my algorithm so that I could predict a next month's number. I am getting a No data for variable errror when clearly I've defined what the objects are that I am putting into the equation.
I've tried to place them in vectors so that it could use one vector as a training data set to predict the new values. Current script has worked for me for a different dataset but for some reason isn't working here.
The data is small so I was wondering if that has anything to do with it. The data is:
Month io obs Units Sold
12 in 1 114
1 in 2 29
2 in 3 105
3 in 4 30
4 in 5
I'm trying to predict Units Sold with the code below
matt<-TEST1
isdf<-matt[matt$month<=3,]
isdf<-na.omit(isdf)
osdf<-matt[matt$Units.Sold==4,]
lmfit<-lm(Units.Sold~obs+Month,data=isdf,na.action=na.omit)
predict(lmFit,osdf[1,1])
I am expecting to be able to place lmfit in predict and get an output.

R: Stanford CoreNLP returnning NAs for getSentiment

I have the following text data:
I always prefer old-school guy. I have a PhD degree in science. I am
really not interested in finding someone with the same background,
otherwise life is gonna be boring.
And I am trying to extract out the sentiment scores of the above text, but what i get is all NAs.
dating3 = annotateString(bio)
bio.emo = getSentiment(dating3)
id sentimentValue sentiment
1 1 NA NA
2 2 NA NA
3 3 NA NA
I do not know why is occuring and googled around but did not find any relevant answers. In the meantime, when i tried the sample data provided within coreNLP package
getSentiment(annoHp)
id sentimentValue sentiment
1 1 4 Verypositive
It gives me an answer, so I don't know why this is happening. Would greatly appreciate if anyone can offer some insight.
Hopefully by now you have already found this but for you and anyone else, this is a known bug which is fixed on the GitHub version, see here: https://github.com/statsmaths/coreNLP/issues/9

Circular-linear regression with covariates in R

I have data showing when an animal came to a survey station. example csv file here The first few lines of data look like this:
Site_ID DateTime HourOfDay MinTemp LunarPhase Habitat
F1 6/12/2013 14:01:00 14 -1 0 river
F1 6/12/2013 14:23:00 14 -1 0 river
F2 6/13/2013 1:21:00 1 3 1 upland
F2 6/14/2013 1:33:00 1 4 2 upland
F3 6/14/2013 1:48:00 1 4 2 river
F3 6/15/2013 11:08:00 11 0 0 river
I would like to perform a circular-linear regression in R to determine peak activity times. The dependent variable could be DateTime or HourOfDay, whichever is easier. I would like to incorporate the covariates Site_ID (random effect), plus MinTemp, LunarPhase, and Habitat into a mixed-effects model.
I have tried using the lm.circular function of program circular, and have the following code:
data<-read.csv("StackOverflowExampleData.csv")
data$DateTime<-as.POSIXct(as.character(data$DateTime), format = "%m/%d/%Y %H:%M:%S")
data$LunarPhase<-as.factor(data$LunarPhase)
str(data)
library(circular)
y<-data$DateTime
y<-circular(y, units ="hours",template = "clock24",rotation = "clock")
x<-data[,c(1,4,5,6)]
lm.circular(y=y, x=x, init=c(1,1,1,1), type='c-l', verbose=TRUE)
I keep getting the error:
Error in Ops.POSIXt(x, 12) : '/' not defined for "POSIXt" objects
Apparently this is a known bug, but I was confused by this threat about it and could not determine an appropriate work-around. Suggestions?
Also, my ultimate goal with this data was to run a circular-linear version of a glm, and then test several models against one another using AIC or some other information theoretics method. The model I'm seeking would be a circular-linear version of something like this:
glmer(HourOfDay~MinTemp+LunarPhase+Habitat+(1|Site_ID),family=binomial,data=data)
Perhaps this is an inappropriate application of the circular package. If so, I'm open to other suggestions of models and/or graphics that would investigate peak activity using the data and covariates.
Note: I did search for related discussions and found this somewhat relevant thread, but it was never answered, did not request a solution in R, and was of a different scope.
The specific problem is caused by conversion.circular. There, a POSIXlt object is divided by 12. This is an operation that has a non-defined outcome:
> as.POSIXlt('2005-07-16') / 2
Error in Ops.POSIXt(as.POSIXlt("2005-07-16"), 2) :
'/' not defined for "POSIXt" objects
So, it seems that you cannot use data of this class as input for the circular package. I could not find any mention of POSIXlt data in the examples. Maybe you need to specify the timestamps simply as a number, not as a POSIXlt object.

Fisher test more than 2 groups

Major Edit:
I decided to rewrite this question since my original was poorly put. I will leave the original question below to maintain a record. Basically, I need to do Fisher's Test on tables as big as 4 x 5 with around 200 observations. It turns out that this is often a major computational challenge as explained here (I think, I can't follow it completely). As I use both R and Stata I will frame the question for both with some made-up data.
Stata:
tabi 1 13 3 27 46 \ 25 0 2 5 3 \ 22 2 0 3 0 \ 19 34 3 8 1 , exact(10)
You can increase exact() to 1000 max (but it will take maybe a day before returning an error).
R:
Job <- matrix(c(1,13,3,27,46, 25,0,2,5,3, 22,2,0,3,0, 19,34,3,8,1), 4, 5,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", ">40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS", "exstatic")))
fisher.test(Job)
For me, at least, it errors out on both programs. So the question is how to do this calculation on either Stata or R?
Original Question:
I have Stata and R to play with.
I have a dataset with various categorical variables, some of which have multiple categories.
Therefore I'd like to do Fisher's exact test with more than 2 x 2 categories
i.e. apply Fisher's to a 2 x 6 table or a 4 x 4 table.
Can this be done with either R or Stata ?
Edit: whilst this can be done in Stata - it will not work for my dataset as I have too many categories. Stata goes through endless iterations and even being left for a day or more does not produce a solution.
My question is really - can R do this, and can it do it quickly ?
Have you studied the documentation of R function fisher.test? Quoting from help("fisher.test"):
For 2 by 2 cases, p-values are obtained directly using the (central or
non-central) hypergeometric distribution. Otherwise, computations are
based on a C version of the FORTRAN subroutine FEXACT which implements
the network developed by Mehta and Patel (1986) and improved by
Clarkson, Fan and Joe (1993).
This is an example given in the documentation:
Job <- matrix(c(1,2,1,0, 3,3,6,1, 10,10,14,9, 6,7,12,11), 4, 4,
dimnames = list(income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")))
fisher.test(Job)
# Fisher's Exact Test for Count Data
#
# data: Job
# p-value = 0.7827
# alternative hypothesis: two.sided
As far as Stata is concerned, your original statement was totally incorrect. search fisher leads quickly to help tabulate twoway and
the help for the exact option explains that it may be applied to r x
c as well as to 2 x 2 tables
the very first example in the same place of Fisher's exact test underlines that Stata is not limited to 2 x 2 tables.
It's a minimal expectation anywhere on this site that you try to read basic documentation. Please!

Resources