I am try to carry out chi-square test to see if there is a significant difference in disease proportion between regions but I end up with error in R. Any suggestions on how to correct this error?
data:
E NE NW SE SW EM WM YH
Cases 11 37 54 30 114 44 31 39
Non.cases 28 73 116 68 211 80 78 92
d=read.csv(file.choose(),header=T)
attach(d)
chisq.test(d)
Error in chisq.test(d) :
all entries of 'x' must be nonnegative and finite
Your problem must be somewhere upstream of the chi-squared test, i.e. the data are getting mangled somehow when being read in.
d <- read.table(header=TRUE,text="
E NE NW SE SW EM WM YH
Cases 11 37 54 30 114 44 31 39
Non.cases 28 73 116 68 211 80 78 92")
However you read the data, results should look like this:
str(d)
## 'data.frame': 2 obs. of 8 variables:
## $ E : int 11 28
## $ NE: int 37 73
## ... etc.
chisq.test(d)
## Pearson's Chi-squared test
## data: d
## X-squared = 3.3405, df = 7, p-value = 0.8518
(attach() is not necessary, and usually actually harmful/confusing ...)
Related
So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations
What is the problem with the following r code as I get error?
nonlinear <- function(G,Q,T) {
Y=G+Q*X^T
}
Model <- nls(nonlinear, start = list(G=0.4467, Q=-0.0020537, T=1), data=sample1)
Error: object of type 'closure' is not subsettable
Taking the data from your other question Nonlinear modelling starting values and the code from #Roland this works:
sample1 <- read.table(header=TRUE, text=
"X Y Z
135 -0.171292376 85
91 0.273954718 54
171 -0.288513438 107
88 -0.17363066 54
59 -1.770852012 50
1 0 37
1 0 32
1 0.301029996 36
2 -0.301029996 39
1 1.041392685 30
11 -0.087150176 42
9 0.577236408 20
34 -0.355387658 28
15 0.329058719 17
32 -0.182930683 24
21 0.196294645 21
33 0.114954516 91
43 -0.042403849 111
39 -0.290034611 88
20 -0.522878746 76
6 -0.301029995 108
3 0.477121254 78
9 0 63
9 0.492915522 51
28 -0.243038048 88
16 -0.028028724 17
15 -0.875061263 29
2 -0.301029996 44
1 0 52
1 1.531478917 65")
nonlinear<-function(X,G,Q,T) G+Q*X^T
nls(Y ~ nonlinear(X,G,Q,T), start=list(G=-0.4, Q=0.2, T=-1), data=sample1)
Depending from the data I had to change the starting values!
I am using createFolds function in R to create folds which is returning successful result. But when I am using loop to perform some calculation on each fold I am getting below error.
Code is:
set.seed(1000)
k <- 10
folds <- createFolds(train_data,k=k,list = TRUE, returnTrain = FALSE)
str(folds)
This is giving output as:
List of 10
$ Fold01: int [1:18687] 1 8 10 21 22 25 26 29 34 35 ...
$ Fold02: int [1:18685] 5 11 14 32 40 46 50 52 56 58 ...
$ Fold03: int [1:18685] 16 20 39 47 49 77 78 83 84 86 ...
$ Fold04: int [1:18685] 3 15 30 38 41 44 51 53 54 55 ...
$ Fold05: int [1:18685] 7 9 17 18 23 37 42 67 75 79 ...
$ Fold06: int [1:18686] 6 31 36 48 72 74 90 113 114 121 ...
$ Fold07: int [1:18686] 2 33 59 61 100 103 109 123 137 161 ...
$ Fold08: int [1:18685] 24 64 68 87 88 101 110 130 141 152 ...
$ Fold09: int [1:18684] 4 27 28 66 70 85 97 105 112 148 ...
$ Fold10: int [1:18684] 12 13 19 43 65 91 94 108 134 138 ...
However below code is giving me error
for( i in 1:k ){
testData <- train_data[folds[[i]], ]
trainData <- train_data[(-folds[[i]]), ]
}
Error is:
> for( i in 1:k ){
+ testData <- train_data[folds[[i]], ]
+ trainData <- train_data[(-folds[[i]]), ]
+ }
Error in train_data[folds[[i]], ] : subscript out of bounds
I tried with different seed values but I am getting same error.
Any help is appreciated.
Thank you!
As per my understanding, your problem is arising because you are using the whole dataframe train_data to create folds. K-folds can be generated for samples, ie, rows of the dataset.
For instance:
data(spam) # from package kernlab
dim(spam) #has 4601 rows/samples
folds <- createFolds(y=spam$type, k=10, list=T, returnTrain = T)
# Here, only one column , spam$type, is used
# and indeed
max(unlist(folds)) #4601
#and these can be used as row indices
head( spam[folds[[4]], ] )
Using the whole dataframe is very similar to using a matrix. Such a matrix will first be converted to a vector. Thus a 5x10 matrix will actually be converted to 50 element vector and the values in folds will be corresponding to the indices of this vector. If you try to then use these values as row indices for your dataframe, they will overshoot
r <- 8
c <- 10
m0 <- matrix(rnorm(r*c), r, c)
features<-apply(m0, c(1,2), function(x) sample(c(0,1),1))
features
folds<-createFolds(features,4)
folds
max(unlist(folds))
m0[folds[[2]],] # Error in m0[folds[[2]], ] : subscript out of bounds
I tried to build a predictive model in R using decision tree through this code:
library(rpart)
library(caret)
DataYesNo<-read.csv('DataYesNo.csv',header=T)
worktrain<- sample(1:50,40)
worktest <- setdiff(1:50,worktrain)
M <- ncol(DataYesNo)
input <- names(DataYesNo)[1:(M-1)]
target <- "ICUtransfer"
tree<- rpart(ICUtransfer~Temperature+RespiratoryRate+HeartRate+SystolicBP+OxygenSaturations,
data=DataYesNo[worktrain, c(input,target)],
method="class",
parms=list(split="information"),
control=rpart.control(usesurrogate=0, maxsurrogate=0))
fitted <- predict(tree, DataYesNo[worktest, c(input,target)])
cmatrix <- confusionMatrix(fitted, worktest$ICUtransfer)
print(cmatrix)
tree
plot(tree)
text(tree)
I got error at : cmatrix <- confusionMatrix(fitted, worktest$ICUtransfer)
"$ operator is invalid for atomic vectors "
please help me to solve this?
Regards,
DataYesNo[worktest,]
Temperature RespiratoryRate HeartRate SystolicBP OxygenSaturations ICUtransfer
11 36.3 26 65 140 97 no
15 37.3 20 80 129 99 no
21 36.9 20 72 154 95 no
26 36.0 28 56 199 97 no
30 36.9 20 72 150 96 no
34 36.6 16 97 118 95 yes
36 36.0 20 77 145 97 yes
38 36.0 20 77 145 97 yes
43 36.3 28 98 116 95 yes
47 36.0 20 77 145 97 yes
I tried this line:
cmatrix <- confusionMatrix(fitted, DataYesNo[worktest,]$ICUtransfer)
but I got this error: Error in confusionMatrix.default(fitted, DataYesNo[worktest, ]$ICUtransfer) :
the data and reference factors must have the same number of levels
please anyone can help me?
You're getting that error because worktest doesn't have any factor called ICUtransfer. worktest is just a numeric vector of indices, and thus has no factors. You want the subset of your data corresponding to the worktest indices.
It's impossible to know what exactly needs to be done, because I can't see into the data structures you're using.
Instead of worktest$ICUtransfer try using DataYesNo[worktest, c(input,target)].
In R, let's say we have a vector
area = c(rep(c(26:30), 5), rep(c(500:504), 5), rep(c(550:554), 5), rep(c(76:80), 5)) and another vector yield = c(1:100).
Now, say I want to index like so:
> yield[area==27]
[1] 2 7 12 17 22
> yield[area==501]
[1] 27 32 37 42 47
No problem, right? But weird things start happening when I try to index it by using c(A, B). (and even weirder when I try c(min:max) ...)
> yield[area==c(27,501)]
[1] 7 17 32 42
What I'm expecting is of course the instances that are present in both of the other examples, not just some weird combination of them. This works when I can use the pipe OR operator:
> yield[area==27 | area==501]
[1] 2 7 12 17 22 27 32 37 42 47
But what if I'm working with a range? Say I want index it by the range c(27:503)? In my real example there are a lot more data points and ranges, so it makes more sense, please don't suggest I do it by hand, which would essentially mean:
yield[area==27 | area==28 | area==29 | ... | area==303 | ... | area==500 | area==501]
There must be a better way...
You want to use %in%. Also notice that c(27:503) and 27:503 yield the same object.
> yield[area %in% 27:503]
[1] 2 3 4 5 7 8 9 10 12 13 14 15 17
[14] 18 19 20 22 23 24 25 26 27 28 29 31 32
[27] 33 34 36 37 38 39 41 42 43 44 46 47 48
[40] 49 76 77 78 79 80 81 82 83 84 85 86 87
[53] 88 89 90 91 92 93 94 95 96 97 98 99 100
Why not use subset?
subset(yield, area > 26 & area < 504) ## for indexes
subset(area, area > 26 & area < 504) ## for values