I tried to build a predictive model in R using decision tree through this code:
library(rpart)
library(caret)
DataYesNo<-read.csv('DataYesNo.csv',header=T)
worktrain<- sample(1:50,40)
worktest <- setdiff(1:50,worktrain)
M <- ncol(DataYesNo)
input <- names(DataYesNo)[1:(M-1)]
target <- "ICUtransfer"
tree<- rpart(ICUtransfer~Temperature+RespiratoryRate+HeartRate+SystolicBP+OxygenSaturations,
data=DataYesNo[worktrain, c(input,target)],
method="class",
parms=list(split="information"),
control=rpart.control(usesurrogate=0, maxsurrogate=0))
fitted <- predict(tree, DataYesNo[worktest, c(input,target)])
cmatrix <- confusionMatrix(fitted, worktest$ICUtransfer)
print(cmatrix)
tree
plot(tree)
text(tree)
I got error at : cmatrix <- confusionMatrix(fitted, worktest$ICUtransfer)
"$ operator is invalid for atomic vectors "
please help me to solve this?
Regards,
DataYesNo[worktest,]
Temperature RespiratoryRate HeartRate SystolicBP OxygenSaturations ICUtransfer
11 36.3 26 65 140 97 no
15 37.3 20 80 129 99 no
21 36.9 20 72 154 95 no
26 36.0 28 56 199 97 no
30 36.9 20 72 150 96 no
34 36.6 16 97 118 95 yes
36 36.0 20 77 145 97 yes
38 36.0 20 77 145 97 yes
43 36.3 28 98 116 95 yes
47 36.0 20 77 145 97 yes
I tried this line:
cmatrix <- confusionMatrix(fitted, DataYesNo[worktest,]$ICUtransfer)
but I got this error: Error in confusionMatrix.default(fitted, DataYesNo[worktest, ]$ICUtransfer) :
the data and reference factors must have the same number of levels
please anyone can help me?
You're getting that error because worktest doesn't have any factor called ICUtransfer. worktest is just a numeric vector of indices, and thus has no factors. You want the subset of your data corresponding to the worktest indices.
It's impossible to know what exactly needs to be done, because I can't see into the data structures you're using.
Instead of worktest$ICUtransfer try using DataYesNo[worktest, c(input,target)].
Related
So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations
I am using createFolds function in R to create folds which is returning successful result. But when I am using loop to perform some calculation on each fold I am getting below error.
Code is:
set.seed(1000)
k <- 10
folds <- createFolds(train_data,k=k,list = TRUE, returnTrain = FALSE)
str(folds)
This is giving output as:
List of 10
$ Fold01: int [1:18687] 1 8 10 21 22 25 26 29 34 35 ...
$ Fold02: int [1:18685] 5 11 14 32 40 46 50 52 56 58 ...
$ Fold03: int [1:18685] 16 20 39 47 49 77 78 83 84 86 ...
$ Fold04: int [1:18685] 3 15 30 38 41 44 51 53 54 55 ...
$ Fold05: int [1:18685] 7 9 17 18 23 37 42 67 75 79 ...
$ Fold06: int [1:18686] 6 31 36 48 72 74 90 113 114 121 ...
$ Fold07: int [1:18686] 2 33 59 61 100 103 109 123 137 161 ...
$ Fold08: int [1:18685] 24 64 68 87 88 101 110 130 141 152 ...
$ Fold09: int [1:18684] 4 27 28 66 70 85 97 105 112 148 ...
$ Fold10: int [1:18684] 12 13 19 43 65 91 94 108 134 138 ...
However below code is giving me error
for( i in 1:k ){
testData <- train_data[folds[[i]], ]
trainData <- train_data[(-folds[[i]]), ]
}
Error is:
> for( i in 1:k ){
+ testData <- train_data[folds[[i]], ]
+ trainData <- train_data[(-folds[[i]]), ]
+ }
Error in train_data[folds[[i]], ] : subscript out of bounds
I tried with different seed values but I am getting same error.
Any help is appreciated.
Thank you!
As per my understanding, your problem is arising because you are using the whole dataframe train_data to create folds. K-folds can be generated for samples, ie, rows of the dataset.
For instance:
data(spam) # from package kernlab
dim(spam) #has 4601 rows/samples
folds <- createFolds(y=spam$type, k=10, list=T, returnTrain = T)
# Here, only one column , spam$type, is used
# and indeed
max(unlist(folds)) #4601
#and these can be used as row indices
head( spam[folds[[4]], ] )
Using the whole dataframe is very similar to using a matrix. Such a matrix will first be converted to a vector. Thus a 5x10 matrix will actually be converted to 50 element vector and the values in folds will be corresponding to the indices of this vector. If you try to then use these values as row indices for your dataframe, they will overshoot
r <- 8
c <- 10
m0 <- matrix(rnorm(r*c), r, c)
features<-apply(m0, c(1,2), function(x) sample(c(0,1),1))
features
folds<-createFolds(features,4)
folds
max(unlist(folds))
m0[folds[[2]],] # Error in m0[folds[[2]], ] : subscript out of bounds
First of all, sorry for any mistakes regarding my post, I'm new to this site.
I´m getting started with R now and I´m trying to do some analysis with time series data.
So, I got a times series at hand and already loaded it into R.
I can also plot this times series and add labels to the axes and so on. So far so good.
My problem: When I plot the time series, R would set the range of values on the y-axis to the interval of [0:170] approximately.
This is somehow strange, since the times series contains the daily EUR/USD exchange rates for this year. That means the values are in a range of about 1.05 to 1.2.
The relative values are correct.
If the plot shows a maximum around day 40, the corresponding value in the data set appears to be a maximum.
But it is around 1.4 and not 170.
I hope one can understand my problem.
I would like to have the y-axis on a scale from 1 to 1.2 for example.
The ylim=c(1, 1.2) command will scale the axis to that range but not the values.
It just ignores them.
Does anyone know how to adjust that?
I´d really appreciate it.
Thank you very much in advance.
Thanks a lot for the input so far.
The "critical code" is the following:
> FRB <- read.csv("FRB_H10.csv", header=TRUE, sep=",")
> attach(FRB)
> str(FRB)
'data.frame': 212 obs. of 2 variables:
$ Date: Factor w/ 212 levels "2015-01-01","2015-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Rate: Factor w/ 180 levels "1.0524","1.0575",..: 180 179 177 178 174 173 175 176 171 172 ...
> plot.ts(Rate)
The result of this last plot is the one shown above.
Changing the variable to numeric yields this:
> as.numeric(Rate)
[1] 180 179 177 178 174 173 175 176 171 172 170 166 180 167 169 160 123 128 150 140 132 128 138 165
[25] 161 163 136 134 134 129 159 158 180 156 140 155 151 142 131 148 104 100 96 104 65 53 27 24
[49] 13 3 8 1 2 7 10 9 21 42 36 50 39 33 23 15 19 29 51 54 26 23 11 6
[73] 4 12 5 16 20 18 17 14 22 30 34 49 92 89 98 83 92 141 125 110 81 109 151 149
[97] 162 143 85 69 77 61 180 30 32 38 52 37 78 127 120 73 105 126 131 106 122 119 107 112
[121] 157 137 152 96 93 99 87 94 86 70 71 180 67 43 66 58 84 57 55 47 35 25 26 41
[145] 31 48 48 75 63 59 38 60 46 44 28 40 45 52 62 101 82 74 68 60 64 102 144 168
[169] 159 154 108 91 98 118 111 72 76 180 95 90 117 139 131 116 130 133 145 103 79 88 115 97
[193] 106 113 89 102 121 102 119 114 124 148 180 153 164 161 147 135 146 141 80 56
So, it remains unchanged. This is very strange. The data excerpt shows that "Rate" takes on values between 1.1 and 1.5 approximately, so really not the values that are shown above. :/
The data set can be found under this link:
https://www.dropbox.com/s/ndxstdl1aae5glt/FRB_H10.csv?dl=0
It should be alright. I got it from the data base from the Federal Reserve System, so quite a decent source.
(Had to remove the link to the data excerpt because my reputation only allows for 2 links to be posted at a time. But the entire data set should be even better, I guess.
#BlankUsername
Thanks very much for the link. I got it working now using this code:
FRB <- read.csv("FRB_H10.csv", header=TRUE, sep=",")
> attach(FRB)
> as.numeric(paste(Rate))
[1] NA 1.2015 1.1918 1.1936 1.1820 1.1811 1.1830 1.1832 1.1779 1.1806 1.1598 1.1517 NA
[14] 1.1559 1.1584 1.1414 1.1279 1.1290 1.1370 1.1342 1.1308 1.1290 1.1337 1.1462 1.1418 1.1432
[27] 1.1330 1.1316 1.1316 1.1300 1.1410 1.1408 NA 1.1395 1.1342 1.1392 1.1372 1.1346 1.1307
[40] 1.1363 1.1212 1.1197 1.1190 1.1212 1.1070 1.1006 1.0855 1.0846 1.0707 1.0576 1.0615 1.0524
[53] 1.0575 1.0605 1.0643 1.0621 1.0792 1.0928 1.0908 1.0986 1.0919 1.0891 1.0818 1.0741 1.0768
[66] 1.0874 1.0990 1.1008 1.0850 1.0818 1.0671 1.0598 1.0582 1.0672 1.0596 1.0742 1.0780 1.0763
[79] 1.0758 1.0729 1.0803 1.0876 1.0892 1.0979 1.1174 1.1162 1.1194 1.1145 1.1174 1.1345 1.1283
[92] 1.1241 1.1142 1.1240 1.1372 1.1368 1.1428 1.1354 1.1151 1.1079 1.1126 1.1033 NA 1.0876
[105] 1.0888 1.0914 1.0994 1.0913 1.1130 1.1285 1.1271 1.1108 1.1232 1.1284 1.1307 1.1236 1.1278
[118] 1.1266 1.1238 1.1244 1.1404 1.1335 1.1378 1.1190 1.1178 1.1196 1.1156 1.1180 1.1154 1.1084
[131] 1.1090 NA 1.1076 1.0952 1.1072 1.1025 1.1150 1.1020 1.1015 1.0965 1.0898 1.0848 1.0850
[144] 1.0927 1.0884 1.0976 1.0976 1.1112 1.1055 1.1026 1.0914 1.1028 1.0962 1.0953 1.0868 1.0922
[157] 1.0958 1.0994 1.1042 1.1198 1.1144 1.1110 1.1078 1.1028 1.1061 1.1200 1.1356 1.1580 1.1410
[170] 1.1390 1.1239 1.1172 1.1194 1.1263 1.1242 1.1104 1.1117 NA 1.1182 1.1165 1.1262 1.1338
[183] 1.1307 1.1260 1.1304 1.1312 1.1358 1.1204 1.1133 1.1160 1.1252 1.1192 1.1236 1.1246 1.1162
[196] 1.1200 1.1276 1.1200 1.1266 1.1249 1.1282 1.1363 NA 1.1382 1.1437 1.1418 1.1360 1.1320
[209] 1.1359 1.1345 1.1140 1.1016
Warning message:
NAs introduced by coercion
> Rate <- cbind(paste(Rate))
> plot(Rate)
Warning message:
In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
> plot.ts(Rate, ylab="EUR/USD")
Despite the warning message, I get the following output (shown below). Like I intended to plot it.
Nevertheless, I do not really understand why it works the way it did. Why I have to use the paste() command and what it does exactly. I get the basic idea of what the classes do, but am very new to this whole world of R.
One thing I came to realize already is that R is such a powerful program. And yet confusing if you are a beginner. :D
I am try to carry out chi-square test to see if there is a significant difference in disease proportion between regions but I end up with error in R. Any suggestions on how to correct this error?
data:
E NE NW SE SW EM WM YH
Cases 11 37 54 30 114 44 31 39
Non.cases 28 73 116 68 211 80 78 92
d=read.csv(file.choose(),header=T)
attach(d)
chisq.test(d)
Error in chisq.test(d) :
all entries of 'x' must be nonnegative and finite
Your problem must be somewhere upstream of the chi-squared test, i.e. the data are getting mangled somehow when being read in.
d <- read.table(header=TRUE,text="
E NE NW SE SW EM WM YH
Cases 11 37 54 30 114 44 31 39
Non.cases 28 73 116 68 211 80 78 92")
However you read the data, results should look like this:
str(d)
## 'data.frame': 2 obs. of 8 variables:
## $ E : int 11 28
## $ NE: int 37 73
## ... etc.
chisq.test(d)
## Pearson's Chi-squared test
## data: d
## X-squared = 3.3405, df = 7, p-value = 0.8518
(attach() is not necessary, and usually actually harmful/confusing ...)
I am trying to solve the DSC(Differential scanning calorimetry) data with R but it seems that I ran into some troubles. All this used to be done in Origin or Qtiplot tediously in my lab.But I wonder if there is another way to do it in batch.But the result did not goes well. For example, maybe I have used the wrong colnames of my data.frame,the code
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
can not reach my data.
So below is the full description of my purpose, thank you in advance!
the DSC data is like this(I store the CSV file in my GoogleDrive Link ) :
T1 0.5min T2 1min
40.59 -0.2904 40.59 -0.2545
40.81 -0.281 40.81 -0.2455
41.04 -0.2747 41.04 -0.2389
41.29 -0.2728 41.29 -0.2361
41.54 -0.2553 41.54 -0.2239
41.8 -0.07 41.8 -0.0732
42.06 0.1687 42.06 0.1414
42.32 0.3194 42.32 0.2817
42.58 0.3814 42.58 0.3421
42.84 0.3863 42.84 0.3493
43.1 0.3665 43.11 0.3322
43.37 0.3438 43.37 0.3109
43.64 0.3265 43.64 0.2937
43.9 0.3151 43.9 0.2819
44.17 0.3072 44.17 0.2735
44.43 0.2995 44.43 0.2656
44.7 0.2899 44.7 0.2563
44.96 0.2779 44.96 0.245
in fact I have merge the data into a data.frame and hope I can adjust it and do something further.
the command is:
dat<-read.csv("Book1.csv",header=F)
colnames(dat)<-c('T1','0.5min','T2','1min','T3','2min','T4','4min','T5','8min','T6','10min',
'T7','20min','T8','ascast1','T9','ascast2','T10','ascast3','T11','ascast4',
'T12','ascast5'
)
so actually dat is a data.frame with 1163 obs. of 24 variables.
T1,T2,T3.....T12 means temperature that the samples were tested of DSC although in the same interval they do differ a little due to the unstability of the machine.
And the colname along T1~T12 is Heat Flow of different heat treatment durations that records by the machine and ascast1~ascast5 means nothing done to the sample to check the accuracy of the machine.
Now I need to do something like the following:
for T1~T2 is in Celsius Degrees,I need to change them into Kelvin Degrees whichi means every data plus 273.16.
Two temperature is chosen to compare the result that is Ts=180.25,Te=240.45(all is discussed in Celsius Degrees and I have seen it Qtiplot to make sure). To be clear I list the two temperature and the first 6 columns data.
T1 0.5min T2 1min T3 2min T4 4min
180.25 -0.01710000 180.25 -0.01780000 180.25 -0.02120000 180.25 -0.02020000
. . . .
. . . .
240.45 0.05700000 240.45 0.04500000 240.45 0.05780000 240.45 0.05580000
That all Heat Flow in Ts should be the same that can be made 0 for convenience. So based on the different values Heat Flow of different times like 0.5min,1min,2min,4min,8min,10min,20min and ascas1~ascast5 all Heat Flow value should be minus the Heat Flow value in Ts.
And for Heat Flow in Te, the value should be adjust to make sure that all the Heat Flow data are the same in Te. The purpose is like the following, (1) calculate mean of the 12 heat flow data in Te. Let's use Hmean for the mean heat flow.So Hmean is the value that all Heat Flow should be. (2) for data in column 0.5min,I use col("0.5min") to denote, and the lineal transform formula is like the following:
col("0.5min")-[([0.05700000-(-0.01710000)]-Hmean)/(Te-Ts)]*(col(T1)-Ts)
Actually, [0.05700000-(-0.01710000)] is done in step 2,but I write it for your reference. And this formula is used for different pair of T1~T12 and columns,like (T1,0.5min),(T2, 1min),(T3,1min).....all is 12 pairs.
Now we can plot the 12 pairs of data on the same plot with intervals from 180~240(also in Celsius Degrees) to magnify the details of differences between the different scans of DSC.
I have been stuck on this problems for 2 days , so I return to stackoverflow for help.
Thanks!
I am assuming that your question was right in the beginning where you got the following error,
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
As I could not find a question in the rest of the steps. They just seemed like a step by step procedure of an experiment.
To fix that error, the problem is the column name has a number in it so to use the column name in the way you want (to reference a column), you should use "`", accent mark, symbol.
>dataF <- data.frame("0.5min"=1:10,"T2"=11:20,check.names = F)
> dataF$`0.5min`
[1] 1 2 3 4 5 6 7 8 9 10
Based on comments adding more information,
You can add a constant to add to alternate columns in the following manner,
dataF <- data.frame(matrix(1:100,10,10))
const <- 237
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
dataF[,seq(1,ncol(dataF),by = 2)] <- dataF[,seq(1,ncol(dataF),by = 2)] + const
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 238 11 258 31 278 51 298 71 318 91
2 239 12 259 32 279 52 299 72 319 92
3 240 13 260 33 280 53 300 73 320 93
4 241 14 261 34 281 54 301 74 321 94
5 242 15 262 35 282 55 302 75 322 95
6 243 16 263 36 283 56 303 76 323 96
7 244 17 264 37 284 57 304 77 324 97
8 245 18 265 38 285 58 305 78 325 98
9 246 19 266 39 286 59 306 79 326 99
10 247 20 267 40 287 60 307 80 327 100
To generalize, we know that the columns of a dataframe can be referenced with a vector of numbers/column names. Most operations in R are vectorized. You can use column names or numbers based on the pattern you are looking for.
For example, I change the name of my first two columns and want to access just those I do this,
colnames(dataF)[c(1,2)] <- c("Y1","Y2")
#Reference all column names with "Y" in it. You can do any operation you want on this.
dataF[,grep("Y",colnames(dataF))]
Y1 Y2
1 238 11
2 239 12
3 240 13
4 241 14
5 242 15
6 243 16
7 244 17
8 245 18
9 246 19
10 247 20