I am getting this error when I run my script.
Script
`# Analyze another data set with row / col data using SpATS
library(plantbreeding) # load library
data(dataset)
head (dataset)
str(dataset)
dataset$genotypes<-as.factor(dataset$genotypes)
m1 <- SpATS(response = "yield", spatial = ~ SAP(columns, rows), genotype = "genotypes", data = dataset)
plot(m1, all.in.one = TRUE) # see all plots in a common plot`
Dataset is arranged as follows.
Variables are as follows.
> str(dataset) tibble [87 x 4] (S3: tbl_df/tbl/data.frame) $ genotypes: Factor w/ 87 levels "1","2","3","4",..: 66 51 64 62 77 2 86 69 21 74 ... $ Columns : num [1:87] 47 47 47 47 47 47 47 47 47 47 ... $ Rows : num [1:87] 55 57 59 61 63 65 67 69 71 73 ... $ yield : num [1:87] 235 NA 152 119 146 ...
Error that I am getting.
Error in [.data.frame(data, , model.terms) : undefined columns selected
I have tried taking the quotes of variables but it doesn't solve the problem.
Related
Prior to running a randomForest model, I load my data and sort variables into categorical and numerical so the model can process it.
Data as first loaded from the .csv file looks like this:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : int 1 1 1 1 0 0 0 0 1 0 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : Factor w/ 200 levels "#N/A","1690",..: 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : Factor w/ 138 levels "#N/A","100","101",..: 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : int 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : Factor w/ 189 levels "#N/A","10029",..: 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame, 3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 3660 152 15 7159.5
2 1 135.17 3660 150 15 7159.5
3 1 137.25 3660 150 15 7159.5
I then attempt to sort the variables in the following way:
##Sort numerical and categorical values
options(digits = 5)
cols <- c("VarX")
for (i in cols) {
DataFrame[,i] = as.factor(DataFrame[,i])
}
cols2 <- c("Var1", "Var2", "Var3", "Var4", "Var5")
for (i in cols2) {
DataFrame[,i] = as.numeric(DataFrame[,i])
}
However, this does something strange and undesirable to the data:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 2 1 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : num 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : num 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : num 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : num 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame,3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 190 44 15 87
2 1 135.17 190 43 15 87
3 1 137.25 190 43 15 87
Also, while not shown in the above excerpt it turns all NA values into 1, which, depending on the data, can skew the results.
Q: What would be the correct way to process the data so that there is no corruption of the data, while ensuring that it can be used by the randomForest package?
You should have used as.numeric(as.character(variable_name)) to convert a factor column to numeric column, otherwise information will be lost.
If you see the documentation of ?factor it says in the WARNING section:
The interpretation of a factor depends on both the codes and the
"levels" attribute. Be careful only to compare factors with the same
set of levels (in the same order). In particular, as.numeric applied
to a factor is meaningless, and may happen by implicit coercion. To
transform a factor f to approximately its original numeric values,
as.numeric(levels(f))[f] is recommended and slightly more efficient
than as.numeric(as.character(f)).
Instead of for loops you can also use the power of sapply to convert these column into numeric like below:
dfnew <- sapply(df[,colms_to_be_converted],function(x)as.numeric(as.character(x)))
I am using createFolds function in R to create folds which is returning successful result. But when I am using loop to perform some calculation on each fold I am getting below error.
Code is:
set.seed(1000)
k <- 10
folds <- createFolds(train_data,k=k,list = TRUE, returnTrain = FALSE)
str(folds)
This is giving output as:
List of 10
$ Fold01: int [1:18687] 1 8 10 21 22 25 26 29 34 35 ...
$ Fold02: int [1:18685] 5 11 14 32 40 46 50 52 56 58 ...
$ Fold03: int [1:18685] 16 20 39 47 49 77 78 83 84 86 ...
$ Fold04: int [1:18685] 3 15 30 38 41 44 51 53 54 55 ...
$ Fold05: int [1:18685] 7 9 17 18 23 37 42 67 75 79 ...
$ Fold06: int [1:18686] 6 31 36 48 72 74 90 113 114 121 ...
$ Fold07: int [1:18686] 2 33 59 61 100 103 109 123 137 161 ...
$ Fold08: int [1:18685] 24 64 68 87 88 101 110 130 141 152 ...
$ Fold09: int [1:18684] 4 27 28 66 70 85 97 105 112 148 ...
$ Fold10: int [1:18684] 12 13 19 43 65 91 94 108 134 138 ...
However below code is giving me error
for( i in 1:k ){
testData <- train_data[folds[[i]], ]
trainData <- train_data[(-folds[[i]]), ]
}
Error is:
> for( i in 1:k ){
+ testData <- train_data[folds[[i]], ]
+ trainData <- train_data[(-folds[[i]]), ]
+ }
Error in train_data[folds[[i]], ] : subscript out of bounds
I tried with different seed values but I am getting same error.
Any help is appreciated.
Thank you!
As per my understanding, your problem is arising because you are using the whole dataframe train_data to create folds. K-folds can be generated for samples, ie, rows of the dataset.
For instance:
data(spam) # from package kernlab
dim(spam) #has 4601 rows/samples
folds <- createFolds(y=spam$type, k=10, list=T, returnTrain = T)
# Here, only one column , spam$type, is used
# and indeed
max(unlist(folds)) #4601
#and these can be used as row indices
head( spam[folds[[4]], ] )
Using the whole dataframe is very similar to using a matrix. Such a matrix will first be converted to a vector. Thus a 5x10 matrix will actually be converted to 50 element vector and the values in folds will be corresponding to the indices of this vector. If you try to then use these values as row indices for your dataframe, they will overshoot
r <- 8
c <- 10
m0 <- matrix(rnorm(r*c), r, c)
features<-apply(m0, c(1,2), function(x) sample(c(0,1),1))
features
folds<-createFolds(features,4)
folds
max(unlist(folds))
m0[folds[[2]],] # Error in m0[folds[[2]], ] : subscript out of bounds
I am try to carry out chi-square test to see if there is a significant difference in disease proportion between regions but I end up with error in R. Any suggestions on how to correct this error?
data:
E NE NW SE SW EM WM YH
Cases 11 37 54 30 114 44 31 39
Non.cases 28 73 116 68 211 80 78 92
d=read.csv(file.choose(),header=T)
attach(d)
chisq.test(d)
Error in chisq.test(d) :
all entries of 'x' must be nonnegative and finite
Your problem must be somewhere upstream of the chi-squared test, i.e. the data are getting mangled somehow when being read in.
d <- read.table(header=TRUE,text="
E NE NW SE SW EM WM YH
Cases 11 37 54 30 114 44 31 39
Non.cases 28 73 116 68 211 80 78 92")
However you read the data, results should look like this:
str(d)
## 'data.frame': 2 obs. of 8 variables:
## $ E : int 11 28
## $ NE: int 37 73
## ... etc.
chisq.test(d)
## Pearson's Chi-squared test
## data: d
## X-squared = 3.3405, df = 7, p-value = 0.8518
(attach() is not necessary, and usually actually harmful/confusing ...)
I have 2142 rows and 9 columns in my data frame. When I call head(df),
the data frame appears fine, something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable? Storage Unit Order Number
2209 NEZ0037-76 FreezerWorks NEZ0037 BoxPos 1 N 76
2210 NEZ0037-77 FreezerWorks NEZ0037 BoxPos 1 N 77
2211 NEZ0037-78 FreezerWorks NEZ0037 BoxPos 1 N 78
2212 NEZ0037-79 FreezerWorks NEZ0037 BoxPos 1 N 79
2213 NEZ0037-80 FreezerWorks NEZ0037 BoxPos 1 N 80
2214 NEZ0037-81 FreezerWorks NEZ0037 BoxPos 1 N 81
Description Storage.Label
2209 I4
2210 I5
2211 I6
2212 I7
2213 I8
2214 I9`
However, when I call write.csv or write.table, I get an incoherent output. Something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable
1 NEZ0011 FreezerWorks NEZ0011 Box-9X9 81 Y
39 40 41 42 43 44 45
80 81 "Box-9X9 NEZ0014" 1 2 3 4
38 39 40 41 42 43 44
79 80 81 "Box-9X9 NEZ0017" 1 2 3
37 38 39 40 41 42 43
78 79 80 81 "Box-9X9 NEZ0020" 1 2
36 37 38 39 40 41 42
77 78 79 80 81 "Box-9X9 NEZ0023" 1
35 36 37 38 39 40 41
76 77 78 79 80 81 "Box-9X9 NEZ0026"`
Calling sapply(df, class) reveals that all columns in the data frame are [1] "factor"
except for $Storage.Level which is [1] "data.table" "data.frame". When I called unlist on $Storage.Level, the output is better but it changes the value in the column. I also tried
df <- data.frame(df, stringsAsFactors=FALSE) without success. Also data.frame(lapply(df, factor)) as suggested in the thread here and as.data.frame in the thread here did not work. Is there a way to unlist $Storage.Level without tampering with the values in the column? Or maybe there is a way to change from level "data.table" "data.frame" to factor and output the data safely.
R version 3.0.3 (2014-03-06)
It sounds like you have something like this:
df <- data.frame(A = 1:2, C = 3:4)
df$AC <- data.table(df)
str(df)
# 'data.frame': 2 obs. of 3 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC:Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
# ..$ A: int 1 2
# ..$ C: int 3 4
# ..- attr(*, ".internal.selfref")=<externalptr>
sapply(df, class)
# $A
# [1] "integer"
#
# $C
# [1] "integer"
#
# $AC
# [1] "data.table" "data.frame"
If that's the case, you will have trouble writing to a csv file.
Try first calling do.call(data.frame, your_data_frame) to see if that sufficiently "flattens" your data.frame, as it does with this example.
str(do.call(data.frame, df))
# 'data.frame': 2 obs. of 4 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC.A: int 1 2
# $ AC.C: int 3 4
You should be able to write this to a csv file without any problems.
I have a data.table in R where I want to throw away the first and the last n rows. I want to to apply some filtering before and then truncate the results. I know I can do this this way:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
e2[100:(nrow(e2)-100)]
Is there a possiblity of doing this in one line? I thought of something like:
example[row1%%2==0][100:-100]
This of course does not work, but is there a simpler solution which does not require a additional variable?
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
n = 5
str(example[!rownames(example) %in%
c( head(rownames(example), n), tail(rownames(example), n)), ])
Classes ‘data.table’ and 'data.frame': 990 obs. of 2 variables:
$ row1: num 6 7 8 9 10 11 12 13 14 15 ...
$ row2: num 17 20 23 26 29 32 35 38 41 44 ...
- attr(*, ".internal.selfref")=<externalptr>
Added a one-liner version with the selection criterion
str(
(res <- example[row1 %% 2 == 0])[ n:( nrow(res)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
And further added this version that does not use an intermediate named value
str(
example[row1 %% 2 == 0][n:(sum( row1 %% 2==0)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
In this case you know the name of one column (row1) that exists, so using length(<any column>) returns the number of rows within the unnamed temporary data.table:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
ans1 = e2[100:(nrow(e2)-100)]
ans2 = example[row1%%2==0][100:(length(row1)-100)]
identical(ans1,ans2)
[1] TRUE