Prior to running a randomForest model, I load my data and sort variables into categorical and numerical so the model can process it.
Data as first loaded from the .csv file looks like this:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : int 1 1 1 1 0 0 0 0 1 0 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : Factor w/ 200 levels "#N/A","1690",..: 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : Factor w/ 138 levels "#N/A","100","101",..: 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : int 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : Factor w/ 189 levels "#N/A","10029",..: 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame, 3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 3660 152 15 7159.5
2 1 135.17 3660 150 15 7159.5
3 1 137.25 3660 150 15 7159.5
I then attempt to sort the variables in the following way:
##Sort numerical and categorical values
options(digits = 5)
cols <- c("VarX")
for (i in cols) {
DataFrame[,i] = as.factor(DataFrame[,i])
}
cols2 <- c("Var1", "Var2", "Var3", "Var4", "Var5")
for (i in cols2) {
DataFrame[,i] = as.numeric(DataFrame[,i])
}
However, this does something strange and undesirable to the data:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 2 1 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : num 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : num 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : num 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : num 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame,3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 190 44 15 87
2 1 135.17 190 43 15 87
3 1 137.25 190 43 15 87
Also, while not shown in the above excerpt it turns all NA values into 1, which, depending on the data, can skew the results.
Q: What would be the correct way to process the data so that there is no corruption of the data, while ensuring that it can be used by the randomForest package?
You should have used as.numeric(as.character(variable_name)) to convert a factor column to numeric column, otherwise information will be lost.
If you see the documentation of ?factor it says in the WARNING section:
The interpretation of a factor depends on both the codes and the
"levels" attribute. Be careful only to compare factors with the same
set of levels (in the same order). In particular, as.numeric applied
to a factor is meaningless, and may happen by implicit coercion. To
transform a factor f to approximately its original numeric values,
as.numeric(levels(f))[f] is recommended and slightly more efficient
than as.numeric(as.character(f)).
Instead of for loops you can also use the power of sapply to convert these column into numeric like below:
dfnew <- sapply(df[,colms_to_be_converted],function(x)as.numeric(as.character(x)))
Related
I am having an issue with assigning factors to my data CSV. Here is a summary of the data frame:
> data.frame': 303 obs. of 12 variables:
> PLOT : int 19 177 54 114 41 48 142 134 160 267 ...
> RANGE : int 2 12 4 8 3 4 10 9 11 18 ...
> ROW : int 4 12 9 9 11 3 7 14 10 12 ...
> REP : int 1 1 1 1 1 1 1 1 1 1 ...
> ENTRY : Factor w/ 184 levels "","17_YMG_0293",..: 40 40 77 82 87 88 102 103 103 6 ...
> PLOT_ID : Factor w/ 301 levels "","18_HZG_OvOv_001",..: 20 178 55 115 42 49 143 135 161 268 ...
> Shatter : num 9 9 9 9 9 9 9 9 9 8 ...
> Chaff.Color : Factor w/ 4 levels "","*Blank ones are segregating in color",..: 3 4 3 4 4 4 3 4 4 3 ...
> Heading_d.from.Jan.1: int 138 139 137 133 135 135 133 137 135 136 ...
> Height_cm : int 74 73 77 76 74 79 78 73 76 70 ...
> Plot.weight..kg. : num 0.26 0.18 0.19 0.14 0.33 0.19 0.13 0.11 0.24 0.18 ...
But I get this error:
HAYSData$Rep<-as.factor(HAYSData$Rep)
Error in `$<-.data.frame`(`*tmp*`, Rep, value = integer(0)) :
replacement has 0 rows, data has 303
I get the same type of error for Entry, Range, and Rows. I am not sure when I look at length(Entry) for example I get 300. I even tested with changing factor to numeric but it does not help.
I don't have an NA in my data each category is its own column as well.
I don't know if something is wrong with my CSV. I have worked this same script with another CSV but no issues in the part of the script for the other data.
Can someone please help me?
It's case-sensitive, try with:
HAYSData$REP <- as.factor(HAYSData$REP)
HAYSData$ENTRY <- as.factor(HAYSData$ENTRY)
HAYSData$RANGE <- as.factor(HAYSData$RANGE)
HAYSData$ROW <- as.factor(HAYSData$ROW)
I have a big problem, in my dataframe I have people that are hypertensive, but dont use medication, and people that use medication however have "normal" blood pressure.
For that, I've created a list with all medications by Brazilian Guideline of Hypertension. It worked, but I generated NA values in people that use antihypertensive medication and NA values in people that didnt report use o medication, therefore if I use complete.cases I'm excluding healthy people and sick people.
Here I import data from a SPSS file, that contain the drugs that people reported in the questionaire
library(memisc)
setwd("C:/Users/Rafael/Documents/RStudio")
Med<- as.data.set(spss.system.file("medicamentos_fase4a_pro_saude.sav"))
Med <- Med[c(2,5)]
Med <- as.data.frame(Med)
names(Med)[names(Med) == 'quest'] <- 'Quest'
View(Med)
Medication List
ListedMeds <- c("diuréticos", "carvedilol", "olmesartana", "tiazídicos", "clortalidona", "hidroclorotiazida", "indapamida", "bumetamida", "furosemida", "piretanida", "amilorida ", "espironolactona", "triantereno ", "antihipertensivo", "alfametildopa", "clonidina", "guanabenzo", "moxonidina", "doxazosina", "prazosina",...)
for(m in ListedMeds){ Med = Med[ !grepl(m, Med$med_rec), ] }
library(plyr) #### I use plyr because in the dataframe people that reported more than 1 medication was duplicated, so there were 1 row for each medication from the same person
Med <- ddply(Med, .(Quest), summarize, Rem = paste (med_rec, collapse = ", "))
Merging Med, DF with medications and number of Questionaire and my DF with Blood pressure results.
DFPA <- merge (DFPA, Med, by = "Quest", all = TRUE)
DFPA <- subset(DFPA, select = c(Quest, PASM, PADM, PAM, PP, CCor, CGI, Sexo, FEtária, HAS))
Excluding NA values:
DFPA <- DFPA[complete.cases(DFPA), ]
DFPA <- subset(DFPA, select = c(Quest,PASM, PADM, PAM, PP, CCor, CGI, Sexo, FEtária, HAS))`
I know that I'm not doing nothing in the end, because I'm excluding everyone that has a NA, and it can be a healthy or a sick person. So I wanna know how to exclude all people that match the listed medication.
ps: The list "ListedMeds" contains medications from people that said they use in a regular basis some medication. So, in this cohort I have 4000 people, I've excluded some people based in some parameters, resulting in 2854 people. When I merge Meds with DFPA, the number becomes 3011, however a lot of these people only have information at the column Rem and are NA at the other columns.
ps2: Would it be possible to create a new dataframe with people that were excluded from DFPA, because said that they use antihypertensive medication? Because I think I could resolve the problem, but more than 1000 people were excluded, however I think this number is wrong.
` str(DFPA)
'data.frame': 2854 obs. of 11 variables:
$ Quest : Factor w/ 3041 levels "0001","0002",..: 1 2 3 4 5 6 7 8 10 11 ...
$ PASM : num 116 128 107 112 103 122 112 99 123 120 ...
$ PADM : num 64 86 58 73 69 84 72 62 73 77 ...
$ PAM : num 81 100 74 86 80 97 85 74 90 91 ...
$ PP : num 52 42 49 39 34 38 40 37 50 43 ...
$ Age : num 60 52 53 47 44 61 54 54 33 55 ...
$ Color : Factor w/ 3 levels "B","P","PD": 1 1 1 3 3 1 3 1 1 3 ...
$ Educ : Factor w/ 3 levels "1º","2º","3º": 2 3 3 3 3 2 3 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
$ FEtária: Ord.factor w/ 4 levels "A"<"B"<"C"<"D": 4 3 3 3 2 4 3 3 1 3 ...
$ HAS : Ord.factor w/ 4 levels "N"<"P"<"H1"<"H2": 1 2 1 1 1 2 1 1 2 2 ... `
` > str(Med)
List of 2
$ Quest: chr [1:2189] "2" "3" "4" "5" ...
$ Rem : chr [1:2189] "cloreto de sódio, dimenidrato, escopolamina,
fitoterápico, omeprazol, ramipril+anlodipino, sertralina" "colágeno,
dipirona, vitamina e suplemento mineral" "homeopatia" "vitamina e suplemento
mineral" ...
`
Sample:
> mysample
Quest PASM PADM PAM PP Age Color Educ Sex FEtária HAS
133 0133 130 84 99 46 56 PD 1º M C P
1641 1685 146 84 105 62 57 PD 1º M C H1
482 0483 122 78 93 44 64 P 2º F D P
2260 2305 118 78 91 40 54 P 3º F C N
1140 1184 114 70 85 44 63 B 2º M D N
1527 1571 168 98 121 70 56 P 2º M C H2
941 0983 116 73 87 43 65 PD 2º M D N
506 0507 134 90 105 44 60 B 3º M D P
2676 2722 100 60 73 40 50 B 3º M C N
326 0327 106 78 87 28 66 P 2º F D N
I am trying to get summary statistics for my data set. The dataset is values for different countries cereal yield of a number of years. I want to get the summary statistics and for each year and then transpose the dataset and get the summary statistics for each country.
For some reason I am not getting the summary statistics and just a list of some of the values and the quantity of them.
I would appreciate any help with this issue.
Below is a sample of my dataset:
row.names YR1990 YR1991 YR1992
3 1200.6 1160 1097.7
4 320.9 417.4 397
5 2794.3 2071.8 2269.2
6 2216.4 1594 2315.3
7 2232.32 2666.1 3057.3
10 2380.9 1833.3 1722.2
This is the results I am getting after summary() function:
summary(CerialData)
YR1990 YR1991 YR1992 YR1993 YR1994
1000 : 1 1000 : 1 943.2 : 2 1000 : 1 1040.03: 1
1003.9 : 1 1043.19: 1 1000 : 1 1055.77: 1 1041.1 : 1
1026.7 : 1 1050.3 : 1 1021.2 : 1 1083.3 : 1 1091.6 : 1
1028.5 : 1 1055.3 : 1 1042.1 : 1 1109.3 : 1 1100 : 1
1033.2 : 1 1094 : 1 1069.7 : 1 1135.5 : 1 1111.1 : 1
1036.8 : 1 1108.3 : 1 1072.3 : 1 1153 : 1 1132.2 : 1
(Other):158 (Other):158 (Other):157 (Other):158 (Other):158
str(CerialData) 'data.frame': 164 obs. of 20 variables:
$ YR1990: Factor w/ 188 levels "","..","0","1000",..: 19 116 103 80 81 85 46 153 26 177 ...
$ YR1991: Factor w/ 191 levels "","..","0","1000",..: 14 141 66 38 93 53 40 154 28 181 ...
$ YR1992: Factor w/ 207 levels "","..","0","1000",..: 10 151 95 96 134 49 67 165 28 197 ...
$ YR1993: Factor w/ 194 levels "","..","0","1000",..: 8 97 99 178 107 35 62 153 23 182 ...
$ YR1994: Factor w/ 214 levels "","..","0","1040.03",..: 11 133 107 74 127 53 15 171 17 207 ...
I have a dataset:
> k
EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504
The variables are as follows. As per the first answer by joran, unused levels can be dropped by droplevels, so no worry about the 898 levels, the illustrative k I'm showing is the complete dataset obtained from k <- d1[1:10, 3:4] where d1 is the original dataset.
> str(k)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 898 levels " HIGH SURF ADVISORY",..: 243 NA NA NA 243 NA NA 243 243 NA
$ FATALITIES: num 583 158 116 114 99 90 75 74 67 57
$ INJURIES : num 0 1150 785 597 0 ...
I'm trying to overwrite the WIND factor:
> k[k$EVTYPE==factor("WIND"), ]$EVTYPE <- factor("AFDAF")
> k[k$EVTYPE=="WIND", ]$EVTYPE <- factor("AFDAF")
But both commands give me error messages: level sets of factors are different or invalid factor level, NA generated.
How should I do this?
Try this instead:
k <- droplevels(d1[1:10, 3:5])
Factors (as per the documentation) are simply a vector of integer codes and then a simple vector of labels for each code. These are called the "levels". The levels are an attribute, and persist with your data even when subsetting.
This is a feature, since for many statistical procedures it is vital to keep track of all the possible values that variable could have, even if they don't appear in the actual data.
Some people find this irritation and run R using options(stringsAsFactors = FALSE).
To simply change the levels, you can do something like this:
d <- read.table(text = " EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504",header = TRUE,sep = "",stringsAsFactors = TRUE)
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "HEAT","WIND": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
> levels(d$EVTYPE) <- c('A','B')
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "A","B": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
Or to just change one:
levels(d$EVTYPE)[2] <- 'C'
I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).