I have a big problem, in my dataframe I have people that are hypertensive, but dont use medication, and people that use medication however have "normal" blood pressure.
For that, I've created a list with all medications by Brazilian Guideline of Hypertension. It worked, but I generated NA values in people that use antihypertensive medication and NA values in people that didnt report use o medication, therefore if I use complete.cases I'm excluding healthy people and sick people.
Here I import data from a SPSS file, that contain the drugs that people reported in the questionaire
library(memisc)
setwd("C:/Users/Rafael/Documents/RStudio")
Med<- as.data.set(spss.system.file("medicamentos_fase4a_pro_saude.sav"))
Med <- Med[c(2,5)]
Med <- as.data.frame(Med)
names(Med)[names(Med) == 'quest'] <- 'Quest'
View(Med)
Medication List
ListedMeds <- c("diuréticos", "carvedilol", "olmesartana", "tiazídicos", "clortalidona", "hidroclorotiazida", "indapamida", "bumetamida", "furosemida", "piretanida", "amilorida ", "espironolactona", "triantereno ", "antihipertensivo", "alfametildopa", "clonidina", "guanabenzo", "moxonidina", "doxazosina", "prazosina",...)
for(m in ListedMeds){ Med = Med[ !grepl(m, Med$med_rec), ] }
library(plyr) #### I use plyr because in the dataframe people that reported more than 1 medication was duplicated, so there were 1 row for each medication from the same person
Med <- ddply(Med, .(Quest), summarize, Rem = paste (med_rec, collapse = ", "))
Merging Med, DF with medications and number of Questionaire and my DF with Blood pressure results.
DFPA <- merge (DFPA, Med, by = "Quest", all = TRUE)
DFPA <- subset(DFPA, select = c(Quest, PASM, PADM, PAM, PP, CCor, CGI, Sexo, FEtária, HAS))
Excluding NA values:
DFPA <- DFPA[complete.cases(DFPA), ]
DFPA <- subset(DFPA, select = c(Quest,PASM, PADM, PAM, PP, CCor, CGI, Sexo, FEtária, HAS))`
I know that I'm not doing nothing in the end, because I'm excluding everyone that has a NA, and it can be a healthy or a sick person. So I wanna know how to exclude all people that match the listed medication.
ps: The list "ListedMeds" contains medications from people that said they use in a regular basis some medication. So, in this cohort I have 4000 people, I've excluded some people based in some parameters, resulting in 2854 people. When I merge Meds with DFPA, the number becomes 3011, however a lot of these people only have information at the column Rem and are NA at the other columns.
ps2: Would it be possible to create a new dataframe with people that were excluded from DFPA, because said that they use antihypertensive medication? Because I think I could resolve the problem, but more than 1000 people were excluded, however I think this number is wrong.
` str(DFPA)
'data.frame': 2854 obs. of 11 variables:
$ Quest : Factor w/ 3041 levels "0001","0002",..: 1 2 3 4 5 6 7 8 10 11 ...
$ PASM : num 116 128 107 112 103 122 112 99 123 120 ...
$ PADM : num 64 86 58 73 69 84 72 62 73 77 ...
$ PAM : num 81 100 74 86 80 97 85 74 90 91 ...
$ PP : num 52 42 49 39 34 38 40 37 50 43 ...
$ Age : num 60 52 53 47 44 61 54 54 33 55 ...
$ Color : Factor w/ 3 levels "B","P","PD": 1 1 1 3 3 1 3 1 1 3 ...
$ Educ : Factor w/ 3 levels "1º","2º","3º": 2 3 3 3 3 2 3 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
$ FEtária: Ord.factor w/ 4 levels "A"<"B"<"C"<"D": 4 3 3 3 2 4 3 3 1 3 ...
$ HAS : Ord.factor w/ 4 levels "N"<"P"<"H1"<"H2": 1 2 1 1 1 2 1 1 2 2 ... `
` > str(Med)
List of 2
$ Quest: chr [1:2189] "2" "3" "4" "5" ...
$ Rem : chr [1:2189] "cloreto de sódio, dimenidrato, escopolamina,
fitoterápico, omeprazol, ramipril+anlodipino, sertralina" "colágeno,
dipirona, vitamina e suplemento mineral" "homeopatia" "vitamina e suplemento
mineral" ...
`
Sample:
> mysample
Quest PASM PADM PAM PP Age Color Educ Sex FEtária HAS
133 0133 130 84 99 46 56 PD 1º M C P
1641 1685 146 84 105 62 57 PD 1º M C H1
482 0483 122 78 93 44 64 P 2º F D P
2260 2305 118 78 91 40 54 P 3º F C N
1140 1184 114 70 85 44 63 B 2º M D N
1527 1571 168 98 121 70 56 P 2º M C H2
941 0983 116 73 87 43 65 PD 2º M D N
506 0507 134 90 105 44 60 B 3º M D P
2676 2722 100 60 73 40 50 B 3º M C N
326 0327 106 78 87 28 66 P 2º F D N
Related
I have a data frame that looks a bit like this:
Type Size `Jul-17` `Aug-17` `Sep-17`
1 A Large 35 24 80
2 B Medium 81 13 38
3 C Small 30 64 45
4 D Large 97 68 65
5 E Medium 31 69 33
6 F Small 84 74 12
I use the ddply function a lot, and instead of summing the three columns together like below...
result <- ddply(Example, .(Type), (summarize),
Q3sum = sum(`Jul-17`, `Aug-17`, `Sep-17`))
I'd like to be able to reference a single variable that contains those three columns and call it "Q3". Is there a way to do this that will still allow the data to work with ddply? I've tried setting the three columns to a single variable using Q3<- c(`Jul-17`, `Aug-17`, `Sep-17`), but it doesn't seem to work.
Any suggestions would be greatly appreciated.
Reproducible data frame:
read.table(check.names = FALSE, text="Type Size Jul-17 Aug-17 Sep-17
A Large 35 24 80
B Medium 81 13 38
C Small 30 64 45
D Large 97 68 65
E Medium 31 69 33
F Small 84 74 12", header=TRUE, stringsAsFactors=FALSE) -> xdf
xdf
## Type Size Jul-17 Aug-17 Sep-17
## 1 A Large 35 24 80
## 2 B Medium 81 13 38
## 3 C Small 30 64 45
## 4 D Large 97 68 65
## 5 E Medium 31 69 33
## 6 F Small 84 74 12
If you just want the sum of the columns into one Q3 column:
xdf$Q3 <- rowSums(xdf[,3:5])
xdf
## Type Size Jul-17 Aug-17 Sep-17 Q3
## 1 A Large 35 24 80 139
## 2 B Medium 81 13 38 132
## 3 C Small 30 64 45 139
## 4 D Large 97 68 65 230
## 5 E Medium 31 69 33 133
## 6 F Small 84 74 12 170
If you want the 3 months making up "Q3" nested into one column:
xdf$q3_alt <- apply(xdf, 1, function(x) { list(as.numeric(x[3:5])) })
xdf
## Type Size Jul-17 Aug-17 Sep-17 Q3 q3_alt
## 1 A Large 35 24 80 139 35, 24, 80
## 2 B Medium 81 13 38 132 81, 13, 38
## 3 C Small 30 64 45 139 30, 64, 45
## 4 D Large 97 68 65 230 97, 68, 65
## 5 E Medium 31 69 33 133 31, 69, 33
## 6 F Small 84 74 12 170 84, 74, 12
str(xdf)
## 'data.frame': 6 obs. of 7 variables:
## $ Type : chr "A" "B" "C" "D" ...
## $ Size : chr "Large" "Medium" "Small" "Large" ...
## $ Jul-17: int 35 81 30 97 31 84
## $ Aug-17: int 24 13 64 68 69 74
## $ Sep-17: int 80 38 45 65 33 12
## $ Q3 : num 139 132 139 230 133 170
## $ q3_alt:List of 6
## ..$ :List of 1
## .. ..$ : num 35 24 80
## ..$ :List of 1
## .. ..$ : num 81 13 38
## ..$ :List of 1
## .. ..$ : num 30 64 45
## ..$ :List of 1
## .. ..$ : num 97 68 65
## ..$ :List of 1
## .. ..$ : num 31 69 33
## ..$ :List of 1
## .. ..$ : num 84 74 12
the solution is the gather function from tidyr. If you use dplyr you can make it in one line of code.
> library(dplyr)
> library(tidyr)
> df%>%
+ gather(key = Q3,value = values,Jul_17:Sep_17)
type size Q3 values
1 1 A Large Jul_17 35
2 2 B Medium Jul_17 81
3 3 C Small Jul_17 30
4 4 D Large Jul_17 97
5 5 E Medium Jul_17 31
6 6 F Small Jul_17 84
7 1 A Large Aug_17 24
8 2 B Medium Aug_17 13
9 3 C Small Aug_17 64
10 4 D Large Aug_17 68
11 5 E Medium Aug_17 69
12 6 F Small Aug_17 74
13 1 A Large Sep_17 80
14 2 B Medium Sep_17 38
15 3 C Small Sep_17 45
16 4 D Large Sep_17 65
17 5 E Medium Sep_17 33
18 6 F Small Sep_17 12
Sounds to me like you want something along the lines of melt from the reshape2 package or gather from the tidyr packge. They will make your data.frame longer with all the Jul-17, Aug-17, and Sep-17 values in one column and another column declaring which month each data point came from.
Check out this nice primer on data tidying.
Prior to running a randomForest model, I load my data and sort variables into categorical and numerical so the model can process it.
Data as first loaded from the .csv file looks like this:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : int 1 1 1 1 0 0 0 0 1 0 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : Factor w/ 200 levels "#N/A","1690",..: 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : Factor w/ 138 levels "#N/A","100","101",..: 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : int 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : Factor w/ 189 levels "#N/A","10029",..: 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame, 3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 3660 152 15 7159.5
2 1 135.17 3660 150 15 7159.5
3 1 137.25 3660 150 15 7159.5
I then attempt to sort the variables in the following way:
##Sort numerical and categorical values
options(digits = 5)
cols <- c("VarX")
for (i in cols) {
DataFrame[,i] = as.factor(DataFrame[,i])
}
cols2 <- c("Var1", "Var2", "Var3", "Var4", "Var5")
for (i in cols2) {
DataFrame[,i] = as.numeric(DataFrame[,i])
}
However, this does something strange and undesirable to the data:
> str(DataFrame)
'data.frame': 1060 obs. of 6 variables:
$ VarX : Factor w/ 2 levels "0","1": 2 2 2 2 1 1 1 1 2 1 ...
$ Var1 : num 127 135 137 138 138 ...
$ Var2 : num 190 190 190 191 191 191 189 185 183 181 ...
$ Var3 : num 44 43 43 43 43 43 43 43 43 42 ...
$ Var4 : num 15 15 15 15 15 16 16 16 16 16 ...
$ Var5 : num 87 87 87 87 87 85 85 85 85 85 ...
> head(DataFrame,3)
VarX Var1 Var2 Var3 Var4 Var5
1 1 126.58 190 44 15 87
2 1 135.17 190 43 15 87
3 1 137.25 190 43 15 87
Also, while not shown in the above excerpt it turns all NA values into 1, which, depending on the data, can skew the results.
Q: What would be the correct way to process the data so that there is no corruption of the data, while ensuring that it can be used by the randomForest package?
You should have used as.numeric(as.character(variable_name)) to convert a factor column to numeric column, otherwise information will be lost.
If you see the documentation of ?factor it says in the WARNING section:
The interpretation of a factor depends on both the codes and the
"levels" attribute. Be careful only to compare factors with the same
set of levels (in the same order). In particular, as.numeric applied
to a factor is meaningless, and may happen by implicit coercion. To
transform a factor f to approximately its original numeric values,
as.numeric(levels(f))[f] is recommended and slightly more efficient
than as.numeric(as.character(f)).
Instead of for loops you can also use the power of sapply to convert these column into numeric like below:
dfnew <- sapply(df[,colms_to_be_converted],function(x)as.numeric(as.character(x)))
I have a dataset:
> k
EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504
The variables are as follows. As per the first answer by joran, unused levels can be dropped by droplevels, so no worry about the 898 levels, the illustrative k I'm showing is the complete dataset obtained from k <- d1[1:10, 3:4] where d1 is the original dataset.
> str(k)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 898 levels " HIGH SURF ADVISORY",..: 243 NA NA NA 243 NA NA 243 243 NA
$ FATALITIES: num 583 158 116 114 99 90 75 74 67 57
$ INJURIES : num 0 1150 785 597 0 ...
I'm trying to overwrite the WIND factor:
> k[k$EVTYPE==factor("WIND"), ]$EVTYPE <- factor("AFDAF")
> k[k$EVTYPE=="WIND", ]$EVTYPE <- factor("AFDAF")
But both commands give me error messages: level sets of factors are different or invalid factor level, NA generated.
How should I do this?
Try this instead:
k <- droplevels(d1[1:10, 3:5])
Factors (as per the documentation) are simply a vector of integer codes and then a simple vector of labels for each code. These are called the "levels". The levels are an attribute, and persist with your data even when subsetting.
This is a feature, since for many statistical procedures it is vital to keep track of all the possible values that variable could have, even if they don't appear in the actual data.
Some people find this irritation and run R using options(stringsAsFactors = FALSE).
To simply change the levels, you can do something like this:
d <- read.table(text = " EVTYPE FATALITIES INJURIES
198704 HEAT 583 0
862634 WIND 158 1150
68670 WIND 116 785
148852 WIND 114 597
355128 HEAT 99 0
67884 WIND 90 1228
46309 WIND 75 270
371112 HEAT 74 135
230927 HEAT 67 0
78567 WIND 57 504",header = TRUE,sep = "",stringsAsFactors = TRUE)
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "HEAT","WIND": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
> levels(d$EVTYPE) <- c('A','B')
> str(d)
'data.frame': 10 obs. of 3 variables:
$ EVTYPE : Factor w/ 2 levels "A","B": 1 2 2 2 1 2 2 1 1 2
$ FATALITIES: int 583 158 116 114 99 90 75 74 67 57
$ INJURIES : int 0 1150 785 597 0 1228 270 135 0 504
Or to just change one:
levels(d$EVTYPE)[2] <- 'C'
I have 2142 rows and 9 columns in my data frame. When I call head(df),
the data frame appears fine, something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable? Storage Unit Order Number
2209 NEZ0037-76 FreezerWorks NEZ0037 BoxPos 1 N 76
2210 NEZ0037-77 FreezerWorks NEZ0037 BoxPos 1 N 77
2211 NEZ0037-78 FreezerWorks NEZ0037 BoxPos 1 N 78
2212 NEZ0037-79 FreezerWorks NEZ0037 BoxPos 1 N 79
2213 NEZ0037-80 FreezerWorks NEZ0037 BoxPos 1 N 80
2214 NEZ0037-81 FreezerWorks NEZ0037 BoxPos 1 N 81
Description Storage.Label
2209 I4
2210 I5
2211 I6
2212 I7
2213 I8
2214 I9`
However, when I call write.csv or write.table, I get an incoherent output. Something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable
1 NEZ0011 FreezerWorks NEZ0011 Box-9X9 81 Y
39 40 41 42 43 44 45
80 81 "Box-9X9 NEZ0014" 1 2 3 4
38 39 40 41 42 43 44
79 80 81 "Box-9X9 NEZ0017" 1 2 3
37 38 39 40 41 42 43
78 79 80 81 "Box-9X9 NEZ0020" 1 2
36 37 38 39 40 41 42
77 78 79 80 81 "Box-9X9 NEZ0023" 1
35 36 37 38 39 40 41
76 77 78 79 80 81 "Box-9X9 NEZ0026"`
Calling sapply(df, class) reveals that all columns in the data frame are [1] "factor"
except for $Storage.Level which is [1] "data.table" "data.frame". When I called unlist on $Storage.Level, the output is better but it changes the value in the column. I also tried
df <- data.frame(df, stringsAsFactors=FALSE) without success. Also data.frame(lapply(df, factor)) as suggested in the thread here and as.data.frame in the thread here did not work. Is there a way to unlist $Storage.Level without tampering with the values in the column? Or maybe there is a way to change from level "data.table" "data.frame" to factor and output the data safely.
R version 3.0.3 (2014-03-06)
It sounds like you have something like this:
df <- data.frame(A = 1:2, C = 3:4)
df$AC <- data.table(df)
str(df)
# 'data.frame': 2 obs. of 3 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC:Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
# ..$ A: int 1 2
# ..$ C: int 3 4
# ..- attr(*, ".internal.selfref")=<externalptr>
sapply(df, class)
# $A
# [1] "integer"
#
# $C
# [1] "integer"
#
# $AC
# [1] "data.table" "data.frame"
If that's the case, you will have trouble writing to a csv file.
Try first calling do.call(data.frame, your_data_frame) to see if that sufficiently "flattens" your data.frame, as it does with this example.
str(do.call(data.frame, df))
# 'data.frame': 2 obs. of 4 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC.A: int 1 2
# $ AC.C: int 3 4
You should be able to write this to a csv file without any problems.
I have a csv file (298 rows and 24 columns) and i want to create a decision tree to predict the column "salary". I have downloaded tree package and added via library function.
But when i try to create the decision tree:
model<-tree(salary~.,data)
I get the error like below:
*Error in tree(salary ~ ., data) :
factor predictors must have at most 32 levels*
What is wrong with that? Data is as follows:
Name bat hit homeruns runs
1 Alan Ashby 315 81 7 24
2 Alvin Davis 479 130 18 66
3 Andre Dawson 496 141 20 65
...
team position putout assists errors
1 Hou. C 632 43 10
2 Sea. 1B 880 82 14
3 Mon. RF 200 11 3
salary league87 team87
1 475 N Hou.
2 480 A Sea.
3 500 N Chi.
And its the value of str(data):
'data.frame': 263 obs. of 24 variables:
$ Name : Factor w/ 263 levels "Al Newman","Alan Ashby",..: 2 7 8 10 6 1 13 11 9 3 ...
$ bat : int 315 479 496 321 594 185 298 323 401 574 ...
$ hit : int 81 130 141 87 169 37 73 81 92 159 ...
$ homeruns : int 7 18 20 10 4 1 0 6 17 21 ...
$ runs : int 24 66 65 39 74 23 24 26 49 107 ...
$ runs.batted : int 38 72 78 42 51 8 24 32 66 75 ...
$ walks : int 39 76 37 30 35 21 7 8 65 59 ...
$ years.in.major.leagues : int 14 3 11 2 11 2 3 2 13 10 ...
$ bats.during.career : int 3449 1624 5628 396 4408 214 509 341 5206 4631 ...
$ hits.during.career : int 835 457 1575 101 1133 42 108 86 1332 1300 ...
$ homeruns.during.career : int 69 63 225 12 19 1 0 6 253 90 ...
$ runs.during.career : int 321 224 828 48 501 30 41 32 784 702 ...
$ runs.batted.during.career: int 414 266 838 46 336 9 37 34 890 504 ...
$ walks.during.career : int 375 263 354 33 194 24 12 8 866 488 ...
$ league : Factor w/ 2 levels "A","N": 2 1 2 2 1 2 1 2 1 1 ...
$ division : Factor w/ 2 levels "E","W": 2 2 1 1 2 1 2 2 1 1 ...
$ team : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 14 14 16 14 10 1 7
8 ...
$ position : Factor w/ 23 levels "1B","1O","23",..: 10 1 20 1 22 4 22 22 13 22 ...
$ putout : int 632 880 200 805 282 76 121 143 0 238 ...
$ assists : int 43 82 11 40 421 127 283 290 0 445 ...
$ errors : int 10 14 3 4 25 7 9 19 0 22 ...
$ salary : num 475 480 500 91.5 750 ...
$ league87 : Factor w/ 2 levels "A","N": 2 1 2 2 1 1 1 2 1 1 ...
$ team87 : Factor w/ 24 levels "Atl.","Bal.",..: 9 21 5 14 16 13 10 1 7 8 ...
The issue is almost certainly that you're including the name variable in your model, as it has too many factor levels. I would also remove it a methodological standpoint but this probably isn't the place for that discussion. Try:
train <- data
train$Name <- NULL
model<-tree(salary~.,train)
It seems that your salary is a factor vector, while you are trying to perform a regression, so it should be a numbers vector. Simply convert you salary to numeric, and it should work just fine. For more details read the library's help:
http://cran.r-project.org/web/packages/tree/tree.pdf
Usage
tree(formula, data, weights, subset, na.action = na.pass,
control = tree.control(nobs, ...), method = "recursive.partition",
split = c("deviance", "gini"), model = FALSE, x = FALSE, y = TRUE, wts
= TRUE, ...)
Arguments
formula A formula expression. The left-hand-side (response) should be either a numerical vector when a
regression tree will be fitted or a factor, when a classification tree
is produced. The right-hand-side should be a series of numeric or
factor variables separated by +; there should be no interaction terms.
Both . and - are allowed: regression trees can have offset terms.
(...)
Depending on what exactly is stored in your salary variable, the conversion can be less or more tricky, but this should generaly work:
salary = as.numeric(levels(salary))[salary]
EDIT
As pointed out in the comment, the actual error corresponds to the data variable, so if it is a numerical data, it could also be converted to numeric to solve the issue, if it has to be a factor you will need another model or reduce the number of levels. You can also convert these factors to the numerical format by hand (by for example defining as many binary features as you have levels), but this can lead to the exponential growth of your input space.
EDIT2
It seems that you have to first decide what you are trying to model. You are trying to predict salary, but based on what? It seems that your data consists of players' records, then their names are for sure wrong type of data to use for this prediction (in particular - it is probably causing the 32 levels error). You should remove all the columns from the data variable which should not be used for building a prediction. I do not know what is the exact aim here (as there is no information regarding it in the question), so I can only guess that you are trying to predict the person's salary based on his/her stats, so you should remove from the input data: players' names, players' teams and obviously salaries (as predicting X using X is not a good idea ;)).