How to Interpret "Levels" in Random Forest using R/Rattle - r

I am brand new at using R/Rattle and am having difficulty understanding how to interpret the last line of this code output. Here is the function call along with it's output:
> head(weatherRF$model$predicted, 10)
336 342 94 304 227 173 265 44 230 245
No No No No No No No No No No
Levels: No Yes
This code is implementing a weather data set in which we are trying to get predictions for "RainTomorrow". I understand that this function calls for the predictions for the first 10 observations of the data set. What I do NOT understand is what the last line ("Levels: No Yes") means in the output.

It's called a factor variable.
That is the list of permitted values of the factor, here the values No and Yes are permitted.

Related

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

Single line user input in R

I want to input numeric values from user in R. These numeric values will be in one line. readline() does read the values but then returns them as character making me unable to do statistical operations on those values whereas scan() doesn't take multiple numeric values in one line in R. Please help.
Sample Input
630 135 146 233 144 498 729 120 511 670
Can you suggest me a way using which I can prompt user to input these values and store them in numeric so that I can perform basic statistic operation on these values.
as.numeric(unlist(strsplit(readline()," ")))

R: Plots of subset still include excluded attributes, how do I get draw a plot without them?

I am trying to draw a boxplot in R:
I have a dataset with 70 attributes:
The format is
patient number medical_speciality number_of_procedures
111 Ortho 21
232 Emergency 16
878 Pediatrics 20
981 OBGYN 31
232 Care of Elderly 15
211 Ortho 32
238 Care of Elderly 11
219 Care of Elderly 6
189 Emergency 67
323 Emergency 23
189 Pediatrics 1
289 Ortho 34
I have been trying to get a subset to only include emergency, pediatrics in a boxplot (there are 10000+ datapoints in reality)
I thought that I could just do this:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
plot(newdata)
Since if I do a summary of newdata, all it has is the pediatrics and emergency results. But when it comes to plotting it still includes the ortho, OBGYN, care of elderly in the x axis with no boxplot.
I presume that there is a way to do this in ggplot by doing
ggplot(newdata, aes(x=medical_speciality, y=num_of_procedures, fill=cond)) + geom_boxplot()
but this gives me the error:
Don't know how to automatically pick scale for object of type data.frame.
Defaulting to continuous
Error: Aesthetics must either be length one, or the same length as the dataProblems:cond
Can someone help me out?
I believe your problem comes from the fact that the column medical_speciality is a factor.
So, even though you subset your data the right way, you still get all the levels (including "Ortho", "OBGYN", etc...).
You can get rid of them by using the function droplevels:
newdata<-subset(olddata[ms$medical_specialty=='emergency'|olddata$medical_specialty=='pediatrics',])
newdata <- droplevels(newdata) ## THIS IS THE NEW ADDITION
plot(newdata)
Does this help?

Object not found error

Firstly I need to explain that I've had MINIMAL training on R and have 0 knowledge of coding languages or programmes like R so please excuse me if I ask silly questions or don't understand something basic.
Also, I have tried to look at past topics/answers on this but I'm having a hard time relating the answers to my data so I apologise if this question has already been answered.
Basically I have a data set and I'm trying to find the mean of two variables (Peak flow before a walk in the cold, and peak flow after a walk in the cold) in this set. This is the entire code I've used so far:
drugs <- read.table(file = "C:\\Users\\Becky\\My Documents\\Asthmadata.txt", header = TRUE)
drugs
str(drugs)
mean.Asthmadata <- tapply (Asthmadata$trial1, list(Asthmadata$PEFR1), mean)
mean.Asthmadata
It works fine until the mean.Asthmadata. The data comes up in R just fine with the other codes but when I get to the mean and do the mean.Asthmadata [...] code, I keep getting the same error: "object 'mean.Asthmadata' not found"
My friend used the same code I did and it worked for him so I'm confused. Am I doing something wrong?
Thanks
EDIT:
#BenBolker
This is my data set
trial1 PEFR1 trial2 PEFR2
Before 310 After 299
Before 242 After 201
Before 340 After 232
Before 388 After 312
Before 294 After 221
Before 251 After 256
Before 391 After 327
Before 401 After 331
Before 287 After 231
And here's all the code I've used:
drugs <- read.table(file = "C:\\Users\\Becky\\My Documents\\Asthmadata.txt", header = TRUE)
drugs
str(drugs)
mean.drugs <- tapply (drugs$trial1, list(drugs$PEFR1), mean)
mean.drugs
The R version I have has two versions: i386 3.1.3, and x64 3.1.3 – I've tried both but neither seem to do what I want. I'm also using Windows 7 Home Premium 64bit. Hope I've included everything you need and I apologise if my formatting is off – I can't quite figure out how to format properly on here yet.
And the error I'm getting NOW is: “Error in split.default(X, group) : first argument must be a vector” when running the code Roland kindly provided. So I'm getting a different error each time I try it – it must be something I'm doing wrong.
Hope I've formatted that all correctly and included everything you need. Thanks :)
drugs <- read.table(header=TRUE,text="
trial1 PEFR1 trial2 PEFR2
Before 310 After 299
Before 242 After 201
Before 340 After 232
Before 388 After 312
Before 294 After 221
Before 251 After 256
Before 391 After 327
Before 401 After 331
Before 287 After 231")
In the current format you can calculate the mean before and after just by doing
mean(drugs$PEFR1)
and
mean(drugs$PEFR2)
What you may have had in mind was this shape:
drugs2 <- with(drugs,
data.frame(trial=c(as.character(trial1),
as.character(trial2)),
PEFR=c(PEFR1,PEFR2)))
I used with() for convenience -- it's a way to temporarily attach a data frame so you can refer directly to the variables therein.)
There's a bit of a pitfall in combining trial1 and trial2, as they get coerced to their numeric codes, which are all 1s in both cases, unless you use as.character() ...
you had the order of the variable to aggregate and the variable to group by backwards (you want to aggregate PEFR by trial, not the other way around)
mean.drugs <- with(drugs2,
tapply (PEFR, list(trial), mean))
## After Before
## 267.7778 322.6667

Data dictionary packing in R

I am thinking of writing a data dictionary function in R which, taking a data frame as an argument, will do the following:
1) Create a text file which:
a. Summarises the data frame by listing the number of variables by class, number of observations, number of complete observations … etc
b. For each variable, summarise the key facts about that variable: mean, min, max, mode, number of missing observations … etc
2) Creates a pdf containing a histogram for each numeric or integer variable and a bar chart for each attribute variable.
The basic idea is to create a data dictionary of a data frame with one function.
My question is: is there a package which already does this? And if not, do people think this would be a useful function?
Thanks
There are a variety of describe functions in various packages. The one I am most familiar with is Hmisc::describe. Here's its description from its help page:
" This function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each. A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed. A frequency table is printed for any non-binary variable if it has no more than 20 unique values. For any variable with at least 20 unique values, the 5 lowest and highest values are printed."
And an example of the output:
Hmisc::describe(work2[, c("CHOLEST","HDL")])
work2[, c("CHOLEST", "HDL")]
2 Variables 5325006 Observations
----------------------------------------------------------------------------------
CHOLEST
n missing unique Mean .05 .10 .25 .50 .75 .90
4410307 914699 689 199.4 141 152 172 196 223 250
.95
268
lowest : 0 10 19 20 31, highest: 1102 1204 1213 1219 1234
----------------------------------------------------------------------------------
HDL
n missing unique Mean .05 .10 .25 .50 .75 .90
4410298 914708 258 54.2 32 36 43 52 63 75
.95
83
lowest : -11.0 0.0 0.2 1.0 2.0, highest: 241.0 243.0 248.0 272.0 275.0
----------------------------------------------------------------------------------
Furthermore, on your point about getting histograms, the Hmisc::latex method for a describe-object will produce histograms interleaved in the output illustrated above. (You do need to have a function LaTeX installation to take advantage of this.) I'm pretty sure you can find an illustration of the output in either Harrell's website or with the Amazon "Look Inside" presentation of his book "Regression Modeling Strategies". The book has a ton of useful material regarding data analysis.

Resources