Exclude zero values from a ggplot barplot? - r

does anyone know if it is possible to exclude zero values from a barplot in ggplot?
I have a dataset that contains proportions as follows:
X5employf prop X5employff
1 increase 0.02272727
2 increase 0.59090909 1
3 increase 0.02272727 1 and 8
4 increase 0.02272727 2
5 increase 0.34090909 3
6 increase 0.00000000 4
7 increase 0.00000000 5
8 increase 0.00000000 6
9 increase 0.00000000 6 and 7
10 increase 0.00000000 6 and 7
11 increase 0.00000000 7
12 increase 0.00000000 8
13 decrease 0.00000000
14 decrease 0.00000000 1
15 decrease 0.00000000 1 and 8
16 decrease 0.00000000 2
17 decrease 0.00000000 3
18 decrease 0.10000000 4
19 decrease 0.50000000 5
20 decrease 0.20000000 6
21 decrease 0.00000000 6 and 7
22 decrease 0.00000000 6 and 7
23 decrease 0.10000000 7
24 decrease 0.10000000 8
25 same 0.00000000
26 same 0.00000000 1
27 same 0.00000000 1 and 8
28 same 0.00000000 2
29 same 0.00000000 3
30 same 0.21052632 4
31 same 0.31578947 5
32 same 0.26315789 6
33 same 0.15789474 6 and 7
34 same 0.00000000 6 and 7
35 same 0.05263158 7
36 same 0.00000000 8
as you can see in the 'prop' column there are a lot of zero values. I am producing a facetted bar plot with 'X5employf' column as the facet. But because of the zero values I end up with a lot of empty space on my plot(see below). Is there a way of forcing ggplot to not plot the zero values? Its not the case of dropping unused factors as these are not NA values but 0s. Any ideas??!

For your plot, simply use which to specify that you only want to use the subset of the dataframe containing non-zero proportions. This way you don't have to modify your original dataframe. Then, specify "free_x" in your scales argument within facet_grid to get rid of your empty space in your faceted plot.
plot <- ggplot(df[which(df$prop>0),], aes(X5employff, prop)) +
geom_bar(aes(fill=X5employff, stat="identity")) +
facet_grid( ~ X5employf, scales="free_x") +
theme_bw()
plot
Note that I replaced the blank fields with "blank" for the sake of quick import into R from Excel.

I'm unsure whether or not there is a way to set ignored values in ggplot. However you could consider simply recoding 0's to NA:
df[df$prop == 0] <- NA

Related

How to turn an rpart object into a dendrogram? (as.dendrogram.rpart ?))

I would like a way to turn an rpart tree object into a nested list of lists (a dendrogram). Ideally, the attributes in each node will include the information in the rpart object (impurity, variable and rule that is used for splitting, the number of observations funneled to that node, etc.).
Looking at the rpart$frame object, it is not clear to me how to read it. Any suggestions?
Tiny example:
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$frame
var n wt dev yval complexity ncompete nsurrogate yval2.V1 yval2.V2 yval2.V3 yval2.V4 yval2.V5 yval2.nodeprob
1 Start 81 81 17 1 0.17647059 2 1 1.00000000 64.00000000 17.00000000 0.79012346 0.20987654 1.00000000
2 Start 62 62 6 1 0.01960784 2 2 1.00000000 56.00000000 6.00000000 0.90322581 0.09677419 0.76543210
4 <leaf> 29 29 0 1 0.01000000 0 0 1.00000000 29.00000000 0.00000000 1.00000000 0.00000000 0.35802469
5 Age 33 33 6 1 0.01960784 2 2 1.00000000 27.00000000 6.00000000 0.81818182 0.18181818 0.40740741
10 <leaf> 12 12 0 1 0.01000000 0 0 1.00000000 12.00000000 0.00000000 1.00000000 0.00000000 0.14814815
11 Age 21 21 6 1 0.01960784 2 0 1.00000000 15.00000000 6.00000000 0.71428571 0.28571429 0.25925926
22 <leaf> 14 14 2 1 0.01000000 0 0 1.00000000 12.00000000 2.00000000 0.85714286 0.14285714 0.17283951
23 <leaf> 7 7 3 2 0.01000000 0 0 2.00000000 3.00000000 4.00000000 0.42857143 0.57142857 0.08641975
3 <leaf> 19 19 8 2 0.01000000 0 0 2.00000000 8.00000000 11.00000000 0.42105263 0.57894737 0.23456790
(the function ggdendro:::dendro_data.rpart might be helpful somehow, but I couldn't get it to really solve the problem)
Here is a GitHub gist with the function rpart2dendro for converting an object of class "rpart" to a dendrogram. Note that branches are not weighted in the output object, but it should be fairly straightforward to recursively modify the "height" attributes of the dendrogram to get proportional branch lengths. The Kyphosis example is provided at the bottom.

Ada in R giving me single classification

I am using the function ada in R, and I'm having a little difficulty. I have training data that looks like this
V13 V15 V17 V19
1 0.017241379 0.471264368 0.01449275 0.24637681
2 0.255813953 0.011627907 0.06849315 0.05479452
3 0.040000000 0.400000000 0.06000000 0.10000000
4 0.500000000 0.000000000 0.05128205 0.00000000
5 0.102040816 0.367346939 0.05769231 0.19230769
6 0.561403509 0.105263158 0.11111111 0.00000000
7 0.300813008 0.048780488 0.12222222 0.03333333
8 0.000000000 0.714285714 0.14285714 0.07142857
9 0.328947368 0.013157895 0.01492537 0.00000000
10 0.536585366 0.060975610 0.16071429 0.03571429
11 0.338461538 0.030769231 0.11764706 0.03921569
12 0.033898305 0.322033898 0.11764706 0.21568627
This is what I have stored in the variable
matrix.x
Then I have the response variables y
y
1 1
2 -1
3 1
4 -1
5 1
6 -1
7 -1
8 1
9 -1
10 -1
11 -1
12 1
I simply run the following:
ada.obj = ada(matrix.x, matrix.y)
And then
ada.pred = predict(ada.obj, matrix.x)
And for some reason, I get a matrix with all 1s or all -1s. What am I doing wrong? Ideally, I want the ada.pred to spit out the exact classifications of the training data.
Thanks.
Also how would I go about using the AdabOost1.M1 function in caret package of R?

Changing the scale of x-axis using scale_x_continuous

I am creating a scatter plot using ggplot2. The default gives me an x axis that has every value form 0 to 30. I'd prefer to have it go by 5s or something like that. I have been trying to use scale_x_continuous(), but I get this:
Error: Discrete value supplied to continuous scale
Here is the code that I am trying to work with:
Daily.Average.plot <- ggplot(data = Daily.average, aes(factor(Day), Mass))+
geom_point(aes(color = factor(Temp))) +
scale_x_continuous(breaks = seq(0,30,5))
Daily.Average.plot
When I run this without the scale_x_continuous I get a graph that looks fine with no errors, just the incorrect x axis. All of the columns in the data set are numeric when I check str(), if that matters. Do I have an error in my code, or should I be using something different to change the scale?
Here is a sample of my data set:
N Day Mass Temp
1 1 0.00000000 5
2 2 0.00000000 5
3 3 0.07692308 5
4 4 0.07692308 5
5 5 0.07692308 5
6 6 0.15384615 5
7 7 0.15384615 5
8 8 0.23076923 5
9 9 0.38461538 5
10 10 0.46153846 5
11 1 0.00000000 10
12 2 0.00000000 10
13 3 0.00000000 10
14 4 0.09090909 10
15 5 0.09090909 10
16 6 0.54545455 10
17 7 0.54545455 10
18 8 0.63636364 10
19 9 0.90909091 10
20 10 1.36363636 10
21 1 0.00000000 15
22 2 0.07692308 15
23 3 0.61538462 15
24 4 0.76923077 15
25 5 0.76923077 15
26 6 1.23076923 15
27 7 1.69230769 15
28 8 2.07692308 15
29 9 2.46153846 15
30 10 3.07692308 15

How to use variable names as arguments

For a homework assignment, I wrote a function that performs forward step-wise regression. It takes 3 arguments: dependent variable, list of potential independent variables, and the data frame in which these variables are found. Currently all of my inputs except data frame, including the list of independent variables, are strings.
Many built-in functions, as well as functions from high-profile packages, allow for variable inputs that are not strings. Which way is best-practice and why? If non-string is best practice, how can I implement this considering that one of the arguments is a list of variables in the data frame, not a single variable?
Personally I don't see any problem with using strings if it accomplishes what you need it to. If you want, you could rewrite your function to take a formula as input rather than strings to designate independent and dependent variables. In this case your function calls would look like this:
fitmodel(x ~ y + z,data)
rather than this:
fitmodel("x",list("y","z"),data)
Using formulas would allow you to specify simple algebraic combinations of variables to use in your regression, like x ~ y + log(z). If you go this route, then you can build the data frame specified by the formula by calling model.frame and then use this new data frame to run your algorithm. For example:
> df<-data.frame(x=1:10,y=10:1,z=sqrt(1:10))
> model.frame(x ~ y + z,df)
x y z
1 1 10 1.000000
2 2 9 1.414214
3 3 8 1.732051
4 4 7 2.000000
5 5 6 2.236068
6 6 5 2.449490
7 7 4 2.645751
8 8 3 2.828427
9 9 2 3.000000
10 10 1 3.162278
> model.frame(x ~ y + z + I(x^2) + log(z) + I(x*y),df)
x y z I(x^2) log(z) I(x * y)
1 1 10 1.000000 1 0.0000000 10
2 2 9 1.414214 4 0.3465736 18
3 3 8 1.732051 9 0.5493061 24
4 4 7 2.000000 16 0.6931472 28
5 5 6 2.236068 25 0.8047190 30
6 6 5 2.449490 36 0.8958797 30
7 7 4 2.645751 49 0.9729551 28
8 8 3 2.828427 64 1.0397208 24
9 9 2 3.000000 81 1.0986123 18
10 10 1 3.162278 100 1.1512925 10
>

R Create Column as Running Average of Another Column

I want to create a column in R that is simply the average of all previous values of another column. For Example:
D
X
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I would like D$Y to be the prior average of D$X that is, D$Y is the average of all previous observations of D$X. I know how to do this using a for loop moving through every row, but is there a more efficient manner?
I have a large dataset and hardware not up to that task!
You can generate cumulative means of a vector like this:
set.seed(123)
x<-sample(20)
x
## [1] 6 15 8 16 17 1 18 12 7 20 10 5 11 9 19 13 14 4 3 2
xmeans<-cumsum(x)/1:length(x)
xmeans
## [1] 6.000000 10.500000 9.666667 11.250000 12.400000 10.500000 11.571429
## [8] 11.625000 11.111111 12.000000 11.818182 11.250000 11.230769 11.071429
## [15] 11.600000 11.687500 11.823529 11.388889 10.947368 10.500000
So D$Y<-cumsum(D$X)/1:nrow(D) should work.

Resources