ggplot with data frame columns - r

I am totally lost with using ggplot. I've tried with various solutions, but none were successful. Using numbers below, I want to create a line graph where the three lines, each representing df$c, df$d, and df$e, the x-axis representing df$a, and the y-axis representing the cumulative probability where 95=100%.
a b c d e
1 0 18 0.047368421 0.036842105 0.005263158
2 1 20 0.047368421 0.036842105 0.010526316
13 2 26 0.052631579 0.031578947 0.026315789
20 3 35 0.084210526 0.036842105 0.031578947
22 4 41 0.068421053 0.052631579 0.047368421
24 5 88 0.131578947 0.068421053 0.131578947
26 7 90 0.131578947 0.068421053 0.136842105
27 8 93 0.126315789 0.068421053 0.147368421
28 9 96 0.126315789 0.073684211 0.152631579
3 10 115 0.105263158 0.078947368 0.210526316
4 11 116 0.105263158 0.084210526 0.210526316
5 12 120 0.094736842 0.084210526 0.226315789
6 13 128 0.105263158 0.073684211 0.247368421
7 14 129 0.100000000 0.073684211 0.252631579
8 15 154 0.031578947 0.042105263 0.368421053
9 16 155 0.031578947 0.036842105 0.373684211
10 17 158 0.036842105 0.036842105 0.378947368
11 18 161 0.036842105 0.031578947 0.389473684
12 19 163 0.026315789 0.031578947 0.400000000
14 20 169 0.026315789 0.021052632 0.421052632
15 21 171 0.015789474 0.021052632 0.431578947
16 22 174 0.010526316 0.021052632 0.442105263
17 24 176 0.010526316 0.021052632 0.447368421
18 25 186 0.005263158 0.005263158 0.484210526
19 26 187 0.005263158 0.000000000 0.489473684
21 35 188 0.005263158 0.005263158 0.489473684
23 40 189 0.005263158 0.000000000 0.494736842
25 60 190 0.000000000 0.000000000 0.500000000
I was somewhat successful with using R base coding
plot(df$a, df$c, type="l",col="red")
lines(df$a, df$d, col="green")
lines(df$a, df$e, col="blue")

You first need to melt your data so that you have one column that designates from which variables the data comes from (call it variable) and another column that lists actual value (call it value). Study the example below to fully understand what happens to the variables from the original data.frame you want to keep constant.
library(reshape2)
xymelt <- melt(xy, id.vars = "a")
library(ggplot2)
ggplot(xymelt, aes(x = a, y = value, color = variable)) +
theme_bw() +
geom_line()
ggplot(xymelt, aes(x = a, y = value)) +
theme_bw() +
geom_line() +
facet_wrap(~ variable)
This code is also drawing column from your data called "d". You can remove it prior to melting, after melting, prior to plotting... or plot it.

Related

Creating boxplot based on some conditions

Data given are a sample of cholesterol levels taken from 24 hospital employees who were on a standard American diet and who agreed to adopt a vegetarian diet for 1 month. Serum-cholesterol measurements were made before adopting the diet and 1 month after.
Subject Before After Difference
1 1 195 146 49
2 2 145 155 -10
3 3 205 178 27
4 4 159 146 13
5 5 244 208 36
6 6 166 147 19
7 7 250 202 48
8 8 236 215 21
9 9 192 184 8
10 10 224 208 16
11 11 238 206 32
12 12 197 169 28
13 13 169 182 -13
14 14 158 127 31
15 15 151 149 2
16 16 197 178 19
17 17 180 161 19
18 18 222 187 35
19 19 168 176 -8
20 20 168 145 23
21 21 167 154 13
22 22 161 153 8
23 23 178 137 41
24 24 137 125 12
Now here is the question I am trying to answer. Some investigators believe that the effects of diet
on cholesterol are more evident in people with high rather than low cholesterol levels. If you split the data  according to whether baseline cholesterol is above or below the median, can you comment descriptively on this issue?
Now, I am thinking of creating boxplot based on two categories here. I wish to use dplyr for data manipulation here. So, I will create a new column based on if Before is less than or greater than median of Before. So, I will have a new character vector with "high" for high Before cholesterol and low for low Before cholesterol. And, then I will do a boxplot of Difference based on the categorical new column. So, here is my code. I call the original data set as df2.
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
group_by(new_col) %>%
ggplot(aes(x= new_col, y=Difference)) +
geom_boxplot()
And following is the boxplot I get
So, based on this, I conclude that investigators are right and effects of diet on cholesterol are more evident in people with high rather than low cholesterol levels. I want to know if this can be done more effectively.
This is more a statistical plan question rather than a programming question, therefore it would belong more to stats.stackexchange than StackOverflow.
Anyway, categorizing a variable depending on the median is not the recommended way of visualizing associations, as you are suppressing a lot of information. You can read about this in this very good article by Peter Flom.
It is better to keep all the points and apply some spline or smoothing algorithm.
For instance, you could consider something like this:
ggplot(df2, aes(x= Before, y=Difference)) +
geom_point() +
geom_smooth()
Here, the relationship is clearly seeable, while keeping all the information you want.
If you really have to generate subgroups, you could also try something like this:
df2 %>%
mutate(new_col = if_else(Before < median(Before), "low", "high")) %>%
ggplot(aes(x= Before, y=Difference, group=new_col, color=new_col)) +
geom_point() +
geom_smooth(span=3) #try some other values here
However, using the median is still not a very good idea, especially with that amount of data points. You might want to assess the functional form of the relationship, but that would need a specific question on stats.stackexchange.com.
not really an answer, but more of a different approach in visualisation of the data..
library( data.table )
library( ggplot2 )
DT.melt <- melt( DT, id.vars = "Subject", measure.vars = c( "Before", "After" ) )
ggplot() +
geom_line( data = DT.melt,
aes( x = variable, y = value, group = Subject ) ) +
geom_line( data = DT.melt[, .(mean = mean(value)), by = variable ],
aes( x = variable, y = mean, group = 1 ), color = "red", size = 2 ) +
labs( x = "", y = "" )
sample data used
DT <- fread(" Subject Before After Difference
1 195 146 49
2 145 155 -10
3 205 178 27
4 159 146 13
5 244 208 36
6 166 147 19
7 250 202 48
8 236 215 21
9 192 184 8
10 224 208 16
11 238 206 32
12 197 169 28
13 169 182 -13
14 158 127 31
15 151 149 2
16 197 178 19
17 180 161 19
18 222 187 35
19 168 176 -8
20 168 145 23
21 167 154 13
22 161 153 8
23 178 137 41
24 137 125 12")

How to user NSE inside fct_reorder() in ggplot2

I would like to know how to use NSE (Non-Standard Evaluation) expression in fct_reorder() in ggplot2 to replicate charts for different data frames.
This is an example of data frame that I use to draw a chart:
travel_time_br30 travel_time_br30_int time_reduction shift not_shift total
1 0-30 0 10 2780 3268 6048
2 0-30 0 20 2779 3269 6048
3 0-30 0 30 2984 3064 6048
4 0-30 0 40 3211 2837 6048
5 30-60 30 10 2139 2007 4146
6 30-60 30 20 2159 1987 4146
7 30-60 30 30 2363 1783 4146
8 30-60 30 40 2478 1668 4146
9 60-90 60 10 764 658 1422
10 60-90 60 20 721 701 1422
11 60-90 60 30 782 640 1422
12 60-90 60 40 801 621 1422
13 90-120 90 10 296 224 520
14 90-120 90 20 302 218 520
15 90-120 90 30 317 203 520
16 90-120 90 40 314 206 520
17 120-150 120 10 12 10 22
18 120-150 120 20 10 12 22
19 120-150 120 30 10 12 22
20 120-150 120 40 13 9 22
21 150-180 150 10 35 21 56
22 150-180 150 20 40 16 56
23 150-180 150 30 40 16 56
24 150-180 150 40 35 21 56
share
1 45.96561
2 45.94907
3 49.33862
4 53.09193
5 51.59190
6 52.07429
7 56.99469
8 59.76845
9 53.72714
10 50.70323
11 54.99297
12 56.32911
13 56.92308
14 58.07692
15 60.96154
16 60.38462
17 54.54545
18 45.45455
19 45.45455
20 59.09091
21 62.50000
22 71.42857
23 71.42857
24 62.50000
These are the scripts to draw a chart from above data frame:
g.var <- "travel_time_br30"
go.var <- "travel_time_br30_int"
test %>% ggplot(.,aes_(x=as.name(x.var),y=as.name("share"),group=as.name(g.var))) +
geom_line(size=1.4, aes(
color=fct_reorder(travel_time_br30,order(travel_time_br30_int))))
As I have several data frames which has different fields such as access_time_br30, access_time_br30_int instead of travel_time_br30 and travel_time_br30_int in the data frame, I set two variables (g.var and go.var) to easily replicate multiple chars in the same scripts.
As I need to reorder the factor group numerically, in particular, changing order of travel_time_br30 by travel_time_br30_int, I am using fct_reorder function in ggplot2(., aes_(...)). However, if I use aes_ with fct_reorder() in geom_line() as shown as an example in the following script, it returns an error saying Error:fmust be a factor (or character vector).
geom_line(size=1.4, aes_(color=fct_reorder(as.name(g.var),order(as.name(go.var)))))
Fct_reorder() does not seem to have an NSE version like fct_reorder_().
Is it impossible to use both aes_ and fct_reorder() in a sequence of scripts or are there any other solutions?
Based on my novice working knowledge of tidy-eval, you could transform your factor order in mutate() before passing the data into ggplot() and acheive your result.
Sorry I couldn't easily read in your table above, because of the line return so I made a new example off of mtcars that I think captures your intent. (let me know if it doesn't)
mtcars2 <- mutate(mtcars,
gear_int = 6 - gear,
gear_intrev = rev(gear_int)) %>%
mutate_at(vars(cyl, gear), as.factor)
library(rlang)
gg_reorder <- function(data, col_var, col_order) {
eq_var <- sym(col_var) # sym is flexible and my novice preference
eq_ord <- sym(col_order)
data %>% mutate(!!quo_name(eq_var) := fct_reorder(!!eq_var, !!eq_ord) ) %>%
ggplot(aes_(~mpg, ~hp, color = eq_var)) +
geom_line()
}
And now put it to use plotting...
gg_reorder(mtcars2, "gear", "gear_int")
gg_reorder(mtcars2, "gear", "gear_intrev")
I didn't specify all of the aes_() variables as strings but you could pass those as text and use the as.name() pattern. If you want more tidy-eval patterns Edwin Thoen wrote up a bunch of common cases.

Plot histogram by first sorting data and then dividing x values into bins in R

I have a dataset in a given format:
USER.ID avgfrequency
1 3 3.7821782
2 7 14.7500000
3 9 13.4761905
4 13 5.1967213
5 16 6.7812500
6 26 41.7500000
7 49 13.6666667
8 50 7.0000000
9 51 1.0000000
10 52 17.7500000
11 69 4.5000000
12 75 9.9500000
13 91 84.2000000
14 98 8.0185185
15 138 14.2000000
16 139 34.7500000
17 149 7.6666667
18 155 35.3333333
19 167 24.0000000
20 170 7.3529412
21 171 4.4210526
22 175 6.5781250
23 176 19.2857143
24 177 10.4864865
25 178 28.0000000
26 180 4.8461538
27 183 25.5000000
28 184 13.0000000
29 210 32.0000000
30 215 13.4615385
31 220 11.3611111
32 223 26.2500000
I want to first sort the dataset by avgfrequency and then I want to plot count of USER.ID's that fall under different bin categories.
I want to divide avgfrequency into different bin categories of width 10.
I am trying to sort data using:
user_avgfrequency <- user_avgfrequency[order(user_avgfrequency[,1]), ]
but getting an error.
df <- data.frame(USER.ID=c(3,7,9,13,16,26,49,50,51,52,69,75,91,98,138,139,149,155,167,170,171,175,176,177,178,180,183,184,210,215,220,223), avgfrequency=c(3.7821782,14.7500000,13.4761905,5.1967213,6.7812500,41.7500000,13.6666667,7.0000000,1.0000000,17.7500000,4.5000000,9.9500000,84.2000000,8.0185185,14.2000000,34.7500000,7.6666667,35.3333333,24.0000000,7.3529412,4.4210526,6.5781250,19.2857143,10.4864865,28.0000000,4.8461538,25.5000000,13.0000000,32.0000000,13.4615385,11.3611111,26.2500000) );
breaks <- seq(0,ceiling(max(df$avgfrequency)/10)*10,10);
cols <- colorRampPalette(c('blue','green','red'))(length(breaks)-1);
hist(df$avgfrequency,breaks,col=cols,axes=F,xlab='Average Frequency',ylab='Count');
axis(1,breaks);
axis(2,0:max(tabulate(cut(df$avgfrequency,breaks))));

Plot multiple lines from dataframe in R

I have some data in a single dataframe. It represents several days' worth of data broken down by age within each day. What I'm looking to do is plot the Value (data points) for each age (y axis) by day (x axis). The frame is set up like this:
Age day Value
1 13 15 139
2 14 15 198
3 15 15 287
4 16 15 404
5 17 15 439
6 18 15 323
7 19 15 255
8 13 16 135
9 14 16 202
10 15 16 309
11 16 16 380
12 17 16 451
13 18 16 366
14 19 16 256
15 13 17 117
16 14 17 208
17 15 17 303
18 16 17 392
19 17 17 410
20 18 17 359
21 19 17 246
Thus, 13 would plot from 139 to 135 to 117 over the three day period. I'm trying to use ggplot2, and am having trouble with the syntax. The end result should plot lines with different color by age.
So far I've tried this:
ggplot(d, aes(x=day, y=Age, color=Value, group=Age)) + geom_line()
But this yields an empty plot and this error message: geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
What am I missing?
Not quite sure by your wording what you're after...
I think it's this...
ggplot(df, aes(day, Value, group=factor(Age), color=factor(Age))) + geom_line()
plots days vs Value with separate lines being each Age?

ggplot:boxplot:presentations

I have the following out :
t1 t2 res
103 19 28.66667
222 49 28.66667
140 36 28.66667
102 33 24.66667
88 37 24.66667
38 22 24.66667
34 19 36.00000
102 25 36.00000
506 25 36.00000
73 9 39.00000
55 17 39.00000
34 17 39.00000
20 22 38.33333
50 67 38.33333
30 19 38.33333
27 15 34.00000
40 21 34.00000
35 16 34.00000
34 17 37.00000
22 29 37.00000
12 30 37.00000
25 39 26.33333
20 53 26.33333
22 20 26.33333
I have plotted the boxplot of both of t1 and t2 in Y-axis and res in X-axis, after I reshape the data and melt them. My question is how to to choose the color inside each of results and is it possible to change the filling to grid or shadowing filling so if I print the graph in black and white I will be still able to differentiate between t1 and t2 boxplot.
below is my code, it is auto generating different color but I want be able to choose!!:
ggplot(df_melted, aes(x = factor(res), y =value, fill=variable)) +
geom_boxplot(las=1,varwidth=T,border="black",col="red",medlwd=3,whiskcol="black",staplecol="blue",top=T)+
coord_cartesian(ylim = c(0, 200))
Note: df_melted is the data after applying melt command.
scale_fill_grey and theme_bw could be what you're after.
Try this:
ggplot(df_melted, aes(x = factor(res), y =value, fill=variable)) +
geom_boxplot()+
scale_fill_grey(start = .5, end = .9) +
theme_bw()

Resources