frequency plot using ggplot hangs or not showing plot

frequency plot using ggplot hangs or not showing plot - r

I have a dataframe with 880,000 rows and 2 columns ('width', 'group') in the following form:
width group
20 a
25 a
20 a
25 a
35 b
40 c
20 d
25 d
I want to create a frequency polygon for all the four groups in the same figure but so far I remained unsuccessful.
df1 = cbind(ceiling(rnorm(20, 30,5)), 'a')
df2 = cbind(ceiling(rnorm(40, 80,10)), 'b')
df3 = cbind(ceiling(rnorm(30, 50,8)), 'c')
df4 = cbind(ceiling(rnorm(35, 30,7)), 'd')
dfrm = rbind(df1,rbind(df2,rbind(df3,df4)))
colnames(dfrm)=c('width', 'group')
dfrm = as.data.frame(dfrm)
qplot(width, data = dfrm, geom="freqpoly", binwidth = 100) #not showing any plot
ggplot(dfrm, aes(width, ..density.., colour = group)) +
geom_freqpoly(binwidth = 1000) #create more than four plots
I need to draw something similar to the following:
http://had.co.nz/ggplot2/graphics/996ae62d750dfccac8805fa0c87168cc.png
Or
http://had.co.nz/ggplot2/graphics/55078149a733dd1a0b42a57faf847036.png

There are a couple of problems. First, the way you have created dfrm, width is a factor.
> str(dfrm)
'data.frame': 125 obs. of 2 variables:
$ width: Factor w/ 60 levels "106","20","21",..: 7 7 17 10 9 9 6 7 17 4 ...
$ group: Factor w/ 4 levels "a","b","c","d": 1 1 1 1 1 1 1 1 1 1 ...
This is because cbind creates a matrix which must have all the same type and since there is a character, it is a character matrix. Later transformation to a data.frame makes them into factors. This can be fixed with
dfrm$width <- as.numeric(as.character(dfrm$width))
or better, not making matrices to begin with
df1 = data.frame(width=ceiling(rnorm(20, 30,5)), group='a')
df2 = data.frame(width=ceiling(rnorm(40, 80,10)), group='b')
df3 = data.frame(width=ceiling(rnorm(30, 50,8)), group='c')
df4 = data.frame(width=ceiling(rnorm(35, 30,7)), group='d')
dfrm = rbind(df1,df2,df3,df4)
This is enough to make a graph
ggplot(dfrm, aes(width, ..density.., colour = group)) +
geom_freqpoly(binwidth = 1000)
Though it looks like there is only one line, there are actually 4, all on top of each other. You only see the last one drawn (group "d"). This points out the second problem: your binwidth is way too large for this data.
ggplot(dfrm, aes(width, ..density.., colour = group)) +
geom_freqpoly(binwidth = 10)
geom_freqpoly does not appear to have a fill aesthetic, though.

Related

how to make a dot plot based on each column and highlight the beginning and the end

I have a data like this
df<- structure(list(Number = 1:23, Value1 = c(0.054830335, 1.19531842,
3.27820329, 1.03530176, 5.77430976, 3.72944, -0.683513395, 0.029550239,
2.487922644, 0.533448117, 0.098825565, -1.089022938, 2.301631235,
-0.095666867, -1.359480317, -1.359480317, 1.089441628, 3.307589929,
4.67838434, 3.562761178, 2.630726653, 1.795107015, 2.616255192
), Value2 = c(-0.296874921, 1.491747294, 2.951219257, 1.258677675,
-8.68096591, 3.361029751, -1.824459195, -1.445827538, 1.889631269,
-15.47774216, 3.085461276, -1.078286963, 0.948056999, -2.109354753,
-1.36703068, -1.36703068, 1.074642842, 2.945589842, 3.757911793,
2.765225717, 2.44452491, 1.784451022, 1.158493893)), class = "data.frame", row.names = c(NA,
-23L))
I am trying to make a dot plot (one color for the Value1 vrsus number) and one with Value2 versus Number. Then show the first 5 values in bigger size and the bottom 5 in bigger size
I tried to plot it like this
df$Number <- factor(df$Number, levels = paste0("D", 1:23), ordered = TRUE)
ggplot(df, aes(x=Value1, y=Value2, color= Number)) +
geom_text()+
theme_classic()
I can plot one of them like this
ggplot(data = df, aes(x = Number, y = Value1))+
geom_point()
when it comes to have the second one on the same plot, kinda fuzzy.
I can put them together in this way
# wide to long format
plotDf <- gather(df, Group, Myvalue, -1)
# plot
ggplot(plotDf, aes(Number, Myvalue, col = Group)) +
geom_point()
I still don't know how to show the first 5 values in bigger size and last 5 values in bigger size
The first 5 and the last 5 I mean these ones
df
Number Value1 Value2
1 1 0.05483034 -0.2968749
2 2 1.19531842 1.4917473
3 3 3.27820329 2.9512193
4 4 1.03530176 1.2586777
5 5 5.77430976 -8.6809659
6 6 3.72944000 3.3610298
7 7 -0.68351339 -1.8244592
8 8 0.02955024 -1.4458275
9 9 2.48792264 1.8896313
10 10 0.53344812 -15.4777422
11 11 0.09882557 3.0854613
12 12 -1.08902294 -1.0782870
13 13 2.30163123 0.9480570
14 14 -0.09566687 -2.1093548
15 15 -1.35948032 -1.3670307
16 16 -1.35948032 -1.3670307
17 17 1.08944163 1.0746428
18 18 3.30758993 2.9455898
19 19 4.67838434 3.7579118
20 20 3.56276118 2.7652257
21 21 2.63072665 2.4445249
22 22 1.79510701 1.7844510
23 23 2.61625519 1.1584939
These are the first 5
1 1 0.05483034 -0.2968749
2 2 1.19531842 1.4917473
3 3 3.27820329 2.9512193
4 4 1.03530176 1.2586777
5 5 5.77430976 -8.6809659
and these are the last 5
19 19 4.67838434 3.7579118
20 20 3.56276118 2.7652257
21 21 2.63072665 2.4445249
22 22 1.79510701 1.7844510
23 23 2.61625519 1.1584939

Using the original data (without factor):
ggplot(df, aes(Number, Value1, size = (Number <= 5 | Number > 18))) +
geom_point() +
geom_point(aes(y=Value2)) +
scale_size_manual(name = NULL, values = c("TRUE" = 2, "FALSE" = 0.5)) +
scale_x_continuous(breaks = function(z) do.call(seq, as.list(round(z,0))))
Because using a logical condition to determine size=, the manual values assigned to it need to correspond to character versions of the various values observed, which are of course TRUE and FALSE logicals into "TRUE" and "FALSE". My choice of 2 and 0.5 is arbitrary.
Feel free to name the legend better with name="some name" if desired. If you want no legend (which makes sense), you can use
... +
scale_size_manual(guide = "none", values = c("TRUE" = 2, "FALSE" = 0.5))
instead.
Another alternative, in case you want to make distinct the dots by which Value# they are, is to melt the data into a long format before plotting.
ggplot(reshape2::melt(df, "Number"),
aes(Number, value, color = variable,
size = (Number <= 5 | Number >= 18))) +
geom_point() +
scale_size_manual(guide = "none", values = c("TRUE" = 2, "FALSE" = 0.5))
One can use tidyr::pivot_longer or data.table::melt with similar results, see Reshaping data.frame from wide to long format.

Error in ggplot2 when using both fill and group parameters in geom_bar

There seems to be a problem with R's ggplot2 library when I include both the fill and group parameters in a bar plot (geom_bar()). I've already tried looking for answers for several hours but couldn't find one that would help. This is actually my first post here.
To give a little background, I have a dataframe named smokement (short for smoke and mental health), a categorical variable named smoke100 (smoked in the past 100 days?) with "Yes" and "No", and another categorical variable named misnervs (frequency of feelings of nervousness) with 5 possible values: "All", "Most", "Some", "A little", and "None."
When I run this code, I get this result:
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, fill = smoke100)) +
facet_wrap(~misnervs, nrow = 1)
However, the result I want is to have all grouped bar plots display their respective proportions. By reading a bit of "R for Data Science" book I found out that I need to include y = ..prop.. and group = 1 in aes() to achieve it:
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., group = 1)) +
facet_wrap(~misnervs, nrow = 1)
Finally, I try to use the fill = smoke100 parameter in aes() to display this categorical variable in color, just like I did on the first code. But when I add this fill parameter, it doesn't work! The code runs, but it shows exactly the same output as the second code, as if the fill parameter this time was somehow ignored!
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., group = 1, fill = smoke100)) +
facet_wrap(~misnervs, nrow = 1)
Does anyone have an idea of why this happens, and how to solve it? My end goal is to display each value of smoke100 (the "Yes" and "No" bars) with colors and a legend at the right, just like on the first graph, while having each grouping level of "misnervs" display their respective proportions of smoke100 ("Yes", "No") levels, just like on the second graph.
EDIT:
> dim(smokement)
[1] 35471 6
> str(smokement)
'data.frame': 35471 obs. of 6 variables:
$ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 1 1 1 ...
$ misnervs: Factor w/ 5 levels "All","Most","Some",..: 3 4 5 4 1 5 3 3 5 5 ...
$ mishopls: Factor w/ 5 levels "All","Most","Some",..: 3 5 5 5 5 5 5 5 5 5 ...
$ misrstls: Factor w/ 5 levels "All","Most","Some",..: 3 5 5 3 1 5 3 5 1 5 ...
$ misdeprd: Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 4 5 5 5 5 5 ...
$ miswtles: Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 5 5 5 5 5 5 ...
> head(smokement)
smoke100 misnervs mishopls misrstls misdeprd miswtles
1 Yes Some Some Some None None
2 No A little None None None None
3 Yes None None None None None
4 No A little None Some None None
5 Yes All None All A little None
6 Yes None None None None None
As for the output without group = 1
ggplot(data = smokement) +
+ geom_bar(aes(x = smoke100, y = ..prop.., fill = smoke100)) +
+ facet_wrap(~misnervs, nrow = 1)

Besides the solution offered here the GGAlly package includes a stat_prop which introduces a new by aesthetic to specify the way the proportions should be calculated:
library(GGally)
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., fill = smoke100, by = misnervs), stat = "prop") +
facet_wrap(~misnervs, nrow = 1)
And just for reference the same could be achieved without GGAlly by setting fill=factor(..x..):
ggplot(data = smokement) +
geom_bar(aes(x = smoke100, y = ..prop.., fill = factor(..x..), group = 1)) +
facet_wrap(~misnervs, nrow = 1)
DATA
misnervs <- c("All", "Most", "Some", "A little", "None")
set.seed(123)
smokement <-
data.frame(
smoke100 = sample(c("Yes", "No"), 100, replace = TRUE),
misnervs = factor(sample(misnervs, 100, replace = TRUE), levels = misnervs)
)

I wasn't able to get what you wanted by tweaking your call to geom_bar*, but I think this gives you what you are looking for. As you didn't provide your input dataset (for understandable reasons), I've used the diamonds tibble in my code. The changes you need to make should be obvious.
*: I'm sure it can be done: I just wasn't able to work it out.
The idea behind my solution is to pre-compute the proportions you want to plot before the call to ggplot.
group_modify takes a grouped tibble and applies the specified function to each group in turn, before returning the modified (grouped) tibble.
diamonds %>%
group_by(cut) %>%
group_modify(
function(.x, .y)
.x %>%
group_by(color) %>%
summarise(Prop=n()/nrow(.))
) %>%
ggplot() +
geom_col(aes(x=color, y=Prop, fill=color)) +
facet_wrap(~cut)
Note the switch from geom_bar to geom_col: geom_bar uses row counts, geom_col uses values in the data.
As a rough-and-ready QC, here's the equivalent of your code that produces the "all grey' plot:
diamonds %>%
ggplot() +
geom_bar(aes(x=color, y=..prop.., fill=color, group=1)) +
facet_wrap(~cut)

2d plot with 3rd variable as color in RStudio

I have a dataset as CSV with three columns:
timestamp (e.g. 2018/12/15)
keyword (e.g. "hello")
count (e.g. 7)
I want one plot where all the lines of the same keyword are connected with each other and timestamp is on the X- and count is on the Y- axis. I would like each keyword to have a different color for its line and the line being labeled with the keyword.
The CSV has only ~30.000 rows and R runs on a dedicated machine. Performance can be ignored.
I tried various approaches with mathplot and ggplot in this forum, but didn't get it to work with my own data.
What is the easiest solution to do this in R?
Thanks!
EDIT:
I tried customizing Romans code and tried the following:
`csvdata <- read.csv("c:/mydataset.csv", header=TRUE, sep=",")
time <- csvdata$timestamp
count <- csvdata$count
keyword <- csvdata$keyword
time <- rep(time)
xy <- data.frame(time, word = c(keyword), count, lambda = 5)
library(ggplot2)
ggplot(xy, aes(x = time, y = count, color = keyword)) +
theme_bw() +
scale_color_brewer(palette = "Set1") + # choose appropriate palette
geom_line()`
This creates a correct canvas, but no points/lines in it...
DATA:
head(csvdata)
keyword count timestamp
1 non-distinct-word 3 2018/08/09
2 non-distinct-word 2 2018/08/10
3 non-distinct-word 3 2018/08/11
str(csvdata)
'data.frame': 121 obs. of 3 variables:
$ keyword : Factor w/ 10 levels "non-distinct-word",..: 5 5 5 5 5 5 5 5 5 5 ...
$ count : int 3 2 3 1 6 6 2 3 2 1 ...
$ timestamp: Factor w/ 103 levels "2018/08/09","2018/08/10",..: 1 2 3 4 5 6 7 8 9 10 ...

Something like this?
# Generate some data. This is the part poster of the question normally provides.
today <- as.Date(Sys.time())
time <- rep(seq.Date(from = today, to = today + 30, by = "day"), each = 2)
xy <- data.frame(time, word = c("hello", "world"), count = rpois(length(time), lambda = 5))
library(ggplot2)
ggplot(xy, aes(x = time, y = count, color = word)) +
theme_bw() +
scale_color_brewer(palette = "Set1") + # choose appropriate palette
geom_line()

R - reshaped data from wide to long format, now want to use created timevar as factor

I am working with longitudinal data and assess the utilization of a policy over 13 months. In oder to get some barplots with the different months on my x-axis, I converted my data from wide Format to Long Format.
So now, my dataset looks like this
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
I thought, after reshaping I could easily use my newly created "month" variable as a factor and plot some graphs. However, it does not work out and tells me it's a list or an atomic vector. Transforming it into a factor did not work out - I would desperately Need it as a factor.
Does anybody know how to turn it into a factor?
Thank you very much for your help!
EDIT.
The OP's graph code was posted in a comment. Here it is.
library(ggplot2)
ggplot(data, aes(x = hours, y = month)) + geom_density() + labs(title = 'Distribution of hours')

# Loading ggplot2
library(ggplot2)
# Placing example in dataframe
data <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
# Converting month to factor
data$month <- factor(data$month, levels = 1:12, labels = 1:12)
# Plotting grouping by id
ggplot(data, aes(x = month, y = hours, group = id, color = factor(id))) + geom_line()
# Plotting hour density by month
ggplot(data, aes(hours, color = month)) + geom_density()

The problem seems to be in the aes. geom_density only needs a x value, if you think about it a little, y doesn't make sense. You want the density of the x values, so on the vertical axis the values will be the values of that density, not some other values present in the dataset.
First, read in the data.
Indirekte_long <- read.table(text = "
id month hours
1 1 13
1 2 16
1 3 20
2 1 0
2 2 0
2 3 10
", header = TRUE)
Now graph it.
library(ggplot2)
g <- ggplot(Indirekte_long, aes(hours))
g + geom_density() + labs(title = 'Distribution of hours')

Stacked bar plot with percentages in separate columns

I am attempting to draw a stacked bar plot with the the following data, using either ggplot2 or the barplot function in r. I have failed with both.
str(ISCE_LENGUAJE5_APE_DEC)
'data.frame': 50 obs. of 5 variables:
$ Nombre : Factor w/ 49 levels "C.E. DE BORAUDO",..: 6 5 25 21 16 7 27 45 24 38 ...
$ v2014_5L_porNivInsu: int 100 93 73 67 67 65 63 60 59 54 ...
$ v2014_5L_porNivMini: int 0 7 22 26 32 32 37 26 34 35 ...
$ v2014_5L_porNivSati: int 0 0 4 6 2 3 0 12 6 10 ...
$ v2014_5L_porNivAvan: int 0 0 1 2 0 0 0 2 1 2 ...
The integers are percentage values: them sum of the v2014... columns for each observation is 100.
I have attempted to use ggplot2, but I only manage to plot one of the variables, not the stacked bar with all four.
ggplot(ISCE_LENGUAJE5_APE_DEC, aes(x=Nombre, y= v2014_5L_porNivInsu)) + geom_bar(stat="identity")
I can't figure out how to pass the values for all four columns to the y parameter.
If I only pass x, I get an error:
ggplot(ISCE_LENGUAJE5_APE_DEC, aes(x=Nombre)) + geom_bar(stat="identity")
Error in exists(name, envir = env, mode = mode) :
argument "env" is missing, with no default
I found this answer, but don't understand the data transformations used. Thank you for any help provided.

ggplot2 works with data expressed in "long" format. The function melt from package reshape2 is your friend.
Because you did not provide a reproducible example, I generated some data.
v2014 <- data.frame(v2014_5L_porNivInsu = sample(1:100, 50, replace = TRUE),
v2014_5L_porNivMini = sample(1:50, 50, replace = TRUE),
v2014_5L_porNivSati = sample(0:10, 50, replace = TRUE),
v2014_5L_porNivAvan = sample(0:2, 50, replace = TRUE))
v2014_prop <- t(apply(dummy[, -1], 1, function(x) {x / sum(x) * 100}))
ISCE_LENGUAJE5_APE_DEC <- data.frame(Nombre = factor(sample(1:100, 50)),
v2014_prop)
You first express your table in long format using melt.
library(reshape2)
gg <- melt(ISCE_LENGUAJE5_APE_DEC, id = "Nombre")
See how your new table, gg, looks like.
str(gg)
head(gg)
In your ggplot, you use the data.frame gg. The x-axis is Nombre, the y-axis is value, i.e. the proportions, segmented by different fill colours defined from the variable column, where you find the v2014_... expressed as factor levels instead as column headers thanks to the melt function.
library(ggplot2)
ggplot(gg, aes(x = Nombre, y = value, fill = variable)) +
geom_bar(stat = "identity")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

frequency plot using ggplot hangs or not showing plot - r

Related

how to make a dot plot based on each column and highlight the beginning and the end

Error in ggplot2 when using both fill and group parameters in geom_bar

2d plot with 3rd variable as color in RStudio

R - reshaped data from wide to long format, now want to use created timevar as factor

Stacked bar plot with percentages in separate columns

Categories

Resources