I have the following data.frame:
sample <- data.frame(day=c(1,2,5,10,12,12,14))
sample.table <- as.data.frame(table(sample$day))
Now what I'd like to do is graph the day against the count of days, so something like:
require(ggplot2)
qplot(Var1, Freq, data=sample.table)
I realized though that Var1 really really really wants to be a factor. This works fine for a small number of days, but is terrible when days becomes much larger because the graph becomes unreadable. If I change it to a numeric or integer, then instead of plotting day on the x-axis, it plots the count of day, e.g. 1,2,3,4,5,6,7.
What can I do so that if I have, say 5000 days, it is still visible well?
This is because when you use table you get a vector with names (which are characters), and when you convert to data.frame these get converted to factors with the default settings.
You could avoid this by using your original data and getting ggplot2 to count the data:
qplot(day, ..count.., data=sample, stat="bin", binwidth=1)
or just use a histogram,
qplot(day, data=sample, geom="histogram", binwidth=1)
Note that you can adjust the binwidth argument to count in larger groups.
Figured out a hack for this.
as.integer(as.character(sample$day))
Related
I have a barplot where I have one entry that is so much larger then my other entries that it makes it difficult to do interesting analysis on the other smaller valued data-points.
plt <- ggplot(dffd[dffd$Month==i & dffd$UniqueCarrier!="AA",],aes(x=UniqueCarrier,y=1,fill=DepDelay))+
geom_col()+
coord_flip()+
scale_fill_gradientn(breaks=late_breaks,labels=late_breaks,limits=c(0,150),colours=c('black','yellow','orange','red','darkred'))
When I remove it I get back to an interesting degree of interpretation but now I'm tossing out upwards of half the data and arguably the most important one to explore.
I was wondering if there is a way that I could set an interval on my bar plot, say 500 in this case, after which I can start another column for the same entry right under it and resume building up my bar plot. In this example, that would translate here into WN splitting into 3 bars of length 500 500 and ~400 stacked one below the other all under that one WN label (ideally it shows the one tick for all three). Since I have a couple of other disproportionately large representative, plots doing this in as a layer during the plotting is of great interest to me.
Typically, when you have such disproportionate values in your data set, you should either put your values on a log scale (or use some other transformation) or zoom in on the plot using coord_cartesian. I think you probably could hack your way around and create the desired plot, but it's going to be quite misleading in terms of visualisation and analysis.
EDIT:
Based on your comments, I have a rather hacky solution. The data you've pasted was not directly usable (a part of dput was missing + there's no DepDelay columns, so I improvised).
The idea is to create an extra tag column based on the UniqueCarrier column and the max amount you want.
df2 <- df %>%
filter(UniqueCarrier != "AA" & Month == i) %>%
group_by(UniqueCarrier) %>%
mutate(tag = paste(UniqueCarrier, rep(seq(1, n()%/%500+1), each=500), sep="_")[1:n()])
This adds a tag column that basically says how many columns you'll have in each category.
plt <- ggplot(df2, aes(x=tag, y=1, fill=DepDelay)) +
geom_col() +
coord_flip() +
scale_fill_gradientn(breaks=late_breaks, labels=late_breaks,
limits=c(0,150),
colours=c('black','yellow','orange','red','darkred')) +
scale_x_discrete(labels=str_replace(sort(unique(df2$tag)), "_[:digit:]", ""))
plt
In the image above, I've used CarrierDelay with break interval of 100. You can see that the WN label then repeats - there are ways to remove the extra ones (some more creative replacements in scale_x_discrete labels.
If you want the columns to be ordered differently, just replace seq(1, n()%/%500+1) with seq(n()%/%500+1, 1).
I'm having some trouble with qplot in R. I am trying to plot data from a data frame. When I execute the command below the plot gets bunched up on the left side (see the image below). The data frame only has 963 rows so I don't think size is the issue, but I can use the same command on a smaller data frame and it looks fine. Any ideas?
library(ggplot2)
qplot(x=variable,
y=value,
data=data,
color=Classification,
main="Average MapQ Scores")
Or similarly:
ggplot(data = data, aes(x = variable, y = value, color = Classification) +
geom_point()
Your column value is likely a factor, when it should be a numeric. This causes each categorical value of value to be given its own entry on the y-axis, thus producing the effect you've noticed.
You should coerce it to be a numeric
data$value <- as.numeric(as.character(data$value))
Note that there is probably a good reason it has been interpreted as a factor and not a numeric, possibly because it has some entries that are not pure numeric values (maybe 1,000 or 1000 m or some other character entry among the numbers). The consequence of the coercion may be a loss of information, so be warned or cleanse the data thoroughly.
Also, you appear to have the same problem on the x-axis.
I have a data set in which a coordinate can be repeated several times.
I want to make a hexbinplot displaying the maximum number of times a coordinate is repeated within that bin. I am using R and I would prefer to make it with ggplot so the graph is consistent with other graphs in the same report.
Minimum working example (the bins display the count not the max):
library(ggplot2)
library(data.table)
set.seed(41)
dat<-data.table(x=sample(seq(-10,10,1),1000,replace=TRUE),
y=sample(seq(-10,10,1),1000,replace=TRUE))
dat[,.N,by=c("x","y")][,max(N)]
# No bin should be over 9
p1 <- ggplot(dat,aes(x=x,y=y))+stat_binhex(bins=10)
p1
I believe the approach should be related to this question:
calculating percentages for bins in ggplot2 stat_binhex but I am not sure how to adapt it to my case.
Also, I am concerned about this issue ggplot2: ..count.. not working with stat_bin_hex anymore as it can make my objective harder than what I initially thought.
Is it possible to make the bins display the maximum number of times a point is repeated?
I think, after playing with the data a bit more, I now understand. Each bin in the plot represents multiple points, e.g., (9,9);(9,10)(10,9);(10,10) are all in a single bin in the plot. I must caution that this is the expected behavior. It is unclear to me why you do not want to do it this way. Instead, you seem to want to display the values of just one of those points (e.g. 9,9).
I don't think you will be able to do this directly in a call to geom_hex or stat_hexbin, as those functions are trying to faithfully represent all of the data. In fact, they are not necessarily expecting discrete coordinates like you have at all -- they work equally well on continuous data.
For your purpose, if you want finer control, you may want to instead use geom_tile and count the values yourself, eg. (using dplyr and magrittr):
countedData <-
dat %$%
table(x,y) %>%
as.data.frame()
ggplot(countedData
, aes(x = x
, y = y
, fill = Freq)) +
geom_tile()
and you might play with the representation a bit from there, but it would at least display each of the separate coordinates more faithfully.
Alternatively, you could filter your raw data to only include the points that are the maximum within a bin. That would require you to match the binning, but could at least be an option.
For completeness, here is how to adapt the stat_summary_hex solution that #Jon Nagra (OP) linked to. Note that there are a few additional steps, so I don't think that this is quite a duplicate. Specifically, the table step above is required to generate something that can be used as a z for the summaries, and then you need to convert x and y back from factors to the original scale.
ggplot(countedData
, aes(x = as.numeric(as.character(x))
, y = as.numeric(as.character(y))
, z = Freq)) +
stat_summary_hex(fun = max, bins = 10
, col = "white")
Of note, I still think that the geom_tile may be more useful, even it is not quite as flashy.
This is my first post, so go easy. Up until now (the past ~5 years?) I've been able to either tweak my R code the right way or find an answer on this or various other sites. Trust me when I say that I've looked for an answer!
I have a working script to create the attached boxplot in basic R.
http://i.stack.imgur.com/NaATo.jpg
This is fine, but I really just want to "jazz" it up in ggplot, for vain reasons.
I've looked at the following questions and they are close, but not complete:
Why does a boxplot in ggplot requires axis x and y?
How do you draw a boxplot without specifying x axis?
My data is basically like "mtcars" if all the numerical variables were on the same scale.
All I want to do is plot each variable on the same boxplot, like the basic R boxplot I made above. My y axis is the same continuous scale (0 to 1) for each box and the x axis simply labels each month plus a yearly average (think all the mtcars values the same on the y axis and the x axis is each vehicle model). Each box of my data represents 75 observations (kind of like if mtcars had 75 different vehicle models), again all the boxes are on the same scale.
What am I missing?
Though I don't think mtcars makes a great example for this, here it is:
First, we make the data (hopefully) more similar to yours by using a column instead of rownames.
mt = mtcars
mt$car = row.names(mtcars)
Then we reshape to long format:
mt_long = reshape2::melt(mt, id.vars = "car")
Then the plot is easy:
library(ggplot2)
ggplot(mt_long, aes(x = variable, y = value)) +
geom_boxplot()
Using ggplot all but requires data in "long" format rather than "wide" format. If you want something to be mapped to a graphical dimension (x-axis, y-axis, color, shape, etc.), then it should be a column in your data. Luckily, it's usually quite easy to get data in the right format with reshape2::melt or tidyr::gather. I'd recommend reading the Tidy Data paper for more on this topic.
I'm having some trouble with qplot in R. I am trying to plot data from a data frame. When I execute the command below the plot gets bunched up on the left side (see the image below). The data frame only has 963 rows so I don't think size is the issue, but I can use the same command on a smaller data frame and it looks fine. Any ideas?
library(ggplot2)
qplot(x=variable,
y=value,
data=data,
color=Classification,
main="Average MapQ Scores")
Or similarly:
ggplot(data = data, aes(x = variable, y = value, color = Classification) +
geom_point()
Your column value is likely a factor, when it should be a numeric. This causes each categorical value of value to be given its own entry on the y-axis, thus producing the effect you've noticed.
You should coerce it to be a numeric
data$value <- as.numeric(as.character(data$value))
Note that there is probably a good reason it has been interpreted as a factor and not a numeric, possibly because it has some entries that are not pure numeric values (maybe 1,000 or 1000 m or some other character entry among the numbers). The consequence of the coercion may be a loss of information, so be warned or cleanse the data thoroughly.
Also, you appear to have the same problem on the x-axis.