So I'm having trouble creating a dot plot/bar graph of this data set I have. My data set looks like this. I want an output that looks like this. However, geom_bar() through ggplot will only give me counts, and won't take the individual decimal values from the table. I've tried using Plotly as well, but it doesn't seem to scale well to plots with multiple players.
I've already set up a larger data frame with 200+ variables. I'm trying to make something that can search for specific players in that data frame, and then create a plot from it. Consequently, I'm ideally looking for something that can easily handle 5-10 different series.
Any help would be greatly appreciated.
Thanks!
This is pretty straightforward, the key is to get your data from its current wide format into the long format that is more useful for plotting in R. And use geom_point rather than geom_bar.
First, some reproducible example data (that you should use again in your question if you post another question here, makes it much easier for others to help you):
library(ggplot2)
library(reshape2)
dataset <- data.frame(
PlayerName = letters[1:6],
IsolationPossG = runif(6),
HandoffPossG = runif(6),
OffScreenPossG = runif(6)
)
This is your current data, in the wide format:
dataset
PlayerName IsolationPossG HandoffPossG OffScreenPossG
1 a 0.78184751 0.939183520 0.74461784
2 b 0.06557433 0.745699149 0.96540299
3 c 0.21105745 0.753534811 0.02977973
4 d 0.41271918 0.555475622 0.18317886
5 e 0.38153149 0.246292074 0.74862310
6 f 0.89946318 0.008412111 0.53195933
Now we convert to the long format:
molten <- melt(
dataset,
id.vars = "PlayerName",
measure.vars = c("IsolationPossG", "HandoffPossG", "OffScreenPossG")
)
Here is the long format, much more useful for plotting in R:
head(molten)
PlayerName variable value
1 a IsolationPossG 0.78184751
2 b IsolationPossG 0.06557433
3 c IsolationPossG 0.21105745
4 d IsolationPossG 0.41271918
5 e IsolationPossG 0.38153149
6 f IsolationPossG 0.89946318
Here's how to plot it:
ggplot(molten, aes(x = variable, y = value, colour = PlayerName)) +
geom_point(size = 4) +
theme_bw() +
theme(legend.position="bottom",legend.direction="horizontal")
Which gives:
h/t how to have multple labels in ggplot2 for bubble plot
If you want the shape of the data point to vary by name, as your example image shows (but it seems rather excessive to have the player name variable on two of the plot's aesthetics):
ggplot(molten, aes(x = variable, y = value, shape = PlayerName, colour = PlayerName)) +
geom_point(size = 4) +
theme_bw() +
theme(legend.position="bottom",legend.direction="horizontal")
Related
I'm currently trying to develop a surface plot that examines the results of the below data frame. I want to plot the increasing values of noise on the x-axis and the increasing values of mu on the y-axis, with the point estimate values on the z-axis. After looking at ggplot2 and ggplotly, it's not clear how I would plot each of these columns in surface or 3D plot.
df <- "mu noise0 noise1 noise2 noise3 noise4 noise5
1 1 0.000000 0.9549526 0.8908646 0.919630 1.034607
2 2 1.952901 1.9622004 2.0317115 1.919011 1.645479
3 3 2.997467 0.5292921 2.8592976 3.034377 3.014647
4 4 3.998339 4.0042379 3.9938346 4.013196 3.977212
5 5 5.001337 4.9939060 4.9917115 4.997186 5.009082
6 6 6.001987 5.9929932 5.9882173 6.015318 6.007156
7 7 6.997924 6.9962483 7.0118066 6.182577 7.009172
8 8 8.000022 7.9981131 8.0010066 8.005220 8.024569
9 9 9.004437 9.0066182 8.9667536 8.978415 8.988935
10 10 10.006595 9.9987245 9.9949733 9.993018 10.000646"
Thanks in advance.
Here's one way using geom_tile(). First, you will want to get your data frame into more of a Tidy format, where the goal is to have columns:
mu: nothing changes here
noise: need to combine your "noise0", "noise1", ... columns together, and
z: serves as the value of the noise and we will apply the fill= aesthetic using this column.
To do that, I'm using dplyr and gather(), but there are other ways (melt(), or pivot_longer() gets you that too). I'm also adding some code to pull out just the number portion of the "noise" columns and then reformatting that as an integer to ensure that you have x and y axes as numeric/integers:
# assumes that df is your data as data.frame
df <- df %>% gather(key="noise", value="z", -mu)
df <- df %>% separate(col = "noise", into=c('x', "noise"), sep=5) %>% select(-x)
df$noise <- as.integer(df$noise)
Here's an example of how you could plot it, but aesthetics are up to you. I decided to also include geom_text() to show the actual values of df$z so that we can see better what's going on. Also, I'm using rainbow because "it's pretty" - you may want to choose a more appropriate quantitative comparison scale from the RColorBrewer package.
ggplot(df, aes(x=noise, y=mu, fill=z)) + theme_bw() +
geom_tile() +
geom_text(aes(label=round(z, 2))) +
scale_fill_gradientn(colors = rainbow(5))
EDIT: To answer OP's follow up, yes, you can also showcase this via plotly. Here's a direct transition:
p <- plot_ly(
df, x= ~noise, y= ~mu, z= ~z,
type='mesh3d', intensity = ~z,
colors= colorRamp(rainbow(5))
)
p
Static image here:
A much more informative way to show this particular set of information is to see the variation of df$z as it relates to df$mu by creating df$delta_z and then using that to plot. (you can also plot via ggplot() + geom_tile() as above):
df$delta_z <- df$z - df$mu
p1 <- plot_ly(
df, x= ~noise, y= ~mu, z= ~delta_z,
type='mesh3d', intensity = ~delta_z,
colors= colorRamp(rainbow(5))
)
Giving you this (static image here):
ggplot accepts data in the long format, which means that you need to melt your dataset using, for example, a function from the reshape2 package:
dfLong = melt(df,
id.vars = "mu",
variable.name = "noise",
value.name = "meas")
The resulting column noise contains entries such as noise0, noise1, etc. You can extract the numbers and convert to a numeric column:
dfLong$noise = with(dfLong, as.numeric(gsub("noise", "", noise)))
This converts your data to:
mu noise meas
1 1 0 1.0000000
2 2 0 2.0000000
3 3 0 3.0000000
...
As per ggplot documentation:
ggplot2 can not draw true 3D surfaces, but you can use geom_contour(), geom_contour_filled(), and geom_tile() to visualise 3D surfaces in 2D.
So, for example:
ggplot(dfLong,
aes(x = noise
y = mu,
fill = meas)) +
geom_tile() +
scale_fill_gradientn(colours = terrain.colors(10))
Produces:
This question already has answers here:
Single barplot for each row of dataframe
(2 answers)
Closed 13 days ago.
I picked up r recently and was trying some code for data visualization. For practice, I created a small data frame to plot the data and understand the result.
First I tried plotting a simple vector, like temperature over a week, and function barplot worked like a charm.
later I moved on to plot a simple tabular data of marks of students in 2 subjects as shown below:
stuname sub1 sub2
st1 rocket 95 70
st2 Ash 58 85
I used below to create the dataframe
plotdata=data.frame("stuname"=c("rocket","Ash"),
"sub1"=c(95,58),
"sub2"=c(70,85),
row.names = c("st1","st2"))
I am using below to plot the data
barplot(as.matrix(plotdata[ ,2:3]), xlab = "Stu", ylab = "marks", beside = TRUE)
I think the requirement is basic enough so I have not moved to ggplot yet.
This is what I'm getting:
This is what I was expecting:
I mean, this is how usually we would like to plot, we can keep on adding row data and the plot can keep on increasing and I see one figure to get all the marks for a particular student.
Separate just the numeric values and transpose them so that they will plot in the order you want. Note that if you transpose without separating the numeric values, they may be converted to character.
barplot(height = t(plotdata[c("sub1", "sub2")]),
names.arg = plotdata$stuname,
beside = TRUE)
I would still recommend using ggplot as it takes care of so many things for you
library(reshape2)
library(ggplot2)
#Convert to long format
d = melt(plotdata, id.vars = "stuname")
ggplot(data = d,
mapping = aes(x = stuname, y = value, fill = variable)) +
geom_col(position = position_dodge())
I have been trying to plot a graph between two columns from a data frame which I had created. The data values stored in the first column is daily time data named "Time"(format- YYYY-MM-DD) and the second column contains precipitation magnitude, which is a numeric value named "data1".
This data is taken from an excel file "St Lucia3" which has a total 11598 data points and stores daily precipitation data from 1981 to 2018 in two columns:
YearMonthDay (format- "YYYYMMDD", example "19810501")
Rainfall (mm)
The code for importing data into R:
StLucia <- read_excel("C:/Users/hp/Desktop/St Lucia3.xlsx")
The code for time data "Time" :
Time <- as.Date(as.character(StLucia$YearMonthDay), format= "%Y%m%d")
The code for precipitation data "data1" :
library("imputeTS")
data1 <- na_ma(StLucia$`Rainfall (mm)`, k = 4, weighting = "exponential")
The code for data frame "Pecip1" :
Precip1 <- data.frame(Time, data1, check.rows=TRUE)
The code for ggplot is:
ggplot(data = Precip1, mapping= aes(x= Time, y= data1)) + geom_line()
Using ggplot for plotting the graph between "Time" and "data1" results as:
Can someone please explain to me why there is an "unusual kink" like behavior at the right end of the graph, even though there are no such values in the column "data1".
The plot of "data1" data against its index is as shown:
The code for this plot is:
plot(data1, type = "l")
Any help would be highly appreciated. Thanks!
By using pad we can make up for those lost values an assign an NA value as to
avoid plotting in the region of missing data.
library(padr)
library(zoo)
YearMonthDay<-c(19810501,19810502,19810504,19810505)
Data<-c(1,2,3,4)
StLucia<-data.frame(YearMonthDay,Data)
StLucia$YearMonthDay <- as.Date(as.character(StLucia$YearMonthDay), format=
"%Y%m%d")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-04 3
4 1981-05-05 4
Note: you can see we are missing a date, but still there is no gap between position 2 and 3, thus plotting versus indexing you would not see a gap.
So lets add the missing date:
StLucia<-pad(StLucia,interval="day")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-03 NA
4 1981-05-04 3
5 1981-05-05 4
plot(StLucia, type = "l")
If you want to fill in those NA values, use na.locf() from package(zoo)
Here is a reproducible example - change the names to match your data.
# create sample data
set.seed(47)
dd = data.frame(t = Sys.Date() + c(0:5, 30:32), y = runif(9))
# demonstrate problem
ggplot(dd, aes(t, y)) +
geom_point() +
geom_line()
The easiest solution, as Tung points out, is to use a more appropriate geom, like geom_col:
ggplot(dd, aes(t, y)) +
geom_col()
If you really want to use lines, you should fill in the missing dates with NA for rainfall. H
# calculate all days
all_days = data.frame(t = seq.Date(from = min(dd$t), to = max(dd$t), by = "day"))
# join to original data
library(dplyr)
dd_complete = left_join(all_days, dd, by = "t")
# ggplot won't connect lines across missing values
ggplot(dd_complete, aes(t, y)) +
geom_point() +
geom_line()
Alternately, you could replace the missing values with 0s to have the line just go along the axis, but I think it's nicer to not plot the line, which implies no data/missing data, rather than plot 0s which implies no rainfall.
Here's facsimile of my data:
d1 <- data.frame(
e=rnorm(3000,10,10)
)
d2 <- data.frame(
e=rnorm(2000,30,30)
)
So, I got around the problem of plotting two different density distributions from two very different datasets on the same graph by doing this:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2)
But when I try to manually add a legend, like so:
ggplot() +
geom_density(aes(x=e),fill="red",data=d1) +
geom_density(aes(x=e),fill="blue",data=d2) +
scale_fill_manual(name="Data", values = c("XXXXX" = "red","YYYYY" = "blue"))
Nothing happens. Does anybody know what's going wrong? I thought I could actually manually add legends if need be.
Generally ggplot works best when your data is in a single data.frame and in long format. In your case we therefore want to combine the data from both data.frames. For this simple example, we just concatenate the data into a long variable called d and use an additional column id to indicate to which dataset that value belongs.
d.f <- data.frame(id = rep(c("XXXXX", "YYYYY"), c(3000, 2000)),
d = c(d1$e, d2$e))
More complex data manipulations can be done using packages such as reshape2 and tidyr. I find this cheat sheet often useful. Then when we plot we map fill to id, and ggplot will take of the legend automatically.
ggplot(d.f, aes(x = d, fill = id)) +
geom_density()
I'm running a monte-carlo simulation and the output is in the form:
> d = data.frame(iter=seq(1, 2), k1 = c(0.2, 0.6), k2=c(0.3, 0.4))
> d
iter k1 k2
1 0.2 0.3
2 0.6 0.4
The plots I want to generate are:
plot(d$iter, d$k1)
plot(density(d$k1))
I know how to do equivalent plots using ggplot2, convert to data frame
new_d = data.frame(iter=rep(d$iter, 2),
k = c(d$k1, d$k2),
label = rep(c('k1', 'k2'), each=2))
then plotting is easy. However the number of iterations can be very large and the number of k's can also be large. This means messing about with a very large data frame.
Is there anyway I can avoid creating this new data frame?
Thanks
Short answer is "no," you can't avoid creating a data frame. ggplot requires the data to be in a data frame. If you use qplot, you can give it separate vectors for x and y, but internally, it's still creating a data frame out of the parameters you pass in.
I agree with juba's suggestion -- learn to use the reshape function, or better yet the reshape package with melt/cast functions. Once you get fast with putting your data in long format, creating amazing ggplot graphs becomes one step closer!
Yes, it is possible for you to avoid creating a data frame: just give an empty argument list to the base layer, ggplot(). Here is a complete example based on your code:
library(ggplot2)
d = data.frame(iter=seq(1, 2), k1 = c(0.2, 0.6), k2=c(0.3, 0.4))
# desired plots:
# plot(d$iter, d$k1)
# plot(density(d$k1))
ggplot() + geom_point(aes(x = d$iter, y = d$k1))
# there is not enough data for a good density plot,
# but this is how you would do it:
ggplot() + geom_density(aes(d$k1))
Note that although this allows for you not to create a data frame, a data frame might still be created internally. See, e.g., the following extract from ?geom_point:
All objects will be fortified to produce a data frame.
You can use the reshape function to transform your data frame to "long" format. May be it is a bit faster than your code ?
R> reshape(d, direction="long",varying=list(c("k1","k2")),v.names="k",times=c("k1","k2"))
iter time k id
1.k1 1 k1 0.2 1
2.k1 2 k1 0.6 2
1.k2 1 k2 0.3 1
2.k2 2 k2 0.4 2
So just to add to the previous answers. With qplot you could do
p <- qplot(y=d$k2, x=d$k1)
and then from there building it further, e.g. with
p + theme_bw()
But I agree - melt/cast is genereally the way forward.
Just pass NULL as the data frame, and define the necessary aesthetics using the data vectors. Quick example:
library(MASS)
library(tidyverse)
library(ranger)
rf <- ranger(medv ~ ., data = Boston, importance = "impurity")
rf$variable.importance
ggplot(NULL, aes(x = fct_reorder(names(rf$variable.importance), rf$variable.importance),
y = rf$variable.importance)) +
geom_col(fill = "navy blue", alpha = 0.7) +
coord_flip() +
labs(x = "Predictor", y = "Importance", title = "Random Forest") +
theme_bw()