I have data that looks like this
year species number.of.seed dist.to.forest
2006 0 -5
2006 Bridelia_speciosa 3 0
2006 0 5
2006 Carapa 5 10
2006 0 15
And I have created a bar chart, that shows for each year the number of different species found in seed traps and as shown by their distance from forest, which looks like this:
bu I would like to use the geom = "dotplot", and have a single dot representing each species which I have counted, basically exactly the same as the bar chart, but instead of the first bar in year 2006, 24 dots, and instead of the second bar 23 dots etc. But when I use geom = "dotplot" I just cant get it to work, not the way i want it, i can get it with a single dot at 24, or 23, but not 24 dots. I have tried a number of other solutions to similar problems on SO but nothing is working. Many thanks in advance.
my code:
dat1<-read.csv(file="clean_06_14_inc_dist1.csv")
diversity<-function(years){
results<-data.frame()
dat2<-subset(dat1, dat1$year %in% years)
for (j in years){
for (i in seq(-5, 50, 5)){
dat3<-subset(dat2, dat2$year == j & dat2$dist.to.forest == i)
a<-length(unique(dat3$species))
new_row<-data.frame(year = j, dist.to.forest = i, number.of.species = a)
results<-rbind(results, new_row)
}}
print(results)
qplot(x = dist.to.forest, y =number.of.species, data = results, facets = .~year, geom = "bar", stat = "identity")
}
diversity(2006:2008)
I think your problem is that you are trying to do a dotplot-graph with both an x and y-value as in your bar-graph, whereas I believe dotplot-graphs are meant to be used as histograms, taking only the one variable..
So, if I'm not mistaken, if you make your dataframe distinct in the variables you are interested in (since you wanted unique number of species), you can plot it straight away, basically something like
dat2 = unique(dat1[,c("year","species", "dist.to.forest")])
qplot(x=dist.to.forest, data=dat2, facets=.~year, geom="dotplot")
On a side note, I think you may be making this charting more complicated than needs be and you may want to look into dplyr which makes this kind of data manipulation a breeze..
require(dplyr)
dat1 = tbl_df(read.csv(file="clean_06_14_inc_dist1.csv"))
dat2 = dat1 %.%
filter (year %in% 2006:2008) %.%
group_by (year, dist.to.forest) %.%
summarise (number.of.species = n_distinct(species))
qplot(x=dist.to.forest, y=number.of.species, data=dat2, facets=.~year, geom="bar")
Disclaimer: Since you did not provide any sample data, this code is just off the top of my head and may contain errors.
Related
I have a dataframe consisting of car models in countries with an associated value which looks like this
Car Country Value
Audi A6 US 23
Audi A6 UK 12
Audi A6 DE 19
BMW X5 UK 8
BMW X5 DE 5
etc
Now, I want to make a histogram of the Values column and I also want the colour of the bars indicating whether there are a large amount of Audi A6 models in this bar for example.
I know how to make a histogram using ggplot:
qplot(beta_0jk[data$Value],
geom="histogram", fill=I("lightblue"))
But does someone know how I can let the colour depend on the Car or Country columns in this dataframe? Or does someone know a different way than a histogram for visualizing this?
Firt of all I would seriously recommend looking up cheat sheets for R which are very conveniently placed here
I'm personaly more used to write full version of ggplot function because it's more clear when you're getting more familiar with this libary.
Problem
First you need to understand the idea behind HISTOGRAMS, histograms works when you don't have value and want to calculate quantity or density of some characteristics. In your case you just need simple dots to represent values you already have in your data frame.
It's easy to do with some understanding of ggplot.
Aesthetics
When you use ggplot() function it takes some basic arguments.
ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
Data you provide is just whole beta_0jk dataframe. The mapping corresponds to the elements you define by your columns and so you would need to specify them:
x - something to group by your values, I would say you would want "Car" here to specify model
y - that should be clear - "Value" is variable you measure so you chose it to represent y axis value
col - it's again GROUP, but it works differently than x - it makes different colours for every group you specify. To use it you have to make sure your column is factor
Implementation
ggplot2::ggplot(beta_0jk,ggplot2::aes(
x = Car,
y = Value,
col = Country)
) + geom_jitter()
Start from this and use ggplot2 cheat sheet to make your desirible result because to be honest I don't know what do you excatly want to show. I also recommend looking up dplyr and tidyr libraries
Is this what you are looking for? To have all bars of the same width I had to fill the data with an extra row, since there is no Country == 'US' when Car == 'BMW X5'. The data preparation pipe %>% was completely inspired in this answer.
library(tidyverse)
library(ggplot2)
data %>%
spread(key = Car, value = Value, fill = NA) %>%
gather(key = Car, value = Value, -Country) %>%
ggplot(aes(x = Car, y = Value, fill = Country)) +
geom_col(position = position_dodge())
Data.
data <- read.table(text = "
Car Country Value
'Audi A6' US 23
'Audi A6' UK 12
'Audi A6' DE 19
'BMW X5' UK 8
'BMW X5' DE 5
", header = TRUE)
i've decadal time series from 1700 to 1900 (21 time slices) and for each decade i've got 7 categories that represent a quantity; see here
As you can see, only 5 of the decades actually have data.
I can plot a nice little stacked area chart in R, with the help of this very nice example, which retains only the 5 time slices that have data.
My problem is that i want an x-axis that retains all 21 times slices but still plots a stacked area chart using only the 5 time slices. The idea is that the stacked areas will still only be plotted against the correct year but simply connect up to the next point, 10 ticks down the x-axis, ignoring the no-data in between. i can achieve something in excel but i dont like it.
My reasoning is i want to plot lines on the top of the stacked area that are much more complete, for example from 1700 to 1850, or 1800 to 1900, for visual comparison purposes.
This post suggests how to connect dots in a line chart when you want to ignore NAs but it doesnt work for me in this instance.
a <- 1700:1900
b <- a[seq(1, length(a), 10)]
df <- data.frame("Year"=b,replicate(7,sample(1:21)))
rows <- c(2:10,11:15,17,19,21)
df[rows,2:8] <- NA
df
thanks a lot
If you wish to transform your year to factor, on the lines of the code below:
# Transform the data to long
library(reshape2)
df <- melt(data = df, na.rm = FALSE, id.vars = "Year")
df$Year <- as.factor(df$Year)
# Chart
require(ggplot2)
ggplot(df, aes(Year, value)) +
geom_area(aes(colour = variable, fill= variable), position = 'stack')
It will generate the chart below:
I wasn't sure if you are interested in mapping all of the X variables. I was thinking that this is the case so I reshaped your data. Presumably, it is wiser not to change the Year to factor. The code below:
a <- 1700:1900
b <- a[seq(1, length(a), 10)]
df <- data.frame("Year"=b,replicate(7,sample(1:21)))
rows <- c(2:10,11:15,17,19,21)
df[rows,2:8] <- NA
# Transform the data to long
library(reshape2)
df <- melt(data = df, na.rm = FALSE, id.vars = "Year")
# Leave it as int.
# df$Year <- as.factor(df$Year)
# Chart
require(ggplot2)
ggplot(df, aes(Year, value)) +
geom_area(aes(colour = variable, fill= variable), position = 'stack')
would generate much more meaningful chart:
Potentially, if you decide to use years as factors you may group them and have one category for a number of missing years so the x-axis is more readable. I would say it's a matter of presentation to great extent.
I have been trying to plot a line plot with ggplot.
My data looks something like this:
I04 F04 I05 F05 I06 F06
CAT 3 12 2 6 6 20
DOG 0 0 0 0 0 0
BIEBER 1 0 0 1 0 0
and can be found here.
Basically, we have a certain number of CATs (or other creatures) initially in a year (this is I04), and a certain number of CATs at the end of the year (this is F04). This goes on for some time.
I can plot something like this fairly simply using the code below, and get this:
This is fantastic, but doesn't work very well for me. After all, I have these staring and ending inventory for each year. So I am interested in seeing how the initial values (I04, I05, I06) change over time. So, for each animal, I would like to create two different lines, one for initial quantity and one for final quantity (F01, F05, F06). This seems to me like now I have to consider two factors.
This is really difficult given the way my data is set up. I'm not sure how to tell ggplot that all the I prefixed years are one factor, and all the F prefixed years are another factor. When the dataframe gets melted, it's too late. I'm not sure how to control this situation.
Any advice on how I can separate these values or perhaps another, better way to tackle this situation?
Here is the code I have:
library(ggplot2)
library(reshape2)
DF <- read.csv("mydata.csv", stringsAsFactors=FALSE)
## cleaning up, converting factors to numeric, etc
text_names <- data.frame(as.character(DF$animals))
names(text_names) <- c("animals")
numeric_cols <- DF[, -c(1)]
numeric_cols <- sapply(numeric_cols, as.numeric)
plot_me <- data.frame(cbind(text_names, numeric_cols))
plot_me$animals <- as.factor(plot_me$animals)
meltedDF <- melt(plot_me)
p <- ggplot()
p <- p + geom_line(aes(seq(1:36), meltedDF$value, group=meltedDF$animals, color=meltedDF$animals))
p
Using your original data from the link:
nd <- reshape(mydata, idvar = "animals", direction = "long", varying = names(mydata)[-1], sep = "")
ggplot(nd, aes(x = time, y = I, group = animals, colour = animals)) + geom_line() + ggtitle("Development of initial inventories")
ggplot(nd, aes(x = time, y = F, group = animals, colour = animals)) + geom_line() + ggtitle("Development of final inventories")
I think from a data analyst perspective the following approach might provide better insight.
For each animal we visualize the initial and the final quantity in a separate panel. Moreover, each subplot has its own y scale because the values of the different animal types are radically different. Like this, differences within and across animal types are easier to spot.
Given the current structure of your data, we do not need two different factors. After the gather call the indicator column includes data like I04, F04, etc. We just need to separate the first character from the rest resulting in two columns type and time. We can use type as the argument for color in the ggplot call. time provides a unified x-axis across all animal types.
library(tidyr)
library(dplyr)
library(ggplot2)
data %>% gather(indicator, value, -animals) %>%
separate(indicator, c('type', 'time'), sep = 1) %>%
mutate(
time = as.numeric(time)
) %>% ggplot(aes(time, value, color = type)) +
geom_line() +
facet_grid(animals ~ ., scales = "free_y")
Of course, you might also do it the other way round, namely using a subplot for the initial and the final quantities like this:
data %>% gather(indicator, value, -animals) %>%
separate(indicator, c('type', 'time'), sep=1) %>%
mutate(
time = as.numeric(time)
) %>% ggplot(aes(time, value, color = animals)) +
geom_line() +
facet_grid(type ~ ., scales = "free_y")
But as described above, I would not recommend that because the y scale varies too much across animal types.
I've had a good look around this site and others on how to set the hjust and vjust according to a value in a particular column. The following shows how the data is structured (but is a simplified subset of many entries for many years):
YearStart <- c(2001,2002,2003,2001,2002,2003)
Team <- c("MU","MU","MU","MC","MC","MC")
Attendance <- c(67586,67601,67640,33058,34564,46834)
Position <- c(3,1,3,1,9,16)
offset <-c()
df <- data.frame(YearStart,Team,Attendance,Position)
so
> head(df)
YearStart Team Attendance Position
1 2001 MU 67586 3
2 2002 MU 67601 1
3 2003 MU 67640 3
4 2001 MC 33058 1
5 2002 MC 34564 9
6 2003 MC 46834 16
what I would like to acheive is a vjust value based on the Team. In the following, MU would be vjust=1 and MC would be vjust=-1 so I can control where the data label is located from the data group with which it is associated.
I've tried to hack around a couple of examples that use a function containing a lookup table (it's not a straight ifelse as I have many values for Team) but I can't seem to pass a string to the function through the aes method along these lines:
lut <- list(MU=1,MC=-1)
vj <-function(x){lut[[x]]}
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(vjust = vj(Team) ) )
print(p)
The following is pseudo(ish)code which applies the labels twice to each group in each location above and below the points.
p=ggplot(df, aes(YearStart, Attendance, label=Position, group=Team))+
geom_point()+
geom_text(aes(Team="MU"), vjust=1)+
geom_text(aes(Team="MC"), vjust=-1)
print(p)
I've tried several other strategies for this and I can't tell whether I'm trying this from the wrong direction or I'm just missing a very trivial piece of ggplot syntax. I've accomplished a stop-gap solution by labelling them manually in Excel but that's not sustainable :-)
To specify an aesthetic, that aesthetic should be a column in your data.frame.
(Notice also that your lookup function should have single brackets, not double.)
And a final thought: vjust and hjust are strictly only defined between [0, 1] for left/bottom and right/top justification. In practise, however, it is usually possible to extend this. I find that settings of (-0.2, 1.2) work quite well, in most cases.
lut <- list(MU=-0.2, MC=1.2)
vj <- function(x) lut[x]
df$offset <- vj(df$Team)
library(ggplot2)
ggplot(df, aes(YearStart, Attendance, label=Position, group=Team)) +
geom_point(aes(colour=Team)) +
geom_text(aes(vjust = offset))
I'm struggling get the right ordering of variables in a graph I made with ggplot2 in R.
Suppose I have a dataframe such as:
set.seed(1234)
my_df<- data.frame(matrix(0,8,4))
names(my_df) <- c("year", "variable", "value", "vartype")
my_df$year <- rep(2006:2007)
my_df$variable <- c(rep("VX",2),rep("VB",2),rep("VZ",2),rep("VD",2))
my_df$value <- runif(8, 5,10)
my_df$vartype<- c(rep("TA",4), rep("TB",4))
which yields the following table:
year variable value vartype
1 2006 VX 5.568517 TA
2 2007 VX 8.111497 TA
3 2006 VB 8.046374 TA
4 2007 VB 8.116897 TA
5 2006 VZ 9.304577 TB
6 2007 VZ 8.201553 TB
7 2006 VD 5.047479 TB
8 2007 VD 6.162753 TB
There are four variables (VX, VB, VZ and VD), belonging to two groups of variable types, (TA and TB).
I would like to plot the values as horizontal bars on the y axis, ordered vertically first by variable groups and then by variable names, faceted by year, with values on the x axis and fill colour corresponding to variable group.
(i.e. in this simplified example, the order should be, top to bottom, VB, VX, VD, VZ)
1) My first attempt has been to try the following:
ggplot(my_df,
aes(x=variable, y=value, fill=vartype, order=vartype)) +
# adding or removing the aesthetic "order=vartype" doesn't change anything
geom_bar() +
facet_grid(. ~ year) +
coord_flip()
However, the variables are listed in reverse alphabetical order, but not by vartype : the order=vartype aesthetic is ignored.
2) Following an answer to a similar question I posted yesterday, i tried the following, based on the post Order Bars in ggplot2 bar graph :
my_df$variable <- factor(
my_df$variable,
levels=rev(sort(unique(my_df$variable))),
ordered=TRUE
)
This approach does gets the variables in vertical alphabetical order in the plot, but ignores the fact that the variables should be ordered first by variable goups (with TA-variables on top and TB-variables below).
3) The following gives the same as 2 (above):
my_df$vartype <- factor(
my_df$vartype,
levels=sort(unique(my_df$vartype)),
ordered=TRUE
)
... which has the same issues as the first approach (variables listed in reverse alphabetical order, groups ignored)
4) another approach, based on the original answer to Order Bars in ggplot2 bar graph , also gives the same plat as 2, above
my_df <- within(my_df,
vartype <- factor(vartype,
levels=names(sort(table(vartype),
decreasing=TRUE)))
)
I'm puzzled by the fact that, despite several approaches, the aesthetic order=vartype is ignored. Still, it seems to work in an unrelated problem: http://learnr.wordpress.com/2010/03/23/ggplot2-changing-the-default-order-of-legend-labels-and-stacking-of-data/
I hope that the problem is clear and welcome any suggestions.
Matteo
I posted a similar question yesterday, but, unfortunately I made several mistakes when descrbing the problem and providing a reproducible example.
I've listened to several suggestions since, and thoroughly searched stakoverflow for similar question and applied, to the best of my knowledge, every suggested combination of solutions, to no avail.
I'm posting the question again hoping to be able to solve my issue and, hopefully, be helpful to others.
This has little to do with ggplot, but is instead a question about generating an ordering of variables to use to reorder the levels of a factor. Here is your data, implemented using the various functions to better effect:
set.seed(1234)
df2 <- data.frame(year = rep(2006:2007),
variable = rep(c("VX","VB","VZ","VD"), each = 2),
value = runif(8, 5,10),
vartype = rep(c("TA","TB"), each = 4))
Note that this way variable and vartype are factors. If they aren't factors, ggplot() will coerce them and then you get left with alphabetical ordering. I have said this before and will no doubt say it again; get your data into the correct format first before you start plotting / doing data analysis.
You want the following ordering:
> with(df2, order(vartype, variable))
[1] 3 4 1 2 7 8 5 6
where you should note that we get the ordering by vartype first and only then by variable within the levels of vartype. If we use this to reorder the levels of variable we get:
> with(df2, reorder(variable, order(vartype, variable)))
[1] VX VX VB VB VZ VZ VD VD
attr(,"scores")
VB VD VX VZ
1.5 5.5 3.5 7.5
Levels: VB VX VD VZ
(ignore the attr(,"scores") bit and focus on the Levels). This has the right ordering, but ggplot() will draw them bottom to top and you wanted top to bottom. I'm not sufficiently familiar with ggplot() to know if this can be controlled, so we will also need to reverse the ordering using decreasing = TRUE in the call to order().
Putting this all together we have:
## reorder `variable` on `variable` within `vartype`
df3 <- transform(df2, variable = reorder(variable, order(vartype, variable,
decreasing = TRUE)))
Which when used with your plotting code:
ggplot(df3, aes(x=variable, y=value, fill=vartype)) +
geom_bar() +
facet_grid(. ~ year) +
coord_flip()
produces this: