I need to look at relative change in 2 groups of data which have very different scales.
I would therefore think that by setting my first value to 100% and then creating a proportion to that value per group is the way forward. I can then create a line chart to show the relative movement.
I would call this an index chart so may have missed existing questions.
However I don't know how to set my data up in R to do this.
My aggregated data below. I want each of 1999 to be 100% and the subsequent years to be % of that.
> Totals
year fips Emissions
1 1999 06037 6109.6900
2 2002 06037 7188.6802
3 2005 06037 7304.1149
4 2008 06037 6421.0170
5 1999 24510 403.7700
6 2002 24510 192.0078
7 2005 24510 185.4144
8 2008 24510 138.2402
I'm probably going to want to add a bar chart behind it to show weighting too as relative change is much more dramatic for smaller data. Tips on that are appreciated too but I've not searched for that yet as the above is the primary issue IMO.
Appreciate your help.
James
For example with dplyr:
library(dplyr)
dat <-
df1 %>%
group_by(fips) %>%
mutate(ind = Emissions / first(Emissions))
And using ggplot2 to plot a line chart:
library(ggplot2)
ggplot(dat, aes(x = year, y = ind, color = as.factor(fips))) +
geom_line()
Related
I have a dataset (df) that looks like this:
EIN Year Cat Fund
1 16 2005 A 9784.490
2 16 2006 A 10020.720
3 16 2007 A 9232.796
4 15 2008 B 8567.893
5 15 2009 B 10292.670
6 17 2010 C 9274.589
The data has relatively large dimensions (around 300k observations), which makes plotting a potentially slow process. I would like to plot the variable Fund for each year, by the identifier EIN. Based on this post I have tried the following code:
library(ggplot2)
ggplot(df, mapping = aes(x = Year, y = Fund)) +
geom_line(aes(linetype = as.factor(EIN)))
Here are my questions:
This code becomes pretty slow given the high amount of observations that I have. Do you suggest any alternatives that could speed up the process?
Since I have a huge number of EINs, the legend ends-up taking all the space available for the graph, so I would like to get rid of it unsuccesfully. I tried adding + guides(fill=FALSE) at the end, but it did not work. Any advice?
If I wanted to either subset or color code my plot by Cat, what would be the best way to do it?
Thanks a lot for your help!
You can get rid of the legend using:
+ theme(legend.position = 'none')
To subset (facet) your plot, especially if there aren't too many categories, use facet_wrap:
+ facet_wrap(~Cat)
To colour instead, put colour = Cat inside your aes() calll.
I'm fairly new to R and I've been having trouble with a plot.
I'm trying to create a line plot with:
$YEAR on the X axis
$METRIC on the Y axis
a different-colored line for each country (meaning, a total of 3 lines on the same plot)
$COUNTRY is a factor with 3 levels
COUNTRY YEAR METRIC
USA 2000 14.874
USA 2001 15.492
USA 2002 13.091
USA 2003 14.717
CAN 1999 15.031
CAN 2000 14.343
CAN 2001 12.972
CAN 2002 13.216
SWE 1999 14.771
SWE 2000 17.033
SWE 2001 15.932
SWE 2002 14.516
SWE 2003 15.655
When I create the plot with
plot(df$YEAR, df$METRIC, col=df$COUNTRY, type="p")
I get a plot with points for each (x,y) combination and different color for each level of the factor $COUNTRY
However, when I try to get a line for each country, with
plot(df$YEAR, df$METRIC, col=df$COUNTRY, type="l")
I get one non-stopping line, that starts with the 4 observations of "USA" and then goes back to the first year of the next country ("CAN").
Can anyone explain why is this happening?
Is it possible to create this plot using only the pre-built functions?
Thank you in advance for any assistance.
Other than my comments above, here is a basic base implementation. If initially your $COUNTRY is a factor (is.factor(df$COUNTRY)), then you can skip the creation of ctryfctr and change the lines call to lines(..., col=x$COUNTRY[1]):
df$ctryfctr <- factor(df$COUNTRY)
plot(NA, xlim=range(df$YEAR), ylim=range(df$METRIC))
for (x in split(df, df$COUNTRY)) lines(x$YEAR, x$METRIC, col=x$ctryfctr[1])
Since you seem to mix up some concepts, I thought it would be helpful to clarify things a bit.
R's base plot package is great for quick sketching without prior knowledge, but more complicated plots are defined easier with ggplot2 package. You can install it with install.packages("ggplot2"). With ggplot2 you can group the lines as you already tried, and as r2evans already pointed out.
library(ggplot2)
ggplot(df) + geom_line(aes(YEAR, METRIC, group=COUNTRY, color=COUNTRY))
So, you tell the ggplot that you are using the df as your data. You define the x and y axis for geom_line inside aes(). With group= you define the grouping variable, and with color= you define that each line is using a different color.
Hope that you have great time with R and ggplot2!
I created a bar graph in ggplot to show how counts in column scheme changed over time (i.e. from 2001 to 2016).
The x-axis is the year, the y-axis shows the frequencies (I used the fill=) to get the counts.
The data set consists of two columns (year and scheme) filled with character values:
year scheme
2016 yes
2016 yes
2016 yes
2016 yes
2015 yes
2015 yes
2014 yes
2013 yes
....
2006 no
2006 no
2006 no
2006 no
2005 no
2005 no
2004 no
2003 no
2002 no
2002 no
2001 no
2001 no
My code:
a <- ggplot(s) +
stat_bin(aes(x=year, fill=scheme, group=scheme), geom="bar", position = "dodge",bins=30)
b <- a + scale_x_continuous(breaks = c(2001:2016), labels = factor(2001:2016))
c <- b + theme(axis.text.x=element_text(size = 10, colour = "black"))
The graph:
The problem I have is that the bars are shifted in the graph for no reason. You can recognize it by looking at the x-axis and the year label. The bars are moved too much to the left (e.g.2007) or to the right (2002).
I have no clue why it happened and how can I fix it? Any type of suggestions is very much welcome.
Use binwidth = 1 instead of bins = 30. When you specify there should be 30 bins, you're asking for the years to be broken into the segments whose endpoints are sequential values in seq(2001, 2016, length.out = 30).
All the weird gaps are from the bins which didn't include a whole number.
I am using the xyplot in lattice trying to make a plot that shows temperature change over time in correlation with count data. I am not sure if ggplot2 would be better? My data is arrange like this:
Year (1998 1998 1999 2000 2001 2001 2002)
Low (2.777778 8.333330 10.555556 4.444444 26.388889 15.555556 12.500000)
Geese (2 14 10 16 7 10 15)
State (Arkansas California California California California Florida California)
I am stuck at this part of the code:
xyplot(c(geese,low)~year,subset=state=="California", par.settings=bwtheme, auto.key=TRUE)
The plot has the geese and low (temperature) as the same type of point and if I add a line there is no separation between the two. Please any help for this would be awesome.
To plot multiple series on the same plot, use + rather than c() to specify multiple y values. For example
xyplot(geese + low ~year, subset=state=="California", auto.key=TRUE, type="b")
That will produce
I am making a visualization that involves factors, ratios, and countries. There are about 15 factors and I am trying to use small multiples to create a large graph where the X and Y axes are the factors, ie, roughly:
Population
Num of Cars
Num of houses
Num of Houses Num of Cars Population
Where each intersection would be a plot of the values for each country (so, the plot at the intersection of # of Cars and # of Houses would be # of houses vs # of cars, etc). I currently have a data frame with the information with column headers: country, factors, ratios. I've tried using a few methods (facet_grid, facet_wrap, etc), but just can't get an output - when I run the script, a blank screen pops up. I haven't been able to figure out how to successfully google the type of small multiples plot i'm trying to create and am having a bit of trouble. I'm also brand new to R and have been stuck for a great many hours.
Any advice?
Edited: More information
some sample data:
factor country year ratio
1 LiteracyRate Afghanistan 2000 0.3622047
2 PostSecondarySchoolAgePopulation Afghanistan 2011 0.9272919
3 PrePrimaryEducationSchoolAgePopulation Afghanistan 2012 0.9397506
4 PrimaryEducationSchoolAgePopulation Afghanistan 2009 0.9344603
5 SecondarySchoolAgePopulation Afghanistan 2008 0.9301103
(I have this data for every country, and more factors than shown, also)
code that has been most successful so far:
try <- read.table(".../temper.csv", header = TRUE, sep = ",")
remr <- ggplot(try, aes(factor, ratio)) + geom_point()
remr + facet_grid(factor ~ factor)
Graph produced: http://www.flickr.com/photos/94273266#N05/11411186776/