Plot large panel data in R by category - r

I have a dataset (df) that looks like this:
EIN Year Cat Fund
1 16 2005 A 9784.490
2 16 2006 A 10020.720
3 16 2007 A 9232.796
4 15 2008 B 8567.893
5 15 2009 B 10292.670
6 17 2010 C 9274.589
The data has relatively large dimensions (around 300k observations), which makes plotting a potentially slow process. I would like to plot the variable Fund for each year, by the identifier EIN. Based on this post I have tried the following code:
library(ggplot2)
ggplot(df, mapping = aes(x = Year, y = Fund)) +
geom_line(aes(linetype = as.factor(EIN)))
Here are my questions:
This code becomes pretty slow given the high amount of observations that I have. Do you suggest any alternatives that could speed up the process?
Since I have a huge number of EINs, the legend ends-up taking all the space available for the graph, so I would like to get rid of it unsuccesfully. I tried adding + guides(fill=FALSE) at the end, but it did not work. Any advice?
If I wanted to either subset or color code my plot by Cat, what would be the best way to do it?
Thanks a lot for your help!

You can get rid of the legend using:
+ theme(legend.position = 'none')
To subset (facet) your plot, especially if there aren't too many categories, use facet_wrap:
+ facet_wrap(~Cat)
To colour instead, put colour = Cat inside your aes() calll.

Related

Stacked barplot histogram in R

I would like to make a histogram for my data but I would also like to visualize it in such a way that each category is coloured differently but stacked together.
This is what I'm trying to achieve: Stacked histogram from already summarized counts using ggplot2
but I'm unsure how to do it for my data set and my R skills are very much on the rusty side.
My data is formatted like this
Name Category Age Year
1 A 3 2017
2 B 6 2016
3 B 12 2017
4 B 8 2017
I'm only interested in Category B so I made a subset called catB. I would like the histogram to graph the frequency of the different ages, and I would like to colour the stacks based on year (in my data there are 5 year options).
I would appreciate any help! Thank you!
ggplot(catB, aes(x = Age, fill = Year)) +
geom_histogram()
one more nice graphical option. You have to add frequency(count): in example given it is count=1. However you have to see on real data what is count value:
catB <- cbind(catB, count=1)
ggplot(catB, aes(x=Age, y=count)) + geom_histogram(aes(fill=Year), stat="identity", group=1)

Plotting stacked histogram with log scale in ggplot2

Note: I found a similar question, for which there was an answer explaining the problem. However, I'm looking for an answer, as opposed to a reason why it's difficult (which I fully understand).
I have data for which I want to create a histogram. This data has a count of 10000 for the bin [0, 200) and a count of 1 for several bins such as [30000, 30200). Both bins are important and need to be visible. For this, I can perform a histogram with the log1p scale.
contig_len <- read.table(data_file, header = FALSE, sep = ",", col.names=c("Length"))
ggplot(contig_len, aes(x = Length)) + geom_histogram(binwidth=200) +
scale_y_continuous(trans="log1p")
This works perfectly! But now, I want to categorise the items in the histogram, as follows:
ggplot(contig_len, aes(x = Length, fill = Prevalence)) +
geom_histogram(binwidth=200, alpha=0.5, position="stack") +
scale_y_continuous(trans = "log1p")
This doesn't work, however, as the stacking is performed without taking the log scale into account. Has anyone found a way around this problem? My data looks like this:
head(contig_len)
Length Prevalence
1 606 Repetitive (<5)
2 888 Non-Repetitive
3 192 Repetitive (<9)
4 9830 Non-Repetitive
5 506 Non-Repetitive
6 850 Non-Repetitive

ggplot doesn't plot the order of the data.frame

If I have a head(df) like:
feature Comparison Primary diff key
1 work 15.441176 20.588235 5.1470588 1
2 employee 22.794118 19.117647 -3.6764706 2
3 good 11.029412 11.764706 0.7352941 3
4 improve 8.088235 10.294118 2.2058824 4
5 career 2.941176 8.823529 5.8823529 5
6 manager 2.941176 8.823529 5.8823529 6
and I'm trying to plot something with:
p = ggplot(x, aes(x = feature,size=8)) + geom_point(aes(y = Primary)) +
geom_point(aes(y=Comparison)) + coord_flip()
ggplotly(p)
Is there something I'm missing that causes p not to plot the order of the data above? the first five on the plot are
work
train
time
skill
people
But according to the df, it should be work, employee, good, improve, career.
There are these things called "levels" which ggplot uses to determine the order things should appear in the plot. If you ran levels(x$feature) in the console, then I bet the list you see has the same order as what appears in the plot.
To have them show up in the order you want, you can just have to override the "levels" for the feature column.
x$feature = factor(x$feature, levels = c("work",
"employee",
"good",
"improve",
"manager"))

R - For loop only plots data from one filtered value, despite correctly calculating data frames for each filtered value

I'm trying to generate a series of bar charts, one for each of 7 provinces, based off a master data table. However, the software only plots data from one of the provinces -- province 4. When I export to PDF I get 7 of the same bar chart (with different titles).
The data is in this format (abbreviated for clarity)
province travelcat pc_pop
60 1 0 to 4 hours 0.6807
21 1 4 to 8 hours 0.1093
28 2 4 to 8 hours 0.0969
44 2 36 to 48 hours 0.0014
31 3 48 to 72 hours 0.0016
49 3 > 72 hours 0.0007
Weirdly, when I generate a filtered table prov_filter and print that, it shows the data exactly as I'd expect it, specific to each province. Similarly the province title province_number is assigned correctly in the resulting PDF printouts. So the filtering is happening...but the data isn't going into the plot.
province_list=list()
for (i in unique(slim_prov_TCR$province)) {
province_number <- paste("Province",i)
prov_filter <- filter(slim_prov_TCR, province == i)
print(prov_filter)
plot <- ggplot(prov_filter, aes(x = prov_filter$travelcat, y = prov_filter$pc_pop))
+ theme(axis.text.x = element_text(angle=45, hjust=1))
+ scale_y_continuous(limits=c(0,1),labels = scales::percent)
+ ylab("% of provincial population") + xlab("Travel time to nearest medical facility")
+ ggtitle(province_number)
+ stat_summary(fun.y="identity",geom="bar")
filename=paste(province_number,".pdf",sep="")
province_list[[i]] = plot
print(plot)
}
I've done this before using similar code with no problems, but this time I've had serial problems, despite revising the filter code using multiple methods. I'm relatively new to R and statistics land in general so I'm probably mucking up something on the grammar side. Any and all help appreciated.
For reference purposes the final printout code is below
for (i in unique(slim_prov_TCR$province)) { # Another for loop, this time to save out the bar charts in province_list as PDFs
province_number <- paste("Province",i)
filename=paste(province_number,".pdf",sep="") # Make the file name for each PDF. The paste makes the name a variable of the disrict, so each chart is named by sensor
pdf(filename,width=3.5,height=3.5) # PDF basic specifications. Modify the width and height here.
print(province_list[[i]])
dev.off()
}
As highlighted by alistaire and Gregor, using $ for categories and having the + at the beginning of lines was confusing R. Reformatting these two points did the trick. See below text.
province_list=list()
for (i in unique(slim_prov_TCR$province)) {
province_number <- paste("Province",i)
prov_filter <- filter(slim_prov_TCR, province == i)
print(prov_filter)
plot <- ggplot(prov_filter, aes(x = travelcat, y = pc_pop)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(limits=c(0,1),labels = scales::percent) +
ylab("% of provincial population") + xlab("Travel time to nearest medical facility") +
ggtitle(province_number) +
stat_summary(fun.y="identity",geom="bar")
filename=paste(province_number,".pdf",sep="")
province_list[[i]] = plot
print(plot)
}

How do you plot two vectors on x-axis and another on y-axis in ggplot2

I am trying to plot two vectors with different values, but equal length on the same graph as follows:
a<-23.33:52.33
b<-33.33:62.33
days<-1:30
df<-data.frame(x,y,days)
a b days
1 23.33 33.33 1
2 24.33 34.33 2
3 25.33 35.33 3
4 26.33 36.33 4
5 27.33 37.33 5
etc..
I am trying to use ggplot2 to plot x and y on the x-axis and the days on the y-axis. However, I can't figure out how to do it. I am able to plot them individually and combine the graphs, but I want just one graph with both a and b vectors (different colors) on x-axis and number of days on y-axis.
What I have so far:
X<-ggplot(df, aes(x=a,y=days)) + geom_line(color="red")
Y<-ggplot(df, aes(x=b,y=days)) + geom_line(color="blue")
Is there any way to define the x-axis for both a and b vectors? I have also tried using the melt long function, but got stuck afterwards.
Any help is much appreciated. Thank you
I think the best way to do it is via a the approach of melting the data (as you have mentioned). Especially if you are going to add more vectors. This is the code
library(reshape2)
library(ggplot2)
a<-23:52
b<-33:62
days<-1:30
df<-data.frame(x=a,y=b,days)
df_molten=melt(df,id.vars="days")
ggplot(df_molten) + geom_line(aes(x=value,y=days,color=variable))
You can also change the colors manually via scale_color_manual.
A simpler solution is to use only ggplot. The following code will work in your case
a<-23.33:52.33
b<-33.33:62.33
days<-1:30
df<-data.frame(a,b,days)
ggplot(data = df)+
geom_line(aes(x = df$days,y = df$a), color = "blue")+
geom_line(aes(x = df$days,y = df$b), color = "red")
I added the colors, you might want to use them to differentiate between your variables.

Resources