facet_wrap get count of column - r

There's a little problem that I'm not able to solve. In my dataset I have three columns (pluginUserID, type, timestamp) and I want to create a ggplot with facet wrap for every pluginUserID. My dataset looks like this, just with more users.
pluginUserID type timestamp
3 follow 2015-03-23
3 follow 2015-03-27
43 follow 2015-04-28
So in the next step I wanted to create a ggplot with a facet wrap, so my code looks like this.
timeline.plot <- ggplot(
timeline.follow.data,
aes(x=timeline.follow.data$timestamp, y=timeline.follow.data$type)
) + geom_bar(stat = "identity") +
facet_wrap(~timeline.follow.data$pluginUserID) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()
)
If I'm going to view my plot, it looks like this.
As you can see, on the y axis there's no unit to read and that's what I want to do. I want to visualise the number of follows per day and per pluginUser. And on the y axis should be a unit.

as I see your dataset I would do one thing before visualize it- count.
timeline.follow.data<- timeline.follow.data %>%
count(pluginUserID, type, timestamp)
if your the data looks like this:
pluginUserID type timestamp
3 follow 2015-03-23
3 follow 2015-03-27
3 follow 2015-03-27
43 follow 2015-04-28
43 follow 2015-04-28
after count function:
pluginUserID type timestamp n
3 follow 2015-03-23 1
3 follow 2015-03-27 2
43 follow 2015-04-28 2
and so on.
Then use ggplot function:
timeline.plot <- ggplot(
timeline.follow.data,
aes(x=timeline.follow.data$timestamp, y=timeline.follow.data$n)
) + geom_bar(stat = "identity") +
facet_wrap(~timeline.follow.data$pluginUserID) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()
)
n would mean as you wanted, how many follows was for selected user and day. Hope it helped :)

Related

Plot large panel data in R by category

I have a dataset (df) that looks like this:
EIN Year Cat Fund
1 16 2005 A 9784.490
2 16 2006 A 10020.720
3 16 2007 A 9232.796
4 15 2008 B 8567.893
5 15 2009 B 10292.670
6 17 2010 C 9274.589
The data has relatively large dimensions (around 300k observations), which makes plotting a potentially slow process. I would like to plot the variable Fund for each year, by the identifier EIN. Based on this post I have tried the following code:
library(ggplot2)
ggplot(df, mapping = aes(x = Year, y = Fund)) +
geom_line(aes(linetype = as.factor(EIN)))
Here are my questions:
This code becomes pretty slow given the high amount of observations that I have. Do you suggest any alternatives that could speed up the process?
Since I have a huge number of EINs, the legend ends-up taking all the space available for the graph, so I would like to get rid of it unsuccesfully. I tried adding + guides(fill=FALSE) at the end, but it did not work. Any advice?
If I wanted to either subset or color code my plot by Cat, what would be the best way to do it?
Thanks a lot for your help!
You can get rid of the legend using:
+ theme(legend.position = 'none')
To subset (facet) your plot, especially if there aren't too many categories, use facet_wrap:
+ facet_wrap(~Cat)
To colour instead, put colour = Cat inside your aes() calll.

Chart showing current value and historic value relative to a range

I would like to recreate the following chart in R using ggplot. My data is as per a similar table where for each code (A, B, C etc.). I have a current value, a value 12M ago and the respective range (max, min) over the period.
My chart needs to show the current value in red, the value 12M ago in blue and then a line show the max and min range.
I can produce this painstaking in Excel using error bars, but I would like to reproduce it in R.
Any ideas on how I can do this using ggplot? Thanks.
Here's what I came up with, but just a note: please if you post your dataset, don't post an image, but instead post the result of dput(your.data.frame). The result of that is easily copy-pasted into the console in order to replicate your dataset, whereas I recreated your data frame manually. :/
A few points first regarding your data as is and the intended plot:
The red and blue hash marks used to indicate 12 months ago and today are not a geom I know of off the top of my head, so I'm using geom_point here to show them (easiest way). You can pick another geom of you wish to show them differently.
The ranges for high and low are already specified by those column names. I'll use those values for the required aesthetics in geom_errorbar.
You can use your data as is to plot and use two separate geom_point calls (one for "today" and one for "12M ago"), but that's going to make creating the legend more difficult than it needs to be, so the better option is to adjust the dataset to support having the legend created automatically. For that, we'll use the gather function from tidyr, being sure to just "gather together" the information in "today" and "12M ago" (my column name for that was different b/c you need to start with a letter in the data frame), but leave alone the columns for "high", "low", and the letters (called "category" in my dataframe).
Where df is the original data frame:
df1 <- df %>% gather(time, value, -category, -high, -low)
The new dataframe (df1) looks like this (18 observations total):
category high low time value
1 A 82 28 M12.ago 81
2 B 82 54 M12.ago 80
3 C 80 65 M12.ago 75
4 D 76 34 M12.ago 70
5 E 94 51 M12.ago 93
6 F 72 61 M12.ago 65
where "time" has "M12.ago" or "today".
For the plot, you apply category to x and value to y, and specify ymax and ymin with high and low, respectively for the geom_errorbar:
ggplot(df1, aes(x=category, y=value)) +
geom_errorbar(aes(ymin=low, ymax=high), width=0.2) + ylim(0,100) +
geom_point(aes(color=time), size=2) +
scale_color_manual(values=list('M12.ago'='blue', 'today'='red')) +
theme_bw() + labs(color="") + theme(legend.position='bottom')
Giving you this:

ggplot doesn't plot the order of the data.frame

If I have a head(df) like:
feature Comparison Primary diff key
1 work 15.441176 20.588235 5.1470588 1
2 employee 22.794118 19.117647 -3.6764706 2
3 good 11.029412 11.764706 0.7352941 3
4 improve 8.088235 10.294118 2.2058824 4
5 career 2.941176 8.823529 5.8823529 5
6 manager 2.941176 8.823529 5.8823529 6
and I'm trying to plot something with:
p = ggplot(x, aes(x = feature,size=8)) + geom_point(aes(y = Primary)) +
geom_point(aes(y=Comparison)) + coord_flip()
ggplotly(p)
Is there something I'm missing that causes p not to plot the order of the data above? the first five on the plot are
work
train
time
skill
people
But according to the df, it should be work, employee, good, improve, career.
There are these things called "levels" which ggplot uses to determine the order things should appear in the plot. If you ran levels(x$feature) in the console, then I bet the list you see has the same order as what appears in the plot.
To have them show up in the order you want, you can just have to override the "levels" for the feature column.
x$feature = factor(x$feature, levels = c("work",
"employee",
"good",
"improve",
"manager"))

R - For loop only plots data from one filtered value, despite correctly calculating data frames for each filtered value

I'm trying to generate a series of bar charts, one for each of 7 provinces, based off a master data table. However, the software only plots data from one of the provinces -- province 4. When I export to PDF I get 7 of the same bar chart (with different titles).
The data is in this format (abbreviated for clarity)
province travelcat pc_pop
60 1 0 to 4 hours 0.6807
21 1 4 to 8 hours 0.1093
28 2 4 to 8 hours 0.0969
44 2 36 to 48 hours 0.0014
31 3 48 to 72 hours 0.0016
49 3 > 72 hours 0.0007
Weirdly, when I generate a filtered table prov_filter and print that, it shows the data exactly as I'd expect it, specific to each province. Similarly the province title province_number is assigned correctly in the resulting PDF printouts. So the filtering is happening...but the data isn't going into the plot.
province_list=list()
for (i in unique(slim_prov_TCR$province)) {
province_number <- paste("Province",i)
prov_filter <- filter(slim_prov_TCR, province == i)
print(prov_filter)
plot <- ggplot(prov_filter, aes(x = prov_filter$travelcat, y = prov_filter$pc_pop))
+ theme(axis.text.x = element_text(angle=45, hjust=1))
+ scale_y_continuous(limits=c(0,1),labels = scales::percent)
+ ylab("% of provincial population") + xlab("Travel time to nearest medical facility")
+ ggtitle(province_number)
+ stat_summary(fun.y="identity",geom="bar")
filename=paste(province_number,".pdf",sep="")
province_list[[i]] = plot
print(plot)
}
I've done this before using similar code with no problems, but this time I've had serial problems, despite revising the filter code using multiple methods. I'm relatively new to R and statistics land in general so I'm probably mucking up something on the grammar side. Any and all help appreciated.
For reference purposes the final printout code is below
for (i in unique(slim_prov_TCR$province)) { # Another for loop, this time to save out the bar charts in province_list as PDFs
province_number <- paste("Province",i)
filename=paste(province_number,".pdf",sep="") # Make the file name for each PDF. The paste makes the name a variable of the disrict, so each chart is named by sensor
pdf(filename,width=3.5,height=3.5) # PDF basic specifications. Modify the width and height here.
print(province_list[[i]])
dev.off()
}
As highlighted by alistaire and Gregor, using $ for categories and having the + at the beginning of lines was confusing R. Reformatting these two points did the trick. See below text.
province_list=list()
for (i in unique(slim_prov_TCR$province)) {
province_number <- paste("Province",i)
prov_filter <- filter(slim_prov_TCR, province == i)
print(prov_filter)
plot <- ggplot(prov_filter, aes(x = travelcat, y = pc_pop)) +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
scale_y_continuous(limits=c(0,1),labels = scales::percent) +
ylab("% of provincial population") + xlab("Travel time to nearest medical facility") +
ggtitle(province_number) +
stat_summary(fun.y="identity",geom="bar")
filename=paste(province_number,".pdf",sep="")
province_list[[i]] = plot
print(plot)
}

How to make an interactive graph in R-studio

The data has 4 columns and roughly 600 rows. The data is twitter data collected using the twitteR package, and then summarized into a data frame. The summary is based on how many words from these libraries each tweet has, the tweets are given a score and then the summary is the number of tweets which get specific scores. So the columns are the two types of scores, the dates, and then the number of tweets with those scores.
Score1 Score2 Date Number
0 0 01/10/2015 50
0 1 01/10/2015 34
1 0 01/10/2015 10
...and so on
With dates and data that extend over a month, and the scores either way can go +/- 10 or so.
I'm trying to plot that kind of data using a bubble plots, score1 on the x axis and score2 on the y axis with the size of the bubble dependant on the number (how many tweets of with those scores there were per day).
My problem is that I only know how to use ggplot.
g <- ggplot(
twitterdata,
aes(x=score1, y=score2, size=number, label=""), guide=FALSE) +
geom_point(colour="black", fill="red", shape=21) +
scale_size_area(max_size = 30) +
scale_x_continuous(name="score1", limits=c(0, 10)) +
scale_y_continuous(name="score2", limits=c(-10, 10)) +
geom_text(size=4) +
theme_bw()
and that just gives me the plot for all dates, and what I need is a good way to see how that data changes over time. I've looked into using sliders and selectors but I really have no idea what would be the best tool to use. I've tried subsetting the data based on date, which works nicely but ideally I could make some kind of interactive graph.
I really need some way select certain days out of that data to plot so it doesn't pile up all on itself, but do it interactively so it can be presented.
Any help would be greatly appreciated, thank you.
It sounds like this won't completely satisfy your use case, but an extremely low-overhead way to add some interactivity to your plot would be to install.packages('plotly') and add the following line to your code:
# your original code
g <- ggplot(
twitterdata,
aes(x=score1, y=score2, size=number, label=""),
guide=FALSE)+
geom_point(colour="black", fill="red", shape=21) +
scale_size_area(max_size = 30) +
scale_x_continuous(name="score1", limits=c(0,10)) +
scale_y_continuous(name="score2", limits=c(-10,10)) +
geom_text(size=4) +
theme_bw()
# add this line
gg <- ggplotly(g)
Details and demos: https://plot.ly/ggplot2/
As Eric suggested, if you want sliders and such you should check out shiny. Here's a demo combining shiny with plotly: https://plot.ly/r/shiny-tutorial/

Resources