Subset data frame based on top N most frequent values in variable - r

My objective is to create a simple density or barplot of a long dataframe which shows the relative frequency of nationalities in a course (MOOC). I just don't want all of the nationalities in there, just the top 10. I created this example df below + the ggplot2 code I use for plotting.
d=data.frame(course=sample(LETTERS[1:5], 500,replace=T),nationality=as.factor(sample(1:172,500,replace=T)))
mm <- ggplot(d, aes(x=nationality, colour=factor(course)))
mm + geom_bar() + theme_classic()
...but as said: I want a subset of the entire dataset based on frequency. The above shows all data.
PS. I added the ggplot2 code for context but also because maybe there is something within ggplot2 itself that would make this possible (I doubt it however).
EDIT 2014-12-11:
The current answers use ddplyr or table methods to arrive at the desired subset, but I wonder if there is not a more direct way to achieve the same.. I will let it stay for now, see if there are other ways.

Using dplyr functions count and top_n to get top-10 nationalities. Because top_n accounts for ties, the number of nationalities included in this example are more than 10 due to ties. arrange the counts, use factor and levels to set nationalities in descending order.
# top-10 nationalities
d2 <- d %>%
count(nationality) %>%
top_n(10) %>%
arrange(n, nationality) %>%
mutate(nationality = factor(nationality, levels = unique(nationality)))
d %>%
filter(nationality %in% d2$nationality) %>%
mutate(nationality = factor(nationality, levels = levels(d2$nationality))) %>%
ggplot(aes(x = nationality, fill = course)) +
geom_bar()

Here's an approach to select the top 10 nationalities. Note that multiple nationalities share the same frequency. Therefore, selecting the top 10 results in omitting some nationalities with the same frequency.
# calculate frequencies
tab <- table(d$nationality)
# sort
tab_s <- sort(tab)
# extract 10 most frequent nationalities
top10 <- tail(names(tab_s), 10)
# subset of data frame
d_s <- subset(d, nationality %in% top10)
# order factor levels
d_s$nationality <- factor(d_s$nationality, levels = rev(top10))
# plot
ggplot(d_s, aes(x = nationality, fill = as.factor(course))) +
geom_bar() +
theme_classic()
Note that I changed colour to fill since colour affects the colour of the border.

although the questions was raised some time ago, I propose two other solutions for the sake of completeness:
d_raw <- data.frame(
course = sample(LETTERS[1:5], 500, replace = T),
nationality = as.factor(sample(1:172, 500, replace=T))
)
One using fct_lump_n() from the forcats package and filter()
d1 <- d_raw %>%
mutate(nationality = fct_lump_n(
f = nationality,
n = 10,
ties.method = "first"
)) %>%
filter(nationality != "Other")
d1 %>% count(nationality, sort = TRUE)
ggplot(d1, aes(x = nationality, fill = course)) + # factor() is not needed here.
geom_bar() +
theme_classic()
fct_lump_n() summarises all nationalities except for the 10 most frequent ones to category "Other". Note that in fct_lump_n() argument ties.method = "first" is needed to really get only the first ten nationalities, not 11 or 12. All other nationalities are labeled "Other" even though they may appear just as often as the first ten nationalities.
Levels of nationality are only ordered alphabetically.
Another solution is using fct_infreq() from the forcats package, cur_group_id() and filter().
d2 <- d_raw %>%
group_by(nationality = fct_infreq(nationality)) %>%
filter(cur_group_id() <= 10) %>%
ungroup()
d2 %>% count(nationality, sort = TRUE)
ggplot(d2, aes(x = nationality, fill = course)) + # factor() is not needed here.
geom_bar() +
theme_classic()
cur_group_id() assigns a group ID to every nationality. To get started with the most frequent nationality we first need to order column nationality by its frequencies. Then we filter for the first ten group IDs aka the ten most frequent nationalities.
Levels of nationality are first ordered by n, then ordered alphabetically.
I used count() to verify the two data frames d1 and d2 look the same.
Both solutions have the advantage, that we don't need a second (temporary) data frame or temporary vectors.
I hope this helps someone in the future.

Related

How to aggregate count by grouped rows of multiple columns inside pipes?

I want to get the head of the count of grouped rows by multiple columns in ascending order for a plot.
I found some answers on the internet but nothing seems to work when I try to merge it with arrange and pipes.
df_Cleaned %>%
head(arrange(aggregate(df_Cleaned$Distance,
by = list(df_Cleaned$start_station_id, df_Cleaned$end_station_id),
FUN = nrow)))) %>%
ggplot(mapping = aes(x = ride_id, color = member_casual)) +
geom_bar()
it seems to have problems with df_Cleaned$ since it's required in front of each column.
I hope I understood your meaning correctly. If you want to group your data by the columns Distance, start_station_id, and end_station_id and then count how many values there are under each group and then take only the head of those values, then maybe the following code will help using tidyverse:
df_Cleaned %>%
group_by(Distance, start_station_id, end_station_id) %>%
count() %>%
head() %>%
In addition, it seems like you you are later trying to plot using a variable you did not group by, so either you add it to your group_by or choose a different variable to plot by.
We may use add_count to create a count column by 'start_station_id' and 'end_station_id', and sort it, then filter the first 6 unique values (head ) or last 6 (tail) of 'n' and plot on the subset of the data
library(dplyr)
library(ggplot2)
df_Cleaned %>%
add_count(start_station_id, end_station_id, sort = TRUE) %>%
filter(n %in% head(unique(n), 6)) %>%
ggplot(mapping = aes(x = ride_id, color = member_casual)) +
geom_bar()

R Function to Create Custom Data Frames from Larger Data Frame

Ok, So I found somewhat similar questions asked of this already, but I'm not quite getting it. So, here is my example. I have a very large table of data that has a basic setup like the small example data below. I will try to explain very clearly what I am wanting to do. I'm guessing maybe it's easier to do than I think, but I'm not really good at creating functions or for-loops at this point, and I'm guessing that's what I need. So here is the basic setup for my data.
test_year <- c(2019,2019,2019,2020,2020,2020,2021,2021,2021)
SN <- c(1001,1002,1003,1004,1005,1006,1007,1008,1009)
Owner <- c("Adam","Bob","Bob","Carl","Adam","Bob","Adam","Carl","Adam")
ObsA <- c(0,0,1,1,0,1,1,NA,1)
ObsB <- c(1,1,1,0,0,0,0,0,1)
ObsC <- c(0,0,0,0,1,1,0,0,0)
df <- data.frame(test_year, SN, Owner, ObsA, ObsB, ObsC)
From this, I need to be able to create smaller data frames by selecting individual observation columns. So if this were a small data set:
df_A <- df %>% select(test_year, SN, Owner, ObsA)
and then have a data frame for each of the other observations. And yes, it is easier to select the columns that I want versus the columns I don't want as most of the columns selected will be standard, and I just need to change which observation is picked out of over 40 in my real data.
From these smaller data frames, I will be doing numerous other operations including making multiple tables and graphs. As examples, the following are similar to the types of graphs I will make (with some additional formatting that is simple enough). Notice too in these graphs a title that is based on (though not identical to), the column selected.
df_A[is.na(df_A)] = 0
df_A
df_A %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = test_year, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_point()
df_A %>% group_by(Owner) %>%
summarize(n = n(), obs = sum(ObsA)) %>%
ggplot(aes(x = Owner, y = 100*obs/n)) +
ggtitle("Observation A") +
geom_bar(stat = "identity") +
coord_flip() +
scale_x_discrete()
As I said, additional analysis will also need to be done. So, I'm needing help figuring out how I can structure a function to do what it is I'm wanting to do. Thanks!
Here is a way to return a list of plots.
Split all the 'Obs' columns in a list of dataframes, use imap to pass dataframe along with the column name (to use it as title).
library(tidyverse)
common_cols <- 1:3
df[is.na(df)] = 0
list_plots <- df %>%
select(starts_with('Obs')) %>%
split.default(names(.)) %>%
imap(~{
tmp <- df[common_cols] %>% bind_cols(.x)
tmp %>% group_by(test_year) %>%
summarize(n = n(), obs = sum(.data[[.y]])) %>%
ggplot(aes(x = factor(test_year), y = 100*obs/n)) +
geom_point() +
labs(x = 'Year', y = 'ratio', title = .y)
})
Individual plots can be accessed by list_plots[[1]],list_plots[[2]] etc.

grouping & plotting by textual column value

I've got a (very) basic level of competency with R when working with numbers, but when it comes to manipulating data based on text values in columns I'm stuck. For example, if I want to plot meal frequency vs. day of week (is Tuesday really for tacos?) using the following data frame, how would I do that? I've seen suggestions of tapply, aggregate, colSums, and others, but those have all been for slightly different scenarios and nothing gives me what I'm looking for. Should I be looking at something other than R for this problem? My end goal is a graph with day of week on the X-axis, count on the Y-axis, and a line plot for each meal.
df <- data.frame(meal= c("tacos","spaghetti","burgers","tacos","spaghetti",
"spaghetti"), day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
This is as close as I've gotten, and, to be honest, I don't fully understand what it's doing:
tapply(df$day, df$meal, FUN = function(x) length(x))
It will summarize the meal counts, but a) it doesn't have column names (my understanding is that's due to tapply returning a vector), and b) it doesn't keep an association with the day of the week.
Edit: The melt() suggestion below works for this dataset, but it won't scale to the size I need. I was, however, able to get a working graph from the dataframe produced by the melt. If anybody runs across this in the future, try:
ggplot(new, aes(day, value, group=meal, col=meal)) +
geom_line() + geom_point() + scale_y_continuous(breaks = function(x)
unique(floor(pretty(seq(0, (max(x) + 1) * 1.1)))))
(The part after geom_point() is to force the Y-axis to only be integers, which is what makes sense in this case.)
I tried to cut this into smaller pieces so you can understand whats going on
library(tidyverse)
# defining the dataframes
df <- data.frame(meal = c("tacos","spaghetti","burgers","tacos","spaghetti","spaghetti"),
day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
# define a vector of days of week ( will be useful to display x axis in the correct order)
ordered_days =c("sunday","monday","tuesday","wednesday",
"thursday","friday",'saturday')
# count the number of meals per day of week
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
# a lot of combinations are missing, for example no burgers on monday
# so i am creating all combinations with count 0
fill_0 <- expand.grid(
meal=factor(unique(df$meal)),
day=factor(ordered_days),
n=0)
# append this fill_0 to df_count
# as some combinations already exist, group by again and sum n
# so only one row per (meal,day) combination
df_count <- rbind(df_count,fill_0) %>%
group_by(meal,day) %>%
summarise(n=sum(n)) %>%
mutate(day=factor(day,ordered=TRUE,
ordered_days))
# plot this by grouping by meal
ggplot(df_count,aes(x=day,y=n,group=meal,col=meal)) + geom_line()
The magic is here, courtesy of #fmarm:
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
The fill_0 and rbind bits also in the sample provided by #fmarm are necessary to keep from bombing out on unspecified combinations, but it's the line above that handles summing meals by day.

scatter plot against all groups for a long data frame

I am pretty sure something like this is already asked but I don't know how to search for it.
I often get data in a wide format like in my little example with 3 experiments (a-c). I normally convert to long format and convert the values by some function (here log2 as an example).
What I often want to do is to plot all experiments against each other and here I am looking for a handy solution. How can I convert my data frame to get facets for example with a~b, a~c and b~c...
So far I tidy::spread the data again and execute 3 times a ggplot command with the individual column names as x and y. Later I merge the individual graphs together.
Is there a more convenient way?
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(
names=letters,
a=1:26,
b=1:13,
c=11:36
)
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value))
EDIT
Since I got a very useful answer from #hdkrgr I adapted a bit my code. The inner_join was a great trick which I can implement to automate my idea, what I still miss is a clever filter to get rid of the redundant data, since I don't want to plot c~c or b~a if I already plot a~b.
I solved this now by providing the pairings I want to do, but can anyone think ob a straight forward solution? I couldn't think of something which gives me the unique pairing.
my_pairs <- c('a vs. b', 'a vs. c', 'b vs. c')
df %>%
as_tibble() %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>%
inner_join(., ., by=c("names")) %>%
mutate(pairing=sprintf('%s vs. %s', experiment.x, experiment.y)) %>%
filter(pairing %in% my_pairs) %>%
ggplot(aes(log2.value.x, log2.value.y)) +
geom_point() +
facet_wrap( ~ pairing, labeller=label_both)
One way starting from long format would be to do a self-join on the long-data in order to get all combinations of two experiments in each row:
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>%
inner_join(., ., by=c("names")) %>%
ggplot(aes(log2.value.x, log2.value.y)) + geom_point() + facet_grid(experiment.y ~ experiment.x)
Edit: To avoid plotting redundant experiment-pairs, you can do:
df %>%
tidyr::gather(experiment, value, -names) %>%
mutate(log2.value=log2(value)) %>% inner_join(., ., by=c("names")) %>%
filter(experiment.x < experiment.y) %>%
ggplot(aes(log2.value.x, log2.value.y)) + geom_point() + facet_wrap(~experiment.y + experiment.x)
This is really interesting because it's actually more complex than it first seems. One thing that sticks out is getting unique pairs of experiments—it seems like you'd want a vs b but not necessarily b vs a as well. To do that, you need the unique set of experiment pairs.
Initially, I tried to work from your gathered data, but realized it might be simpler to start from the wide version. Take the names of the experiments from the column names—you can do this multiple ways, but I just took the strings that aren't "names"—and get the combinations of them. I pasted them together to make them a little easier to work with.
library(dplyr)
library(tidyr)
library(ggplot2)
df <- data.frame(
names=letters,
a=1:26,
b=1:13,
c=11:36
) %>%
as_tibble()
exp <- stringr::str_subset(names(df), "names", negate = T)
pairs <- combn(exp, 2, paste, simplify = F, collapse = ",") %>%
unlist()
pairs
#> [1] "a,b" "a,c" "b,c"
Then, for each pair, extract the associated column names, do a little tidyeval to select those columns, do the log2 transform that you had. I had to detour here to rename the columns with something I could refer back to—I think this isn't necessary, but I couldn't get my tidyeval working inside the ggplot aes. Someone else might have an idea on that. Then make your plot, and label the axes and title accordingly. That leaves you with a list of 3 plots.
plots <- purrr::map(pairs, function(pair) {
cols <- strsplit(pair, split = ",", fixed = T)[[1]]
df %>%
select(names, !!cols[1], !!cols[2]) %>%
mutate_at(vars(-names), log2) %>%
rename(exp1 = !!cols[1], exp2 = !!cols[2]) %>%
ggplot(aes(x = exp1, y = exp2)) +
geom_point() +
labs(x = cols[1], y = cols[2], title = pair)
})
Use your method of choice to put the plots together however you want. I went with cowplot, but I also like the patchwork package.
cowplot::plot_grid(plotlist = plots, nrow = 1)
This is probably not what you want, but if the purpose is to explore the correlation pattern between each variable, you may want to consider ggpairs from the GGally package. It provides not only scatter plots, but also correlation score and distribution.
library(GGally)
ggpairs(df[, c("a", "b", "c")])
You could start from creating all combinations via combnand then work your way through:
library(purrr)
t(combn(names(df)[-1], 2)) %>% ## get all combinations
as.data.frame(stringsAsFactors = FALSE) %>%
mutate(l = paste(V1, V2, sep = " vs. ")) %>%
pmap_dfr(function(V1, V2, l)
df %>%
select(one_of(c(V1, V2))) %>% ## select the elements given by the combination
mutate_all(log2) %>%
setNames(c("x", "y")) %>%
mutate(experiment = l)) %>%
ggplot(aes(x, y)) + geom_point() + facet_wrap(~experiment)

How to use ggplot to group and show top X categories?

I am trying to use use ggplot to plot production data by company and use the color of the point to designate year. The follwoing chart shows a example based on sample data:
However, often times my real data has 50-60 different comapnies wich makes the Company names on the Y axis to be tiglhtly grouped and not very asteticly pleaseing.
What is th easiest way to show data for only the top 5 companies information (ranked by 2011 quanties) and then show the rest aggregated and shown as "Other"?
Below is some sample data and the code I have used to create the sample chart:
# create some sample data
c=c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ")
q=c(1,2,3,4,5,6,7,8,9,10)
y=c(2010)
df1=data.frame(Company=c, Quantity=q, Year=y)
q=c(3,4,7,8,5,14,7,13,2,1)
y=c(2011)
df2=data.frame(Company=c, Quantity=q, Year=y)
df=rbind(df1, df2)
# create plot
p=ggplot(data=df,aes(Quantity,Company))+
geom_point(aes(color=factor(Year)),size=4)
p
I started down the path of a brute force approach but thought there is probably a simple and elegent way to do this that I should learn. Any assistance would be greatly appreciated.
What about this:
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
ggplot (data = subset (df, Company %in% companies [1 : 5]),
aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
BTW: in order for the code to be called elegant, spend a few more spaces, they aren't that expensive...
See if this is what you want. It takes your df dataframe, and some of the ideas already suggested by #cbeleites. The steps are:
1.Select 2011 data and order the companies from highest to lowest on Quantity.
2.Split df into two bits: dftop which contians the data for the top 5; and dfother, which contains the aggregated data for the other companies (using ddply() from the plyr package).
3.Put the two dataframes together to give dfnew.
4.Set the order for which levels of Company are plotted: Top to bottom is highest to lowest, then "Other". The order is partly given by companies, plus "Other".
5.Plot as before.
library(ggplot2)
library(plyr)
# Step 1
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
# Step 2
dftop = subset(df, Company %in% companies [1:5])
dftop$Company = droplevels(dftop$Company)
dfother = ddply(subset(df, !(Company %in% companies [1:5])), .(Year), summarise, Quantity = sum(Quantity))
dfother$Company = "Other"
# Step 3
dfnew = rbind(dftop, dfother)
# Step 4
dfnew$Company = factor(dfnew$Company, levels = c("Other", rev(as.character(companies)[1:5])))
levels(dfnew$Company) # Check that the levels are in the correct order
# Step 5
p = ggplot (data = dfnew, aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
p
The code produces:

Resources