Sorting histogram plots within facet_wrap by skew - r

I have about 1K observations for each country and I have used facet_wrap to display each country's geom_bar but the output is by alphabetical order. I would want to cluster or order them by skew (so the most positive-skew are together and moving towards the normal-distribution countries, then the negative-skew countries ending with the most negative-skewed) without eyeballing what countries are more similar to each other. I was thinking maybe psych::describe() might be useful since it calculates skew, but I am having a hard time figuring out how I would implement adding that information to a similar question.
Any suggestions would be helpful

I can't go into too much detail without a reproducible example but this would be my general approach. Use psych::describe() to create a vector of countries that are sorted from most positive skew to least positive skew: country_order . Next, factor the country column in your dataset with country = factor(country, levels = country_order). When you use facet_wrap the plots will be displayed in the same order as country_order.

After some troubleshooting , I found (what I think is) an efficient way of doing it:
skews <- psych::describe.By(df$DV, df$Country, mat = TRUE) #.BY and mat will produce a matrix that you can use to merge into your df easily
skews %<>%select(group1, mean, skew) %>% sjlabelled::as_factor(., group1) #Turn it into a factor, I also kept country means
combined <- sort(union(levels(df$Country), levels(skews$group1))) #I was getting an error that my levels were inconsistent even though they were the same (since group1 came from df$Country) which I think was due to having Country reference category Germany which through off the alphabetical sort of group1 so I used [dfrankow's answer][1]
df <- left_join(mutate(df, Country=factor(Country, levels=combined)),
mutate(skews, Country=factor(group1, levels=combined))) %>% rename(`Country skew` = "skew", `Country mean` = "mean") %>% select(-group1)
df$`Country skew` <- round(df$`Country skew`, 2)
ggplot(df) +
geom_bar(aes(x = DV, y=(..prop..)))+
xlab("Scale axis text") + ylab("Proportion") +
scale_x_continuous()+
scale_y_continuous(labels = scales::percent_format(accuracy = 1))+
ggtitle("DV distribution by country mean")+
facet_wrap(~ Country %>% fct_reorder(.,mean), nrow = 2) #this way the reorder that was important for my lm can remain intact

Related

How to create a stacked area chart in R from a csv with non-numerical data

I am trying to create a stacked area chart in R using data from this csv: https://raw.githubusercontent.com/fivethirtyeight/data/master/masculinity-survey/raw-responses.csv
(The above file is raw content, for better readability of the data look here: https://github.com/fivethirtyeight/data/blob/master/masculinity-survey/masculinity-survey.csv)
I am trying to create a percentage based stacked area chart, that i similar to this example: https://r-charts.com/en/evolution/percentage-stacked-area_files/figure-html/percentage-areaplot.png
The problem is that since i am working with non-numerical data only, it is a bit hard for me to get a proper graph.
My goal is to have the graph display the different age groups in the x-axis ( row "age3" in raw content), and the fill to be the ethnicities (row "racethn4" in raw content. All while the y axis simply is the percentage that represents the number of total answers in the survey (that of course goes up to 100).
I tried to do it the following way, but im not sure what the y value should be:
df <- read_csv("Path to csv")
ggplot(df, aes(x = df$age3, y = ???, fill = df$racethn4)) + geom_stream()
Any ideas on how to represent the plot as described?
I'm not too well versed in ggplot as I use other graphing packages but I gave this a shot. I don't believe you can use geom_area when x is a categorical variable. At least I did not have any luck trying that. So I used geom_col instead.
Here's two approaches for transforming the data. Using dplyr and data.table. Feel free to pick whichever is more natural for you.
You need to sum up the number of observations per group combo first and then get the percent total for the y values.
library(data.table)
library(ggplot2)
library(dplyr)
dat = fread("temp.csv") # from data.table::fread
# data.table way
dat_sub = dat[, .(age3 = as.factor(age3), racethn4 = as.factor(racethn4))][,.N, by = .(age3,racethn4)]
dat_sub[, tot := sum(N), by = age3][, perc := N/tot*100][order(age3)]
# dplyr way
dat_sub = dat %>%
select(age3, racethn4) %>%
group_by(age3, racethn4) %>%
summarise(n = n()) %>%
group_by(age3) %>%
mutate(tot = sum(n),
perc = n / tot * 100)
# using a stacked bar chart instead of stacked area
ggplot(dat_sub, aes(x = age3, y = perc, fill = racethn4)) +
geom_col()

Summarising data before plotting with geom_tile() renders different results

I have noticed that when plotting with ggplot2's geom_tile(), summarising the data before plotting renders a completely different result than when it is not pre-summarised. I don't understand why.
For a dataframe with three columns, year (character), state (character) and profit (numeric), consider the following examples:
# Plot straight away
data %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit))
# Summarise before plotting
data %>% group_by(year, state) %>% summarize(profit_mean = mean(profit)) %>%
ungroup() %>%
ggplot(aes(x=year, y=state)) + geom_tile(aes(fill=profit_mean))
These two examples render two different tile plots - the values are quite different. I thought that these two methods of plotting would be analogous and that ggplot2 would take a mean automatically - is that not so?
I tried reproducing this error on a smaller subset of data, but it didn't appear. What could be going on here?
OP, this was a very interesting question.
First, let's get this out of the way. It is clear what plotting the summary of your data is plotting just that: the summary. You are summarizing via mean, so what is plotted equals the mean of the values for each tile.
The actual question here is: If you have a dataset containing more than one value per tile, what is the result of plotting the "non-summarized" dataset?
User #akrun is correct: the default stat used for geom_tile is stat="identity", but it might not be clear what that exactly means. It says it "leaves the data unchanged"... but that's not clear what that means here.
Illustrative Example Dataset
For purposes of demonstration, I'll create an illustrative dataset, which will answer the question very clearly. I'm creating two individual datasets df1 and df2, which each contain 4 "tiles" of data. The difference between these is that the values themselves for the tiles are different. I've include text labels on each tile for more clarity.
library(ggplot2)
library(cowplot)
df1 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(5,15,20,25)
)
df2 <- data.frame(
x=rep(paste("Test",1:2), 2),
y=rep(c("A", "B"), each=2),
value=c(10,5,25,15)
)
tile1 <- ggplot(df1, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df1")
tile2 <- ggplot(df2, aes(x,y, fill=value, label=value)) +
geom_tile() + geom_text() + labs(title="df2")
plot_grid(tile1, tile2)
Plotting the Combined Data Frame
Each of the data frames df1 and df2 contain only one value per tile, so in order to see how that changes when we have more than one value per tile, we need to combine them into one so that each tile will contain 2 values. In this example, we are going to combine them in two ways: first df1 then df2, and the other way is df2 first, then df1.
df12 <- rbind(df1, df2)
df21 <- rbind(df2, df1)
Now, if we plot each of those as before and compare, the reason for the discrepancy the OP posted should be quite obvious. I'm including the value for each tile for each originating dataset to make things super-clear.
tile12 <- ggplot(df12, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df1, then df2") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
tile21 <- ggplot(df21, aes(x,y, fill=value, label=value)) +
geom_tile() + labs(title="df2, then df1") +
geom_text(data=df1, aes(label=paste("df1:",value)), nudge_y=0.1) +
geom_text(data=df2, aes(label=paste("df2:",value)), nudge_y=-0.1)
plot_grid(tile12, tile21)
Note that the legend colorbar value does not change, so it's not doing an addition. Plus, since we know it's stat="identity", we know this should not be the case. When we use the dataset that contains first observations from df1, then observations from df2, the value plotted is the one from df2. When we use the dataset that contains observations first from df2, then from df1, the value plotted is the one from df1.
Given this piece of information, it can be clear that the value shown in geom_tile() when using stat="identity" (default argument) corresponds to the last observation for that particular tile represented in the data frame.
So, that's the reason why your plot looks odd OP. You can either summarize beforehand as you have done, or use stat_summary(geom="tile"... to do the transformation in one go within ggplot.

Graphing different variables in the same graph R- ggplot2

I have several datasets and my end goal is to do a graph out of them, with each line representing the yearly variation for the given information. I finally joined and combined my data (as it was in a per month structure) into a table that just contains the yearly means for each item I want to graph (column depicting year and subsequent rows depicting yearly variation for 4 different elements)
I have one factor that is the year and 4 different variables that read yearly variations, thus I would like to graph them on the same space. I had the idea to joint the 4 columns into one by factor (collapse into one observation per row and the year or factor in the subsequent row) but seem unable to do that. My thought is that this would give a structure to my y axis. Would like some advise, and to know if my approach to the problem is effective. I am trying ggplot2 but does not seem to work without a defined (or a pre defined range) y axis. Thanks
I would suggest next approach. You have to reshape your data from wide to long as next example. In that way is possible to see all variables. As no data is provided, this solution is sketched using dummy data. Also, you can change lines to other geom you want like points:
library(tidyverse)
set.seed(123)
#Data
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))
#Plot
df %>% pivot_longer(-year) %>%
ggplot(aes(x=factor(year),y=value,group=name,color=name))+
geom_line()+
theme_bw()
Output:
We could use melt from reshape2 without loading multiple other packages
library(reshape2)
library(ggplot2)
ggplot(melt(df, id.var = 'year'), aes(x = factor(year), y = value,
group = variable, color = variable)) +
geom_line()
-output plot
Or with matplot from base R
matplot(as.matrix(df[-1]), type = 'l', xaxt = 'n')
data
set.seed(123)
df <- data.frame(year=1990:2000,
v1=rnorm(11,2,1),
v2=rnorm(11,3,2),
v3=rnorm(11,4,1),
v4=rnorm(11,5,2))

Reorder ggplot axis by one value, display labels from another

Situation is as follows:
I have many names and many corresponding codes for those names.
All different names have a unique code, but not all different codes have a unique name.
This has created an issue when plotting the data, as I need to group_by(code), and reorder(name,code) when plotting, but the codes are nonsense and I want to display the names. Since some codes share names, this creates a bit of an issue.
Example to illustrate below:
library(tidyverse)
set.seed(1)
# example df
df <- tibble("name" = c('apple','apple','pear','pear','pear',
'orange','banana','peach','pie','soda',
'pie','tie','beer','picnic','cigar'),
"code" = seq(1,15),
"value" = round(runif(15,0,100)))
df %>%
ggplot(aes(x=reorder(name,value)))+
geom_bar(aes(y=value),
stat='identity')+
coord_flip()+
ggtitle("The axis labels I want, but the order I don't")
df %>%
ggplot(aes(x=reorder(code,value)))+
geom_bar(aes(y=value),
stat='identity')+
coord_flip()+
ggtitle("The order I want, but the axis labels I don't")
Not quite sure how to get ggplot to keep the display and order of the second plot while being able to replace the axis labels with the names from the first plot.
What about using interaction to bind names and code and in scale_x_discrete replace labels by appropriate one such as follow:
df %>%
ggplot(aes(x=interaction(reorder(name, value),reorder(code,value))))+
geom_bar(aes(y=value),
stat='identity')+
scale_x_discrete(labels = function(x) sub("\\..*$","",x), name = "name")+
coord_flip()
is it what you are looking for ?

Controlling alpha in ggparcoord (from GGally package)

I am trying to build from a question similar to mine (and from which I borrowed the self-contained example and title inspiration). I am trying to apply transparency individually to each line of a ggparcoord or somehow add two layers of ggparcoord on top of the other. The detailed description of the problem and format of data I have for the solution to work is provided below.
I have a dataset with thousand of lines, lets call it x.
library(GGally)
x = data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
After clustering this data I also get a set of 5 lines, let's call this dataset y.
y = data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
In order to see the centroids y overlaying x I use the following code. First I add y to x such that the 5 rows are on the bottom of the final dataframe. This ensures ggparcoord will put them last and therefore stay on top of all the data:
df <- rbind(x,y)
Next I create a new column for df, following the question advice I referred such that I can color differently the centroids and therefore can tell it apart from the data:
df$cluster = "data"
df$cluster[(nrow(df)-4):(nrow(df))] <- "centroids"
Finally I plot it:
p <- ggparcoord(df, columns=1:4, groupColumn=5, scale="globalminmax", alphaLines = 0.99) + xlab("Sample") + ylab("log(Count)")
p + scale_colour_manual(values = c("data" = "grey","centroids" = "#94003C"))
The problem I am stuck with is from this stage and onwards. On my original data, plotting solely x doesn't lead to much insight since it is a heavy load of lines (on this data this is equivalent to using ggparcoord above on x instead of df:
By reducing alphaLines considerably (0.05), I can naturally see some clusters due to the overlapping of the lines (this is again running ggparcoord on x reducing alphaLines):
It makes more sense to observe the centroids added to df on top of the second plot, not the first.
However, since everything it is on a single dataframe, applying such a high value for alphaLine makes the centroid lines disappear. My only option is then to use ggparcoord (as provided above) on df without decreasing the alphaValue:
My goal is to have the red lines (centroid lines) on top of the second figure with very low alpha. There are two ways I thought so far but couldn't get it working:
(1) Is there any way to create a column on the dataframe, similar to what is done for the color, such that I can specify the alpha value for each line?
(2) I originally attempted to create two different ggparcoords and "sum them up" hoping to overlay but an error was raised.
The question may contain too much detail, but I thought this could motivate better the applicability of the answer to serve the interest of other readers.
The answer I am looking for would use the provided data variables on the current format and generate the plot I am looking for. Better ways to reconstruct the data is also welcomed, but using the current structure is preferred.
In this case I think it easier to just use ggplot, and build the graph yourself. We make slight adjustments to how the data is represented (we put it in long format), and then we make the parallel coordinates plot. We can now map any attribute to cluster that you like.
library(dplyr)
library(tidyr)
# I start the same as you
x <- data.frame(a=runif(100,0,1),b=runif(100,0,1),c=runif(100,0,1),d=runif(100,0,1))
y <- data.frame(a=runif(5,0,1),b=runif(5,0,1),c=runif(5,0,1),d=runif(5,0,1))
# I find this an easier way to combine the two data.frames, and have an id column
df <- bind_rows(data = x, centroids = y, .id = 'cluster')
# We need to add id's, so we know which points to connect with a line
df$id <- 1:nrow(df)
# Put the data into long format
df2 <- gather(df, 'column', 'value', a:d)
# And plot:
ggplot(df2, aes(column, value, alpha = cluster, color = cluster, group = id)) +
geom_line() +
scale_colour_manual(values = c("data" = "grey", "centroids" = "#94003C")) +
scale_alpha_manual(values = c("data" = 0.2, "centroids" = 1)) +
theme_minimal()

Resources