How to melt a dataframe into multiple factors - r

I have been trying to plot a line plot with ggplot.
My data looks something like this:
I04 F04 I05 F05 I06 F06
CAT 3 12 2 6 6 20
DOG 0 0 0 0 0 0
BIEBER 1 0 0 1 0 0
and can be found here.
Basically, we have a certain number of CATs (or other creatures) initially in a year (this is I04), and a certain number of CATs at the end of the year (this is F04). This goes on for some time.
I can plot something like this fairly simply using the code below, and get this:
This is fantastic, but doesn't work very well for me. After all, I have these staring and ending inventory for each year. So I am interested in seeing how the initial values (I04, I05, I06) change over time. So, for each animal, I would like to create two different lines, one for initial quantity and one for final quantity (F01, F05, F06). This seems to me like now I have to consider two factors.
This is really difficult given the way my data is set up. I'm not sure how to tell ggplot that all the I prefixed years are one factor, and all the F prefixed years are another factor. When the dataframe gets melted, it's too late. I'm not sure how to control this situation.
Any advice on how I can separate these values or perhaps another, better way to tackle this situation?
Here is the code I have:
library(ggplot2)
library(reshape2)
DF <- read.csv("mydata.csv", stringsAsFactors=FALSE)
## cleaning up, converting factors to numeric, etc
text_names <- data.frame(as.character(DF$animals))
names(text_names) <- c("animals")
numeric_cols <- DF[, -c(1)]
numeric_cols <- sapply(numeric_cols, as.numeric)
plot_me <- data.frame(cbind(text_names, numeric_cols))
plot_me$animals <- as.factor(plot_me$animals)
meltedDF <- melt(plot_me)
p <- ggplot()
p <- p + geom_line(aes(seq(1:36), meltedDF$value, group=meltedDF$animals, color=meltedDF$animals))
p

Using your original data from the link:
nd <- reshape(mydata, idvar = "animals", direction = "long", varying = names(mydata)[-1], sep = "")
ggplot(nd, aes(x = time, y = I, group = animals, colour = animals)) + geom_line() + ggtitle("Development of initial inventories")
ggplot(nd, aes(x = time, y = F, group = animals, colour = animals)) + geom_line() + ggtitle("Development of final inventories")

I think from a data analyst perspective the following approach might provide better insight.
For each animal we visualize the initial and the final quantity in a separate panel. Moreover, each subplot has its own y scale because the values of the different animal types are radically different. Like this, differences within and across animal types are easier to spot.
Given the current structure of your data, we do not need two different factors. After the gather call the indicator column includes data like I04, F04, etc. We just need to separate the first character from the rest resulting in two columns type and time. We can use type as the argument for color in the ggplot call. time provides a unified x-axis across all animal types.
library(tidyr)
library(dplyr)
library(ggplot2)
data %>% gather(indicator, value, -animals) %>%
separate(indicator, c('type', 'time'), sep = 1) %>%
mutate(
time = as.numeric(time)
) %>% ggplot(aes(time, value, color = type)) +
geom_line() +
facet_grid(animals ~ ., scales = "free_y")
Of course, you might also do it the other way round, namely using a subplot for the initial and the final quantities like this:
data %>% gather(indicator, value, -animals) %>%
separate(indicator, c('type', 'time'), sep=1) %>%
mutate(
time = as.numeric(time)
) %>% ggplot(aes(time, value, color = animals)) +
geom_line() +
facet_grid(type ~ ., scales = "free_y")
But as described above, I would not recommend that because the y scale varies too much across animal types.

Related

Why does my line plot (ggplot2) look vertical?

I am new to coding in R, when I was using ggplot2 to make a line graph, I get vertical lines. This is my code:
all_trips_v2 %>%
group_by(Month_Name, member_casual) %>%
summarise(average_duration = mean(length_of_ride))%>%
ggplot(aes(x = Month_Name, y = average_duration)) + geom_line()
And I'm getting something like this:
This is a sample of my data:
(Not all the cells in the Month_Name is August, it's just sorted)
Any help will be greatly appreciated! Thank you.
I added a bit more code just for the mere example. the data i chose is probably not the best choice to display a proper timer series.
I hope the features of ggplot i displayed will be benficial for you in the future
library(tidyverse)
library(lubridate)
mydat <- sample_frac(storms,.4)
# setting the month of interest as the current system's month
month_of_interest <- month(Sys.Date(),label = TRUE)
mydat %>% group_by(year,month) %>%
summarise(avg_pressure = mean(pressure)) %>%
mutate(month = month(month,label = TRUE),
current_month = month == month_of_interest) %>%
# the mutate code is just for my example.
ggplot(aes(x=year, y=avg_pressure,
color=current_month,
group=month,
size=current_month
))+geom_line(show.legend = FALSE)+
## From here its not really important,
## just ideas for your next plots
scale_color_manual(values=c("grey","red"))+
scale_size_manual(values = c(.4,1))+
ggtitle(paste("Averge yearly pressure,\n
with special interest in",month_of_interest))+
theme_minimal()
## Most important is that you notice the group argument and also,
# in most cases you will want to color your different lines.
# I added a logical variable so only October will be colored,
# but that is not mandatory
You should add a grouping argument.
see further info here:
https://ggplot2.tidyverse.org/reference/aes_group_order.html
# Multiple groups with one aesthetic
p <- ggplot(nlme::Oxboys, aes(age, height))
# The default is not sufficient here. A single line tries to connect all
# the observations.
p + geom_line()
# To fix this, use the group aesthetic to map a different line for each
# subject.
p + geom_line(aes(group = Subject))

Plotting two overlapping density curves using ggplot

I have a dataframe in R consisting of 104 columns, appearing as so:
id vcr1 vcr2 vcr3 sim_vcr1 sim_vcr2 sim_vcr3 sim_vcr4 sim_vcr5 sim_vcr6 sim_vcr7
1 2913 -4.782992840 1.7631999 0.003768704 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
2 1260 0.003768704 3.1577108 -0.758378208 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
3 2912 -4.782992840 1.7631999 0.003768704 1.376937 -2.096857 6.903021 7.018855 6.135139 3.188382 6.905323
4 2914 -1.311132669 0.8220594 2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
5 2915 -1.311132669 0.8220594 2.372950077 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
6 1261 2.372950077 -0.7022792 -4.951318264 -4.194246 -1.460474 -9.101704 -6.663676 -5.364724 -2.717272 -3.682574
The "sim_vcr*" variables go all the way through sim_vcr100
I need two overlapping density density curves contained within one plot, looking something like this (except here you see 5 instead of 2):
I need one of the density curves to consist of all values contained in columns vcr1, vcr2, and vcr3, and I need another density curve containing all values in all of the sim_vcr* columns (so 100 columns, sim_vcr1-sim_vcr100)
Because the two curves overlap, they need to be transparent, like in the attached image. I know that there is a pretty straightforward way to do this using the ggplot command, but I am having trouble with the syntax, as well as getting my data frame oriented correctly so that each histogram pulls from the proper columns.
Any help is much appreciated.
With df being the data you mentioned in your post, you can try this:
Separate dataframes with next code, then plot:
library(tidyverse)
library(gdata)
#Index
i1 <- which(startsWith(names(df),pattern = 'vcr'))
i2 <- which(startsWith(names(df),pattern = 'sim'))
#Isolate
df1 <- df[,c(1,i1)]
df2 <- df[,c(1,i2)]
#Melt
M1 <- pivot_longer(df1,cols = names(df1)[-1])
M2 <- pivot_longer(df2,cols = names(df2)[-1])
#Plot 1
ggplot(M1) + geom_density(aes(x=value,fill=name), alpha=.5)
#Plot 2
ggplot(M2) + geom_density(aes(x=value,fill=name), alpha=.5)
Update
Use next code for one plot:
#Unique plot
#Melt
M <- pivot_longer(df,cols = names(df)[-1])
#Mutate
M$var <- ifelse(startsWith(M$name,'vcr',),'vcr','sim_vcr')
#Plot 3
ggplot(M) + geom_density(aes(x=value,fill=var), alpha=.5)
Using the dplyr package, first you can convert your data to long format using the function pivot_longer as follows:
df %<>% pivot_longer(cols = c(starts_with('vcr'), starts_with('sim_vcr')),
names_to = c('type'),
values_to = c('values'))
After using filter function you can create separate plots for each value type
For vcr columns:
df %>%
filter(str_detect(type, '^vcr')) %>%
ggplot(.) +
geom_density(aes(x = values, fill = type), alpha = 0.5)
The above produces the following plot:
for sim_vcr columns:
df %>%
filter(str_detect(type, '^sim_vcr')) %>%
ggplot(.) +
geom_density(aes(x = values, fill = type), alpha = 0.5)
The above code produces the following plot:
Another simple way to subset and prepare your data for ggplot is with gather() from tidyr which you can read more about. Heres how I do it. df being your data frame provided.
# Load tidyr to use gather()
library(tidyr)
#Split appart the data you dont want on their own, the first three columns, and gather them
df_vcr <- gather(data = df[,2:4])
#Gather the other columns in the dataframe
df_sim<- gather(data = df[,-c(1:4)])
#Plot the first
ggplot() +
geom_density(data = df_vcr,
mapping = aes(value, group = key, color = key, fill = key),
alpha = 0.5)
#Plot the second
ggplot() +
geom_density(data = df_sim,
mapping = aes(value, group = key, color = key, fill = key),
alpha = 0.5)
However I am a little unclear on what you mean by "all values in all of the sim_vcr* columns". Perhaps you want all of those values in one density curve? To do this, simply do not give ggplot any grouping info in the second case.
ggplot() + geom_density(data = df_sim,
mapping = aes(value),
fill = "grey50",
alpha = 0.5)
Notice here I can still specify the 'fill' for the curve outside of the aes() function and it will apply it too all curves instead of give each group specified in 'key' a different color.

R - How to create a seasonal plot - Different lines for years

I already asked the same question yesterday, but I didnt get any suggestions until now, so I decided to delete the old one and ask again, giving additional infos.
So here again:
I have a dataframe like this:
Link to the original dataframe: https://megastore.uni-augsburg.de/get/JVu_V51GvQ/
Date DENI011
1 1993-01-01 9.946
2 1993-01-02 13.663
3 1993-01-03 6.502
4 1993-01-04 6.031
5 1993-01-05 15.241
6 1993-01-06 6.561
....
....
6569 2010-12-26 44.113
6570 2010-12-27 34.764
6571 2010-12-28 51.659
6572 2010-12-29 28.259
6573 2010-12-30 19.512
6574 2010-12-31 30.231
I want to create a plot that enables me to compare the monthly values in the DENI011 over the years. So I want to have something like this:
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Seasonal%20Plot
Jan-Dec on the x-scale, values on the y-scale and the years displayed by different colored lines.
I found several similar questions here, but nothing works for me. I tried to follow the instructions on the website with the example, but the problem is that I cant create a ts-object.
Then I tried it this way:
Ref_Data$MonthN <- as.numeric(format(as.Date(Ref_Data$Date),"%m")) # Month's number
Ref_Data$YearN <- as.numeric(format(as.Date(Ref_Data$Date),"%Y"))
Ref_Data$Month <- months(as.Date(Ref_Data$Date), abbreviate=TRUE) # Month's abbr.
g <- ggplot(data = Ref_Data, aes(x = MonthN, y = DENI011, group = YearN, colour=YearN)) +
geom_line() +
scale_x_discrete(breaks = Ref_Data$MonthN, labels = Ref_Data$Month)
That also didnt work, the plot looks horrible. I dont need to put all the years in 1 plot from 1993-2010. Actually only a few years would be ok, like from 1998-2006 maybe.
And suggestions, how to solve this?
As others have noted, in order to create a plot such as the one you used as an example, you'll have to aggregate your data first. However, it's also possible to retain daily data in a similar plot.
reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2018-02-11
library(tidyverse)
library(lubridate)
# Import the data
url <- "https://megastore.uni-augsburg.de/get/JVu_V51GvQ/"
raw <- read.table(url, stringsAsFactors = FALSE)
# Parse the dates, and use lower case names
df <- as_tibble(raw) %>%
rename_all(tolower) %>%
mutate(date = ymd(date))
One trick to achieve this would be to set the year component in your date variable to a constant, effectively collapsing the dates to a single year, and then controlling the axis labelling so that you don't include the constant year in the plot.
# Define the plot
p <- df %>%
mutate(
year = factor(year(date)), # use year to define separate curves
date = update(date, year = 1) # use a constant year for the x-axis
) %>%
ggplot(aes(date, deni011, color = year)) +
scale_x_date(date_breaks = "1 month", date_labels = "%b")
# Raw daily data
p + geom_line()
In this case though, your daily data are quite variable, so this is a bit of a mess. You could hone in on a single year to see the daily variation a bit better.
# Hone in on a single year
p + geom_line(aes(group = year), color = "black", alpha = 0.1) +
geom_line(data = function(x) filter(x, year == 2010), size = 1)
But ultimately, if you want to look a several years at a time, it's probably a good idea to present smoothed lines rather than raw daily values. Or, indeed, some monthly aggregate.
# Smoothed version
p + geom_smooth(se = F)
#> `geom_smooth()` using method = 'loess'
#> Warning: Removed 117 rows containing non-finite values (stat_smooth).
There are multiple values from one month, so when plotting your original data, you got multiple points in one month. Therefore, the line looks strange.
If you want to create something similar to the example your provided, you have to summarize your data by year and month. Below I calculated the mean of each year and month for your data. In addition, you need to convert your year and month to factors if you want to plot it as discrete variables.
library(dplyr)
Ref_Data2 <- Ref_Data %>%
group_by(MonthN, YearN, Month) %>%
summarize(DENI011 = mean(DENI011)) %>%
ungroup() %>%
# Convert the Month column to factor variable with levels from Jan to Dec
# Convert the YearN column to factor
mutate(Month = factor(Month, levels = unique(Month)),
YearN = as.factor(YearN))
g <- ggplot(data = Ref_Data2,
aes(x = Month, y = DENI011, group = YearN, colour = YearN)) +
geom_line()
g
If you don't want to add in library(dplyr), this is the base R code. Exact same strategy and results as www's answer.
dat <- read.delim("~/Downloads/df1.dat", sep = " ")
dat$Date <- as.Date(dat$Date)
dat$month <- factor(months(dat$Date, TRUE), levels = month.abb)
dat$year <- gsub("-.*", "", dat$Date)
month_summary <- aggregate(DENI011 ~ month + year, data = dat, mean)
ggplot(month_summary, aes(month, DENI011, color = year, group = year)) +
geom_path()

grouped barplot: order x-axis & keep constant bar width, in case of missing levels

Here is my script (example inspired from here and using the reorder option from here):
library(ggplot2)
Animals <- read.table(
header=TRUE, text='Category Reason Species
1 Decline Genuine 24
2 Improved Genuine 16
3 Improved Misclassified 85
4 Decline Misclassified 41
5 Decline Taxonomic 2
6 Improved Taxonomic 7
7 Decline Unclear 10
8 Improved Unclear 25
9 Improved Bla 10
10 Decline Hello 30')
fig <- ggplot(Animals, aes(x=reorder(Animals$Reason, -Animals$Species), y=Species, fill = Category)) +
geom_bar(stat="identity", position = "dodge")
This gives the following output plot:
What I would like is to order my barplot only on condition 'Decline', and all the 'Improved' would not be inserted in the middle. Here is what I would like to get (after some svg editing):
So now all the whole 'Decline' condition is sorted and the 'Improved' condition comes after. Besides, ideally, the bars would all be at the same width, even if the condition is not represented for the value (e.g. "Bla" has no "Decline" value).
Any idea on how I could do that without having to play with SVG editors? Many thanks!
First let's fill your data.frame with missing combinations like this.
library(dplyr)
Animals2 <- expand.grid(Category=unique(Animals$Category), Reason=unique(Animals$Reason)) %>% data.frame %>% left_join(Animals)
Then you can create an ordering variable for the x-scale:
myorder <- Animals2 %>% filter(Category=="Decline") %>% arrange(desc(Species)) %>% .$Reason %>% as.character
An then plot:
ggplot(Animals2, aes(x=Reason, y=Species, fill = Category)) +
geom_bar(stat="identity", position = "dodge") + scale_x_discrete(limits=myorder)
Define new data frame with all combinations of "Category" and "Reason", merge with data of "Species" from data frame "Animals". Adapt ggplot by correct scale_x_discrete:
Animals3 <- expand.grid(Category=unique(Animals$Category),Reason=unique(Animals$Reason))
Animals3 <- merge(Animals3,Animals,by=c("Category","Reason"),all.x=TRUE)
Animals3[is.na(Animals3)] <- 0
Animals3 <- Animals3[order(Animals3$Category,-Animals3$Species),]
ggplot(Animals3, aes(x=Animals3$Reason, y=Species, fill = Category)) + geom_bar(stat="identity", position = "dodge") + scale_x_discrete(limits=as.character(Animals3[Animals3$Category=="Decline","Reason"]))
To achieve something like that I would adjust the data frame when working with ggplot. Add the missing categories with a value of zero.
Animals <- rbind(Animals,
data.frame(Category = c("Improved", "Decline"),
Reason = c("Hello", "Bla"),
Species = c(0,0)
)
)
Along the same lines as the answer from user Alex, a less manual way of adding the categories might be
d <- with(Animals, expand.grid(unique(Category), unique(Reason)))
names(d) <- names(Animals)[1:2]
Animals <- merge(d, Animals, all.x=TRUE)
Animals$Species[is.na(Animals$Species)] <- 0

Plotting a line graph with multiple lines

I am trying to plot a line graph with multiple lines in different colors, but not having much luck. My data set consists of 10 states and the voting turnout rates for each state from 9 elections (so the states are listed in the left column, and each subsequent column is an election year from 1980-2012 with the voting turnout rate for each of the 10 states). I would like to have a graph with the year on the X axis and the voting turnout rate on the Y axis, with a line for each state.
I found this previous answer (Plotting multiple lines from a data frame in R) to a similar question but cannot seem to replicate it using my data. Any ideas/suggestions would be immensely appreciated!
Use tidyr::gather or reshape::melt to transform the data to a long form.
## Simulate data
d <- data.frame(state=letters[1:10],
'1980'=runif(10,0,100),
'1981'=runif(10,0,100),
'1982'=runif(10,0,100))
library(dplyr)
library(tidyr)
library(ggplot2)
## Transform to a long df
e <- d %>% gather(., key, value, -state) %>%
mutate(year = as.numeric(substr(as.character(key), 2, 5))) %>%
select(-key)
## Plot
ggplot(data=e,aes(x=year,y=value,color=state)) +
geom_point() +
geom_line()
Please include your data, or sample data, in your question so that we can answer your question directly and help you get to the root of the problem. Pasting your data is simplified by using dput().
Here's another solution to your problem, using scoa's sample data and the reshape2 package instead of the tidyr package:
# Sample data
d <- data.frame(state = letters[1:10],
'1980' = runif(10,0,100),
'1981' = runif(10,0,100),
'1982' = runif(10,0,100))
library(reshape2)
library(ggplot2)
# Melt data and remove X introduced into year name
melt.d <- melt(d, id = "state")
melt.d[["variable"]] <- gsub("X", "", melt.td[["variable"]])
# Plot melted data
ggplot(data = melt.d,
aes(x = variable,
y = value,
group = state,
color = state)) +
geom_point() +
geom_line()
Produces:
Note that I left out the as.numeric() conversion for year from scoa's example, and this is why the graph above does not include the extra x-axis ticks that scoa's does.

Resources