I have data similar to the following:
a <- list("1999"=c(1:50), "2000"=rep(35, 20), "2001"=c(100, 101, 103))
I want to make a scatterplot where the x axis is the years (1999, 2000, 2001 as given by the list names) and the y axis is the value within each list. Is there an easy way to do this? I can achieve something close to what I want with a simple boxplot(a), but I'd like it to be a scatterplot rather than a boxplot.
You could create a data frame with the year in one column and the value in the other, and then you could plot that the appropriate columns:
b <- data.frame(year=as.numeric(rep(names(a), sapply(a, length))), val=unlist(a))
plot(b)
This will do what you want
do.call(rbind,
lapply(names(a), function(year) data.frame(year = year, obs = a[[year]]))
)
Or break it up into two function calls (lapply and then do.call) to make it more understandable what's going on. It's pretty simple if you go through it. The lapply creates one dataframe per year where each year gets all the values for that year in the list. Now you have 3 dataframes. Then do.call binds these dataframes together.
An option using tidyr/dplyr/ggplot2
library(ggplot2)
library(tidyr)
library(dplyr)
unnest(a, year) %>%
ggplot(., aes(x=year, y=x)) +
geom_point(shape=1)
Related
I have a df with date under column "månad" in format 2020M07.
When I plot this the x-axis gets all crowded and instead of plotting a continuous line I want to create a series per year and the x-axis only contain month.
In order to do this I need to have a col. in my df with only year so I can group on that variable (in ggplot), AND I also need to have a col. w only month (for x-data input in ggplot). How do I use the "månad" column to achieve this? Is there like in excel a LEFT-function or something that you can use with the dplyr function mutate? Or else how do I do this?
Maybe my idea isn't the best, feel free to answer both to my suggested solution and if you got a better one!
Thanks!
EDIT: - dump on the df in current state
Code:
library(pxweb)
library(tidyverse)
library(astsa)
library(forecast)
library(scales)
library(plotly)
library(zoo)
library(lubridate)
# PXWEB query
pxweb_query_list_BAS <-
list("Region"=c("22"),
"Kon" =c("1+2"),
"SNI2007" =c("A-U+US"),
"ContentsCode"=c("000005F3"))
# Download data
px_data_BAS <-
pxweb_get(url = "https://api.scb.se/OV0104/v1/doris/sv/ssd/START/AM/AM0210/AM0210B/ArbStDoNMN",
query = pxweb_query_list_BAS)
# Convert to data.frame
df_syss_natt <- as.data.frame(px_data_BAS, column.name.type = "text", variable.value.type = "text") %>%
rename(syss_natt = 'sysselsatta efter bostadens belägenhet') %>%
filter(månad >2020)
# Plot data
ggplot(df_syss_natt, aes(x=månad, y=syss_natt, group=1)) +
geom_point() +
geom_line(color="red")
Very crowded output w current 1 series:
I would like to plot a geom_point() using ggplot2.
My tibble is shown below. It is a single row tibble, and I would like the x axis to be the row and the y axis to be the column names (used as a date)
So I tried this :
g <- ggplot(data, aes(x = data, y = str_remove_all(colnames(data), "X"))) +
geom_point()
g
But it does not work.
I have the idea of maybe transpose my tibble but I really do not know how to do it.
Thanks in advance for your help.
OP, you need to reorganize your data to follow Tidy Data guidelines. Luckily, there's a lot of ways to do this in R. I usually default to using tidyr and dplyr via gather(), where you need to "gather" all the columns together and use those column names as "key" and the "value" would be the observations (the row values).
Since OP did not share their data in a format that is easy to work with.. here's a reprex:
set.seed(1234)
df <- data.frame(NA)
for(i in LETTERS){ df[[i]]=sample(1:10, size=1) }
df <- df[,-1] # gives you data frame with 1 row and 26 columns
That gives you a data frame with 26 columns ("A" through "Z"), where one observation for each column is a random integer from 1 to 10. Here's how to gather them together and use to make a plot:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- df %>% gather(key=y, value=x) # result = data frame with 2 columns and 26 rows
ggplot(df, aes(x=x, y=y)) + geom_point()
I've got a (very) basic level of competency with R when working with numbers, but when it comes to manipulating data based on text values in columns I'm stuck. For example, if I want to plot meal frequency vs. day of week (is Tuesday really for tacos?) using the following data frame, how would I do that? I've seen suggestions of tapply, aggregate, colSums, and others, but those have all been for slightly different scenarios and nothing gives me what I'm looking for. Should I be looking at something other than R for this problem? My end goal is a graph with day of week on the X-axis, count on the Y-axis, and a line plot for each meal.
df <- data.frame(meal= c("tacos","spaghetti","burgers","tacos","spaghetti",
"spaghetti"), day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
This is as close as I've gotten, and, to be honest, I don't fully understand what it's doing:
tapply(df$day, df$meal, FUN = function(x) length(x))
It will summarize the meal counts, but a) it doesn't have column names (my understanding is that's due to tapply returning a vector), and b) it doesn't keep an association with the day of the week.
Edit: The melt() suggestion below works for this dataset, but it won't scale to the size I need. I was, however, able to get a working graph from the dataframe produced by the melt. If anybody runs across this in the future, try:
ggplot(new, aes(day, value, group=meal, col=meal)) +
geom_line() + geom_point() + scale_y_continuous(breaks = function(x)
unique(floor(pretty(seq(0, (max(x) + 1) * 1.1)))))
(The part after geom_point() is to force the Y-axis to only be integers, which is what makes sense in this case.)
I tried to cut this into smaller pieces so you can understand whats going on
library(tidyverse)
# defining the dataframes
df <- data.frame(meal = c("tacos","spaghetti","burgers","tacos","spaghetti","spaghetti"),
day = c("monday","tuesday","wednesday","monday","tuesday","wednesday"))
# define a vector of days of week ( will be useful to display x axis in the correct order)
ordered_days =c("sunday","monday","tuesday","wednesday",
"thursday","friday",'saturday')
# count the number of meals per day of week
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
# a lot of combinations are missing, for example no burgers on monday
# so i am creating all combinations with count 0
fill_0 <- expand.grid(
meal=factor(unique(df$meal)),
day=factor(ordered_days),
n=0)
# append this fill_0 to df_count
# as some combinations already exist, group by again and sum n
# so only one row per (meal,day) combination
df_count <- rbind(df_count,fill_0) %>%
group_by(meal,day) %>%
summarise(n=sum(n)) %>%
mutate(day=factor(day,ordered=TRUE,
ordered_days))
# plot this by grouping by meal
ggplot(df_count,aes(x=day,y=n,group=meal,col=meal)) + geom_line()
The magic is here, courtesy of #fmarm:
df_count <- df %>% group_by(meal,day) %>% count() %>% ungroup()
The fill_0 and rbind bits also in the sample provided by #fmarm are necessary to keep from bombing out on unspecified combinations, but it's the line above that handles summing meals by day.
This question already has answers here:
What is the most elegant way to split data and produce seasonal boxplots?
(3 answers)
Closed 5 years ago.
I have a year's worth of data spanning two calendar years. I want to plot boxplots for those data subset by month.
The plots will always be ordered alphabetically (if I use month names) or numerically (if I use month numbers). Neither suits my purpose.
In the example below, I want the months on the x-axis to start at June (2013) and end in May (2014).
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
boxplot(df$x ~ months(df$date), outline = FALSE)
I could probably generate a vector of the months in the order I need (e.g. months <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month")))
Is there a more elegant way to do this? What am I missing?
Are you looking for something like this :
boxplot(df$x ~ reorder(format(df$date,'%b %y'),df$date), outline = FALSE)
I am using reorder to reorder your data according to dates. I am also formatting dates to skip day part since it is you aggregate your boxplot by month.
Edit :
If you want to skip year part ( but why ? personally I find this a little bit confusing):
boxplot(df$x ~ reorder(format(df$date,'%B'),df$date), outline = FALSE)
EDIT2 a ggplot2 solution:
Since you are in marketing field and you are learning ggplot2 :)
library(ggplot2)
ggplot(df) +
geom_boxplot(aes(y=x,
x=reorder(format(df$date,'%B'),df$date),
fill=format(df$date,'%Y'))) +
xlab('Month') + guides(fill=guide_legend(title="Year")) +
theme_bw()
I had a similar problem where I wanted to order the plot January to December. This seems to be a common cause of vexation for people, here is my solution:
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
months <- month.name
boxplot(x~as.POSIXlt(date)$mon,names=months, outline = FALSE)
Found an answer here - use a factor, not a date:
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
# create an ordered factor
m <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month"))
df$months <- factor(months(df$date), levels = m)
# plot x axis as ordered
boxplot(df$x ~ df$months, outline = FALSE)
I am trying to use use ggplot to plot production data by company and use the color of the point to designate year. The follwoing chart shows a example based on sample data:
However, often times my real data has 50-60 different comapnies wich makes the Company names on the Y axis to be tiglhtly grouped and not very asteticly pleaseing.
What is th easiest way to show data for only the top 5 companies information (ranked by 2011 quanties) and then show the rest aggregated and shown as "Other"?
Below is some sample data and the code I have used to create the sample chart:
# create some sample data
c=c("AAA","BBB","CCC","DDD","EEE","FFF","GGG","HHH","III","JJJ")
q=c(1,2,3,4,5,6,7,8,9,10)
y=c(2010)
df1=data.frame(Company=c, Quantity=q, Year=y)
q=c(3,4,7,8,5,14,7,13,2,1)
y=c(2011)
df2=data.frame(Company=c, Quantity=q, Year=y)
df=rbind(df1, df2)
# create plot
p=ggplot(data=df,aes(Quantity,Company))+
geom_point(aes(color=factor(Year)),size=4)
p
I started down the path of a brute force approach but thought there is probably a simple and elegent way to do this that I should learn. Any assistance would be greatly appreciated.
What about this:
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
ggplot (data = subset (df, Company %in% companies [1 : 5]),
aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
BTW: in order for the code to be called elegant, spend a few more spaces, they aren't that expensive...
See if this is what you want. It takes your df dataframe, and some of the ideas already suggested by #cbeleites. The steps are:
1.Select 2011 data and order the companies from highest to lowest on Quantity.
2.Split df into two bits: dftop which contians the data for the top 5; and dfother, which contains the aggregated data for the other companies (using ddply() from the plyr package).
3.Put the two dataframes together to give dfnew.
4.Set the order for which levels of Company are plotted: Top to bottom is highest to lowest, then "Other". The order is partly given by companies, plus "Other".
5.Plot as before.
library(ggplot2)
library(plyr)
# Step 1
df2011 <- subset (df, Year == 2011)
companies <- df2011$Company [order (df2011$Quantity, decreasing = TRUE)]
# Step 2
dftop = subset(df, Company %in% companies [1:5])
dftop$Company = droplevels(dftop$Company)
dfother = ddply(subset(df, !(Company %in% companies [1:5])), .(Year), summarise, Quantity = sum(Quantity))
dfother$Company = "Other"
# Step 3
dfnew = rbind(dftop, dfother)
# Step 4
dfnew$Company = factor(dfnew$Company, levels = c("Other", rev(as.character(companies)[1:5])))
levels(dfnew$Company) # Check that the levels are in the correct order
# Step 5
p = ggplot (data = dfnew, aes (Quantity, Company)) +
geom_point (aes (color = factor (Year)), size = 4)
p
The code produces: