I have a df with date under column "månad" in format 2020M07.
When I plot this the x-axis gets all crowded and instead of plotting a continuous line I want to create a series per year and the x-axis only contain month.
In order to do this I need to have a col. in my df with only year so I can group on that variable (in ggplot), AND I also need to have a col. w only month (for x-data input in ggplot). How do I use the "månad" column to achieve this? Is there like in excel a LEFT-function or something that you can use with the dplyr function mutate? Or else how do I do this?
Maybe my idea isn't the best, feel free to answer both to my suggested solution and if you got a better one!
Thanks!
EDIT: - dump on the df in current state
Code:
library(pxweb)
library(tidyverse)
library(astsa)
library(forecast)
library(scales)
library(plotly)
library(zoo)
library(lubridate)
# PXWEB query
pxweb_query_list_BAS <-
list("Region"=c("22"),
"Kon" =c("1+2"),
"SNI2007" =c("A-U+US"),
"ContentsCode"=c("000005F3"))
# Download data
px_data_BAS <-
pxweb_get(url = "https://api.scb.se/OV0104/v1/doris/sv/ssd/START/AM/AM0210/AM0210B/ArbStDoNMN",
query = pxweb_query_list_BAS)
# Convert to data.frame
df_syss_natt <- as.data.frame(px_data_BAS, column.name.type = "text", variable.value.type = "text") %>%
rename(syss_natt = 'sysselsatta efter bostadens belägenhet') %>%
filter(månad >2020)
# Plot data
ggplot(df_syss_natt, aes(x=månad, y=syss_natt, group=1)) +
geom_point() +
geom_line(color="red")
Very crowded output w current 1 series:
Related
I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()
I have some data that I need to graph in R. There are two columns of data. The first one is a series of years ranging from 2001 to 2011. The second column is a string. The strings can be anything. I need to make a multi-line graph ( I was trying to use ggplot ) where the occurences of a string is on the y-axis and the year is on the x-axis.
I don't really have much of an idea where to start. This is what I had but I'm not sure if this is correct.
year <- data$year
# Idk how to get occurences per year
# year_2001 <- data$string[data$year == 2001]
# would this work?
# ggplot + geom_line()
I know most of that is commented out but that's because I'm new to R. Any help or guidance is greatly appreciated. Thanks!
Here is one way to get it done.
library(ggplot2)
library(dplyr)
set.seed(272727)
data <- data.frame(year = sample(2001:2011, 100, replace = TRUE),
string = sample(letters[1:5], 100, replace = TRUE))
# this is what will be plotted
table(data$string, data$year)
dataSummary <- as.data.frame(xtabs(~year+string, data))
ggplot(dataSummary, aes(x = year, y = Freq, group = string, colour = string)) + geom_line()
Note my previous answer used dplyr, but it had an issue with year-string combinations that are zero length. See dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output.
Here is what I have:
A data frame which contains a date field, and a number of summary statistics.
Here's what I want:
I want a chart that allows me to compare the time series week over week, to see how the performance of the process this week compares to the previous one, for example.
What I have done so far:
##Get the week day name to display
summaryData$WeekDay <- format(summaryData$Date, format = '%A')
##Get the week number to differentiate the weeks
summaryData$Week <- format(summaryData$Date, format = '%V')
summaryData %>%
ggvis(x = ~WeekDay, y = ~Referrers) %>%
layer_lines(stroke = ~Week)`
I expected it to create a chart with multiple coloured lines, each one representing a week in my data set. It does not do what I expect
Try looking at reshaper to convert your data with a factor variable for each week, or split up the data with a dplyr::lag() command.
A general way of doing graphs of multiple columns in ggivs is to use the following format
summaryData %>%
ggvis() %>%
layer_lines(x = ~WeekDay, y = ~Referrers)%>%
layer_lines(x=~WeekDay, y= ~Other)
I hope this helps
I have data similar to the following:
a <- list("1999"=c(1:50), "2000"=rep(35, 20), "2001"=c(100, 101, 103))
I want to make a scatterplot where the x axis is the years (1999, 2000, 2001 as given by the list names) and the y axis is the value within each list. Is there an easy way to do this? I can achieve something close to what I want with a simple boxplot(a), but I'd like it to be a scatterplot rather than a boxplot.
You could create a data frame with the year in one column and the value in the other, and then you could plot that the appropriate columns:
b <- data.frame(year=as.numeric(rep(names(a), sapply(a, length))), val=unlist(a))
plot(b)
This will do what you want
do.call(rbind,
lapply(names(a), function(year) data.frame(year = year, obs = a[[year]]))
)
Or break it up into two function calls (lapply and then do.call) to make it more understandable what's going on. It's pretty simple if you go through it. The lapply creates one dataframe per year where each year gets all the values for that year in the list. Now you have 3 dataframes. Then do.call binds these dataframes together.
An option using tidyr/dplyr/ggplot2
library(ggplot2)
library(tidyr)
library(dplyr)
unnest(a, year) %>%
ggplot(., aes(x=year, y=x)) +
geom_point(shape=1)
This question already has answers here:
What is the most elegant way to split data and produce seasonal boxplots?
(3 answers)
Closed 5 years ago.
I have a year's worth of data spanning two calendar years. I want to plot boxplots for those data subset by month.
The plots will always be ordered alphabetically (if I use month names) or numerically (if I use month numbers). Neither suits my purpose.
In the example below, I want the months on the x-axis to start at June (2013) and end in May (2014).
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
boxplot(df$x ~ months(df$date), outline = FALSE)
I could probably generate a vector of the months in the order I need (e.g. months <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month")))
Is there a more elegant way to do this? What am I missing?
Are you looking for something like this :
boxplot(df$x ~ reorder(format(df$date,'%b %y'),df$date), outline = FALSE)
I am using reorder to reorder your data according to dates. I am also formatting dates to skip day part since it is you aggregate your boxplot by month.
Edit :
If you want to skip year part ( but why ? personally I find this a little bit confusing):
boxplot(df$x ~ reorder(format(df$date,'%B'),df$date), outline = FALSE)
EDIT2 a ggplot2 solution:
Since you are in marketing field and you are learning ggplot2 :)
library(ggplot2)
ggplot(df) +
geom_boxplot(aes(y=x,
x=reorder(format(df$date,'%B'),df$date),
fill=format(df$date,'%Y'))) +
xlab('Month') + guides(fill=guide_legend(title="Year")) +
theme_bw()
I had a similar problem where I wanted to order the plot January to December. This seems to be a common cause of vexation for people, here is my solution:
date <- seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "days")
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
months <- month.name
boxplot(x~as.POSIXlt(date)$mon,names=months, outline = FALSE)
Found an answer here - use a factor, not a date:
set.seed(100)
x <- as.integer(abs(rnorm(365))*1000)
df <- data.frame(date, x)
# create an ordered factor
m <- months(seq.Date(as.Date("2013-06-01"), as.Date("2014-05-31"), "month"))
df$months <- factor(months(df$date), levels = m)
# plot x axis as ordered
boxplot(df$x ~ df$months, outline = FALSE)