Related
I have a dataset looks like this
year china India United state ....
2020 30 40 50
2021 20 30 60
2022 34 20 40
....
I have 10 columns and more than 50 rows in this dataframe. I have to plot them in one graph to show the movement of different countries.
So I think line graph would be good for the purpose.But I don't know how should I do the visulisation.
I think I shuold change the dataframe format and then start visulisation. How should I do it?
Pivot (reshape from wide to long) then plot with groups.
dat <- structure(list(year = 2020:2022, China = c(30L, 20L, 34L), India = c(40L, 30L, 20L), UnitedStates = c(50L, 60L, 40L)), class = "data.frame", row.names = c(NA, -3L))
datlong <- reshape2::melt(dat, "year", variable.name = "country", value.name = "value")
datlong
# year country value
# 1 2020 China 30
# 2 2021 China 20
# 3 2022 China 34
# 4 2020 India 40
# 5 2021 India 30
# 6 2022 India 20
# 7 2020 UnitedStates 50
# 8 2021 UnitedStates 60
# 9 2022 UnitedStates 40
### or using tidyr::
tidyr::pivot_longer(dat, -year, names_to = "country", values_to = "value")
Once reshaped, just group= (and optionally color=) lines:
library(ggplot2)
ggplot(datlong, aes(year, value, color = country)) +
geom_line(aes(group = country))
If you have many more years, the decimal-years in the axis will likely smooth out. You can alternately control it by converting year to a Date-class and forcing the display with scale_x_date.
I have a data frame in R which looks like below
Model Month Demand Inventory
A Jan 10 20
B Feb 30 40
A Feb 40 60
I want the data frame to look
Jan Feb
A_Demand 10 40
A_Inventory 20 60
A_coverage
B_Demand 30
B_Inventory 40
B_coverage
A_coverage and B_Coverage will be calculated in excel using a formula. But the problem I need help with is to pivot the data frame from wide to long format (original format).
I tried to implement the solution from the linked duplicate but I am still having difficulty:
HD_dcast <- reshape(data,idvar = c("Model","Inventory","Demand"),
timevar = "Month", direction = "wide")
Here is a dput of my data:
data <- structure(list(Model = c("A", "B", "A"), Month = c("Jan", "Feb",
"Feb"), Demand = c(10L, 30L, 40L), Inventory = c(20L, 40L, 60L
)), class = "data.frame", row.names = c(NA, -3L))
Thanks
Here's an approach with dplyr and tidyr, two popular R packages for data manipulation:
library(dplyr)
library(tidyr)
data %>%
mutate(coverage = NA_real_) %>%
pivot_longer(-c(Model,Month), names_to = "Variable") %>%
pivot_wider(id_cols = c(Model, Variable), names_from = Month ) %>%
unite(Variable, c(Model,Variable), sep = "_")
## A tibble: 6 x 3
# Variable Jan Feb
# <chr> <dbl> <dbl>
#1 A_Demand 10 40
#2 A_Inventory 20 60
#3 A_coverage NA NA
#4 B_Demand NA 30
#5 B_Inventory NA 40
#6 B_coverage NA NA
I am using src_postgres to connect and dplyr::tbl function to fetch data from redshift database. I have applied some filters and top function to it using the dplyr itself. Now my data looks as below:
riid day hour
<dbl> <chr> <chr>
1 5542. "THURSDAY " 12
2 5862. "FRIDAY " 15
3 5982. "TUESDAY " 15
4 6022. WEDNESDAY 16
My final output should be as below:
riid MON TUES WED THUR FRI SAT SUN
5542 12
5862 15
5988 15
6022 16
I have tried spread. It throws the below error because of the class type:
Error in UseMethod("spread_") : no applicable method for 'spread_'
applied to an object of class "c('tbl_dbi', 'tbl_sql', 'tbl_lazy',
'tbl')"
Since this is a really big table, I do not want to use dataframe as it takes a longer time.
I was able to use as below:
df_mon <- df2 %>% filter(day == 'MONDAY') %>% mutate(MONDAY = hour) %>% select(riid,MONDAY)
df_tue <- df2 %>% filter(day == 'TUESDAY') %>% mutate(TUESDAY = hour) %>% select(riid,TUESDAY)
df_wed <- df2 %>% filter(day == 'WEDNESDAY') %>% mutate(WEDNESDAY = hour) %>% select(riid,WEDNESDAY)
df_thu <- df2 %>% filter(day == 'THURSDAY') %>% mutate(THURSDAY = hour) %>% select(riid,THURSDAY)
df_fri <- df2 %>% filter(day == 'FRIDAY') %>% mutate(FRIDAY = hour) %>% select(riid,FRIDAY)
Is it possible to write all above in one statement?
Any help to transpose this in a faster manner is really appreciated.
EDIT
Adding the dput of the tbl object:
structure(list(src = structure(list(con = <S4 object of class structure("PostgreSQLConnection", package = "RPostgreSQL")>,
disco = <environment>), .Names = c("con", "disco"), class = c("src_dbi",
"src_sql", "src")), ops = structure(list(name = "select", x = structure(list(
name = "filter", x = structure(list(name = "filter", x = structure(list(
name = "group_by", x = structure(list(x = structure("SELECT riid,day,hour,sum(weightage) AS score FROM\n (SELECT riid,day,hour,\n POWER(2,(cast(datediff (seconds,convert_timezone('UTC','PKT',SYSDATE),TO_DATE(TO_CHAR(event_captured_dt,'mm/dd/yyyy hh24:mi:ss'),'mm/dd/yyyy hh24:mi:ss')) as decimal) / cast(7862400 as decimal))) AS weightage\n FROM (\n SELECT riid,convert_timezone('GMT','PKT',event_captured_dt) AS EVENT_CAPTURED_DT,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'DAY') AS day,\n TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'HH24') AS hour\n FROM Zameen_STO_DATA WHERE EVENT_CAPTURED_DT >= TO_DATE((sysdate -30),'yyyy-mm-dd') and LIST_ID = 4282\n )) group by riid,day,hour", class = c("sql",
"character")), vars = c("riid", "day", "hour", "score"
)), .Names = c("x", "vars"), class = c("op_base_remote",
"op_base", "op")), dots = structure(list(riid = riid,
day = day), .Names = c("riid", "day")), args = structure(list(
add = FALSE), .Names = "add")), .Names = c("name",
"x", "dots", "args"), class = c("op_group_by", "op_single",
"op")), dots = structure(list(~min_rank(desc(~score)) <=
1), .Names = ""), args = list()), .Names = c("name",
"x", "dots", "args"), class = c("op_filter", "op_single",
"op")), dots = structure(list(~row_number() == 1), .Names = ""),
args = list()), .Names = c("name", "x", "dots", "args"), class = c("op_filter",
"op_single", "op")), dots = structure(list(~riid, ~day, ~hour), class = "quosures", .Names = c("",
"", "")), args = list()), .Names = c("name", "x", "dots", "args"
), class = c("op_select", "op_single", "op"))), .Names = c("src",
"ops"), class = c("tbl_dbi", "tbl_sql", "tbl_lazy", "tbl"))
I think what you're looking for is the ability to run the tidyr::spread() function against a remote source, or database. I have a PR for dbplyr that attempts to implement that here: https://github.com/tidyverse/dbplyr/pull/72, you can try it out by using: devtools::install_github("tidyverse/dbplyr", ref = devtools::github_pull(72)).
Use dcast from reshape2 package
> data
# A tibble: 4 x 3
riid day hour
<dbl> <chr> <dbl>
1 1.00 TH 12.0
2 2.00 FR 15.0
3 3.00 TU 15.0
4 4.00 WE 16.0
> dcast(data, riid~day, value.var = "hour")
riid FR TH TU WE
1 1 NA 12 NA NA
2 2 15 NA NA NA
3 3 NA NA 15 NA
4 4 NA NA NA 16
Further if you want to remove NA, then
> z <- dcast(data, riid~day, value.var = "hour")
> z[is.na(z)] <- ""
> z
riid FR TH TU WE
1 1 12
2 2 15
3 3 15
4 4 16
I tried to combine your multiple line attempts into one. Can you try this and let us know the outcome?
library(dplyr)
df %>%
rowwise() %>%
mutate(Mon = ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
Tue = ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
Wed = ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
Thu = ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
Fri = ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
Sat = ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
Sun = ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA)) %>%
select(-day, -hour)
Output is:
riid Mon Tue Wed Thu Fri Sat Sun
1 5542 NA NA NA 12 NA NA NA
2 5862 NA NA NA NA 15 NA NA
3 5982 NA 15 NA NA NA NA NA
4 6022 NA NA 16 NA NA NA NA
Sample data:
# A tibble: 4 x 3
riid day hour
* <dbl> <chr> <int>
1 5542 THURSDAY 12
2 5862 FRIDAY 15
3 5982 TUESDAY 15
4 6022 WEDNESDAY 16
Update:
Can you try below approach using data.table?
library(data.table)
dt <- setDT(df)[, c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") :=
list(ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA))][, !c("day","hour"), with=F]
I perform forecast
w=read.csv("C:/Users/admin/Documents/aggrmonth.csv", sep=";",dec=",")
w
#create time series object
w=ts(w$new,frequency = 12,start=c(2015,1))
w
#timeplot
plot.ts(w)
#forecast for the next months
library("forecast")
m <- stats::HoltWinters(w)
test=forecast:::forecast.HoltWinters(m,h=4) #h is how much month do you want to predict
test
now i want get forecast for 4 months ahead.
From 01.2017-04.2017. I this know original values.
1-Jan-17 1020
1-Feb-17 800
1-Mar-17 1130
1-Apr-17 600
But i need get plot where displayed predicted values with CI are overlapped with original value.
Of course if i don't clearly exlplain, i attached the plot.
The green curve is the initial value of the series(my 4 months)
and green dotted line is predictied values are overlapped on original values.
Dashes on the predicted dotted curve are confidence intervals.
How to create such plot
w=
structure(list(yearMon = structure(c(9L, 7L, 15L, 1L, 17L, 13L,
11L, 3L, 23L, 21L, 19L, 5L, 10L, 8L, 16L, 2L, 18L, 14L, 12L,
4L, 24L, 22L, 20L, 6L), .Label = c("1-Apr-15", "1-Apr-16", "1-Aug-15",
"1-Aug-16", "1-Dec-15", "1-Dec-16", "1-Feb-15", "1-Feb-16", "1-Jan-15",
"1-Jan-16", "1-Jul-15", "1-Jul-16", "1-Jun-15", "1-Jun-16", "1-Mar-15",
"1-Mar-16", "1-May-15", "1-May-16", "1-Nov-15", "1-Nov-16", "1-Oct-15",
"1-Oct-16", "1-Sep-15", "1-Sep-16"), class = "factor"), new = c(8575L,
8215L, 16399L, 16415L, 15704L, 19805L, 17484L, 18116L, 19977L,
14439L, 9258L, 12259L, 4909L, 9539L, 8802L, 11253L, 11971L, 7838L,
2095L, 4157L, 3910L, 1306L, 3429L, 1390L)), .Names = c("yearMon",
"new"), class = "data.frame", row.names = c(NA, -24L))
We can use ggfortify to create a data frame then plot both timeseries with ggplot2
# Load required libraries
library(lubridate)
library(magrittr)
library(tidyverse)
library(scales)
library(forecast)
library(ggfortify)
w <- structure(list(yearMon = structure(c(9L, 7L, 15L, 1L, 17L, 13L,
11L, 3L, 23L, 21L, 19L, 5L, 10L, 8L, 16L, 2L, 18L, 14L, 12L,
4L, 24L, 22L, 20L, 6L), .Label = c("1-Apr-15", "1-Apr-16", "1-Aug-15",
"1-Aug-16", "1-Dec-15", "1-Dec-16", "1-Feb-15", "1-Feb-16", "1-Jan-15",
"1-Jan-16", "1-Jul-15", "1-Jul-16", "1-Jun-15", "1-Jun-16", "1-Mar-15",
"1-Mar-16", "1-May-15", "1-May-16", "1-Nov-15", "1-Nov-16", "1-Oct-15",
"1-Oct-16", "1-Sep-15", "1-Sep-16"), class = "factor"), new = c(8575L,
8215L, 16399L, 16415L, 15704L, 19805L, 17484L, 18116L, 19977L,
14439L, 9258L, 12259L, 4909L, 9539L, 8802L, 11253L, 11971L, 7838L,
2095L, 4157L, 3910L, 1306L, 3429L, 1390L)), .Names = c("yearMon",
"new"), class = "data.frame", row.names = c(NA, -24L))
# create time series object
w = ts(w$new, frequency = 12, start=c(2015, 1))
w
#> Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
#> 2015 8575 8215 16399 16415 15704 19805 17484 18116 19977 14439 9258
#> 2016 4909 9539 8802 11253 11971 7838 2095 4157 3910 1306 3429
#> Dec
#> 2015 12259
#> 2016 1390
# forecast for the next months
m <- stats::HoltWinters(w)
# h is how much month do you want to predict
pred = forecast:::forecast.HoltWinters(m, h=4)
pred
#> Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
#> Jan 2017 -5049.00381 -9644.003 -454.0045 -12076.449 1978.441
#> Feb 2017 37.44605 -5599.592 5674.4843 -8583.660 8658.552
#> Mar 2017 -256.41474 -6770.890 6258.0601 -10219.444 9706.615
#> Apr 2017 2593.09445 -4693.919 9880.1079 -8551.431 13737.620
# plot
plot(pred, include = 24, showgap = FALSE)
# Convert pred from list to data frame object
df1 <- fortify(pred) %>% as_tibble()
# Create Date column, remove Index column and rename other columns
df1 %<>%
mutate(Date = as.Date(Index, "%Y-%m-%d")) %>%
select(-Index) %>%
rename("Low95" = "Lo 95",
"Low80" = "Lo 80",
"High95" = "Hi 95",
"High80" = "Hi 80",
"Forecast" = "Point Forecast")
df1
#> # A tibble: 28 x 8
#> Data Fitted Forecast Low80 High80 Low95 High95 Date
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date>
#> 1 8575 NA NA NA NA NA NA 2015-01-01
#> 2 8215 NA NA NA NA NA NA 2015-02-01
#> 3 16399 NA NA NA NA NA NA 2015-03-01
#> 4 16415 NA NA NA NA NA NA 2015-04-01
#> 5 15704 NA NA NA NA NA NA 2015-05-01
#> 6 19805 NA NA NA NA NA NA 2015-06-01
#> 7 17484 NA NA NA NA NA NA 2015-07-01
#> 8 18116 NA NA NA NA NA NA 2015-08-01
#> 9 19977 NA NA NA NA NA NA 2015-09-01
#> 10 14439 NA NA NA NA NA NA 2015-10-01
#> # ... with 18 more rows
### Avoid the gap between data and forcast
# Find the last non missing NA values in obs then use that
# one to initialize all forecast columns
lastNonNAinData <- max(which(complete.cases(df1$Data)))
df1[lastNonNAinData,
!(colnames(df1) %in% c("Data", "Fitted", "Date"))] <- df1$Data[lastNonNAinData]
ggplot(df1, aes(x = Date)) +
geom_ribbon(aes(ymin = Low95, ymax = High95, fill = "95%")) +
geom_ribbon(aes(ymin = Low80, ymax = High80, fill = "80%")) +
geom_point(aes(y = Data, colour = "Data"), size = 4) +
geom_line(aes(y = Data, group = 1, colour = "Data"),
linetype = "dotted", size = 0.75) +
geom_line(aes(y = Fitted, group = 2, colour = "Fitted"), size = 0.75) +
geom_line(aes(y = Forecast, group = 3, colour = "Forecast"), size = 0.75) +
scale_x_date(breaks = scales::pretty_breaks(), date_labels = "%b %y") +
scale_colour_brewer(name = "Legend", type = "qual", palette = "Dark2") +
scale_fill_brewer(name = "Intervals") +
guides(colour = guide_legend(order = 1), fill = guide_legend(order = 2)) +
theme_bw(base_size = 14)
Edit: To included known values from "2017-01-01" to "2017-04-01"
# Create new column which has known values
df1$Obs <- NA
df1$Obs[(nrow(df1)-3):(nrow(df1))] <- c(1020, 800, 1130, 600)
ggplot(df1, aes(x = Date)) +
geom_ribbon(aes(ymin = Low95, ymax = High95, fill = "95%")) +
geom_ribbon(aes(ymin = Low80, ymax = High80, fill = "80%")) +
geom_point(aes(y = Data, colour = "Data"), size = 4) +
geom_line(aes(y = Data, group = 1, colour = "Data"),
linetype = "dotted", size = 0.75) +
geom_line(aes(y = Fitted, group = 2, colour = "Fitted"), size = 0.75) +
geom_line(aes(y = Forecast, group = 3, colour = "Forecast"), size = 0.75) +
scale_x_date(breaks = scales::pretty_breaks(), date_labels = "%b %y") +
scale_colour_brewer(name = "Legend", type = "qual", palette = "Dark2") +
scale_fill_brewer(name = "Intervals") +
guides(colour = guide_legend(order = 1), fill = guide_legend(order = 2)) +
theme_bw(base_size = 14) +
geom_line(aes(y = Obs, group = 4, colour = "Obs"), linetype = "dotted", size = 0.75)
Or put those values directly into Data column
df1$Data[(nrow(df1)-3):(nrow(df1))] <- c(1020, 800, 1130, 600)
ggplot(df1, aes(x = Date)) +
geom_ribbon(aes(ymin = Low95, ymax = High95, fill = "95%")) +
geom_ribbon(aes(ymin = Low80, ymax = High80, fill = "80%")) +
geom_point(aes(y = Data, colour = "Data"), size = 3) +
geom_line(aes(y = Data, group = 1, colour = "Data"),
linetype = "dotted", size = 0.75) +
geom_line(aes(y = Fitted, group = 2, colour = "Fitted"), size = 0.75) +
geom_line(aes(y = Forecast, group = 3, colour = "Forecast"), size = 0.75) +
scale_x_date(breaks = scales::pretty_breaks(), date_labels = "%b %y") +
scale_colour_brewer(name = "Legend", type = "qual", palette = "Dark2") +
scale_fill_brewer(name = "Intervals") +
guides(colour = guide_legend(order = 1), fill = guide_legend(order = 2)) +
theme_bw(base_size = 14)
Created on 2018-04-21 by the reprex package (v0.2.0).
I have a table with the following headers and example data
Lat Long Date Value.
30.497478 -87.880258 01/01/2016 10
30.497478 -87.880258 01/02/2016 15
30.497478 -87.880258 01/05/2016 20
33.284928 -85.803608 01/02/2016 10
33.284928 -85.803608 01/03/2016 15
33.284928 -85.803608 01/05/2016 20
I would like to average the value column on monthly basis for a particular location.
So example output would be
Lat Long Month Avg Value
30.497478 -87.880258 January 15
A solution using dplyr and lubridate.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Date = mdy(Date), Month = month(Date)) %>%
group_by(Lat, Long, Month) %>%
summarise(`Avg Value` = mean(Value))
dt2
# A tibble: 2 x 4
# Groups: Lat, Long [?]
Lat Long Month `Avg Value`
<dbl> <dbl> <dbl> <dbl>
1 30.49748 -87.88026 1 15
2 33.28493 -85.80361 1 15
You can try the following, but it first modifies the data frame adding an extra column, Month, using package zoo.
library(zoo)
dat$Month <- as.yearmon(as.Date(dat$Date, "%m/%d/%Y"))
aggregate(Value. ~ Lat + Long + Month, dat, mean)
# Lat Long Month Value.
#1 30.49748 -87.88026 jan 2016 15
#2 33.28493 -85.80361 jan 2016 15
If you don't want to change the original data, make a copy dat2 <- dat and change the copy.
DATA
dat <-
structure(list(Lat = c(30.497478, 30.497478, 30.497478, 33.284928,
33.284928, 33.284928), Long = c(-87.880258, -87.880258, -87.880258,
-85.803608, -85.803608, -85.803608), Date = structure(c(1L, 2L,
4L, 2L, 3L, 4L), .Label = c("01/01/2016", "01/02/2016", "01/03/2016",
"01/05/2016"), class = "factor"), Value. = c(10L, 15L, 20L, 10L,
15L, 20L)), .Names = c("Lat", "Long", "Date", "Value."), class = "data.frame", row.names = c(NA,
-6L))
EDIT.
If you want to compute several statistics, you can define a function that computes them and returns a named vector and call it in aggregate, like the following.
stat <- function(x){
c(Mean = mean(x), Median = median(x), SD = sd(x))
}
agg <- aggregate(Value. ~ Lat + Long + Month, dat, stat)
agg <- cbind(agg[1:3], as.data.frame(agg[[4]]))
agg
# Lat Long Month Mean Median SD
#1 30.49748 -87.88026 jan 2016 15 15 5
#2 33.28493 -85.80361 jan 2016 15 15 5