Reshape data in R? - r

I have mean daily data for different sites organized as shown in figure 1 in this folder.
However, I want to organize this data to look like figure 2 in the same folder.
Using this code, the data was reshaped but the final values (reshpae_stage_R.csv) didn't match the original values.
By running the code for the second time, I got this error:
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L], :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA.January’
Could you please help me why the final values don't match the original values?
Thanks in advance

Update:
Thanks to #aelwan for catching a bug, and the updated code is below:
library(ggplot2)
library(reshape2)
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\\s\\(mm\\)\\sat\\s))([A-Za-z\\s0-9\\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep(1:dim(dfSiteYear)[1], each = 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"),
measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = reshape2::dcast(dfStageFinalLong, Day + Month + Year ~ Site,
value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
reshape2::melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
This leads to the picture below:
Original answer:
This example requires a fair amount of data munging skills. You basically have to note the repeating patters in the data -- the data are site-year measurements organized as day x month tables.
Recipe:
Here is a recipe for creating the desired dataset:
1. Remove the rows & columns in the data that are redundant.
2. Extract the rows that identify the year and the site of the table using pattern matching (grep).
3. From the longer string, extract the year and site name using regular expressions (regexpr and regmatches).
4. Find the starting row indices of the tables for each site-year combination and assign the site-year names just extracted to all rows that correspond to that site & year.
5. Now you can go ahead and reshape it into any shape you want. In the code below, the row identifiers are year, month and day, and the columns are the sites.
6. Some cleanup, and you are good to go.
Code:
Here is code for the recipe above:
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\\s\\(mm\\)\\sat\\s))([A-Za-z\\s0-9\\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep.int(1:dim(dfSiteYear)[1], 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = dcast(dfStageFinalLong, Day + Month + Year ~ Site, value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
Output:
Here is what the output looks like:
> head(dfStageFinalWide)
Day Month Year Kumeti at Te Rehunga Makakahi at Hamua Makuri at Tuscan Hills Manawatu at Hopelands Manawatu at Upper Gorge Manawatu at Weber Road Mangahao at Ballance
1 1 Jan 1990 454 NA 700 5133 NA NA NA
2 1 Jan 1991 1002 3643 1416 50 3597 1836 18160
3 1 Jan 1992 3490 34239 8922 3049 1221 417 NA
4 1 Jan 1993 404 NA 396 3408 NA 272 NA
5 1 Jan 1994 NA NA 3189 795 NA 2321 1889
6 1 Jan 1995 16548 1923 69862 4808 NA 6169 94
Mangapapa at Troup Rd Mangatainoka at Larsons Road Mangatainoka at Pahiatua Town Bridge Mangatainoka at Tararua Park Mangatoro at Mangahei Road Oruakeretaki at S.H.2 Napier
1 9406 2767 NA NA 6838 2831
2 4985 2479 823 1078 76 105
3 478 3665 1415 210 394 8247
4 6394 1298 NA 2668 3837 1878
5 14051 3561 NA 2645 807 NA
6 NA 1057 7029 4497 NA NA
Raparapawai at Jackson Rd Tamaki at Stephensons Tiraumea at Ngaturi
1 5189 50444 17951
2 345 416 3025
3 1364 5713 1710
4 3457 28078 8670
5 199 NA 292
6 NA NA 22774
And a picture to bring it all together.

Related

change repeating row value based on values on another column

Based on the data and code below, how can I change the value 1/0/1900 to values based on the corresponding row value in the year column?
Data + code:
df = structure(list(year = c("2005", "2004", "ORIG", "ORIG", "2000-2001",
"2000-2003"), date = c("1/0/1900", "1/0/1900", "1/0/1900", "1/0/1900",
"1/0/1900", "1/0/1900")), class = "data.frame", row.names = c(NA,
-6L))
desired = structure(list(year = c("2005", "2004", "ORIG", "ORIG", "2000-2001",
"2000-2001"), date = c("01/01/2005", "01/01/2004", "01/01/2005", "01/01/2005",
"01/01/2000", "01/01/2000")), class = "data.frame", row.names = c(NA,
-6L))
# Current approach replaces every `1/0/1900` to `01/01/2005`
df = df %>% mutate(date = ifelse(date == "1/0/1900",
"01/01/2005",
date))
This feels clunky, perhaps I'm missing something?
df %>%
mutate(
date = paste0(sub("/[^/]*$", "/", date),
substring(if_else(year == "ORIG", first(year), year), 1, 4))
)
# year date
# 1 2005 1/0/2005
# 2 2004 1/0/2004
# 3 ORIG 1/0/2005
# 4 ORIG 1/0/2005
# 5 2000-2001 1/0/2000
# 6 2000-2003 1/0/2000
However, if you are finding /1900 dates in your data, that suggests that a database somewhere had a "null" that was converted into a zero-date like value. Perhaps it should be handled more directly at the data source?
Regex explanation:
/[^/]*$
^ literal '/'
^^^^^ any characters that are not the literal '/'
^ end of string
Essentially: remove from the end of string any non-/. We could also have used substring(date, 1, nchar(date) - 4) in place of sub(..):
df %>%
mutate(
date = paste0(substring(date, 1, nchar(date) - 4),
substring(if_else(year == "ORIG", first(year), year), 1, 4))
)
We could use case_when to replace the last 4 digits in 'date' where the 'year' is 'ORIG' to 2005 or replace with substring of year column
library(dplyr)
library(stringr)
df %>%
mutate(date = case_when(year == 'ORIG' ~
str_replace(date, '\\d{4}$', '2005'),
TRUE ~ str_replace(date, '\\d{4}$', substr(year, 1, 4))))
-output
year date
1 2005 1/0/2005
2 2004 1/0/2004
3 ORIG 1/0/2005
4 ORIG 1/0/2005
5 2000-2001 1/0/2000
6 2000-2003 1/0/2000

Multiple grouping variables using kableExtra

I've got the following dataset:
tab <- tibble(year = c(2017,2017,2017,2018,2018,2018)
mth = c("Apr", "Apr", "Jun", "Jul", "Jul", "Sep"),
var1 = 1:6,
var2 = 10:15)
Is it possible to use kableExtra to generate a table of this data where there are two grouping variables, year and month? This would give:
var1 var2
2017
Apr
1 10
2 11
Jun
3 12
2018
Jul
4 13
5 14
Sep
6 15
I've tried:
kable(tab[,3:4]) %>% pack_rows(index = table(year$Month, tab$mth))
It works fine with one grouping variable, but it doesn't work for two grouping variables.
This tutorial has great examples and explains how to do this.
library(dplyr)
library(kableExtra)
kable(tab, align = "c", col.names = c("","",names(tab)[3:4])) %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T) %>%
collapse_rows(columns = 1:2, valign = "top")

How to put/save all elements of a List into one Excel sheet in R?

I have a list (bbb) with 5 elements in it, i.e., each element for a year, like 2010, 2011, ... , 2014:
The first one in the list is this:
> bbb[1]
$`2010`
Date Average
X2010.01.01 2010-01-01 2.079090e-03
X2010.01.02 2010-01-02 5.147627e-04
X2010.01.03 2010-01-03 2.997464e-04
X2010.01.04 2010-01-04 1.375538e-04
X2010.01.05 2010-01-05 1.332109e-04
The second one in the list is this:
> bbb[2]
$`2011`
Date Average
X2011.01.01 2011-01-01 1.546253e-03
X2011.01.02 2011-01-02 1.152864e-03
X2011.01.03 2011-01-03 1.752446e-03
X2011.01.04 2011-01-04 2.639658e-03
X2011.01.05 2011-01-05 5.231150e-03
X2011.01.06 2011-01-06 8.909878e-04
And so on.
Here is my question:
How can I save all of these list's elements in 1 sheet of an Excel file to have something like this:
Your help would be highly appreciated.
You can do this using dcast.
bbb <- list(`2010` = data.frame(date = as.Date("2010-01-01") + 0:4,
avg = 1:5),
`2011` = data.frame(date = as.Date("2011-01-01") + 0:5,
avg = 11:16),
`2012` = data.frame(date = as.Date("2012-01-01") + 0:9,
avg = 21:30),
`2013` = data.frame(date = as.Date("2013-01-01") + 0:7,
avg = 21:28))
df <- do.call("rbind", bbb)
df$year <- format(df$date, format = "%Y")
df$month_date <- format(df$date, format = "%b-%d")
library(data.table)
library(openxlsx)
df_dcast <- dcast(df, month_date~year, value.var = "avg")
write.xlsx(df_dcast, "example1.xlsx")
Or using spread
library(dplyr)
library(tidyr)
df2 <- df %>%
select(-date) %>%
spread(key = year, value = avg)
write.xlsx(df2, "example2.xlsx")
This isn't very pretty, but it's the best I could think of right now. But you could take the dataframes and loop through the list, joining them by date like this:
library(tidyverse)
library(lubridate)
bbb <- list(`2010` = tibble(date = c('01-01-2010', '01-02-2010', '01-03-2010', '01-04-2010', '01-05-2010'),
average = 11:15),
`2011` = tibble(date = c('01-01-2011', '01-02-2011', '01-03-2011', '01-04-2011', '01-05-2011'),
average = 1:5),
`2012` = tibble(date = c('01-01-2012', '01-02-2012', '01-03-2012', '01-04-2012', '01-05-2012'),
average = 6:10))
for (i in seq_along(bbb)) {
if(i == 1){
df <- bbb[[i]] %>%
mutate(
date = paste(day(as.Date(date, format = '%m-%d-%Y')),
month(as.Date(date, format = '%m-%d-%Y'), label = TRUE),
sep = '-')
)
colnames(df) <- c('date', names(bbb[i])) # Assuming your list of dataframes has just 2 columns: date and average
} else {
join_df <- bbb[[i]] %>%
mutate(
date = paste(day(as.Date(date, format = '%m-%d-%Y')),
month(as.Date(date, format = '%m-%d-%Y'), label = TRUE),
sep = '-')
)
colnames(join_df) <- c('date', names(bbb[i]))
df <- full_join(df, join_df, by = 'date')
}
}
This loops through the list of dataframes and reformats the dates to Day-Month.
# A tibble: 5 x 4
date `2010` `2011` `2012`
<chr> <int> <int> <int>
1 1-Jan 11 1 6
2 2-Jan 12 2 7
3 3-Jan 13 3 8
4 4-Jan 14 4 9
5 5-Jan 15 5 10
You could then write that out with the writexl package function write_xlsx

Calculate multiple moving calculations in one statement

I want to calculate all moving averages in one statement rather than repeating myself. Is this possible using quantmod or does it require some clever use of tidyeval and/or purrr?
library(tidyquant)
library(quantmod)
library(zoo)
tibble(date = as.Date('2018-01-01') + days(1:100),
value = 100 + cumsum(rnorm(100))) %>%
tq_mutate(mutate_fun = rollapply, select = "value", width = 10, FUN = mean, col_rename = "rm10") %>%
tq_mutate(mutate_fun = rollapply, select = "value", width = 5, FUN = mean, col_rename = "rm5") %>%
gather(series, value, -date) %>%
ggplot(aes(date, value, color = series)) +
geom_line()
Here is a solution using data.table's new frollmean()-function
data.table v1.12.0 or higher required.
sample data
library( data.table )
set.seed(123)
dt <- data.table( date = as.Date('2018-01-01') + days(1:100),
value = 100 + cumsum(rnorm(100)))
code
#set windwos you want to roll on
windows <- c(5,10)
#create a rm+window column for each roll
dt[, ( paste0( "rm", windows ) ) := lapply( windows, function(x) frollmean( value, x)) ]
output
head( dt, 15 )
# date value rm5 rm10
# 1: 2018-01-02 99.43952 NA NA
# 2: 2018-01-03 99.20935 NA NA
# 3: 2018-01-04 100.76806 NA NA
# 4: 2018-01-05 100.83856 NA NA
# 5: 2018-01-06 100.96785 100.2447 NA
# 6: 2018-01-07 102.68292 100.8933 NA
# 7: 2018-01-08 103.14383 101.6802 NA
# 8: 2018-01-09 101.87877 101.9024 NA
# 9: 2018-01-10 101.19192 101.9731 NA
# 10: 2018-01-11 100.74626 101.9287 101.0867
# 11: 2018-01-12 101.97034 101.7862 101.3398
# 12: 2018-01-13 102.33015 101.6235 101.6519
# 13: 2018-01-14 102.73092 101.7939 101.8482
# 14: 2018-01-15 102.84161 102.1239 102.0485
# 15: 2018-01-16 102.28577 102.4318 102.1802
plot
#plot molten data
library(ggplot2)
ggplot( data = melt(dt, id.vars = c("date") ),
aes(x = date, y = value, colour = variable)) +
geom_line()
update - grouped data
library(data.table)
library(ggplot2)
set.seed(123)
#changed the sample data a bit, to get different values for grp=1 and grp=2
dt <- data.table(grp = rep(1:2, each = 100), date = rep(as.Date('2018-01-01') + days(1:100), 2), value = 100 + cumsum(rnorm(200)))
dt[, ( paste0( "rm", windows ) ) := lapply( windows, function(x) frollmean( value, x)), by = "grp" ]
ggplot( data = melt(dt, id.vars = c("date", "grp") ),
aes(x = date, y = value, colour = variable)) +
geom_line() +
facet_wrap(~grp, nrow = 1)
In this example I use the AAPL adjusted close price downloaded using the getSymbols function from quantmod
lets say you want the SMAs with the following lengths:
smaLength = c(30,35,40,46,53,61,70,81,93)
Now create the SMA like so:
lapply(smaLength,function(x) SMA(AAPL$AAPL.Adjusted,x)) %>% do.call(cbind,.) %>% tail()
result:
SMA SMA.1 SMA.2 SMA.3 SMA.4 SMA.5 SMA.6 SMA.7 SMA.8
2019-03-04 167.3703 165.2570 163.3706 162.1362 161.5904 162.9735 164.7770 169.3341 175.4143
2019-03-05 168.0162 165.9396 164.0682 162.5499 161.7934 162.8342 164.6408 168.9595 174.9418
2019-03-06 168.7454 166.6585 164.7488 162.9638 162.0062 162.8110 164.6165 168.6446 174.5135
2019-03-07 169.3866 167.2323 165.3086 163.3320 162.1409 162.7868 164.5661 168.2780 174.0284
2019-03-08 170.0820 167.7646 165.8150 163.6764 162.3807 162.8711 164.5855 167.8407 173.5334
2019-03-11 170.8092 168.4419 166.4589 164.1471 162.8097 163.0354 164.6573 167.4864 173.0806
Define the input and then lapply over the widths creating a rollmean for each one merging them together. Finally plot it.
library(ggplot2)
library(magrittr)
library(zoo)
set.seed(123)
w <- c(1, 5, 10)
zoo(100 * cumsum(rnorm(100)), as.Date("2018-01-01") + 1:100) %>%
lapply(w, rollmeanr, x = .) %>%
do.call("merge", .) %>%
setNames(w) %>%
autoplot(facet = NULL)

My original data is weekly data, how do I plot it as weekly data in r?

My data are originally in week (examples below). I find it difficult to perform time series data since this data is always in the from of dd-mm-yy.
WEEK SALES
1: 29.2010 60.48
2: 30.2010 95.76
3: 31.2010 51.66
4: 32.2010 73.71
5: 33.2010 22.05
Thanks in advance!
We can convert the week number as date using functions from the lubridate package, and then plot the date on the x-axis and SALES on the y-axis.
library(tidyverse)
library(lubridate)
dat2 <- dat %>%
separate(WEEK, into = c("WEEK", "YEAR"), convert = TRUE) %>%
mutate(Date = ymd("2010-01-01") + weeks(WEEK - 1))
ggplot(dat2, aes(x = Date, y = SALES)) +
geom_line()
DATA
dat <- read.table(text = " WEEK SALES
1 '29.2010' 60.48
2 '30.2010' 95.76
3 '31.2010' 51.66
4 '32.2010' 73.71
5 '33.2010' 22.05",
header = TRUE, stringsAsFactors = FALSE,
colClasses = c("character", "character", "numeric"))
UPDATE
If the data are from different years, we can use the following code.
dat2 <- dat %>%
separate(WEEK, into = c("WEEK", "YEAR"), convert = TRUE) %>%
mutate(Date = ymd(paste(YEAR, "01", "01", sep = "-")) + weeks(WEEK - 1))
DATA
dat <- read.table(text = " WEEK SALES
1 '29.2010' 60.48
2 '30.2010' 95.76
3 '31.2010' 51.66
4 '32.2010' 73.71
5 '33.2010' 22.05
6 '1.2011' 37.5
7 '2.2011' 45.2
8 '3.2011' 62.9",
header = TRUE, stringsAsFactors = FALSE,
colClasses = c("character", "character", "numeric"))

Resources