Pivoting one column while keeping the rest in R [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I am quite new to coding in R, and im working on cleaning and transforming some data.
I have looked at some different uses of reshape() and reshape2() for the cast function to help me, but i have not been able to succeed.
Basically what i would like to do is, to move one column up as column headers for the values.
This is my data:
#My data:
KEYFIGURE LOCID PRDID KEYFIGUREDATE KEYFIGUREVALUE
Sales 1001 A 2018-01-01 1
Promo 1001 A 2018-01-02 2
Disc 1001 A 2018-01-03 3
Sales 1001 B 2018-01-01 10
Promo 1001 B 2018-01-01 11
Disc 1002 B 2018-01-03 12
The result i would like to get:
LOCID PRDID KEYFIGUREDATE Sales Promo Disc
1001 A 2018-01-01 1 2
1001 A 2018-01-03 3
1001 B 2018-01-01 10 11
1002 B 2018-01-03 12
However, i am having quite some trouble figuring out how this is possibly in a smart way w. reshape package.

You can do this in one line with tidyr::spread:
library(tidyr)
df %>%
spread(KEYFIGURE, KEYFIGUREVALUE)
LOCID PRDID KEYFIGUREDATE Disc Promo Sales
1 1001 A 2018-01-01 NA NA 1
2 1001 A 2018-01-02 NA 2 NA
3 1001 A 2018-01-03 3 NA NA
4 1001 B 2018-01-01 NA 11 10
5 1002 B 2018-01-03 12 NA NA
The way the function works is that you give it 2 variables in your dataset: the first is the variable to spread across multiple columns, while the second is the variable that sets the values to put in those cells.

Related

r data.table : lagging a date variable [duplicate]

This question already has answers here:
How to create a lag variable within each group?
(5 answers)
Closed 2 years ago.
I have data that looks similar to the following except with hundreds of IDs and thousands of observations:
ID date measles
1 2008-09-12 1
1 2008-10-25 NA
1 2009-01-12 1
1 2009-03-12 NA
1 2009-05-12 1
2 2010-05-17 NA
2 2010-06-12 NA
2 2010-07-02 1
2 2010-08-13 NA
I want to create a variable that will store the previous date for each pid like the following:
ID date measles previous_date
1 2008-09-12 1 NA
1 2008-10-25 NA 2008-09-12
1 2009-01-12 1 2008-10-25
1 2009-03-12 NA 2009-01-12
1 2009-05-12 1 2009-03-12
2 2010-05-17 NA NA
2 2010-06-12 NA 2010-05-17
2 2010-07-02 1 2010-06-12
2 2010-08-13 NA 2010-07-02
This should be an extremely easy task, but I have been unsuccessful at getting a lag variable to work properly. I have tried a few methods, such as the following:
dt[, previous_date:=c(NA, current_date[-.N]), by=c("ID")]
dt[,previous_date:=current_date-shift(current_date,1,type="lag"),by=ID]
The code samples above either produce sporadic numbers in the previous_date variable or produce all NAs. I'm not sure why this is? Is it because I'm using a date variable as opposed to an integer?
Is there a better way to accomplish this task that would work for a date variable?
We can just use shift on the 'date' column grouped by 'ID'. By default the type is lag
library(data.table)
dt[, previous_date := shift(date), ID]
dt
# ID date measles previous_date
#1: 1 2008-09-12 1 <NA>
#2: 1 2008-10-25 NA 2008-09-12
#3: 1 2009-01-12 1 2008-10-25
#4: 1 2009-03-12 NA 2009-01-12
#5: 1 2009-05-12 1 2009-03-12
#6: 2 2010-05-17 NA <NA>
#7: 2 2010-06-12 NA 2010-05-17
#8: 2 2010-07-02 1 2010-06-12
#9: 2 2010-08-13 NA 2010-07-02

How to convert 3 hourly data into hourly data? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a set of data for several stations that have 3 hourly and 1 hourly data frames. I have been able to seize through the data and separate the data into sets with 1 hr and 3 hrs. But I want to convert the datasets in 3hrs into hourly data. I do not need to estimate the missing data in between the hours, I can fill those as missing data, but I need to have a uniform data structure and all the other data in the database that I am using are already in hourly except those few stations.
I have included some data that shows the current dataset. hourly dataset. 3hourly dataset. expected dataset
Here is my best guess of what you want solved with R and the tidyverse
I have read in your data. After row binding we expand the data to include the missing time points and join to original data for desired result.
library(tidyverse)
#read in the data
df1 = readxl::read_excel("df1.xlsx")
df2 = readxl::read_excel("df2.xlsx")
#fix names of one dataframe
names(df1) <- names(df2)
#create proper timestamps
df = bind_rows(df1,df2) %>%
mutate(ts = lubridate::ymd_hm(paste0(year, "-", month, "-", day, " ", hour,":00")))
#expand timestamps and station
expanded_ts <-
df %>%
tidyr::expand(ts, station)
#join for desired result
left_join(expanded_ts, df, by=c("ts", "station"))
## A tibble: 96 x 8
# ts station year month day hour T2 DP
# <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2014-08-01 00:00:00 450070 NA NA NA NA NA NA
# 2 2014-08-01 00:00:00 450110 2014 8 1 0 295 259
# 3 2014-08-01 00:00:00 450320 2014 8 1 0 295 259
# 4 2014-08-01 00:00:00 450390 2014 8 1 0 304 236
# 5 2014-08-01 01:00:00 450070 2014 8 1 1 320 250
# 6 2014-08-01 01:00:00 450110 2014 8 1 1 310 250
# 7 2014-08-01 01:00:00 450320 NA NA NA NA NA NA
# 8 2014-08-01 01:00:00 450390 NA NA NA NA NA NA
# 9 2014-08-01 02:00:00 450070 2014 8 1 2 330 250
#10 2014-08-01 02:00:00 450110 2014 8 1 2 320 250

How to calculate moving average by specified grouping and deal with NAs

I have a data.table which needs a moving average to be calculated on the previous n days of data (let's use n=2 for simplicity, not incl. current day) for a specified grouping (ID1, ID2). The moving average should attempt to include the last 2 days of values for each ID1-ID2 pair. I would like to calculate moving average to handle NAs two separate ways:
1. Only calculate when there are 2 non-NA observations, otherwise avg should be NA (e.g. first 2 days within an ID1-ID2 will always have NAs).
2. Calculate the moving average based on any non-NA observations within the last 2 days (na.rm=TRUE ?).
I've tried to use the zoo package and various functions within it. I've settled on the following (used shift() to exclude the week considered in the avg, put dates in reverse order to highlight dates are not always ordered initially):
library(zoo)
library(data.table)
DATE = rev(rep(seq(as.Date("2018-01-01"),as.Date("2018-01-04"),"day"),4))
VALUE =seq(1,16,1)
VALUE[16] <- NA
ID1 = rep(c("A","B"),each=8)
ID2 = rep(1:2,2,each=4)
testdata = data.frame (DATE, ID1, ID2, VALUE)
setDT(testdata)[order(DATE), VALUE_AVG := shift(rollapplyr(VALUE, 2, mean,
na.rm=TRUE,fill = NA)), by = c("ID1", "ID2")]
I seem to have trouble grouping by multiple columns. Groupings where VALUE begins/ends with NA values also seem to cause issues. I'm open to any solutions which make sense within a data.table framework, especially frollmean (need to update my versions of R + data.table). I don't know if I need to order the dates differently in conjunction with a specified alignment (e.g. "right").
I would hope my output would look something like the following except ordered by oldest date first per ID1-ID2 grouping:
DATE ID1 ID2 VALUE VALUE_AVG
1: 2018-01-04 A 1 1 2.5
2: 2018-01-03 A 1 2 3.5
3: 2018-01-02 A 1 3 NA
4: 2018-01-01 A 1 4 NA
5: 2018-01-04 A 2 5 6.5
6: 2018-01-03 A 2 6 7.5
7: 2018-01-02 A 2 7 NA
8: 2018-01-01 A 2 8 NA
9: 2018-01-04 B 1 9 10.5
10: 2018-01-03 B 1 10 11.5
11: 2018-01-02 B 1 11 NA
12: 2018-01-01 B 1 12 NA
13: 2018-01-04 B 2 13 14.5
14: 2018-01-03 B 2 14 15.0
15: 2018-01-02 B 2 15 NA
16: 2018-01-01 B 2 NA NA
My code seems to roughly achieve the desired results for the sample data. Nevertheless, when trying to run the same code on large dataset for a 4-week average where ID1 and ID2 are both integers, I get the following error:
Error in seq.default(start.at, NROW(data), by = by) :
wrong sign in 'by' argument
My results seem right for most ID1-ID2 combinations but there are specific cases of ID1 where VALUE has leading and trailing NAs. I'm guessing this is causing the issue, although it hasn't for the example above.
Using shift complicates this unnecessarily. rollapply already can handle that itself. In rollapplyr specify:
a width of list(-seq(2)) to specify that it should act on offsets -1 and -2.
partial = TRUE to indicate that if there are fewer than 2 prior rows it will use whatever is there.
fill = NA to fill empty cells with NA
na.rm = TRUE to remove any NAs and only perform the mean on the remaining cells. If the prior cells are all NA then mean gives NaN.
To only consider situations where there are 2 prior non-NAs giving NA otherwise remove the partial = TRUE and na.rm = TRUE arguments.
First case
Take mean of non-NAs in prior 2 rows or fewer rows if fewer prior rows.
testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG :=
rollapplyr(VALUE, list(-seq(2)), mean, fill = NA, partial = TRUE, na.rm = TRUE),
by = c("ID1", "ID2")]
testdata
giving:
DATE ID1 ID2 VALUE VALUE_AVG
1: 2018-01-01 A 1 4 NA
2: 2018-01-02 A 1 3 4.0
3: 2018-01-03 A 1 2 3.5
4: 2018-01-04 A 1 1 2.5
5: 2018-01-01 A 2 8 NA
6: 2018-01-02 A 2 7 8.0
7: 2018-01-03 A 2 6 7.5
8: 2018-01-04 A 2 5 6.5
9: 2018-01-01 B 1 12 NA
10: 2018-01-02 B 1 11 12.0
11: 2018-01-03 B 1 10 11.5
12: 2018-01-04 B 1 9 10.5
13: 2018-01-01 B 2 NA NA
14: 2018-01-02 B 2 15 NaN
15: 2018-01-03 B 2 14 15.0
16: 2018-01-04 B 2 13 14.5
Second case
NA if any of the prior 2 rows are NA or if there are fewer than 2 prior rows.
testdata <- data.table(DATE, ID1, ID2, VALUE, key = c("ID1", "ID2", "DATE"))
testdata[, VALUE_AVG :=
rollapplyr(VALUE, list(-seq(2)), mean, fill = NA),
by = c("ID1", "ID2")]
testdata
giving:
DATE ID1 ID2 VALUE VALUE_AVG
1: 2018-01-01 A 1 4 NA
2: 2018-01-02 A 1 3 NA
3: 2018-01-03 A 1 2 3.5
4: 2018-01-04 A 1 1 2.5
5: 2018-01-01 A 2 8 NA
6: 2018-01-02 A 2 7 NA
7: 2018-01-03 A 2 6 7.5
8: 2018-01-04 A 2 5 6.5
9: 2018-01-01 B 1 12 NA
10: 2018-01-02 B 1 11 NA
11: 2018-01-03 B 1 10 11.5
12: 2018-01-04 B 1 9 10.5
13: 2018-01-01 B 2 NA NA
14: 2018-01-02 B 2 15 NA
15: 2018-01-03 B 2 14 NA
16: 2018-01-04 B 2 13 14.5
Maybe something like:
setorder(setDT(testdata), ID1, ID2, DATE)
testdata[order(DATE), VALUE_AVG := shift(
rollapplyr(VALUE, 2L, function(x) if(sum(!is.na(x)) > 0L) mean(x, na.rm=TRUE), fill = NA_real_)
), by = c("ID1", "ID2")]

R Max of Same Date, Previous Date, and Previous Hour Value

A couple basic data manipulations. I searched with different wordings and couldn't find much.
I have data structured as below. In reality the hourly data is continuous, but I just included 4 lines as an example.
start <- as.POSIXlt(c('2017-1-1 1:00','2017-1-1 2:00','2017-1-2 1:00','2017-1-2 2:00'))
values <- as.numeric(c(2,5,4,3))
df <- data.frame(start,values)
df
start values
1 2017-01-01 01:00:00 2
2 2017-01-01 02:00:00 5
3 2017-01-02 01:00:00 4
4 2017-01-02 02:00:00 3
I would like to add a couple columns that:
1) Show the max of the same day.
2) Show the max of the previous day.
3) Show the value of one previous hour.
The goal is to have an output like:
MaxValueDay <- as.numeric(c(5,5,4,4))
MaxValueYesterday <- as.numeric(c(NA,NA,5,5))
PreviousHourValue <- as.numeric(c(NA,2,NA,4))
df2 <- data.frame(start,values,MaxValueDay,MaxValueYesterday,PreviousHourValue)
df2
start values MaxValueDay MaxValueYesterday PreviousHourValue
1 2017-01-01 01:00:00 2 5 NA NA
2 2017-01-01 02:00:00 5 5 NA 2
3 2017-01-02 01:00:00 4 4 5 NA
4 2017-01-02 02:00:00 3 4 5 4
Any help would be greatly appreciated. Thanks
A solution using dplyr, magrittr, and lubridate packages:
library(dplyr)
library(magrittr)
library(lubridate)
df %>%
within(MaxValueDay <- sapply(as.Date(start), function (x) max(df$values[which(x==as.Date(start))]))) %>%
within(MaxValueYesterday <- MaxValueDay[sapply(as.Date(start)-1, match, as.Date(start))]) %>%
within(PreviousHourValue <- values[sapply(start-hours(1), match, start)])
# start values MaxValueDay MaxValueYesterday PreviousHourValue
# 1 2017-01-01 01:00:00 2 5 NA NA
# 2 2017-01-01 02:00:00 5 5 NA 2
# 3 2017-01-02 01:00:00 4 4 5 NA
# 4 2017-01-02 02:00:00 3 4 5 4

How to recreate the table by key?

I thought it could be a very easy question, but I am really a new beginner for R.
I have a data.table with key and lots of rows, two of which could be set as key. I want to recreate the table by Key.
For example, the simple data. In this case, the key is ID and Act, and here we can get a total of 4 groups.
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
After change, each column should present a specific group which is defined by Key (ID and Act).
Below is the result:
ValueDate ID1_EUR D1_MAD D2_EUR D3_EUR
2015-01-01 21 12 5 4
2015-02-01 22 11 7 2
2015-03-01 NA NA NA 6
Thanks a lot !
What you are trying to do is not recreating the data.table, but reshaping it from a long format to a wide format. You can use dcast for this:
dcast(dd, ValueDate ~ ID + Act, value.var = "Volume")
which gives:
ValueDate 1_EUR 1_MAD 2_EUR 3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6
If you want the numbers in the resulting columns to be preceded with ID, then you can use:
dcast(dd, ValueDate ~ paste0("ID",ID) + Act, value.var = "Volume")
which gives:
ValueDate ID1_EUR ID1_MAD ID2_EUR ID3_EUR
1: 2015-01-01 21 12 5 4
2: 2015-02-01 22 11 7 2
3: 2015-03-01 NA NA NA 6

Resources