R add in/populate missing combinations dcast reshape2 table - r

This is my data table:
Name.1 <- c(rep("IVa",12),rep("VIa",10),rep("VIIb",3),rep("IVa",5))
qrt <- c(rep("Q1",6),rep("Q3",10),rep("Q4",3),rep("Q1",5),rep("Q1",3),rep("Q3",3))
variable <- c(rep("wtTonnes",30))
value <- c(201:230)
df <- data.frame(Name.1,qrt,variable,value)
df1 <- dcast(df, Name.1 ~ qrt, fun.aggregate=sum, value.var="value",margins=TRUE)
It gives me an output like this;
Name.1 Q1 Q3 Q4 (all)
IVa 1674 1944 0 3618
VIa 663 858 654 2175
VIIb 672 0 0 672
(all) 3009 2802 654 6465
The 'qrt' values Q1, Q3, Q4 represent quarters of the year. Basically I would like the table to include missing quarters and populate with 0. As every year when I run the script there could be wtTonne values for any combination of quarters and I don't want to hard code each time to add whichever are missing.
In this case I would like it to look like:
Name.1 Q1 Q2 Q3 Q4 (all)
IVa 1674 0 1944 0 3618
VIa 663 0 858 654 2175
VIIb 672 0 0 0 672
(all) 3009 0 2802 654 6465
Is it possible to pass a list to a table or the raw data at any stage to say which columns I want to have? (i.e. there always to be Q1, Q2, Q3, Q4) with dummy values if needs be.

The following should give you the required output:
df$qrt <- factor(df$qrt, levels = c("Q1", "Q2", "Q3", "Q4"))
df1 <- dcast(df, Name.1 ~ qrt, fun.aggregate=sum, value.var="value",margins=TRUE, drop = F)
At first, I tell R that qrt is a factor with the corresponding levels, including the level that does not occur, and then I tell dcast to avoid droppping unused combinations. This gives:
Name.1 Q1 Q2 Q3 Q4 (all)
1 IVa 1674 0 1944 0 3618
2 VIa 663 0 858 654 2175
3 VIIb 672 0 0 0 672
4 (all) 3009 0 2802 654 6465

Related

R: fill out a new colum based on sevaral variables in another dataset

I have a first dataframe with 4 columns (ID, Year, X and Y)
Id Year X Y
1 2017 20_24
1 2016 45_49
2 2017 30_34
2 2014 20_24
4 2014 14_19
4 2015 20_24
I would like to fill out the Y column using another dataset.
The second dataset got the same variables ID and year, the other columns are the items of the column X in the first dataset.
Id Year 14_19 20_24 30_34 45_49
1 2017 123 122 5555 4444
1 2016 456 543 8888 333
1 2015 5644 0908 0987 5456
1 2014 5642 767 233 323
2 2017 123 123 5666 989
2 2016 456 876 55 45
2 2015 786 789 324 77
2 2014 633 543 334 34
3 2017 123 123 321 44
3 2016 456 345 45645 23
3 2015 876 4556 6554 23
So I would like Y to be filled out when ID, Year and items of the X variables are matching the columns of the second dataset.How is this possible ?
Thanks !
Try this dplyr and tidyr solution:
library(dplyr)
library(tidyr)
result <- df2 %>%
gather("X", "Y", -c("ID", "Year")) %>%
right_join(df1, by = c("ID", "Year", "X"))
Or with the use of pivot_longer()
result <- df2 %>%
pivot_longer(cols = 3:4,
names_to = "X",
values_to = "Y") %>%
right_join(df1, by = c("ID", "Year", "X"))

Reshaping data in R with multiple variable levels - "aggregate function missing" warning

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

How to create a numerical variable from Qualterly time column - R

I have a dataset which currently looks like this:
Time Var1 Var2
2013 Q4 123 756
2013 Q4 657 987
2014 Q1 746 756
2014 Q1 66 999
2014 Q2 774 542
And I need to convert this categorical 'Time' variable into a numerical variable, something which may look potentially like this:
Time Var1 Var2 n.Time
2013 Q4 123 756 1
2013 Q4 657 987 1
2014 Q1 746 756 2
2014 Q1 66 999 2
2014 Q2 774 542 3
Or something similar which gives the 'Time' column a numerical value which is proportional.
I have attempted the
df$n.Time <- as.yearqtr(df$Time)
But this just gives the same output as the 'Time' column instead of making it numerical.
Any help would be greatly appreciated
Would something like this work?
df$n.Time <- as.numeric(as.factor(df$Time))
I think you are looking for splitting Q part, from Time column and then change it a numerical value.
df$n.Time <- as.factor(substr(as.character(df$Time),
gregexpr("Q",df$Time),nchar(as.character(df$Time))))

Rolling multi regression in R data table

Say I have an R data.table DT which has a list of returns:
Date Return
2016-01-01 -0.01
2016-01-02 0.022
2016-01-03 0.1111
2016-01-04 -0.006
...
I want to do a rolling multi regression of the previous N observations of Return predicting the next Return over some window K. E.g. Over the last K = 120 days do a regression of the last N = 14 observations to predict the next observation. Once I have this regression I want to use the predict function to get a prediction for each row based on the regression. In pseudocode it would be something like:
DT[, Prediction := predict(lm(Return[prev K - N -1] ~ Return[N observations prev for each observation]), Return[N observations previous for this observation])]
To be clear i want to do a multi regression so if N was 3 it would be:
lm(Return ~ Return[-1] + Return[-2] + Return[-3]) ## where the negatives are the prev rows
How do I write this (as efficiently as possible).
Thanks
If I understand correctly you want a quarterly auto-regression.
There's a related thread on time-series with data.table here.
You can setup a rolling date in data.table like this (see the link above for more context):
#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]
setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")
Since you only provided a few rows of example data, I extended the series through 2019 and made up random return values.
First get your data setup:
require(forecast)
require(xts)
DT <- read.table(con<- file ( "clipboard"))
dput(DT) # the dput was too long to display here
DT[,1] <- as.POSIXct(strptime(DT[,1], "%m/%d/%Y"))
DT[,2] <- as.double(DT[,2])
dat <- xts(DT$V2,DT$V1, order.by = DT$V1)
x.ts <- to.quarterly(dat) # 120 days
dat.Open dat.High dat.Low dat.Close
2016 Q1 1292 1292 1 698
2016 Q2 138 1290 3 239
2016 Q3 451 1285 5 780
2016 Q4 355 1243 27 1193
2017 Q1 878 1279 4 687
2017 Q2 794 1283 12 411
2017 Q3 858 1256 9 1222
2017 Q4 219 1282 15 117
2018 Q1 554 1286 32 432
2018 Q2 630 1272 30 46
2018 Q3 310 1288 18 979
2019 Q1 143 1291 10 184
2019 Q2 250 1289 8 441
2019 Q3 110 1220 23 571
Then you can do a rolling ARIMA model with or without re-estimation like this:
fit <- auto.arima(x.ts)
order <- arimaorder(fit)
fcmat <- matrix(0, nrow=nrow(x), ncol=1)
n <- nrow(x)
for(i in 1:n)
{
x <- window(x.ts, end=2017.99 + (i-1)/4)
refit <- Arima(x, order=order[1:3], seasonal=order[4:6])
fcmat[i,] <- forecast(refit, h=h)$mean
}
Here's a good related resource with several examples of different ways you might construct this: http://robjhyndman.com/hyndsight/rolling-forecasts/
You have to have the lags in the columns anyway, so I if i understand you correctly you can do something like this, say for a lag of 3:
setkey(DT,date)
lag_max<-3
for(i in 1:lag_max){
set(DT,NULL,paste0("lag",i),shift(DT[["return"]],1L,type="lag"))
}
DT[, prediction := lm(return~lag1+lag2+lag3)[["fitted.values"]]]

Looking up values without loop in R

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.
Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.
We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

Resources