cross sectional sub-sets in data.table - r

I have a data.table which contains multiple columns, which is well represented by the following:
DT <- data.table(date = as.IDate(rep(c("2012-10-17", "2012-10-18", "2012-10-19"), each=10)),
session = c(1,2,3), price = c(10, 11, 12,13,14),
volume = runif(30, min=10, max=1000))
I would like to extract a multiple column table which shows the volume traded at each price in a particular type of session -- with each column representing a date.
At present, i extract this data one date at a time using the following:
DT[session==1,][date=="2012-10-17", sum(volume), by=price]
and then bind the columns.
Is there a way of obtaining the end product (a table with each column referring to a particular date) without sticking all the single queries together -- as i'm currently doing?
thanks

Does the following do what you want.
A combination of reshape2 and data.table
library(reshape2)
.DT <- DT[,sum(volume),by = list(price,date,session)][, DATE := as.character(date)]
# reshape2 for casting to wide -- it doesn't seem to like IDate columns, hence
# the character DATE co
dcast(.DT, session + price ~ DATE, value.var = 'V1')
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439
6 2 10 NA 755.2650 998.7646
7 2 11 251.3691 695.0153 NA
8 2 12 791.6882 NA 275.4777
9 2 13 NA 111.7700 240.3329
10 2 14 230.6461 817.9438 NA
11 3 10 902.9220 NA 870.3641
12 3 11 NA 719.8441 963.1768
13 3 12 361.8612 563.9518 NA
14 3 13 393.6963 NA 718.7878
15 3 14 NA 871.4986 582.6158
If you just wanted session 1
dcast(.DT[session == 1L], session + price ~ DATE)
session price 2012-10-17 2012-10-18 2012-10-19
1 1 10 308.9528 592.7259 NA
2 1 11 649.7541 NA 816.3317
3 1 12 NA 502.2700 766.3128
4 1 13 424.8113 163.7651 NA
5 1 14 682.5043 NA 147.1439

Related

To apply mutate with an other line

I have a table and I would like to add a column that calculates the percentage compared to the previous line.
You have to do as calculation takes the line 1 divided by line 2 and on the line 2, you indicate the result
Example
month <- c(10,11,12,13,14,15)
sell <-c(258356,278958,287928,312254,316287,318999)
df <- data.frame(month, sell)
df %>% mutate(augmentation = sell[month]/sell[month+1])
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984
dplyr
You can just use lag like this:
library(dplyr)
df %>%
mutate(resultat = lag(sell)/sell)
Output:
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984
data.table
Another option is using shift:
library(data.table)
setDT(df)[, resultat:= shift(sell)/sell][]
Output:
month sell resultat
1: 10 258356 NA
2: 11 278958 0.9261466
3: 12 287928 0.9688464
4: 13 312254 0.9220955
5: 14 316287 0.9872489
6: 15 318999 0.9914984

Hold current value until non-null value occurs [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 5 years ago.
Hi I come from a background in SAS and I am relatively new to R. I am attempting to convert an existing SAS program into equivalent R code
I am unsure how to achieve the equivalent of SAS's "retain" and "by" Behavior in R
I have a dataframe with two columns first column is a date column and the second column is a numeric value.
The numeric column represents a result from lab test. The test is conducted semi-regularly so on some days there will be Null values in the data. The data is ordered by date and the dates are sequential.
i.e example data looks like this
Date Result
2017/01/01 15
2017/01/02 NA
2017/01/03 NA
2017/01/04 12
2017/01/05 NA
2017/01/06 13
2017/01/07 11
2017/01/08 NA
I would like to create a third column which would contain the most recent result.
If Result column is Null it should be set to most recent previously non Null Result otherwise it should contain the Result value
My desired output would look like this:
Date Result My_var
2017/01/01 15 15
2017/01/02 NA 15
2017/01/03 NA 15
2017/01/04 12 12
2017/01/05 NA 12
2017/01/06 13 13
2017/01/07 11 11
2017/01/08 NA 11
In SAS I can achieve this with something like following code snippet:
data my_data;
retain My_var;
set input_data;
by date;
if Result not = . then
my_var = result;
run;
I am stumped as to how to do this in R I do not think R supports By group processing as in SAS - or at least I don't know how to set that as option.
I have naively tried:
my_data <- mutate(input_data, my_var = if(is.na(Result)) {lag(Result)} else {Result})
But I do not think that syntax is correct.
We can use na.locf function from the zoo package to fill in the missing values.
library(zoo)
dt$My_var <- na.locf(dt$Result)
dt
# Date Result My_var
# 1 2017/01/01 15 15
# 2 2017/01/02 NA 15
# 3 2017/01/03 NA 15
# 4 2017/01/04 12 12
# 5 2017/01/05 NA 12
# 6 2017/01/06 13 13
# 7 2017/01/07 11 11
# 8 2017/01/08 NA 11
Or the fill function from the tidyr package.
library(dplyr)
library(tidyr)
dt <- dt %>%
mutate(My_var = Result) %>%
fill(My_var)
dt
# Date Result My_var
# 1 2017/01/01 15 15
# 2 2017/01/02 NA 15
# 3 2017/01/03 NA 15
# 4 2017/01/04 12 12
# 5 2017/01/05 NA 12
# 6 2017/01/06 13 13
# 7 2017/01/07 11 11
# 8 2017/01/08 NA 11
DATA
dt <- read.table(text = "Date Result
2017/01/01 15
2017/01/02 NA
2017/01/03 NA
2017/01/04 12
2017/01/05 NA
2017/01/06 13
2017/01/07 11
2017/01/08 NA",
header = TRUE, stringsAsFactors = FALSE)

Finding newest data older than a specific date in R

I have a two data.frames (call them dataset.new and dataset.old) that both contain information about some individuals. These individuals all have a identification number (a variable we can call ”individual”) that occurs in both of the data.frames and each frame has information on when the data was collected, stored in a column that we can call ”some.date”.
The second of these two data.frames (dataset.old) contains historical data for the individuals, i.e. values of some other variables measured at other times and thus each individual appears many times in dataset.old.
What I wish to do is the following. For each individual in dataset.new, find the rows from dataset.old that are the newest but still older than the observations in dataset.new. For the individuals that have no such date present in dataset.old, I want it to return NA.
This is perhaps easiest illustrated through some example data, presented below.
dataset.new
individual some.date
1 1 2016-05-01
2 2 2016-01-28
3 7 2016-03-03
dataset.old
individual some.date
1 1 2016-01-12
2 1 2015-12-30
3 1 2016-04-27
4 1 2016-05-02
5 2 2015-11-15
6 2 2012-01-27
7 2 2016-02-06
8 3 2016-04-30
9 3 2016-01-27
10 4 2016-03-01
11 4 2011-01-16
In this example, I am looking for a way get the following output:
individual row.nr
1 1 3
2 2 5
3 7 NA
since those rows correspond to the newest data in dataset.old that still is older than the data in dataset.new.
I have a code that solves the problem, but it is too slow for the data that I have in mind (which has well over 20 000 rows in dataset.new and many, many more in dataset.old). My solution is basically a loop over all individuals, subsetting the data at each stage.
find.previous <- function(dataset.old, individual, some.new.date){
subsetted.dataset <- dataset.old[dataset.old[, "individual"] == individual, ] # We only look at the individual in question.
subsetted.dataset <- subsetted.dataset[subsetted.dataset[, "some.date"] < some.new.date, ]# Here we get all the rows that have data that are measured BEFORE timepoint.
row.index <- which.min(some.new.date - subsetted.dataset[, "some.date"]) # This can be done, since we have already made sure that fromdatum < timepoint.
ifelse(length(row.index)!= 0, as.integer(rownames(subsetted.dataset[row.index,])), NA) # Then we output the row that had that information.
}
output <- matrix(ncol=2, nrow=0)
for(i in 1:nrow(dataset.new)){
output <- rbind(output, cbind(dataset.new[, "individual"][i], find.previous(dataset.old, dataset.new[, "individual"][i], dataset.new[, "some.date"][i])))
}
colnames(output) <- c("individual", "row.nr")
output
Any help on how to solve this problem would be greatly appreciated. I have tried using my Google skills as well as reading other posts on here stackoverflow, but without success.
The example data can be replicated by copying the following lines of code:
dataset.new <- data.frame(individual=c(1, 2, 7), some.date=as.Date(c("2016-05-01", "2016-01-28", "2016-03-03")))
dataset.old <- data.frame(individual=c(1,1,1,1,2,2,2,3,3,4,4), some.date=as.Date(c("2016-01-12", "2015-12-30", "2016-04-27", "2016-05-02", "2015-11-15", "2012-01-27", "2016-02-06", "2016-04-30", "2016-01-27", "2016-03-01", "2011-01-16")))
You can solve this efficiently with a merge.
First make the rownumber variable you want in dataset.old. Then merge dataset.new with dataset.old on individual (left join, or merge(lhs, rhs, all.x = TRUE)). This can get you:
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
4 1 2016-05-01 2016-05-02 4
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
7 2 2016-01-28 2016-02-06 7
8 7 2016-03-03 NA NA
Subset to new.date > old.date or is.na(old.date):
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
8 7 2016-03-03 NA NA
Subset to old.date == max(old.date) or is.na(old.date) grouped by individual.
dataset.old
individual new.date old.date old.rownumber
3 1 2016-05-01 2016-04-27 3
6 2 2016-01-28 2012-01-27 5
8 7 2016-03-03 NA NA
Edit:
I'm partial to data.table. The code would look something like:
dataset.old[, old.rownumber := 1:.N]
setnames(dataset.old, "some.date", "old.date")
setnames(dataset.new, "some.date", "new.date")
dataset.merge <- merge(dataset.old, dataset.new, by = "individual", all.x = TRUE)
dataset.merge <- dataset.merge[, new.date > old.date]
dataset.merge[old.date == max(old.date) | is.na(old.date), by = individual]
We can skip the NA search by finding the minimum square root. The negative values will be coerced to missing for us:
dataset.old$rn <- 1:nrow(dataset.old)
minp <- function(x) if(!length(m <- which.min(as.numeric(x)^.5))) NA else m
mrg <- merge(dataset.new, dataset.old, by="individual", all.x=TRUE)
mrg %>% group_by(individual) %>%
summarise(row.nr=rn[minp(some.date.x - some.date.y)])
# A tibble: 3 x 2
# individual row.nr
# <int> <int>
# 1 1 3
# 2 2 5
# 3 7 NA

Calculate rolling average of simulated data series with data.table

I am simulating a price time series, where the time horizon basically is that each month consists of 20 working days and 12 months are one year. I now would like to calculate the rolling average of this price, always based on the first day of the month.
I do have a working solution, but would like to know if there's a more elegant or faster one.
dt.oil.price
Period Month Day.Month Oil.Price Oil.Supply Risk.Free.Interest
1: 1 1 1 39.4560000 NA 0.08642857
2: 2 1 2 3.7889460 NA 0.08642857
3: 3 1 3 51.0748751 NA 0.08642857
4: 4 1 4 60.6282853 NA 0.08642857
5: 5 1 5 35.7267224 NA 0.08642857
6: 6 1 6 26.1868977 NA 0.08642857
7: 7 1 7 32.6488136 NA 0.08642857
8: 8 1 8 42.6397549 NA 0.08642857
9: 9 1 9 18.8969991 NA 0.08642857
...
20: 20 1 20 8.8036135 NA 0.08642857
21: 21 2 1 2.5559526 NA 0.08642857
22: 22 2 2 24.3996401 NA 0.08642857
...
40: 40 2 20 41.2988566 NA 0.08642857
41: 41 3 1 20.8012327 NA 0.08642857
42: 42 3 2 70.5297726 NA 0.08642857
Just to give you an idea on the structure of the data. To create the above data structure with 60 periods:
set.seed(1);
dt.oil.price <- as.data.table(cbind( Period = 1:60,
Month = as.integer(rep(1:(60/20), each = 20))[1:60],
Oil.Price=rnorm(3*20,mean = 50, sd = 10)))
dt.oil.price[,"Day.Month" := rank(Period),by="Month"]
With the following code I can then select all first days of a month and calculate the mean of the oil price for these days:
dt.oil.price[ Day.Month == 1, mean(Oil.Price)]
In the next step I use another helper column "Num.Months" to rank the number of months accordingly, by
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)]
and with this I can then select only the last two months for the average calculation, by subsetting this
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)][Num.Months <= 2, Oil.Price]
A code snippet, which allows to calculate the mean without using an explicit helper column for the last three months:
dt.oil.price[Day.Month == 1 & Period <= 60, {Num.Months = rank(-Period); list("Period" = Period, "Month" = Month, "Oil.Price" = Oil.Price, "Num.Months" = Num.Months)}][Num.Months <=12, mean(Oil.Price)]
I hope my steps are all clear and it becomes also clear, what I would like to achieve. It is also possible to calculate the moving average dynamically by defining for example a period and then calculate the moving average for the last 12 months preceding that period. This can be achieved, by sub-setting the data.table only to periods smaller than the defined period and then calculating "Num.Months" for this data.table subset.

Identify and remove duplicates by a criteria in R

Hi I am puzzled with a problem concerning duplicates in R. I have looked around a lot and don't seem to find any help. I have a dataset like that
x = data.frame( id = c("A","A","A","A","A","A","A","B","B","B","B"),
StartDate = c("09/07/2006", "09/07/2006", "09/07/2006", "08/10/2006",
"08/10/2006", "09/04/2007", "02/03/2011","05/05/2005", "08/06/2009", "07/09/2009", "07/09/2009"),
EndDate = c("06/08/2006", "06/08/2006", "06/08/2006", "19/11/2006", "19/11/2006", "07/05/2007", "30/03/2011",
"02/06/2005", "06/07/2009", "05/10/2009", "05/10/2009"),
Group = c(1,1,1,2,2,3,4,2,3,4,4),
TestDate = c("09/06/2006", "08/09/2006", "08/10/2006", "08/09/2006", "08/10/2006", "NA", "02/03/2011",
"NA", "07/09/2009", "07/09/2009", "08/10/2009"),
Code = c(4,4,4858,4,4858,NA,4,NA, 795, 795, 4)
)
> x
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
2 A 09/07/2006 06/08/2006 1 08/09/2006 4
3 A 09/07/2006 06/08/2006 1 08/10/2006 4858
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
9 B 08/06/2009 06/07/2009 3 07/09/2009 795
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
So basically what I am trying to do is to identify duplicates in the TestDate variable by ID. For example dates 08/09/2006 and 08/10/2006 seem to be repeated in the same person but for different Group and I don't want the same Testdate to be in different Group by ID. The criteria to choose which TestDate to choose is to take the difference in days of TestDate with StartDate and EndDate for the different groups and then keep the one with the smallest difference in days. For example, about the date 08/10/2006 I would like to keep row 5 as the TestDate there is closer to the StartDate, than compared with the same differences in row 3. Eventually, I would like to get with a dataset like that
> xfinal
id StartDate EndDate Group TestDate Code
1 A 09/07/2006 06/08/2006 1 09/06/2006 4
4 A 08/10/2006 19/11/2006 2 08/09/2006 4
5 A 08/10/2006 19/11/2006 2 08/10/2006 4858
6 A 09/04/2007 07/05/2007 3 NA NA
7 A 02/03/2011 30/03/2011 4 02/03/2011 4
8 B 05/05/2005 02/06/2005 2 NA NA
10 B 07/09/2009 05/10/2009 4 07/09/2009 795
11 B 07/09/2009 05/10/2009 4 08/10/2009 4
Any help on that will be much appreciated. Thanks
x$StartDate <- as.Date(x$StartDate,format="%d/%m/%Y")
x$EndDate <- as.Date(x$EndDate,format="%d/%m/%Y")
x$TestDate <- as.Date(x$TestDate,format="%d/%m/%Y")
x$Diff <- difftime(x$EndDate,x$StartDate,"days")
x <- x[order(x$id,x$Diff),]
x <- x[!duplicated(x[,c("id","TestDate")]),]
x$Diff <- NULL
x

Resources