First element in a data.table aggregation - r

I have a data.table of tick data, which I want to aggregate into seconds timeframe. While getting max, min and last is pretty straightforward:
data[, list(max(value), min(value), last(value)), by=time]
I am struggling to get the first datapoint which corresponds to a certain second timestamp. There is nothing in the manual. Is there an easy way to do it, like say, SQL TOP?

I managed to find the solution. The query to get the first element is to just subset that column's first value using [:
data[, list(value[1], max(value), min(value), last(value)),by=time]
Maybe it helps someone.

It seems that first is a valid aggregation.
foo <- data.table(x=1:10, y=11:20)
x y
1: 1 11
2: 2 12
3: 3 13
4: 4 14
5: 5 15
6: 6 16
7: 7 17
8: 8 18
9: 9 19
10: 10 20
foo[, .(first(x), last(x))]
V1 V2
1: 1 10

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

I need to find the mean for the data with cells without values

I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0

R: Create a column of averages based upon groups of four rows

>head(df)
person week target actual drop_out organization agency
1: QJ1 1 30 19 TRUE BB LLC
2: GJ2 1 30 18 FALSE BB LLC
3: LJ3 1 30 22 TRUE CC BBR
4: MJ4 1 30 24 FALSE CC BBR
5: PJ5 1 35 55 FALSE AA FUN
6: EJ6 1 35 50 FALSE AA FUN
There are around ~30 weeks in the dataset with a repeating Person ID each week.
I want to look at each person's values FOUR weeks at a time (so week 1-4, 5-9, 10-13, and so on). For each of these chunks, I want to add up all the "actual" columns and divide it by the sum of the "target" columns. Then we could put that value in a column called "monthly percent."
As per Shape's recommendation I've created a month column like so
fullReshapedDT$month <- with(fullReshapedDT, ceiling(week/4))
Trying to figure out how to iterate over the month column and calculate averages now. Trying something like this, but it obviously doesn't work:
fullReshapedDT[,.(monthly_attendance = actual/target,by=.(person_id, month)]
Have you tried creating a group variable? It will allow you to group operations by the four-week period:
setDT(df1)[,grps:=ceiling(week/4) #Create 4-week groups
][,sum(actual)/sum(target), .(person, grps) #grouped operations
][,grps:=NULL][] #Remove unnecessary columns
# person V1
# 1: QJ1 1.1076923
# 2: GJ2 1.1128205
# 3: LJ3 0.9948718
# 4: MJ4 0.6333333
# 5: PJ5 1.2410256
# 6: EJ6 1.0263158
# 7: QJ1 1.2108108
# 8: GJ2 0.6378378
# 9: LJ3 0.9891892
# 10: MJ4 0.8564103
# 11: PJ5 1.1729730
# 12: EJ6 0.8666667

Use dplyr to compute lagging difference

My data frame consists of three columns: state name, year, and the tax receipt for each year and each state. Below is an example for just one state.
year RealTaxRevs
1 1971 8335046
2 1972 9624026
3 1973 10498935
4 1974 10052305
5 1975 8708381
6 1976 8911262
7 1977 10759032
I'd like to compute the change in tax receipt from one year to the next, for each state. I used the following code:
data %>% group_by(state) %>% summarise(diff(RealTaxRevs, lag = 1, differences = 1))
but it gives me "Error: expecting a single value".
Could anyone explain this error message, and help me do this correctly using dplyr? Thank you.
If you want to use diff like function, then consider using the zoo library as well. Then you can have code which looks like the following:
library(zoo)
diff(as.zoo(1:4), na.pad=T)
In a data frame setting it would be like:
dat <- data.frame(a=c(8335046, 9624026, 10498935, 10052305, 8708381, 8911262, 10759032))
dat %>% mutate(b=diff(as.zoo(a), na.pad=T))
# a b
# 1 8335046 NA
# 2 9624026 1288980
# 3 10498935 874909
# 4 10052305 -446630
# 5 8708381 -1343924
# 6 8911262 202881
# 7 10759032 1847770
This way you can easily increase the number of lags, without continually adding NA
dat %>% mutate(b2=diff(as.zoo(a), lag=2, na.pad=T))
# a b2
# 1 8335046 NA
# 2 9624026 NA
# 3 10498935 2163889
# 4 NA NA
# 5 8708381 -1790554
# 6 8911262 NA
# 7 10759032 2050651
We can use data.table
library(data.table)
setDT(data)[, Diffs := RealTaxRevs - shift(RealTaxRevs)[[1]], state]

R - How to sum a column based on date range? [duplicate]

This question already has an answer here:
R // Sum by based on date range
(1 answer)
Closed 7 years ago.
Suppose I have df1 like this:
Date Var1
01/01/2015 1
01/02/2015 4
....
07/24/2015 1
07/25/2015 6
07/26/2015 23
07/27/2015 15
Q1: Sum of Var1 on previous 3 days of 7/27/2015 (not including 7/27).
Q2: Sum of Var1 on previous 3 days of 7/25/2015 (This is not last row), basically I choose anyday as reference day, and then calculate rolling sum.
As suggested in one of the comments in the link referenced by #SeƱorO, with a little bit of work you can use zoo::rollsum:
library(zoo)
set.seed(42)
df <- data.frame(d=seq.POSIXt(as.POSIXct('2015-01-01'), as.POSIXct('2015-02-14'), by='days'),
x=sample(20, size=45, replace=T))
k <- 3
df$sum3 <- c(0, cumsum(df$x[1:(k-1)]),
head(zoo::rollsum(df$x, k=k), n=-1))
df
## d x sum3
## 1 2015-01-01 16 0
## 2 2015-01-02 12 16
## 3 2015-01-03 15 28
## 4 2015-01-04 15 43
## 5 2015-01-05 17 42
## 6 2015-01-06 10 47
## 7 2015-01-07 11 42
The 0, cumsum(...) is to pre-populate the first two rows that are ignored (rollsum(x, k) returns a vector of length length(x)-k+1). The head(..., n=-1) discards the last element, because you said that the nth entry should sum the previous 3 and not its own row.

Resources