How to translate data.table code to collapse - r

I read about the collapse package recently and tried to translate the following data.table code to collapse to see if it's faster in real world examples.
Here's my data.table code:
library(data.table)
library(nycflights13)
data("flights")
flights_DT <- as.data.table(flights)
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
flights_DT[
j = list(agg_val_var = sum(abs(get(val_var)), na.rm = TRUE)),
keyby = c(id_var, by)
][
i = order(-agg_val_var),
j = list(value_share = cumsum(agg_val_var)/sum(agg_val_var)),
keyby = by
][
j = .SD[2L],
keyby = by
][
order(-value_share)
]
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-11 by the reprex package (v1.0.0)
I managed to translate the first data.table call, but struggled later on.
It would be great to see how collapse would be used to handle this use case.

So on this the first thing I'd like to note is that collapse is not and probably never will be a full-blown split-apply combine computing tool like dplyr or data.table. It's focus is not on optimally executing arbitrary code expressions by groups, but on providing advanced and highly efficient grouped, weighted, time-series and panel data computations through the broad range of C++ based statistical and data transformation functions it provides. I refer to the vignette on collapse and data.table for further clarity on these points as well as integration examples.
Accordingly, I think it only makes sense to translate data.table code to collapse if (1) you've come up with an arcane expression in data.table to do something complex statistical it is is not good at (such as weighted aggregation, computing quantiles or the mode by groups, lagging / differencing an irregular panel, grouped centering or linear / polynomial fitting) (2) you actually don't need the data.table object but would much rather work with vectors / matrices / data.frame's / tibbles (3) you want to write a statistical program and would much prefer standard evaluation programming over NS eval and data.table syntax or (4) collapse is indeed substantially faster for your specific application.
Now to the specific code you have provided. It mixes standard and non-standard evaluation (e.g. through the use of get()), which is something collapse is not very good at. I'll give you 3 solutions ranging from full NS eval to full standard eval base R style programming.
library(data.table)
library(nycflights13)
library(magrittr)
library(collapse)
data("flights")
flights_DT <- as.data.table(flights)
# Defining a function for the second aggregation
myFUN <- function(x) (cumsum(x[1:2])/sum(x))[2L]
# Soluting 1: Non-Standard evaluation
flights_DT %>%
fgroup_by(carrier, month, day) %>%
fsummarise(agg_val_var = fsum(abs(arr_delay))) %>%
roworder(month, day, -agg_val_var, na.last = NA) %>%
fgroup_by(month, day) %>%
fsummarise(value_share = myFUN(agg_val_var)) %>%
roworder(-value_share)
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note the use of na.last = NA wich actually removes cases where agg_val_var is missing. This is needed here because fsum(NA) is NA and not 0 like sum(NA, na.rm = TRUE). Now the hybrid example which is probably closes to the code you provided:
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
# Solution 2: Hybrid approach with standard eval and magrittr pipes
flights_DT %>%
get_vars(c(id_var, val_var, by)) %>%
ftransformv(val_var, abs) %>%
collapv(c(id_var, by), fsum) %>%
get_vars(c(by, val_var)) %>%
roworderv(decreasing = c(FALSE, FALSE, TRUE), na.last = NA) %>%
collapv(by, myFUN) %>%
roworderv(val_var, decreasing = TRUE) %>%
frename(replace, names(.) == val_var, "value_share")
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note here that I used frename at the end to give the result column the name you wanted, as you cannot mix standard and non-standard eval in the same function in collapse. Finally, a big advantage of collapse is that you can use it for pretty low-level programming:
# Solution 3: Programming
data <- get_vars(flights_DT, c(id_var, val_var, by))
data[[val_var]] <- abs(.subset2(data, val_var))
g <- GRP(data, c(id_var, by))
data <- add_vars(get_vars(g$groups, by),
fsum(get_vars(data, val_var), g, use.g.names = FALSE))
data <- roworderv(data, decreasing = c(FALSE, FALSE, TRUE), na.last = NA)
g <- GRP(data, by)
columns
data <- add_vars(g$groups, list(value_share = BY(.subset2(data, val_var), g, myFUN, use.g.names = FALSE)))
data <- roworderv(data, "value_share", decreasing = TRUE)
data
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
I refer you to the blog post on programming with collapse for a more interesting example on how this can benefit the development of statistical codes.
Now for the evaluation, I wrapped these solutions in functions, where DT() is the data.table code you provided, run with 2 threads on a windows machine. This checks equality:
all_obj_equal(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> TRUE
Now the benchmark:
library(microbenchmark)
microbenchmark(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> DT() 85.81079 87.80887 91.82032 89.47025 92.54601 132.26073 100 b
#> clp_NSE() 13.47535 14.15744 15.99264 14.80606 16.29140 28.16895 100 a
#> clp_Hybrid() 13.79843 14.23508 16.61606 15.00196 16.83604 32.94648 100 a
#> clp_Prog() 13.71320 14.17283 16.16281 14.94395 16.16935 39.24706 100 a
If you care about these milliseconds feel free to optimize, but for 340,000 obs all solutions are bloody fast.

Related

Time spent in each calendar year

I followed some individuals A and B from start to end
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
df
id start end
1 A 2015-01-01 2021-06-12
2 B 2013-01-01 2017-10-10
I would like to calculate the the follow up time for each calendar year. For example I have 1 year for 2013 (from B), 1 year for 2014 (from B), 2 years for 2015 (from A and B) and so on.
I tried to treat year as an integer and count how many years each individual contributes but due to rounding errors the result is not plausible.
I tried
years<-NULL
for (i in 1:length(df$id)){
years<-c(years, as.character(as.Date(seq.Date(from = df$start[i], to = df$end[i], by = "day"))))
}
library(lubridate)
table(year(years))/365
2013 2014 2015 2016 2017 2018 2019 2020 2021
1.0000000 1.0000000 2.0000000 2.0054795 1.7753425 1.0000000 1.0000000 1.0027397 0.4465753
which is the answer I am trying to get but is computationally inefficient and very slow in large data. I am wondering is there any way to do this without the loop? Or do it more efficiently?
I'm now guessing what you actually don't want to round or truncate anything, so here's a solution that works and gives output similar to your method (correcting the 2016 value):
func <- function(st, ed) {
stopifnot(length(st) == 1, length(ed) == 1)
stL <- as.POSIXlt(st)
edL <- as.POSIXlt(ed)
start_year <- 1900 + stL$year
end_year <- 1900 + edL$year
start_eoy <- as.POSIXlt(paste0(start_year, "-12-31"))
end_eoy <- as.POSIXlt(paste0(end_year, "-12-31"))
firstyear <- (start_eoy$yday - stL$yday) / start_eoy$yday
lastyear <- edL$yday / end_eoy$yday
data.frame(
year = seq(start_year, end_year),
n = c(firstyear, rep(1, max(0, end_year - start_year - 1)), lastyear)
)
}
base R
aggregate(n ~ year, data = do.call(rbind, Map(func, df$start, df$end)), FUN = sum)
# year n
# 1 2013 1.0000000
# 2 2014 1.0000000
# 3 2015 2.0000000
# 4 2016 2.0000000
# 5 2017 1.7747253
# 6 2018 1.0000000
# 7 2019 1.0000000
# 8 2020 1.0000000
# 9 2021 0.4450549
dplyr
library(dplyr)
df %>%
with(Map(func, start, end)) %>%
bind_rows() %>%
group_by(year) %>%
summarize(n = sum(n))
# # A tibble: 9 x 2
# year n
# <int> <dbl>
# 1 2013 1
# 2 2014 1
# 3 2015 2
# 4 2016 2
# 5 2017 1.77
# 6 2018 1
# 7 2019 1
# 8 2020 1
# 9 2021 0.445
Sounds like a job for a great package called lubridate. See example:
By the way, I assumed dates are year-month-day, therefore ymd. If not, you can use ydm (year-day-month) for American date format.
df<-data.frame(id=c("A", "B"), start=as.Date(c("2015-01-01", "2013-01-01")), end=as.Date(c("2021-06-12", "2017-10-10")))
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(tidyverse)
df %>%
mutate(across(start:end, ymd),
follow_up_years = interval(start, end)/years(1),
follow_up_months = interval(start, end)/months(1),
follow_up_days = interval(start, end)/days(1),
)
#> id start end follow_up_years follow_up_months follow_up_days
#> 1 A 2015-01-01 2021-06-12 6.443836 77.36667 2354
#> 2 B 2013-01-01 2017-10-10 4.772603 57.29032 1743
Created on 2021-10-28 by the reprex package (v2.0.1)
Edit
I think I understand. I guess we can also just use lubridate intervals:
df %>%
mutate(follow_up_2015 = interval(start, as_date("2015-01-01"))/years(1)) %>%
pull(follow_up_2015) %>%
sum()
#> [1] 2
Created on 2021-10-28 by the reprex package (v2.0.1)

What is an efficient programming way to find the closest time of a dataset to a reference (larger) dataset

I am searching for an efficient way to find the closest times of a small dataset (x) in comparison to a large dataset (a). The result has to be an index of the length of (a). I have already created a function which works very nicely, however, it is absolutely useless for large data as it takes days to process.
Here is my function: function(x, a, which = TRUE,na.rm=FALSE){
if("POSIXt" %in% class(x)) x <- as.numeric(x)
if("POSIXt" %in% class(a)) a <- as.numeric(a)
sapply(a, function(y) DescTools::Closest(x, y, which = TRUE,na.rm=FALSE)[1])
}
both datasets x and a are filtered and therefore have no consistent time stemp but they are filtered after the same requirements.
vector a contains 20 Hz data with a length of 16020209 and x contains 30 sec data with a length of 26908.
Any suggestions are very much appreciated! Thank you :)
One can use a rolling join from data.table:
library(data.table)
set.seed(1) # reproduciblity on Stackoverflow
DF_A <- data.table(x = seq(-500, by = 0.5, length.out = 26908),
idx = seq_len(26908))
DF_HZ <- data.table(x = round(runif(16020209, first(DF_A$x), last(DF_A$x)), 3),
idx_hz = seq_len(16020209))
DF_HZ[, x_hz := x + 0] # so we can check
DF_A[, x_a := x + 0] # so we can check
setkey(DF_A, x)
setkey(DF_HZ, x)
# The order(idx_hz) returns the result in the same order as
# DF_HZ but it is not necessary to match joins.
DF_A[DF_HZ, roll = "nearest"][order(idx_hz)]
#> x idx x_a idx_hz x_hz
#> 1: 3072.021 7145 3072.0 1 3072.021
#> 2: 4506.369 10014 4506.5 2 4506.369
#> 3: 7206.883 15415 7207.0 3 7206.883
#> 4: 11718.574 24438 11718.5 4 11718.574
#> 5: 2213.328 5428 2213.5 5 2213.328
#> ---
#> 16020205: 10517.477 22036 10517.5 16020205 10517.477
#> 16020206: 11407.776 23817 11408.0 16020206 11407.776
#> 16020207: 12051.919 25105 12052.0 16020207 12051.919
#> 16020208: 3482.463 7966 3482.5 16020208 3482.463
#> 16020209: 817.366 2636 817.5 16020209 817.366
Created on 2020-11-11 by the reprex package (v0.3.0)
On my machine, the above (not including the creation of the dummy data) takes about 3 s.
I would use something like a SQL full join for this task since the second df is small - though it depends on your data size and ram. Here is a simple example with test data:
library(dplyr)
# demo tibbles
tab1 <- tibble::tribble(
~time_1, ~VALUE_1,
"2020-11-01", 268L,
"2020-11-02", 479L,
"2020-11-03", 345L,
"2020-11-04", 567L,
"2020-11-05", 567L) %>%
dplyr::mutate(time_1 = as.Date(time_1))
tab2 <- tibble::tribble(
~time_2, ~VALUE_2,
"2020-11-01", 268L,
"2020-11-02", 479L) %>%
dplyr::mutate(time_2 = as.Date(time_2))
# calculations
tab1 %>%
dplyr::mutate(ID = dplyr::row_number()) %>% # Build ID from row number
dplyr::full_join(tab2, by = character()) %>%
dplyr::mutate(DIF = abs(time_1 - time_2)) %>%
dplyr::group_by(ID) %>%
dplyr::slice_min(order_by = DIF, n = 1)
time_1 VALUE_1 ID time_2 VALUE_2 DIF
<date> <int> <int> <date> <int> <drtn>
1 2020-11-01 268 1 2020-11-01 268 0 days
2 2020-11-02 479 2 2020-11-02 479 0 days
3 2020-11-03 345 3 2020-11-02 479 1 days
4 2020-11-04 567 4 2020-11-02 479 2 days
5 2020-11-05 567 5 2020-11-02 479 3 days
If size turns out to be a problem you yould split the large data.frame in smaller once and the run it with a loop. In this case parallel processing would be a great option since by splitting the large DF calculations can be run independently.

R - Sample consecutive series of dates in time series without replacement?

I have a data frame in R containing a series of dates. The earliest date is (ISO format) 2015-03-22 and the latest date is 2016-01-03, but there are two breaks within the data. Here is what it looks like:
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
I.e.:
> date_data
# A tibble: 211 x 2
dates sample_id
<date> <int>
1 2015-03-22 0
2 2015-03-23 0
3 2015-03-24 0
4 2015-03-25 0
5 2015-03-26 0
6 2015-03-27 0
7 2015-03-28 0
8 2015-03-29 0
9 2015-03-30 0
10 2015-03-31 0
# … with 201 more rows
What I want to do is to take ten 10-day long samples of continous dates from within that time series without replacement. For example, a valid sample would be the ten days from 2015-04-01 to 2015-04-10 because that falls completely within the dates column in my date_data data frame. Each sample would then get a unique (non-zero) number in the sample_id column in date_data such as 1:10.
To be clear, my requirements are:
Each sample would be 10 consecutive days.
The sampling has to be without replacement. So if sample_id == 1 is the 2015-04-01 to 2015-04-10 period, those dates can't be part of another 10-day-long sample.
Each 10-day-long sample can't include any date that's not within date_data$dates.
At the end, date_data$sample_id would have unique numbers representing each 10-day-long sample, likely with lots of 0s left over that were not part of any sample (and there would be 200 rows - 10 for each sample - where sample_id != 0).
I am aware of dplyr::sample_n() but it doesn't sample consecutive values, and I don't know how to devise a way to "remember" which dates have already been sampled...
What's a good way to do this? A for loop?!?! Or perhaps something with purrr? Thank you very much for your help.
UPDATE: Thanks to #gfgm's solution, it reminded me that performance is an important consideration. My real dataset is quite a bit larger, and in some cases I would want to take 20+ samples instead of just 10. Ideally the size of the sample can be changed as well, i.e. not necessarily 10-days long.
This is tricky, as you anticipated, because of the requirement of sampling without replacement. I have a working solution below which achieves a random sample and works fast on a problem of the scale given in your toy example. It should also be fine with more observations, but will get really really slow if you need to pick a lot of points relative to the sample size.
The basic premise is to pick n=10 points, generate the 10 vectors from these points forwards, and if the vectors overlap ditch them and pick again. This is simple and works fine given that 10*n << nrow(df). If you wanted to get 15 subvectors out of your 200 observations this would be a good deal slower.
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
# A function that picks n indices, projects them forward 10,
# and if any of the segments overlap resamples
pick_n_vec <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
while (max(table(unlist(vecs))) > 1) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
}
vecs
}
# demonstrate
set.seed(42)
indices <- pick_n_vec(date_data)
for (i in 1:10) {
date_data$sample_id[indices[[i]]] <- i
}
date_data[indices[[1]], ]
#> # A tibble: 10 x 2
#> dates sample_id
#> <date> <int>
#> 1 2015-05-31 1
#> 2 2015-06-01 1
#> 3 2015-06-02 1
#> 4 2015-06-03 1
#> 5 2015-06-04 1
#> 6 2015-06-05 1
#> 7 2015-06-06 1
#> 8 2015-06-07 1
#> 9 2015-06-08 1
#> 10 2015-06-09 1
table(date_data$sample_id)
#>
#> 0 1 2 3 4 5 6 7 8 9 10
#> 111 10 10 10 10 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
marginally faster version
pick_n_vec2 <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
while (min(diff(sort(points))) < 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
}
lapply(points, function(i){i:(i+(out - 1))})
}

use model object, e.g. panelmodel, to flag data used

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Skipping rows until row with a certain value

I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form:
"
... (number of rows)
... (number of rows)
... (number of rows)
DATE VALUE
1945-01-01 144855
1946-01-01 138515
1947-01-01 136405
1948-01-01 135486
1949-01-01 142455
"
I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards (including "DATE // VALUE"). Is there a way to do this with data.table's fread() - or any other way, such as with dplyr?
Thank you very much in advance for your effort and your time!
Best,
c.
Here's a way to get to extract that info from those text files using readr::read_lines, dplyr, and string handling from stringr.
library(tidyverse)
library(stringr)
df <- data_frame(lines = read_lines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt")) %>%
filter(str_detect(lines, "^\\d{4}-\\d{2}-\\d{2}")) %>%
mutate(date = str_extract(lines, "^\\d{4}-\\d{2}-\\d{2}"),
value = as.numeric(str_extract(lines, "[\\d-]+$"))) %>%
select(-lines)
df
#> # A tibble: 286 x 2
#> date value
#> <chr> <dbl>
#> 1 1945-10-01 1245
#> 2 1946-01-01 NA
#> 3 1946-04-01 NA
#> 4 1946-07-01 NA
#> 5 1946-10-01 1298
#> 6 1947-01-01 NA
#> 7 1947-04-01 NA
#> 8 1947-07-01 NA
#> 9 1947-10-01 1413
#> 10 1948-01-01 NA
#> # ... with 276 more rows
I filtered for all the lines you want to keep using stringr::str_detect, then extracted out the info you want from the string using stringr::str_extract and regexes.
Combining fread with unix tools:
> fread("curl -s https://fred.stlouisfed.org/data/HNOMFAQ027S.txt | sed -n -e '/^DATE.*VALUE/,$p'")
DATE VALUE
1: 1945-10-01 1245
2: 1946-01-01 .
3: 1946-04-01 .
4: 1946-07-01 .
5: 1946-10-01 1298
---
282: 2016-01-01 6566888
283: 2016-04-01 6741075
284: 2016-07-01 7022321
285: 2016-10-01 6998898
286: 2017-01-01 7448792
>
Using:
file.names <- c('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt',
'https://fred.stlouisfed.org/data/DGS10.txt',
'https://fred.stlouisfed.org/data/A191RL1Q225SBEA.txt')
text.list <- lapply(file.names, readLines)
skip.rows <- sapply(text.list, grep, pattern = '^DATE\\s+VALUE') - 1
# option 1
l <- Map(function(x,y) read.table(text = x, skip = y), x = text.list, y = skip.rows)
# option 2
l <- lapply(seq_along(text.list), function(i) fread(file.names[i], skip = skip.rows[i]))
will get you a list of data.frame's (option 1) or data.table's (option 2).

Resources