Skipping rows until row with a certain value - r

I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is and the data takes the following form:
... (number of rows)
... (number of rows)
... (number of rows)
1945-01-01 144855
1946-01-01 138515
1947-01-01 136405
1948-01-01 135486
1949-01-01 142455
I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards (including "DATE // VALUE"). Is there a way to do this with data.table's fread() - or any other way, such as with dplyr?
Thank you very much in advance for your effort and your time!

Here's a way to get to extract that info from those text files using readr::read_lines, dplyr, and string handling from stringr.
df <- data_frame(lines = read_lines("")) %>%
filter(str_detect(lines, "^\\d{4}-\\d{2}-\\d{2}")) %>%
mutate(date = str_extract(lines, "^\\d{4}-\\d{2}-\\d{2}"),
value = as.numeric(str_extract(lines, "[\\d-]+$"))) %>%
#> # A tibble: 286 x 2
#> date value
#> <chr> <dbl>
#> 1 1945-10-01 1245
#> 2 1946-01-01 NA
#> 3 1946-04-01 NA
#> 4 1946-07-01 NA
#> 5 1946-10-01 1298
#> 6 1947-01-01 NA
#> 7 1947-04-01 NA
#> 8 1947-07-01 NA
#> 9 1947-10-01 1413
#> 10 1948-01-01 NA
#> # ... with 276 more rows
I filtered for all the lines you want to keep using stringr::str_detect, then extracted out the info you want from the string using stringr::str_extract and regexes.

Combining fread with unix tools:
> fread("curl -s | sed -n -e '/^DATE.*VALUE/,$p'")
1: 1945-10-01 1245
2: 1946-01-01 .
3: 1946-04-01 .
4: 1946-07-01 .
5: 1946-10-01 1298
282: 2016-01-01 6566888
283: 2016-04-01 6741075
284: 2016-07-01 7022321
285: 2016-10-01 6998898
286: 2017-01-01 7448792

file.names <- c('',
text.list <- lapply(file.names, readLines)
skip.rows <- sapply(text.list, grep, pattern = '^DATE\\s+VALUE') - 1
# option 1
l <- Map(function(x,y) read.table(text = x, skip = y), x = text.list, y = skip.rows)
# option 2
l <- lapply(seq_along(text.list), function(i) fread(file.names[i], skip = skip.rows[i]))
will get you a list of data.frame's (option 1) or data.table's (option 2).


Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
# link to report
url <- ""
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!
While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(dplyr, warn.conflicts = F)
url <- ""
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(! &, paste(lag(X1), X1), X1)) %>%
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

Aggregate character string into vector in R

I have a data table test:
I want to aggregate unique key values by id into vector using data.table package.
Expected output:
"2365", "3709"
"6734", "1908", "4523"
So, this should work like array_agg sql function.
I tried:
res <- test[, list(key_array = paste(unique(key), collapse = ", ")), by = "id"], but I get just a string. But I need to have opportunity to find the length of each vector and operate with its certain elements (find the intersection of two vectors for example).
1. Base R
This an aggregate one-liner.
x <- 'id key
1 2365
1 2365
1 3709
2 6734
2 1908
2 4523'
test <- read.table(textConnection(x), header = TRUE)
aggregate(key ~ id, test, \(x) c(unique(x)))
#> id key
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)
But if user #Chris's comment is right then the right solution as follows.
aggregate(key ~ id, test, \(x) paste(unique(x), collapse = ", "))
Note that both c(unique(x)) and as.character(c(unique(x))) will output a list column, so the latter solution is right anyway.
2. Package data.table
Once again a one-liner.
The output is a list column, with each list member an integer vector. To keep as integers use
res <- setDT(test)[, .(key_array = list(as.character(unique(key)))), by = id]
#> id key_array
#> 1: 1 2365,3709
#> 2: 2 6734,1908,4523
#> Classes 'data.table' and 'data.frame': 2 obs. of 2 variables:
#> $ id : int 1 2
#> $ key_array:List of 2
#> ..$ : chr "2365" "3709"
#> ..$ : chr "6734" "1908" "4523"
#> - attr(*, ".internal.selfref")=<externalptr>
Created on 2022-06-14 by the reprex package (v2.0.1)
Then, in order to access the vectors use two extractors, one to extract the column and the other one to extract the vectors.
#> [1] "2365" "3709"
#> [1] "6734" "1908" "4523"
Created on 2022-06-14 by the reprex package (v2.0.1)
3. dplyr solution
Group by id and collapse the unique strings into one only.
test %>%
group_by(id) %>%
summarise(key_array = paste(unique(key), collapse = ", "))
#> # A tibble: 2 × 2
#> id key_array
#> <int> <chr>
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)

Add a row for each month for each id in R

So I have a list of IDs.
I would like to create a table consisting of rows for each month of a year for each id in the list.
I tried to use a rbind in a for loop but this takes forever... as such:
for (k in seq_along(members))
for (i in seq(1,12))
df1<-rbind(df1, data.frame(MemYearMo=paste(members[k],"_",year,formatC(i,width=2,flag=0), sep="")))
where members is obviously my list of id's.
My desired output is:
XXX 201712
where XXX is one of my id's.
What would be the fastest way to do something like this?
Using rep:
id <- letters[1:10]
month <- 201701:201712
ids <- rep(ids, each=length(month))
months <- rep(months, length.out=length(id))
data.frame(id=ids, ym=months, id_ym = paste(ids, months, sep="_"))
id ym id_ym
1 a 201701 a_201701
2 a 201702 a_201702
3 a 201703 a_201703
I'm not sure I understand your desired output but you can use expand.grid to generate the different combinations of member ids with the months of the year.
df <- = 901:903, ym = 201701:201712))
df$MemYearMo <- paste(df$members, df$ym, sep = "_")
#> members ym MemYearMo
#> 1 901 201701 901_201701
#> 2 902 201701 902_201701
#> 3 903 201701 903_201701
#> 4 901 201702 901_201702
#> 5 902 201702 902_201702
#> 6 903 201702 903_201702
#> ...(and so on)...
#> 34 901 201712 901_201712
#> 35 902 201712 902_201712
#> 36 903 201712 903_201712
Using data.table and paste0 you can do this, assuming that your id's are unique.
id <-
id <- id[, .(output = paste0(letters, paste0("_2017", c(paste0(0, 1:9), 10:12)))), by = .(letters)]

R Cleaning and reordering names/serial numbers in data frame

Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!
Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
}), seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#Source: local data frame [1 x 12]
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

Count number of rows meeting criteria in another table - R PRogramming

I have two tables, one with property listings and another one with contacts made for a property (i.e. is someone is interested in the property they will "contact" the owner).
Sample "listings" table below:
listings <- data.frame(id = c("6174", "2175", "9176", "4176", "9177"), city = c("A", "B", "B", "B" ,"A"), listing_date = c("01/03/2015", "14/03/2015", "30/03/2015", "07/04/2015", "18/04/2015"))
listings$listing_date <- as.Date(listings$listing_date, "%d/%m/%Y")
# id city listing_date
#1 6174 A 01/03/2015
#2 2175 B 14/03/2015
#3 9176 B 30/03/2015
#4 4176 B 07/04/2015
#5 9177 A 18/04/2015
Sample "contacts" table below:
contacts <- data.frame (id = c ("6174", "6174", "6174", "6174", "2175", "2175", "2175", "9176", "9176", "4176", "4176", "9177"), contact_date = c("13/03/2015","14/04/2015", "27/03/2015", "13/04/2015", "15/03/2015", "16/03/2015", "17/03/2015", "30/03/2015", "01/06/2015", "08/05/2015", "09/05/2015", "23/04/2015" ))
contacts$contact_date <- as.Date(contacts$contact_date, "%d/%m/%Y")
# id contact_date
#1 6174 2015-03-13
#2 6174 2015-04-14
#3 6174 2015-03-27
#4 6174 2015-04-13
#5 2175 2015-03-15
#6 2175 2015-03-16
#7 2175 2015-03-17
#8 9176 2015-03-30
#9 9176 2015-06-01
#10 4176 2015-05-08
#11 4176 2015-05-09
#12 9177 2015-04-23
1. I need to count the number of contacts made for a property within 'x' days of listing. The output should be a new column added to "listings" with # contacts:
Sample ('x' = 30 days)
# id city listing_date ngs
#1 6174 A 2015-03-01 2
#2 2175 B 2015-03-14 3
#3 9176 B 2015-03-30 1
#4 4176 B 2015-04-07 0
#5 9177 A 2015-04-18 1
I have done this with the for loop; it is horrible slow for live data:
n <- nrow(listings)
mat <- vector ("integer", n)
for (i in 1:n) {
mat[i] <- nrow (contacts[contacts$id==listings[i,"id"] & as.numeric (contacts$contact_date - listings[i,"listing_date"]) <=30,])
listings$ngs <- mat
I need to prepare a histogram of # contacts vs days with 'x' as variable - through manipulate function. I can't figure out a way to do all this inside the manipulate function.
Here's a possible solution using data.table rolling joins
# key `listings` by proper columns in order perform the binary join
setkey(setDT(listings), id, listing_date)
# Perform a binary rolling join while extracting matched icides and counting them
indx <- data.table(listings[contacts, roll = 30, which = TRUE])[, .N, by = V1]
# Joining back to `listings` by proper rows while assigning the counts by reference
listings[indx$V1, ngs := indx$N]
# id city listing_date ngs
# 1: 2175 B 2015-03-14 3
# 2: 4176 B 2015-04-07 NA
# 3: 6174 A 2015-03-01 2
# 4: 9176 B 2015-03-30 1
# 5: 9177 A 2015-04-18 1
I'm not sure if your actual id values are factor, but I'll start by making those numeric. Using them as factors will cause you problems:
listings$id <- as.numeric(as.character(listings$id))
contacts$id <- as.numeric(as.character(contacts$id))
Then, the strategy is to calculate the "days since listing" value for each contact and add this to your contacts data.frame. Then, aggregate this new data.frame (in your example, sum of contacts within 30 days), and then merge the resulting count back into your original data.
contacts$ngs <- contacts$contact_date - listings$listing_date[match(contacts$id, listings$id)]
a <- aggregate(ngs ~ id, data = contacts, FUN = function(x) sum(x <= 30))
merge(listings, a)
# id city listing_date ngs
# 1 2175 B 2015-03-14 3
# 2 4176 B 2015-04-07 0
# 3 6174 A 2015-03-01 2
# 4 9176 B 2015-03-30 1
# 5 9177 A 2015-04-18 1
indx <- match(contacts$id, listings$id)
days_since <- contacts$contact_date - listings$listing_date[indx]
n <- with(contacts[days_since <= 30, ], tapply(id, id, length))
n[] <- 0
listings$n <- n[match(listings$id, names(n))]
It's similar to Thomas' answer but utilizes tapply and match instead of aggregate and merge.
You could use the dplyr package. First merge the data: <- merge(contacts,listings,by = "id")
Set a target number of days:
number.of.days <- 30
Then gather the data by ID (group_by), exclude the results that are not within the time frame (filter) and count the number of occurrences/rows (summarise).
result <- %>% group_by(id) %>% filter(contact_date > listing_date + number.of.days) %>% summarise(count.of.contacts = length(id))
I think there are a number of ways this could be potentially solved but I have found dplyr to be very helpful in a lot circumstances.
Sorry should have thought about that a little more. Does this work,
result <- %>% group_by(id,city,listing_date) %>% summarise(ngs = length(id[which(contact_date < listing_date + number.of.days)]))
I don't think zero results can be passed sensibly through the filter stage (understandably, the goal is usually the opposite). I'm not too sure what sort of impact the 'which' component will have on processing time, likely to be slower than using the 'filter' function but might not matter.
Using dplyr for your first problem:
left_join(contacts, listings, by = c("id" = "id")) %>%
filter(abs(listing_date - contact_date) < 30) %>%
group_by(id) %>% summarise(cnt = n()) %>%
And the output is:
id cnt city listing_date
1 6174 2 A 2015-03-01
2 2175 3 B 2015-03-14
3 9176 1 B 2015-03-30
4 4176 NA B 2015-04-07
5 9177 1 A 2015-04-18
I'm not sure I understand your second question to answer it.
