extracting information from pdfs that have line spills using R - r

I am trying to extract information from pdf files using R. The data I want are in tables although they arent recognised by R.
I am using the pdftools to read in the pdf file, export it to a text file and then re read it in line by line.
The files look like this.
I want to extract the Net cash from / (used in) operating activities but as you can see because the lines spill it makes it very hard.
pdf_text <- pdf_text("test.pdf")
write.table(pdf_text,"out.txt")
just <- readLines("input_file.txt")
> just[30:40]
[1] " (g) insurance costs - (137)"
[2] " 1.3 Dividends received (see note 3) - -"
[3] " 1.4 Interest received 9 21"
[4] " 1.5 Interest and other costs of finance paid - -"
[5] " 1.6 Income taxes paid - -"
[6] " 1.7 Government grants and tax incentives - -"
[7] " 1.8 Other (provide details if material) - -"
[8] " 1.9 Net cash from / (used in) operating"
[9] " (1,258) (3,785)"
[10] " activities"
I want to grab the numbers (1,258) and (3,785) still with the parentheses around them.
A common thing that happens is that the numbers will either be on line 8,9 or 10 (using my example above as reference) so I cant just simply write code to grab the data that is 'next' to "Net cash from / (used in) operating activities"

This code almost arrives at the desired result:
> text_file <- readLines("out.txt")
> operating_line <- grep("Net cash from / \\(used in\\) operat", text_file)
> operating_line <- operating_line[1]
> number_line1 <- text_file[operating_line]
> number_line2 <- text_file[operating_line + 1]
> number_line3 <- text_file[operating_line - 1]
> if (gsub("[^()[:digit:],]+", "", number_line1) != "") {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line1)
+ } else if (gsub("[^()[:digit:],]+", "", number_line2) != "") {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line2)
+ } else {
+ numbers <- gsub("[^()[:digit:],]+", "", number_line3)
+ }
> numbers <- gsub("\\d+\\(\\)", "", numbers)
> numbers
[1] "(1,258)(3,785)"
However there is no gap between the (1,258) and (3,785).
i.e. they are not being identified as different elements

Related

How to find the frequency of words in book titles that I have scraped from a website in R

I am very new to R and webscraping. For practice I am scraping book titles from a website and working out some basic stats using the titles. So far I have managed to scrape the book titles, add them to a table, and find the mean length of the books.
I now want to find the most commonly used word in the book titles, it is probably 'the', but I want to prove this using R. At the moment my program is only looking at the full book title, I need to split the words into their own individual identities so I can count the quantity of different words. However, I am not sure how to do this.
Code:
url <- 'http://books.toscrape.com/index.html'
bookNames <- read_html(allUrls) %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " "), concat( " ", "product_pod", " " ))]//a') %>%
html_text
view(bookNames)
values<-lapply(bookNames,nchar)
mean(unlist(values))
bookNames<-tolower(bookNames)
sort(table(bookNames), decreasing=T)[1:2]
I think splitting every word into a new list would solve my problem, yet I am not sure how to do this.
Thanks in advance.
Above is the table of books I have been able to produce.
You can get all the book titles with :
library(rvest)
url <- 'http://books.toscrape.com/index.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title') -> titles
titles
# [1] "A Light in the Attic"
# [2] "Tipping the Velvet"
# [3] "Soumission"
# [4] "Sharp Objects"
# [5] "Sapiens: A Brief History of Humankind"
# [6] "The Requiem Red"
# [7] "The Dirty Little Secrets of Getting Your Dream Job"
#....
To get the most common words in the title then you can split the string on whitespace and use table to count the frequency.
head(sort(table(tolower(unlist(strsplit(titles, '\\s+')))), decreasing = TRUE))
# the a of #1) and for
# 14 3 3 2 2 2

Removing dates and all junks from texts using R

I am cleaning a huge dataset made up of tens of thousands of texts using R. I know regular expression will do the job conveniently but I am poor in using it. I have combed stackoverflow but could not find solution. This is my dummy data:
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982",
"04/02/2016 Health is a priority: WAI000553",
"09/ 08/2016 Economy is bad: 2031CE8D",
": : 21 / 05 / 13: Vehicle license is needed: DPH2790 ")
I want to remove all the dates, punctuations and IDs and want my result to be this:
[1] "Education is good"
[2] "Health is a priority"
[3] "Economy is bad"
[4] "Vehicle license is needed"
Any help in R will be appreciated.
I think specificity is in order here:
First, let's remove the date-like strings. I'll assume either mm/dd/yyyy or dd/mm/yyyy, where the first two can be 1-2 digits, and the third is always 4 digits. If this is variable, the regex can be changed to be a little more permissive:
foo_data2 <- gsub("\\d{1,2}\\s*/\\s*\\d{1,2}\\s*/\\s*\\d{4}", "", foo_data)
foo_data2
# [1] " Education is good: WO0001982" " Health is a priority: WO0002021" " Economy is bad: WO001999" " Vehicle license is needed: WO001050"
From here, the abbreviations seem rather easy to remove, as the other answers have demonstrated. You have not specified if the abbreviation is hard-coded to be anything after a colon, numbers prepended with "WO", or just some one-word combination of letters and numbers. Those could be:
gsub(":.*", "", foo_data2)
# [1] " Education is good" " Health is a priority" " Economy is bad" " Vehicle license is needed"
gsub("\\bWO\\S*", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
gsub("\\b[A-Za-z]+\\d+\\b", "", foo_data2)
# [1] " Education is good: " " Health is a priority: " " Economy is bad: " " Vehicle license is needed: "
The : removal should be straight forward, and using trimws(.) will remove the leading/trailing spaces.
This can obviously be combined into a single regex (using the logical | with pattern grouping) or a single R call (nested gsub) without complication, I kept them broken apart for discussion.
I think https://stackoverflow.com/a/22944075/3358272 is a good reference for regex in general, note that while that page shows many regex things with single-backslashes, R requires all of those use double-backslashes (e.g., \d in regex needs to be \\d in R). The exception to this is if you use R-4's new raw-strings, where these two are identical:
"\\b[A-Za-z]+\\d+\\b"
r"(\b[A-Za-z]+\d+\b)"
Using stringr try this:
library(stringr)
library(magrittr)
str_remove_all(foo_data, "\\/|\\d+|\\: WO") %>%
str_squish()
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)
data
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
foo_data <- c("03 / 05 / 2016 Education is good: WO0001982", "04/02/2016 Health is a priority: WO0002021",
"09/ 08/2016 Economy is bad: WO001999", "09/08/ 2016 Vehicle license is needed: WO001050")
gsub(".*\\d{4}[[:space:]]+(.*):.*", "\\1", foo_data)
#> [1] "Education is good" "Health is a priority"
#> [3] "Economy is bad" "Vehicle license is needed"
Created on 2021-04-22 by the reprex package (v2.0.0)

R: pasting (or combining) a variable amount of rows together as one

I have a text file I am trying to parse and put the information into a data frame. In each one of the 'events' there may or may not be some notes with it. However the notes can span various amounts of rows. I need to concatenate the notes for each event into one string to store in a column of the data frame.
ID: 20470
Version: 1
notes:
ID: 01040
Version: 2
notes:
The customer was late.
Project took 20 min. longer than anticipated
Work was successfully completed
ID: 00000
Version: 1
notes:
Customer was not at home.
ID: 00000
Version: 7
notes:
Fax at 2:30 pm
Called but no answer
Visit home no answer
Left note on door with call back number
Made a final attempt on 12/5/2013
closed case on 12/10 with nothing resolved
So for example for the third event the notes should be one long string: "The customer was late. Project took 20 min. longer than anticipated Work was successfully completed", which then would be store into the notes columns in the the data frame.
For each event I know how many rows the notes span.
Something like this (actually, you would be happier and learn more figuring it out yourself, I was just procrastinating between two tasks):
x <- readLines("R/xample.txt") # you'll probably read it from a file
ids <- grep("^ID:", x) # detecting lines starting with ID:
versions <- grep("^Version:", x)
notes <- grep("^notes:", x)
nStart <- notes + 1 # lines where the notes start
nEnd <- c(ids[-1]-1, length(x)) # notes end one line before the next ID: line
ids <- sapply(strsplit(x[ids], ": "), "[[", 2)
versions <- sapply(strsplit(x[versions], ": "), "[[", 2)
notes <- mapply(function(i,j) paste(x[i:j], collapse=" "), nStart, nEnd)
df <- data.frame(ID=ids, ver=versions, note=notes, stringsAsFactors=FALSE)
dput of data
> dput(x)
c("ID: 20470", "Version: 1", "notes: ", " ", " ", "ID: 01040",
"Version: 2", "notes: ", " The customer was late.", "Project took 20 min. longer than anticipated",
"Work was successfully completed", "", "ID: 00000", "Version: 1",
"notes: ", " Customer was not at home.", "", "ID: 00000", "Version: 7",
"notes: ", " Fax at 2:30 pm", "Called but no answer", "Visit home no answer",
"Left note on door with call back number", "Made a final attempt on 12/5/2013",
"closed case on 12/10 with nothing resolved ")

extracting large layer netcdf using r

i am trying to extract layers from a large netcdf file using r
my nc file variable are
5 variables:"
[1] " double time_bnds[bnds,time] "
[1] " double plev_bnds[bnds,plev] "
[1] " double lat_bnds[bnds,lat] "
[1] " double lon_bnds[bnds,lon] "
[1] " float hur[lon,lat,plev,time] "
[1] " standard_name: relative_humidity"
[1] " long_name: Relative Humidity"
so i want to get the relative humidity (hur) data for all the lon and lat for a specific plev and specific time
plev<-get.var.ncdf(ncin,"plev")
time<-get.var.ncdf(ncin,"time")
plev2<-plev[2]
time2<-time[2]
hur<-get.var.ncdf(ncin,"hur",start=c(1,1,plev2,time2), count=c(1,1,2,2))
i am getting this error
Error in R_nc_get_vara_double: NetCDF: Index exceeds dimension bound
Var: hur Ndims: 4 Start: 711901,92499,0,0Count: 2,2,144,192Error in
get.var.ncdf(ncin, "hur", start = c(1, 1, plev2, time2), count = c(-1, :
C function R_nc_get_vara_double returned error
how do i solve this
edit------
solution
lat and lon is a vector
hur<-get.var.ncdf(ncin,"hur", start=c(1,1,2,2),count=c(dim(lon),dim(lat),1,1))
previously i kept directly the values of plev2 but we shud mention the index which is 2
thank you #pascal

Splitting a string in R, different split argument elements

I imported some data with no column names, so now I have just over a million rows, and 1 column (instead of 5 columns).
Each row is formatted like this:
x <- "2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
strsplit( x , split = c(" ", " ", "%", " "))
and got
[[1]]
[1] "2012-10-19T16:59:01-07:00" "192.101.136.140"
[3] "<190>Oct" "19"
[5] "2012" "23:59:01:"
[7] "%FWSM-6-305011:" "Built"
[9] "dynamic" "tcp"
[11] "translation" "from"
[13] "Inside:10.2.45.62/56455" "to"
[15] "outside:192.101.136.224/9874"
I know that it has to do with recycling the split argument but I can't seem to figure how to get it how I want:
[[1]]
[1] "2012-10-19T16:59:01-07:00" "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01 "%FWSM-6-305011
[5] Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
Each row has a different message as the fifth element, but after the 4th element I just want to keep the rest of the string together.
Any help would be appreciated.
You can use paste with the collapse argument to combine every element starting with the fifth element.
A <- strsplit( x = "2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874", split = c(" ", " ", "%", " "))
c(A[[1]][1:4], paste(A[[1]][5:length(A[[1]])], collapse=" "))
As #DWin points out, split = c(" ", " ", "%", " ") is not used in order - in other words it's identical to split = c(" ", "%")
I think here you don't need to use strsplit. I use read.table to read the lines using text argument. Then you aggregate columns using paste. Since you have a lot of rows, it is better to do the column aggregation within a data.table.
dt <- read.table(text=x)
library(data.table)
DT <- as.data.table(dt)
DT[ , c('V3','V8') := list(paste(V3,V4,V5),
V8=paste(V8,V9,V10,V11,V12,V13,V14,V15))]
DT[,paste0('V',c(1:3,6:7,8)),with=FALSE]
V1 V2 V3 V6 V7
1: 2012-10-19T16:59:01-07:00 192.101.136.140 <190>Oct 19 2012 23:59:01: %FWSM-6-305011:
V8
1: Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874
Here is a function that I think works in the way that you thought strsplit functioned:
split.seq<-function(x,delimiters) {
break.point<-regexpr(delimiters[1], x)
first<-mapply(substring,x,1,break.point-1,USE.NAMES=FALSE)
second<-mapply(substring,x,break.point+1,nchar(x),USE.NAMES=FALSE)
if (length(delimiters)==1) return(lapply(1:length(first),function(x) c(first[x],second[x])))
else mapply(function(x,y) c(x,y),first, split.seq(second, delimiters[-1]) ,USE.NAMES=FALSE, SIMPLIFY=FALSE)
}
split.seq(x,delimiters)
A test:
x<-rep(x,2)
delimiters=c(" ", " ", "%", " ")
split.seq(x,delimiters)
[[1]]
[1] "2012-10-19T16:59:01-07:00"
[2] "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01: "
[4] "FWSM-6-305011:"
[5] "Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"
[[2]]
[1] "2012-10-19T16:59:01-07:00"
[2] "192.101.136.140"
[3] "<190>Oct 19 2012 23:59:01: "
[4] "FWSM-6-305011:"
[5] "Built dynamic tcp translation from Inside:10.2.45.62/56455 to outside:192.101.136.224/9874"

Resources