I'm attempting to extract data from a pdf, which can be located at https://www.dol.gov/ui/data.pdf. The data I'm interested in are on page 4 of the PDF and are the 3 observations of the Initial Claims (NSA), the 3 observations of the Insured Unemployment (NSA), and the most recent week used covered employment (footnote 2).
I've read the PDF into R using pdftools, but the text output which is generated is quite ugly (kind of to be expected - due to the nature of PDFs). Is there any way I can extract specific data from this text output? I believe the data will always be in the same place in the output, which is helpful.
The output I'm looking at can be seen with the following script:
library(pdftools)
download.file("https://www.dol.gov/ui/data.pdf", "data.pdf", mode="wb")
uidata <- pdf_text("data.pdf")
uidata[4]
I've searched people with similar questions and fiddled around with scan() and grep(), but can't seem to figure out a way to isolate and extract the data I need from the text output. Thanks in advance if anyone stumbles upon this and can point me in the right direction - if not I'll be trying to figure this out!
With grep and a little regex, you can get everything you need into a usable structure:
library(magrittr)
x <- pdftools::pdf_text('https://www.dol.gov/ui/data.pdf')
x2 <- readLines(textConnection(x[4]))
r <- grep('WEEK ENDING', x2)
l <- lapply(seq_along(r), function(i){
x2[r[i]:(na.omit(c(r[i + 1], grep('FOOTNOTE', x2)))[1] - 1)] %>%
trimws() %>%
gsub('\\s{2,}', ';', .) %>%
paste(collapse = '\n') %>%
read.csv2(text = ., dec = '.')
})
from_footnote <- as.numeric(gsub('^2|\\D', '', x2[grep('2\\.', x2)]))
l[[1]][3,]
#> WEEK.ENDING December.17 December.10 Change
#> Initial Claims (NSA) 315,613 305,333 +10,280 352,534
#> December.3
#> Initial Claims (NSA) 319,641
from_footnote
#> [1] 138322138
You'll still need to parse the numbers, but at least it's usable.
Related
Currently, I use the following code to store Excel files (which are stored in a folder on my PC) in a list.
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
decrease_list <- sapply(decrease_names,read_excel,simplify = F)
After that, I combine the dataframes into one object by using the following code.
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows()
The problem I have is that the Excel files that are stored in the folder contain decimal points (points ".") as well as thousand separators (commas ","). I think R (and read_excel() in particular) convert the thousand separators into decimal points, which results in incorrect data.
Although I know that I can remove the thousand separators in Excel first, this would result in a lot of manual work and hence I am interested in a solution that recognises the thousand separator and keeps it intact (or removes it, the goal is to keep the nature of the data correct).
EDIT: as #dario suggested I add a snippet of a tibble that is stored in decrease_list after I run the code. The snippet looks like this:
Raised Avg. change
526.000 2.04
186.000 3.24
...
In the column raised the "." used to be a "," but has become a ".". The "." in Avg. change was a "." already.
Assuming that each excel file contains data in the same format, then we can apply the following code:
library(tidyverse)
library(readxl)
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
# 10 columns as written in your comment
decrease <- sapply(decrease_names, readxl::read_excel, col_types = rep("text", 10L), simplify = F)
# Not tested
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows() %>%
mutate(across(where(is.character), ~ as.numeric(gsub("\\,", "", .x))))
I have a .txt import file from a weather station using some pretty advanced code, and I need to sort based on one area of content within each line. Here's a few lines:
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
I basically need to be able to group together all lines with a $GPGGA, and do the same for $GPGLL, $GPVTG, and I believe 6 other types of entries that repeat. group_by() does work, nor do select() or sort() for obvious reasons. The formatting here is clearly not in any organized table format, making this very difficult for me. How do I do this?
Here's the code I used to import the original file (I replaced my actual username with "my username"):
filefolder <-"C:\\Users\\"my username"\\Downloads\\"
Weather_data = paste(filefolder, "Jul_13_2021_Weatherstation_Test_File.txt", sep = "")
Weather_data <- read.delim("Jul_13_2021_Weatherstation_Test_File.txt")
And here's what I have so far in my attempt:
Screenshot of what I have so far
1: https://i.stack.imgur.com/FSlzf.png][1]
As you say there is no organisation in the table. I would suggest doing something with regular expressions:
df <- data.frame(text = c("13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68",
"13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72",
"13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E"))
library(dplyr)
df %>%
mutate(Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry)
The title doesn't really do my question justice, because there are probably a few ways to skin this cat. But I picked one approach and went with it. This is what I'm working with:
I've pulled all the metadata for a particular study in the NCBI database using the "Send to:" option on their interface and downloading a .txt file.
In total, I have ~23k samples, each with up to 609 unique questions and answers from a questionnaire totaling 8M+ obs of 1 variable when read as a .csv. To my dismay, the metadata are irregular. Some samples have 140 associated key/value pairs. Others have 492. I've included a header of a sample below.
1: qiita_sid_10317:10317.BLANK1.6H.GUELPH
Identifiers: BioSample: SAMEA4790059; SRA: ERS2609990
Organism: metagenome
Attributes:
/Alias="qiita_sid_10317:10317.BLANK1.6H.GUELPH"
/description="American Gut control"
/ENA checklist="ERC000011"
/INSDC center alias="UCSDMI"
/INSDC center name="University of California San Diego Microbiome Initiative"
/INSDC first public="2018-07-13T17:03:10Z"
/INSDC last update="2018-07-13T14:50:03Z"
/INSDC status="public"
/SRA accession="ERS2609990"
I've tried (including but not limited to):
Read .txt file (adding a delimiter hasn't made a difference, am I missing something here?)
I've tried reading the data using various delimiters
I've even removed the header data in Sublime Text, leaving only "Attributes:" and the "/"-delimited key/value pairs in order to mess with the column that way
I've split the column found all unique values in col1 to maybe create a df from scratch, etc etc.
Can't seem to get past the cleaning steps:
samples <- read.csv("~/biosample_result_full.txt")
samples_split <- cSplit(samples, splitCols = sample$Colname, sep = "=")
samples_split$Attributes_1 <- gsub(" ", "_", samples_split$Attributes_1)
questions <- unique(samples_split$Attributes_1)
Ideally, each sample and associated metadata would be transformed into rows, with each "Attribute"/question as the column name.
Any help is greatly appreciated.
I see that the website you've linked to, allows fot the option to export data to xml. I strongly suggest to do so. R can hande/parse xml-files very efficient.
When I download the first three results from that site to a file biosample_result.xml , it's easy to process using the xml2-package
library( xml2 )
library( magrittr )
doc <- read_xml( "./biosample_result.xml")
#gret all BioSample nodes
BioSample.Nodes <- xml_find_all( doc, "//BioSample")
#build a data.frame
data.frame(
sample_name = xml_find_first( BioSample.Nodes , ".//Id[#db='SRA']") %>% xml_text(),
stringsAsFactors = FALSE )
# sample_name
# 1 ERS2609990
# 2 ERS2609989
# 3 ERS2609988
So if you can use the XML, you will just have to use the right xpath-syntax to get the data/nodes you need, into the columns you want...
In the exmaple above, I extracted (from each BioSample-node) the first ID-node with attribute db equals SRA, and stored the result in the co0lumn sample_name.
Still assuming you can use the xml-data.
If you are lokking for all attributes into one df, you need the functions from purrr, so just load the entire tidyverse
library( tidyverse )
df <- xml_find_all( doc, "//BioSample") %>%
map_df(~{
set_names(
xml_find_all(.x, ".//Attribute") %>% xml_text(),
xml_find_all(.x, ".//Attribute") %>% xml_attr( "attribute_name" )
) %>%
as.list() %>%
flatten_df()
})
will result in a df like this
I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe.
Website :Website I want Scraped
Code:
library(rvest)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
#replace multiple whitespaces with single space
p_nodes<- gsub('\\s+',' ',p_nodes)
#trim spaces from ends of elements
p_nodes <- trimws(p_nodes)
#drop blank elements
p_nodes <- p_nodes[p_nodes != '']
How I want the dataframe to look:
I'm not sure if this is even possible. I tried to extract each piece of information separately and then make the dataframe like that but it doesn't work since most of the info is stored in the p tag. I would appreciate any guidance. Thanks!
Proof-of-concept (based on what I wrote in the comment):
Code
lapply(c('data.table', 'httr', 'rvest'), library, character.only = T)
tags <- 'tr:nth-child(6) td , tr~ tr+ tr p , td+ p'
burl <- 'https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml'
url_text <- read_html(burl)
chunks <- url_text %>% html_nodes(tags) %>% html_text()
coordFunc <- function(chunk){
patter_lat <- 'Longitude:.*(-[[:digit:]]{1,2}.[[:digit:]]{0,15})'
ret <- regmatches(x = chunk, m = regexec(pattern = patter_lat, text = chunk))
return(ret[[1]][2])
}
longitudes <- as.numeric(unlist(lapply(chunks, coordFunc)))
Output
# using 'cat' to make the output easier to read
> cat(chunks[14])
Mt. Laurel DOT
Rt. 38, East
1/4 mile East of Rt. 295
Mt. Laurel Open 24 Hrs
Unleaded / Diesel
856-235-3096Latitude: 39.96744662Longitude: -74.88930386
> longitudes[14]
[1] -74.8893
If you do not coerce longitudes to be numeric, you get:
longitudes <- (unlist(lapply(chunks, coordFunc)))
> longitudes[14]
[1] "-74.88930386"
I chose the longitude as a proof-of-concept but you can modify your function to extract all relevant bits in a single call. For getting the right tag you can use SelectorGadget extension (works well in Chrome for me). Alliteratively most browsers let you 'inspect element' to get the html tag. The function could return the extracted values in a data.table which can then be combined into one using rbindlist.
You could even advance pages programatically to scrape the entire website - be sure to check with the usage policy (it's generally frowned upon or restricted to scrape websites).
Edit
the text is not structured the same way throughout the webpage so you'll need to spend more time examining what exceptions can take place.
Here's a new function to resolve each chunk into separate lines and then you can try to use additional regular expressions to get what you want.
newfunc <- function(chunk){
# Each chunk is a couple of lines. First, we split at '\r\n' using strsplit
# the output is a list so we use 'unlist' to get a vector
# then use 'trimws' to remove whitespace around it - try out each of these functions
# separately to understand what is going on. The final output here is a vector.
txt <- trimws(unlist(strsplit(chunk, '\r\n')))
return(txt)
}
This returns the 'text' contained in each chunk as a vector of separate lines. Taking a look at the number of lines in the first 20 chunks, you can see it is not the same:
> unlist(lapply(chunks[1:20], function(z) length(newfunc(z))))
[1] 5 6 5 7 5 5 5 5 5 4 1 6 6 6 5 1 1 1 5 6
A good way to resolve this would be to put in a conditional statement based on the number of lines of text in each chunk, e.g. in newfunc you could add:
if(length(txt) == 1){
return(NULL)
}
This is because that is for the entries that don't have any text in them. since this a proof of concept I haven't checked all entries but there's some simple logic:
The first line is typically the name
the coordinates are in the last line
The fuel can be either unleaded or diesel. You can grep on these two strings to see what each depot offers. e.g. grepl('diesel', newfunc(chunks[12]))
Another approach would be to use a different set of html tags e.g. all coorindates and opening hours are in boldface and have the tag strong. You can extract those separately and then use regular expressions to get what you want.
You could search for 24(Hrs|Hours) to first extract all sites that are open 24 hours and then use selective regex on the remainder to get their operating times.
There is no simple easy answer with most web-scraping, you have to find patterns and then apply some logic based on that. Only on the most structured websites will you find something that works for the entire page/range.
You can use tidyverse package (stringr, tibble, purrr)
library(rvest)
library(tidyverse)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
# Split on new line
l = p_nodes %>% stringr::str_split(pattern = "\r\n")
var1 = sapply(l, `[`, 1) # replace var by the name you want
var2 = sapply(l, `[`, 2)
var3 = sapply(l, `[`, 3)
var4 = sapply(l, `[`, 4)
var5 = sapply(l, `[`, 5)
t = tibble(var1,var2,var3,var4,var5) # make tibble
t = t %>% filter(!is.na(var2)) # delete useless lines
purrr::map_dfr(t,trimws) # delete blanks
I try to figure out how to read data table of Google Chart in R.
For example the source code of this page contains historical Peercoin daily prices. I would like to copy in a R matrix the content of the data table that begin at line 497 with :
var data = google.visualization.arrayToDataTable([
['Period', right_title_name],
['2014/10/01 18:00', 0.01189974],
['2014/10/02 18:00', 0.01194000],
['2014/10/03 18:00', 0.01171897],
['2014/10/04 18:00', 0.01199999],
['2014/10/05 18:00', 0.01200000],
['2014/10/06 18:00', 0.01188685],
['2014/10/07 18:00', 0.01161999],
// data here
]);
I've installed several packages like RCurl, XML and data.table and follow examples from related questions (i.e. using fread, readHTMLTable and getURL) but I'm facing different issues to read the correct data from the source code. Too much noise I can't filter out. For example with RCurl :
library(RCurl)
address <- "http://alt19.com/19R/chart_showing_btc.php?shw=1&label=LTC_BTC&source=cryptsy&period=1day"
data <- getURL(address)
data has all the data but I'm not able to select date and prices with strsplit(data, "some code here").
May somebody suggest me a idea to achieve this ?
Thank you,
Florent
Probably there's a better way but what I usually do after getting the page source with getURL, as you posted, is to use some string manipulations.
My try:
pageSource <- getURL(address)
index1<-str_locate(pageSource,"'Period', right_title_name],")[[2]]
sourceCut1<-substr(pageSource,index1+1,nchar(pageSource))
index2<-str_locate(sourceCut1,"]);")[[1]]
sourceCut2<-substr(sourceCut1,1,index2-1)
#sourceCut2 is the part of page source with the data
data<-str_trim(strsplit(sourceCut2,"\n")[[1]]) #split data rows
dates<-gsub("^.*'([0-9/: ]+).*$", "\\1", data) #extract dates
dates<-as.POSIXct(dates,format="%Y/%m/%d %H:%M")
values<-as.numeric(gsub("^.*,([0-9 .]+).*$", "\\1", data)) #extract numeric values
mydata<-data.frame(dates=dates,values=values)
Note that it will continue working only if the structure of the data (date format, blank spaces, square brackets) remains unchanged, otherwise you will probably need to modify some of the regex.
This answer is highly specific to your situation (and that URL) but it may be enough of a base for others with similar challenges. You can use the V8 package to parse & interpret javascript, so I grab the page, extract the javascript for the table, do some cleanup of it so it can be interpreted pretty easily then post-process the conversion. It's not pretty and others might be able to optimize it, but it will get you what you need:
library(V8)
library(httr)
library(stringr)
library(dplyr)
library(magrittr)
pg <- GET("http://alt19.com/19R/chart_showing_btc.php?shw=1&label=LTC_BTC&source=cryptsy&period=1day")
content(pg, as="text") %>%
str_extract("(google\\.visualization\\.arrayToDataTable.*\\]\\);)") %>%
str_replace("google\\.visualization\\.arrayToDataTable\\(", "[") %>%
str_replace("\\)", "]") %>%
str_replace("right_title_name", "'right_title_name'") -> tbl
ct <- new_context()
ct$eval(tbl) %>%
str_split(",") %>%
extract2(1) %>%
matrix(ncol=2, byrow=TRUE) %>%
data.frame(stringsAsFactors=FALSE) %>%
tail(-1) %>%
select(timestamp=1, value=2) %>%
mutate(timestamp=as.Date(as.POSIXct(timestamp)),
value=as.numeric(value)) -> dat
glimpse(dat)
## Observations: 227
## Variables:
## $ timestamp (date) 2014-10-01, 2014-10-02, 2014-10-03, 2014-1...
## $ value (dbl) 0.01189974, 0.01194000, 0.01171897, 0.01199...
library(ggplot2)
ggplot(dat, aes(timestamp, value)) + geom_line(size=0.5) + theme_bw()