I am quite new to R. I have few text (.txt) files in a folder that have been converted from PDF with page break character (#12). I need to produce a data frame by reading these text files in R with condition that one row in R represents one PDF page. It means that every time there is a page break (\f), it will only then create a new row.
The problem is when the text file gets load into R, every new line became a new row and I do not want this.
Please assist me on this. Thanks!
Some methods that I have tried are read.table and readLines.
As you can see in lines 273 & 293, there is \f, so I need whatever that comes after \f to be in a row (which represents a page)
Base R:
vec <- c("a","b","\fd","e","\ff","g")
# vec <- readLines("file.txt")
out <- data.frame(page = sapply(split(vec, cumsum(grepl("^\f", vec))), paste, collapse = "\n"))
out
# page
# 0 a\nb
# 1 \fd\ne
# 2 \ff\ng
If you need the leading \f removed, easily done with
out$page <- sub("^\f", "", out$page)
Does something like this work?
library(tidyverse)
read_file("mytxt.txt") %>%
str_split("␌") %>%
unlist() %>%
as_tibble_col("data")
It just reads the file as raw text then splits afterwards. You may have to replace the splitting character with something else.
Related
I wrote an R script to make some scientometric analyses of Journal Citation Report data (JCR), which I have been using and updating in the past years.
Today, Clarivate has just introduced some changes in its database and now the exported CSV file contains one last empty column, which spoils my script. Because of this last empty column, read.csv automatically assumes that the first column contains the row names.
As before, there is also one first useless row, which is automatically removed in my script with skip = 1.
One simple solution to this "empty column situation" would be to manually remove this last column in Excel, and then proceed with my script as usual.
However, is there a way to add this removal to my script using base R?
The beginning of my script is:
jcreco = read.csv("data/jcr ecology 2020.csv",
na = "n/a", skip = 1, header = T)
The original CSV file downloaded from JCR is available in my Dropbox.
Could you please help me? Thank you!
The real problem is that empty column doesn't have a header. If they had only had the extra comma at the end of the header line this probably wouldn't be as messy. But you can also do a bit of column shuffling with fill=TRUE. For example
dd <- read.table("~/../Downloads/jcr ecology 2020.csv", sep=",",
skip=2, fill=T, header=T, row.names=NULL)
names(dd)[-ncol(dd)] <- names(dd)[-1]
dd <- dd[,-ncol(dd)]
This reads in the data but puts the rows names in the data.frame and fills the last column with NA. Then you shift all the column names over to the left and drop the last column.
Here is a way.
Read the data as text lines;
Discard the first line;
Remove the end comma with sub;
Create a text connection;
And read in the data from the connection.
The variable fl holds the file, on my disk I had to set the directory.
fl <- "jcr_ecology_2020.csv"
txt <- readLines(fl)
txt <- txt[-1]
txt <- sub(",$", "", txt)
con <- textConnection(txt)
df1 <- read.csv(con)
close(con)
head(df1)
I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe.
Website :Website I want Scraped
Code:
library(rvest)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
#replace multiple whitespaces with single space
p_nodes<- gsub('\\s+',' ',p_nodes)
#trim spaces from ends of elements
p_nodes <- trimws(p_nodes)
#drop blank elements
p_nodes <- p_nodes[p_nodes != '']
How I want the dataframe to look:
I'm not sure if this is even possible. I tried to extract each piece of information separately and then make the dataframe like that but it doesn't work since most of the info is stored in the p tag. I would appreciate any guidance. Thanks!
Proof-of-concept (based on what I wrote in the comment):
Code
lapply(c('data.table', 'httr', 'rvest'), library, character.only = T)
tags <- 'tr:nth-child(6) td , tr~ tr+ tr p , td+ p'
burl <- 'https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml'
url_text <- read_html(burl)
chunks <- url_text %>% html_nodes(tags) %>% html_text()
coordFunc <- function(chunk){
patter_lat <- 'Longitude:.*(-[[:digit:]]{1,2}.[[:digit:]]{0,15})'
ret <- regmatches(x = chunk, m = regexec(pattern = patter_lat, text = chunk))
return(ret[[1]][2])
}
longitudes <- as.numeric(unlist(lapply(chunks, coordFunc)))
Output
# using 'cat' to make the output easier to read
> cat(chunks[14])
Mt. Laurel DOT
Rt. 38, East
1/4 mile East of Rt. 295
Mt. Laurel Open 24 Hrs
Unleaded / Diesel
856-235-3096Latitude: 39.96744662Longitude: -74.88930386
> longitudes[14]
[1] -74.8893
If you do not coerce longitudes to be numeric, you get:
longitudes <- (unlist(lapply(chunks, coordFunc)))
> longitudes[14]
[1] "-74.88930386"
I chose the longitude as a proof-of-concept but you can modify your function to extract all relevant bits in a single call. For getting the right tag you can use SelectorGadget extension (works well in Chrome for me). Alliteratively most browsers let you 'inspect element' to get the html tag. The function could return the extracted values in a data.table which can then be combined into one using rbindlist.
You could even advance pages programatically to scrape the entire website - be sure to check with the usage policy (it's generally frowned upon or restricted to scrape websites).
Edit
the text is not structured the same way throughout the webpage so you'll need to spend more time examining what exceptions can take place.
Here's a new function to resolve each chunk into separate lines and then you can try to use additional regular expressions to get what you want.
newfunc <- function(chunk){
# Each chunk is a couple of lines. First, we split at '\r\n' using strsplit
# the output is a list so we use 'unlist' to get a vector
# then use 'trimws' to remove whitespace around it - try out each of these functions
# separately to understand what is going on. The final output here is a vector.
txt <- trimws(unlist(strsplit(chunk, '\r\n')))
return(txt)
}
This returns the 'text' contained in each chunk as a vector of separate lines. Taking a look at the number of lines in the first 20 chunks, you can see it is not the same:
> unlist(lapply(chunks[1:20], function(z) length(newfunc(z))))
[1] 5 6 5 7 5 5 5 5 5 4 1 6 6 6 5 1 1 1 5 6
A good way to resolve this would be to put in a conditional statement based on the number of lines of text in each chunk, e.g. in newfunc you could add:
if(length(txt) == 1){
return(NULL)
}
This is because that is for the entries that don't have any text in them. since this a proof of concept I haven't checked all entries but there's some simple logic:
The first line is typically the name
the coordinates are in the last line
The fuel can be either unleaded or diesel. You can grep on these two strings to see what each depot offers. e.g. grepl('diesel', newfunc(chunks[12]))
Another approach would be to use a different set of html tags e.g. all coorindates and opening hours are in boldface and have the tag strong. You can extract those separately and then use regular expressions to get what you want.
You could search for 24(Hrs|Hours) to first extract all sites that are open 24 hours and then use selective regex on the remainder to get their operating times.
There is no simple easy answer with most web-scraping, you have to find patterns and then apply some logic based on that. Only on the most structured websites will you find something that works for the entire page/range.
You can use tidyverse package (stringr, tibble, purrr)
library(rvest)
library(tidyverse)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
# Split on new line
l = p_nodes %>% stringr::str_split(pattern = "\r\n")
var1 = sapply(l, `[`, 1) # replace var by the name you want
var2 = sapply(l, `[`, 2)
var3 = sapply(l, `[`, 3)
var4 = sapply(l, `[`, 4)
var5 = sapply(l, `[`, 5)
t = tibble(var1,var2,var3,var4,var5) # make tibble
t = t %>% filter(!is.na(var2)) # delete useless lines
purrr::map_dfr(t,trimws) # delete blanks
I have several .txt files which need to be imported to R as dataframes for some data analysis. One of these files has no EOL in any form, so I'm left wondering how I would go about to import that.
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
This is how the first ~500 characters of that .txt file look like. The EOL would need to be placed like this:
\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\"
\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA
Normally I would just gsub a "\n" to the places I need it to be, but there is no reoccurring string at the places where I would place a \n, so I don't think that gsub would work in this instance.
Seeing how the missing values are clearly indicated with NA, is there a function similar to read_delim that has a "col_number = x" argument? Like the first x values are the headers, the next x values are the values of the first row and so on and so forth?
If it changes anything, these .txt files are rather big (>300mb).
Big thank you to Julian_Hn. Works like a charm.
I would probably just read this in as a vector and then reformat as matrix with the number of columns you know are in the dataset. This essentially does what you want
str <- "\"A\";\"B\";\"C\";\"D\";\"D\";\"E\";\"F\";\"G\";\"H\";\"I\";\"J\";\"K\";\"L\";\"M\";\"N\";\"O\";\"P\";\"Q\";\"R\";\"S\";\"T\";\"U\";\"V\";\"1\";4;\"55-555-5555-555\";1234-56-78;\"111\";1510;5;1234-12-17;12345.1234512345;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;\"2\";6;\"22-222-2222-222\";5678-56-78;\"222\";2051;0;NA;0;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA;NA"
vec <- strsplit(str,";")[[1]]
//EDIT: add byrow = T To stay in the right format. Thanks Yuriy
table <- matrix(vec,ncol=23,nrow=3, byrow = T)
df <- as.data.frame(table)
I'm improving my R-skills rebuilding some of the amazing stuff they do on r-bloggers. Right now im trying to reproduce this:
http://wiekvoet.blogspot.nl/2015/06/deaths-in-netherlands-by-cause-and-age.html. The relevant dataset for this excersize could be found here:
http://statline.cbs.nl/Statweb/publication/?VW=D&DM=SLNL&PA=7052_95&D1=0-1%2c7%2c30-31%2c34%2c38%2c42%2c49%2c56%2c62-63%2c66%2c69-71%2c75%2c79%2c92&D2=0&D3=0&D4=0%2c10%2c20%2c30%2c40%2c50%2c60%2c63-64&HD=150710-0924&HDR=G1%2cG2%2cG3&STB=T
If I'm diving into the code (to be found at the bottom of the first link) and am running into this piece of code:
r1 <- read.csv(sep=';',header=FALSE,
col.names=c('Causes','Causes2','Age','year','aantal','count'),
na.strings='-',text=txtlines[3:length(txtlines)]) %>%
select(.,-aantal,-Causes2)
Could anybody help me seperating the steps that are taken here?
Here is an explanation of what each line in the call to read.csv() is doing from your example. Note that the assignment of the last parameter text is complicated and is dependent on the script from the link you gave above. From a high level, he is first reading in all lines from the file "Overledenen__doodsoo_170615161506.csv" which contain the string "Centraal", using only the third to final lines from that filtered set. There is an additional step applied to these lines as well.
r1 <- read.csv( # columns separate by semi-colon
sep=';',
# first row is data (i.e. is NOT a header)
header=FALSE,
# names of the six columns
col.names=c('Causes','Causes2','Age','year','aantal','count'),
# treat hyphen as NA
na.strings='-',
# read from third line to final line of the original input
# Overledenen__doodsoo_170615161506.csv, after some
# filtering has been applied
text=txtlines[3:length(txtlines)]) %>% select(.,-aantal,-Causes2)
The read.csv, read the csv file, separating column with the separator ";"
so that an input like this a;b;c will be separated in: first column=a, second=b, third=c
header=FALSE -> It specifies no header in the original file was given.
col.names assigns the listed names to your columns in r
na.strings='-' substitutes NA values with '-'
text=txtlines[3:length(txtlines)]) read the lines from position 3 till the end.
%>% select(.,-aantal,-Causes2) filter the data frame
Hi I am performing a task on a text file chosen from file.choose which just separates out the lines and then adds a new number to each separation.
I would like to name the outputted file the same as file.choose but instead of .txt I would like it to be called .fa. So far I have
fileConn<-file("outputtbb.txt")
longlist <- readLines(file.choose())
lvls1 <- unique(longlist)
cat(paste0("Sequence>", seq_along(lvls1), "\n", lvls1, collapse="\n"), file=fileConn)
close(fileConn)
I know its something to do with changing file=fileConn but cant figure out how to manipulate it and all I get is outputtbb.
Please help!
You could seperate the readLines(file.choose()) to two lines.
filename <- file.choose()
longlist <- readLines(filename)
You now have file name available as a variable and you can bend it, you can twist it all day long (no reference to iPhone 6 bending intended).
To change endings, assuming no ".txt" appears anywhere else in the file name, this would work
x <- "file.txt"
sub(".txt", replacement = ".fa", x = x)
[1] "file.fa"