Obtain specific sections of a text file and their contents - r

I have a text file that looks like:
1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last
I have another text file that looks like.
1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last
I was wondering how you could read in data from the previous text file, and use it to find the specific sections within the second data file and all the content after it uptil the the next section. So basically, I'm trying to get something like:
Section Content
1 Hello My name is John. It was nice to meet you.
1.1 Hi Hi again. My last name is Doe. 1.1.1 Bye
1.2 Hey Greetings.
.....And so on
I was wondering how I could do so.

The following solution might certainly be improved, but it might provide you with an idea how to approach your issue. Depending on the size and structure of the files you need to process, this approach might be ok or require more tuning concerning the detection of sections and the speed.
file1 =
"1 Hello
1.1 Hi
1.2 Hey
2 Next section
2.1 New section
3 thrid
4 last"
file2 =
"1 Hello
My name is John. It was nice to meet you.
1.1 Hi
Hi again. My last name is Doe.
1.1.1 Bye
1.2 Hey
Greetings.
2 Next section
This is the second section. I am majoring in CS.
2.1 New Section
Welcome. I am an undergraduate student.
3 third
1. hi
2. hello
3. hey
4 last"
file1 = unlist(strsplit(file1, "\n", fixed = T))
file2 = unlist(strsplit(file2, "\n", fixed = T))
positions = unlist(sapply(file1, function(x) grep(paste0("^", x, "$"), file2, ignore.case = T)))
positions = cbind(positions, c(positions[-1]-1, length(file2)))
text = mapply(function(x, y) file2[x:y], positions[,1], positions[,2])
text = lapply(text, function(x) x[-1])
result = cbind(positions, text)
result
# positions text
# 1 Hello 1 2 "My name is John. It was nice to meet you."
# 1.1 Hi 3 5 Character,2
# 1.2 Hey 6 7 "Greetings."
# 2 Next section 8 9 "This is the second section. I am majoring in CS."
# 2.1 New section 10 15 Character,5
# 4 last 16 16 Character,0
# Note that the text column contains lists storing the individual lines.
# e.g. for "2.1 New section":
class(result[5, "text"])
# list
result[5, "text"]
# [[1]]
# [1] "Welcome. I am an undergraduate student." "3 third" #<< note the different spelling of third
# [3] "1. hi" "2. hello"
# [5] "3. hey"

The answer to this question is Yes, it can be done. The implementation will vary wildly based on what programming language that you are using to accomplish this task. High level overview would be
split original file into string array by line. These are you list of keys for searching the second document.
read second file into string variable
iterate through all your keys (iterator x) and find their index in the second document. something like
int start = seconddocument.indexof(keys[x]);
int end = seconddocument.indexof(keys[x+1]);
then with these start and end positions you can use a substring() function to extract the content.
string matchedContent = seconddocument.substring(start, end);
This works until you get to the last match because keys[x+1] will not exist where x is the last key. in this case end needs to be set to the position of the last character in the document, or you use a substring method that just takes a starting point.
HTH

Related

problem with web scrapping from wikipedia

I have been practicing web scrapping from wikipedia with the rvest library, and I would like to solve a problem that I found when using the str_replace_all() function. here is the code:
library(tidyverse)
library(rvest)
pagina <- read_html("https://es.wikipedia.org/wiki/Anexo:Premio_Grammy_al_mejor_%C3%A1lbum_de_rap") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# convert to a table
html_table()
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista <- str_replace_all(rap$Artista, '\\[[^\\]]*\\]', '')
rap$Trabajo <- str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
The problem is that when I remove the elements between brackets (hyperlinks in wikipedia) from the Artist variable, when doing the tabulation to see the count by artist, Eminem is repeated three times as if it were three different artists, the same happens with Kanye West that is repeated twice. I appreciate any solutions in advance.
There are some hidden bits still attached to the strings and trimws() is not working to remove them. You can use nchar(sort(test)) to see the number of character associated with each entry.
Here is a messy regular expression to extract out the letters, space, comma and - and skip everything else at the end.
rap <- pagina[[2]]
rap <- rap[, -c(5)]
rap$Artista<-gsub("([a-zA-Z -,&]+).*", "\\1", rap$Artista)
rap$Trabajo <- stringr::str_replace_all(rap$Trabajo, '\\[[^\\]]*\\]', '')
table(rap$Artista)
Cardi B Chance the Rapper Drake Eminem Jay Kanye West Kendrick Lamar
1 1 1 6 1 4 2
Lil Wayne Ludacris Macklemore & Ryan Lewis Nas Naughty by Nature Outkast Puff Daddy
1 1 1 1 1 2 1
The Fugees Tyler, the Creator
1 2
Here is another reguarlar expression that seems a bit clearer:
gsub("[^[:alpha:]]*$", "", rap$Artista)
From the end, replace zero or more characters which are not a to z or A to Z.

Compiling API outputs in XML format in R

I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?
Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)

Extract specific text parts from df column R

I have a question how to extract parts of the text and convert them df output.
This is an example of my df, output of one row in my one column (content of one cell)
[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]
My expected output would be to have 2 columns with as many rows I get from this string:
effortDate
2021-07-04
2021-04-11
and second column
effort
2
1
Any suggestion how to achieve that?
Thanks!
looks like json-content... but the => messes with the reading. If you replace it with :, you sould be able to read properly.
mystr <- '[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]'
jsonlite::fromJSON(gsub("=>", ":", mystr))
# id effortDate effort author
# 1 aaaaaaaaaaaaaaaa 2021-07-04T23:00:00.000Z 2 a:aa:a
# 2 bbbbbbbbbbbbbb 2021-07-11T23:00:00.000Z 1 b:bb:b
# 3 ccccccccccccc 2021-07-17T23:00:00.000Z 1 c:cc:c

Start with certain phrase up to certain phrase

I have some texts and some of them actually have pre-defined templates which add no value to the analysis.
I want to use regex to systematically remove the template (which typically consist of a header text like greetings and and closing text like thank you, so that I can focus on the variable text.
Both the header and closing may have variable texts like variable location or variable staff name. So text 1 may have location equals ABC and staff name equals Sofia.
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
want <- "\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n"
header <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:"
tail <- "\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
Current attempt I have is as below.
# remove everything before 'menu'
gsub('(.*)menu:','', have)
# want to correct the above to
# remove everything that
# starts with "Hello, thank you for contacting" up to "Please find our available menu"
# remove everything after Sincerely, inclusive
gsub('Sincerely.*','', have)
# want to correct the above to
# remove everything that
# starts with "Sincerely,\nThe Awesome Pizza Team" up to "\nDelivering Pizza 24/7"
2nd attempt
# text
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
# remove any text in between 'Hello, thank you for contacting`
# up to 'Please find below our available menu:'
# and also the anchoring texts
(want <- gsub(pattern = '(Hello, thank you for contacting).*(Please find below our available menu:)',''
, x = have))
# remove any text after `\n\n Sincerely,\nThe Awesome Pizza Team\n`, inclusive the text itself
(want <- gsub(pattern = '\n\n Sincerely,\nThe Awesome Pizza Team\n.*',''
, x = want))
An option might be to match all lines before Menu. Then capture all consecutive lines that start with Menu and match the rest of the lines starting at Sincerely.
In the replacement use capture group 1.
^[\s\S]*?\R((?:Menu .*\R+)*)\s*Sincerely,[\s\S]*
The pattern matches:
^Start of string
[\s\S]*?\R Match any char as least as possible followed by a newline
( Capture group 1
(?:Menu .*\R+)* Repeat matching all lines that start with Menu and match a newline
) Close group 1
\s* Match optional whitespace chars
Sincerely, Match literally
[\s\S]* Match the rest of the lines
Regex demo | R demo
Example
have <- "Hello, thank you for contacting our Pizza Store, <variable location>. \n\r Please find below our available menu:\nMenu 1 USD 1.99\nMenu 2 USD 3.99\n\n\n Sincerely,\nThe Awesome Pizza Team\n<variable staff name>\nDelivering Pizza 24/7"
trimws(gsub('^[\\s\\S]*?\\R((?:Menu .*\\R+)*)\\s*Sincerely,[\\s\\S]*','\\1', have, perl = TRUE))
Output
[1] "Menu 1 USD 1.99\nMenu 2 USD 3.99"
A bit longer an more precise pattern could be:
^(?:(?!Menu ).*(?:\R(?!Menu ).*)*\R+)?(Menu .*(?:\RMenu .*)*)\R\s*Sincerely,[\s\S]*
Regex demo | R demo

How do I subset a list with mixed data type and data structure?

I have a list which included a mix of data type (character) and data structure (dataframe).
I want to keep only the dataframes and remove the rest.
> head(list)
[[1]]
[1] "/Users/Jane/R/12498798.txt error"
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
[[3]]
[1] "/Users/Jane/R/12498780.txt error"
I expect the final list to contain only dataframes:
[[2]]
match
1 Japan arrests man for taking gun
2 Extradition bill turns ugly
file
1 /Users/Jane/R/12498770.txt
2 /Users/Jane/R/12498770.txt
Based on the example, it is possible that the OP's list elements are vectors and want to remove any element having 'error' substring
list[!sapply(list, function(x) any(grepl("error$", x)))]

Resources