Split Text String in R - r

I have large list in R with more than 5000 elements. The elements are of the form:
$`/home/ricardo/MultiClass/data//Execucao_PUBLICACAO_DECISAO_INTERLOCUTORIA_DETERMINACAO_DE_PAGAMENTO/1117.txt.V1
[1] DATA DE DISPONIBILIZACAO DA PUBLICACAO PELA FONTE OFICIAL: 16/11/2016 Pag 4279 Decisao Processo N RTOrd-0122200-90.2006.5.15.0087 <truncated>`
I would like to transform this in a two columns dataframe where:
c1
The contents between $ and [1]
c2
rest of the text
How can I do this split? Important to note that the numberof strings between $ and [1] can change, and the strings $, [ e ] can appear in the rest of the text.
Thanks in advance,
Ricardo.

library(stringr)
string <- '$/home/ricardo/MultiClass/data//Execucao_PUBLICACAO_DECISAO_INTERLOCUTORIA_DETERMINACAO_DE_PAGAMENTO/1117.txt.V1 [1] DATA DE DISPONIBILIZACAO DA PUBLICACAO PELA FONTE OFICIAL: 16/11/2016 Pag 4279 Decisao Processo N RTOrd-0122200-90.2006.5.15.0087'
c1 <- str_match(string = string, pattern = "^\\$(.*) \\[1\\] (.*)")[,2]
c2 <- str_match(string = string, pattern = "^\\$(.*) \\[1\\] (.*)")[,3]

The $ ... text is the name of the list element, and the [1] ... is the value of that element. You can extract these (or better yet, assign them correctly when reading in your data).
a <- list(`this is the name` = "data stored in that variable")
a
#> $`this is the name`
#> [1] "data stored in that variable"
names(a)
#> [1] "this is the name"
as.character(a)
#> [1] "data stored in that variable"

Related

Subsetting elements in a list and placing them in a data frame

I have a list ("listanswer") that looks something like this:
> str(listanswer)
List of 100
$ : chr [1:3] "" "" "\t\t"
$ : chr [1:5] "" "Dr. Smith" "123 Fake Street" "New York, ZIPCODE 1" ...
$ : chr [1:5] "" "Dr. Jones" "124 Fake Street" "New York, ZIPCODE 2" ...
> listanswer
[[1]]
[1] "" "" "\t\t"
[[2]]
[1] "" "Dr. Smith" "123 Fake Street" "New York"
[5] "ZIPCODE 1"
[[3]]
[1] "" "Dr. Jones" "124 Fake Street," "New York"
[5] "ZIPCODE2"
For each element in this list, I noticed the following pattern within the sub-elements:
# first sub-element is always empty
> listanswer[[2]][[1]]
[1] ""
# second sub-element is the name
> listanswer[[2]][[2]]
[1] "Dr. Smith"
# third sub-element is always the address
> listanswer[[2]][[3]]
[1] "123 Fake Street"
# fourth sub-element is always the city
> listanswer[[2]][[4]]
[1] "New York"
# fifth sub-element is always the ZIP
> listanswer[[2]][[5]]
[1] "ZIPCODE 1"
I want to create a data frame that contains the information from this list in row format. For example:
id name address city ZIP
1 2 Dr. Smith 123 Fake Street New York ZIPCODE 1
2 3 Dr. Jones 124 Fake Street New York ZIPCODE 2
I thought of the following way to do this:
name = sapply(listanswer,function(x) x[2])
address = sapply(listanswer,function(x) x[3])
city = sapply(listanswer,function(x) x[4])
zip = sapply(listanswer,function(x) x[5])
final_data = data.frame(name, address, city, zip)
id = 1:nrow(final_data)
My Question: I just wanted to confirm - Is this the correct way to reference sub-elements in lists?
If it works, it's the correct way, although there might be a more efficient or more readable way to do the same thing.
Another way to do this is to create a data frame with your columns, and add rows to it. i. e.
#create an empty data frame
df <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df) <- c("name", "address", "city", "zip")
#add rows
lapply(listanswer, \(x){df[nrow(df) + 1,] <- x[2:5]})
This is simply another way to solve the same problem. Readability is a personal preference, and there's nothing wrong with your solution either.
If this is based on your elephant question, for businesses in Vancouver, then this mostly works.
library(rvest)
url<-"Website/british-columbia/"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("div.one_third")
listanswer <- b %>% html_text() %>% strsplit("\\n")
#listanswer2 <- b %>% html_text2() %>% strsplit("\\n")
listanswer[[1]]<-NULL #remove first blank record
rows<-lapply(listanswer, function(element){
vect<-element[-1] #remove first blank field
cityindex<-as.integer(grep("Vancouver", vect)) #find city field
#add some error checking and corrections
if(length(cityindex)==0) {
cityindex <- length(vect)-1 }
else if(length(cityindex)>1) {
cityindex <- cityindex[2] }
#get the fields of interest
address <- vect[cityindex-1]
city<-vect[cityindex]
phone <- vect[cityindex+1]
if( cityindex < 3) {
cityindex <- 3
} #error check
#first groups combine into 1 name
name <- toString(vect[1:(cityindex-2)])
data.frame(name, address, city, phone)
})
answer<-bind_rows(rows)
#clean up
answer$phone <- sub("Website", "", answer$phone)
answer
This still needs some clean up to handle the inconsistences but should be 80-90% complete

Split texts by tags and set column names

I have a text column that has a tag style. I want to split this text into columns where column names are the tags with their corresponding values.
text = "{\"article_id\":-41,\"word-count\":379,\"article_date\":05012017,\"source\":\"news::abc\",\"author\":\"Peter K\",\"title\":\"The rise of AI\",\"topics\":{\"Business\":10, \"Computer\":5},\"topics-group\":[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]}"
Desired output:
data = data.frame("article_id" = -41, "word-count" = 379, "article_date" = 05012017,
"source"= "news::abc", "author" = "Peter K", "title" = "The rise of AI",
"topics" = "{\"Business\":10, \"Computer\":5}",
"topics-group" = "[{\"primary\":\"Business\",\"secondary\":\"Computer\"}]")
I tried with strsplit
test = strsplit(as.character(text), ",\\\"")
test
[[1]]
[1] "{\"article_id\":-41" "word-count\":379"
[3] "article_date\":05012017" "source\":\"news::abc\""
[5] "author\":\"Peter K\"" "title\":\"The rise of AI\""
[7] "topics\":{\"Business\":10, \"Computer\":5}" "topics-group\":[{\"primary\":\"Business\""
[9] "secondary\":\"Computer\"}]}"
but there are problems with tags like topics-group which is split into 2.
My workflow thought is to complete the split, then do another split for each element to separate the tags and values. But I think there must be a better way to split and set names to these tags as column names.
We may do this in tidyverse
Use str_replace_all to change the digits (\\d+) after the 'article_date":' to integer class (as there was a 0 padding at the beginning)
Use fromJSON to convert the JSON to R object
Flatten the nested list of data.frame - invoke
Use as_tibble to convert the list to a tibble
Finally, use mdy from lubridate to convert the 'article_date' to Date class
library(dplyr)
library(stringr)
library(jsonlite)
library(lubridate)
library(purrr)
text %>%
str_replace_all('(?<=article_date":)(\\d+)', as.integer) %>%
fromJSON %>%
invoke(c, .) %>%
as_tibble %>%
mutate(article_date = mdy(article_date))
-output
# A tibble: 1 x 10
article_id `word-count` article_date source author title topics.Business topics.Computer `topics-group.prima… `topics-group.second…
<int> <int> <date> <chr> <chr> <chr> <int> <int> <chr> <chr>
1 -41 379 2017-05-01 news::a… Peter K The rise o… 10 5 Business Computer
You cam add double quotes around the article_date field and parse the JSON string with jsonlite:
text <- gsub('("article_date":)(\\d+)', '\\1"\\2"', text)
library(jsonlite)
document <- fromJSON(txt=text)
> as.data.frame(document)
# article_id word.count article_date source author title topics.Business topics.Computer topics.group.primary topics.group.secondary
# 1 -41 379 05012017 news::abc Peter K The rise of AI 10 5 Business Computer
See the regex demo. Details:
("article_date":) - Group 1: "article_date": string
(\d+) - Group 2: one or more digits.
The replacement is \1"\2": Group 1 value + Group 2 value enclosed with double quotation marks.

How to use hunspell package to suggest correct words in a column in R?

I'm currently working with a large data frame containing lots of text in each row and would like to effectively identify and replace misspelled words in each sentence with the hunspell package. I was able to identify the misspelled words, but can't figure out how to do hunspell_suggest on a list.
Here is an example of the data frame:
df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
"Mary and Samantha arived at the bus staton before noon",
"I did not see thm at the station in the mrning",
"The participnts read 60 sentences in radom order",
"how to fix mispelled words in R languge",
"today is Tuesday",
"bing sports quiz"))
I converted the text column into character and used hunspell to identify the misspelled words within each row.
library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)
I tried
df1$suggest <- hunspell_suggest(df1$word_check)
but it keeps giving this error:
Error in hunspell_suggest(df1$word_check) :
is.character(words) is not TRUE
I'm new to this so I'm not exactly sure how does the suggest column using hunspell_suggest function would turn out. Any help will be greatly appreciated.
Check your intermediate steps. The output of df1$word_check is as follows:
List of 5
$ : chr [1:2] "complec" "independet"
$ : chr [1:2] "arived" "staton"
$ : chr [1:2] "thm" "mrning"
$ : chr [1:2] "participnts" "radom"
$ : chr [1:2] "mispelled" "languge"
which is of type list. If you did lapply(df1$word_check, hunspell_suggest) you can get the suggestions.
EDIT
I decided to go into more detail on this question as I have not seen any easy alternative. This is what I have come up with:
cleantext = function(x){
sapply(1:length(x),function(y){
bad = hunspell(x[y])[[1]]
good = unlist(lapply(hunspell_suggest(bad),`[[`,1))
if (length(bad)){
for (i in 1:length(bad)){
x[y] <<- gsub(bad[i],good[i],x[y])
}}})
x
}
Although there probably is a more elegant way of doing it, this function returns a vector of character strings corrected as such:
> df1$Text
[1] "A complec sentence joins an independet"
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"
[4] "The participnts read 60 sentences in radom order"
[5] "how to fix mispelled words in R languge"
[6] "today is Tuesday"
[7] "bing sports quiz"
> cleantext(df1$Text)
[1] "A complex sentence joins an independent"
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"
[4] "The participants read 60 sentences in radon order"
[5] "how to fix misspelled words in R language"
[6] "today is Tuesday"
[7] "bung sports quiz"
Watch out, as this returns the first suggestion given by hunspell - which may or may not be correct.

Importing multiple invoices (.PDF) in R. Turning them from strings to a tibble

So I'm doing a project where I need to load a numerous amount of .pdfs into R. This part is somewhat covered. The problem is when importing the pdfs into R, every line is a string. Not all the information in de the string is relevant. And in some of the cases information is missing. So I want to select the info I need and place them into a tibble for further analysis.
Importing the pdf's are done by pdftools. It's working, hints or tips are welcome though
invoice_pdfs = list.files(pattern="*.pdf") # gather all the .pdf in current wd.
invoice_list <- map(invoice_pdfs, .f = function(invoices){ # Using the purrr::map function .
pdf_text(invoices) %>% # extracting text from listed pdf file(s)
readr::read_lines() %>% # read all text from pdf
str_squish() %>% # clear all white space in text.
str_to_lower # convert string to lower case
})
reproducible example:
invoice_example <- c("invoice",
"to: rade ris",
"cane nompany",
"kakber street 23d",
"nork wey",
"+223 (0)56 015 6542",
"invoice id: 85600023",
"date reference product product reference weigth amount",
"01-02-2016 840000023 product a 24.45.6 de6583621 14.900 kg a 50 per tonne 745,00",
"07-02-2016 840000048 product b 24.45.7 qf8463641 19.000 kg a 50 per tonne 950,00",
"03-02-2016 840000032 product b 24.34.2 qf8463641 4.000 kg per tonne 250,00",
"02-02-2016 840000027 ke7801465 1.780 kg per tonne 89,00",
"subtotal 2.034,00",
"sales tax 183,06",
"total 2.217,06")
So here is where the problem starts.
What I've tried is using stringr and rebus to select specific parts of the text. I've made the following function to search the document for specific string, it returns the rownumber:
word_finder <- function(x, findWord){
word_hit <- x %>% # temp for storing TRUE or FALSE
str_detect(pattern = fixed(findWord))
which(word_hit == TRUE) # give rownumber if TRUE
}
And the following searchpatterns:
detect_date <- dgt(2) %R% "-" %R% dgt(2) %R% "-" %R% dgt(2)
detect_money <- optional(DIGIT) %R% optional(".") %R% one_or_more(DIGIT) %R% "," %R% dgt(2)
detect_invoice_num <- str_trim(SPC %R% dgt(8) %R% optional(SPC))
The next step should be to make a tibble (or data frame) with the column names c("date", "reference", "product", "product reference", "weight", "amount") I've also tried making a tibble of the whole invoice_example problem is the missing info in some fields and the column names don’t match the corresponding value's.
So I would like to make some function that uses the search pattern and places that specific value to a predestined column. I've got no clue how to get this done. Or maybe I should handle this completely different?
final result should be something like this.
reproducible example:
invoice_nr <- c("85600023", "85600023", "85600023", "85600023" )
date <- c( "01-02-2016", "07-02-2016", "03-02-2016", "02-02-2016")
reference <- c( "840000023", "840000048", "840000032", "840000027")
product_id <- c( "de6583621", "qf8463641", "qf8463641", "ke7801465")
weight <- c("14.900", "19.000", "4.000", "1.780")
amount <- c("745.00", "950.00", "250.00", "89.00")
example_tibble <- tibble(invoice_nr, date, reference, product_id, weight, amount)
Result:
# A tibble: 4 x 6
invoice_nr date reference product_id weight amount
<chr> <chr> <chr> <chr> <chr> <chr>
1 85600023 01-02-2016 840000023 de6583621 14.900 745.00
2 85600023 07-02-2016 840000048 qf8463641 19.000 950.00
3 85600023 03-02-2016 840000032 qf8463641 4.000 250.00
4 85600023 02-02-2016 840000027 ke7801465 1.780 89.00
Any suggested ways of dealing with this will be appreciated!
Actually you can use the functions of library(stringr) to achieve your goal (I skipped the rebus part as this seems to eb anyways 'just' a helper for creatign teh regex, which I did by hand):
library(tidyverse)
parse_invoice <- function(in_text) {
## define regex, some assumptions:
## product id is 2 lower characters followed by 7 digits
## weight is some digits with a dot followed by kg
## amount is some digits at the end with a comma
all_regex <- list(date = "\\d{2}-\\d{2}-\\d{4}",
reference = "\\d{9}",
product_id = "[a-z]{2}\\d{7}",
weight = "\\d+\\.\\d+ kg",
amount = "\\d+,\\d+$")
## look only at lines where there is invoice data
rel_lines <- str_subset(in_text, all_regex$date)
## extract the pieces from the regex
ret <- as_tibble(map(all_regex, str_extract, string = rel_lines))
## clean up the data
ret %>%
mutate(invoice_nr = str_extract(str_subset(in_text, "invoice id:"), "\\d{8}"),
date = as.Date(date, "%d-%m-%Y"),
weight = as.numeric(str_replace(weight, "(\\d+.\\d+) kg", "\\1")),
amount = as.numeric(str_replace(amount, ",", "."))
) %>%
select(invoice_nr,
date,
reference,
product_id,
weight,
amount)
}
str(parse_invoice(invoice_example))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 6 variables:
# $ invoice_nr: chr "85600023" "85600023" "85600023" "85600023"
# $ date : Date, format: "2016-02-01" "2016-02-07" ...
# $ reference : chr "840000023" "840000048" "840000032" "840000027"
# $ product_id: chr "de6583621" "qf8463641" "qf8463641" "ke7801465"
# $ weight : num 14.9 19 4 1.78
# $ amount : num 745 950 250 89
Since I'm not familiar with rebus I've rewritten your code. Assuming the invoices are at least somewhat structured the same I could generate a tibble from your example. You would just have to apply this to your whole list and then purrr::reduce it to a big tibble:
df <- tibble(date=na.omit(str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4}")))
df %>% mutate(invoice_nr=na.omit(sub("invoice id: ","",str_extract(invoice_example,"invoice id: [0-9]+"))),
reference=na.omit(sub("\\d{2}-\\d{2}-\\d{4} ","",str_extract(invoice_example,"\\d{2}-\\d{2}-\\d{4} \\d{9}"))),
product_id=na.omit(str_extract(invoice_example,"[:lower:]{2}\\d{7}")),
weight=na.omit(sub(" kg","",str_extract(invoice_example,"[0-9\\.]+ kg"))),
amount=na.omit(sub("tonne ","",str_extract(invoice_example,"tonne [0-9,]+"))))

Convert XML to Data frame in R (different number of elements in my list)

I have a similar question to this:
How to transform XML data into a data.frame?
I have an XML, that I want to convert to a data frame. But when I try this on my data, it doesn't work because i have different number of elements in my list
df3 = plyr::ldply(xmlToList(books), data.frame)
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, : arguments imply differing number of rows: 9, 10
Could anyone tell me how to convert XML to data frame when there are different number of elements in my list?
Thanks,
If you look closely at the XML file, there are 105 nodes under patient. If you pick one like "drugs", you still get 22 subnodes, some tags with text and attributes, some with only attributes and some with more subnodes. ldply can do lots of things, but not combine this mess.
doc <- xmlParse( file )
x <- xmlToList( doc)
names(x)
[1] "admin" "patient" ".attrs"
names(x$patient)
[1] "additional_studies"
[2] "tumor_tissue_site"
[3] "tumor_tissue_site_other"
[4] "prior_dx"
[5] "gender"
[6] "vital_status"
[7] "days_to_birth"
...
[103] "drugs"
[104] "radiations"
[105] "clinical_cqcf"
sapply(x$patient$drugs$drug, names)
## text and attributes (usually 9)
$tx_on_clinical_trial
[1] "text" ".attrs"
# attributes only
$regimen_number
[1] "preferred_name" "display_order" "cde" "cde_ver"
[5] "xsd_ver" "tier" "owner" "procurement_status"
[9] "restricted"
## 2 sub nodes
$therapy_types
[1] "therapy_type" "therapy_type_notes"
...

Resources