Instructing R to find variable name in rows when reading csv file - r

Is there a way to have R read the column/variable name in each cell when reading csv file?
My csv file is malformed. Not every row has every variable and not every row is of the same length. However, every row has a variable name within it, e.g. "id": "37189", "city": "Phoenix", "type": "business". When I tell R to read the csv can I instruct it to find the variable name within the data and sort accordingly?
Data sample for your convenience:
business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
business_id: UsFtqoBl7naz8AVUBZMjQQ,full_address: 202 McClure St\nDravosburg, PA 15034, open: true, categories: [Nightlife], city: Dravosburg, review_count: 4, name: Clancy's Pub, neighborhoods: [], longitude: -79.886930000000007, state: PA, stars: 3.5, latitude: 40.350518999999998, attributes: Happy Hour: true, Accepts Credit Cards: true, Good For Groups: true, Outdoor Seating: false, Price Range: 1, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
business_id: cE27W9VPgO88Qxe4ol6y_g,{ full_address: 1530 Hamilton Rd\nBethel Park, PA 15234}, open: false, categories: [Active Life, Mini Golf, Golf], city: Bethel Park, review_count: 5, name: Cool Springs Golf Center, neighborhoods: [], longitude: -80.015910000000005, state: PA, stars: 2.5, latitude: 40.356896200000001, attributes: Good for Kids: true, type: business,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
In bold are a few of the variables which do not appear in other entries.

This will get you started but you still have quite a bit of work to do. This works for one line (and it may work for the other two in the example) but it can be extrapolated to work with all of the lines (lapply FTW). Basically you need to rebuild the JSON structure from that single field (there may be alternative ways, especially if you do not need all the fields). It's easier than it might otherwise be since the Yelp schema is known.
You have to attack it in a pretty deterministic way, converting some fields before others, accounting for spaces in field names, dealing with arrays & nested structures, etc. As I said, you have quite a bit of work ahead of you. If your regex-fu is weak, this will provide ample practice to become a regex ninja.
library(stringi)
library(stringr)
library(jsonlite)
txt <- 'business_id: vcNAWiLM4dR7D2nwwJ7nCA, full_address: 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018, close: 17:00, open: 08:00, open: true, categories: [Doctors, Health & Medical], city: Phoenix, review_count: 9, name: Eric Goldberg, MD, neighborhoods: [], longitude: -111.98375799999999, state: AZ, stars: 3.5, latitude: 33.499313000000001, attributes: By Appointment Only: true, type: business'
txt <- gsub("\n", "|", txt)
txt <- sub("business_id: ([[:alnum:]\\:]+)", '"business_id": "\\1"', txt)
txt <- sub('attributes: ', '"attributes": {', txt)
txt <- sub('By Appointment Only: ', '"By Appointment Only": ', txt)
txt <- sub('Accepts Credit Cards: ', '"Accepts Credit Cards": ', txt)
txt <- sub('Good For Groups: ', '"Good For Groups": ', txt)
txt <- sub('Outdoor Seating: ', '"Outdoor Seating": ', txt)
txt <- sub('Price Range: ', '"Price Ranges": ', txt)
txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), close:", '"full_address": "\\1", close:', txt)
txt <- sub("full_address: ([[:alnum:][:space:],\\|\\-\\.]+), open:", '"full_address": "\\1", open:', txt)
txt <- sub("name: (.*), neighborhoods:", '"name": "\\1", "neighborhoods":', txt)
txt <- gsub("open: ([[:alnum:]\\:]+)", '"open": "\\1"', txt)
txt <- sub("close: ([[:alnum:]\\:]+)", '"close": "\\1"', txt)
txt <- sub("longitude: ([[:digit:]\\.-]+)", '"longitude": "\\1"', txt)
txt <- sub("latitude: ([[:digit:]\\.-]+)", '"latitude": "\\1"', txt)
txt <- sub("review_count: ([[:digit:]\\.]+)", '"review_count": "\\1"', txt)
txt <- sub("stars: ([[:digit:]\\.]+)", '"stars": "\\1"', txt)
txt <- sub("state: ([[:alpha:]]+)", '"state": "\\1"', txt)
txt <- sub("city: ([[:alpha:] \\.-]+)", '"city": "\\1"', txt)
txt <- sub("type: ([[:alpha:]]+)", '"type": "\\1"', txt)
cats <- paste0(sprintf('"%s"', str_trim(str_split(str_match_all(txt, "categories: \\[([[:alpha:] &-,]+)\\],")[[1]][,2], ",")[[1]])), collapse=", ")
txt <- sub("categories: \\[([[:alpha:] &-,]+)\\],", '"categories": [' %s+% cats %s+% '],', txt)
txt <- "{" %s+% txt %s+% "}}"
fromJSON(txt)
## $business_id
## [1] "vcNAWiLM4dR7D2nwwJ7nCA"
##
## $full_address
## [1] "4840 E Indian School Rd|Ste 101|Phoenix, AZ 85018"
##
## $close
## [1] "17:00"
##
## $open
## [1] "08:00"
##
## $open
## [1] "true"
##
## $categories
## [1] "Doctors" "Health & Medical"
##
## $city
## [1] "Phoenix"
##
## $review_count
## [1] "9"
##
## $name
## [1] "Eric Goldberg, MD"
##
## $neighborhoods
## list()
##
## $longitude
## [1] "-111.98375799999999"
##
## $state
## [1] "AZ"
##
## $stars
## [1] "3.5"
##
## $latitude
## [1] "33.499313000000001"
##
## $attributes
## $attributes$`By Appointment Only`
## [1] TRUE
##
## $attributes$type
## [1] "business"
And, whomever gave you this file deserves whatever evil comes their way in their programmatic life. I'd give them back whatever they wanted from this in gnarly XML with EBCDIC encoding.

Related

Subsetting elements in a list and placing them in a data frame

I have a list ("listanswer") that looks something like this:
> str(listanswer)
List of 100
$ : chr [1:3] "" "" "\t\t"
$ : chr [1:5] "" "Dr. Smith" "123 Fake Street" "New York, ZIPCODE 1" ...
$ : chr [1:5] "" "Dr. Jones" "124 Fake Street" "New York, ZIPCODE 2" ...
> listanswer
[[1]]
[1] "" "" "\t\t"
[[2]]
[1] "" "Dr. Smith" "123 Fake Street" "New York"
[5] "ZIPCODE 1"
[[3]]
[1] "" "Dr. Jones" "124 Fake Street," "New York"
[5] "ZIPCODE2"
For each element in this list, I noticed the following pattern within the sub-elements:
# first sub-element is always empty
> listanswer[[2]][[1]]
[1] ""
# second sub-element is the name
> listanswer[[2]][[2]]
[1] "Dr. Smith"
# third sub-element is always the address
> listanswer[[2]][[3]]
[1] "123 Fake Street"
# fourth sub-element is always the city
> listanswer[[2]][[4]]
[1] "New York"
# fifth sub-element is always the ZIP
> listanswer[[2]][[5]]
[1] "ZIPCODE 1"
I want to create a data frame that contains the information from this list in row format. For example:
id name address city ZIP
1 2 Dr. Smith 123 Fake Street New York ZIPCODE 1
2 3 Dr. Jones 124 Fake Street New York ZIPCODE 2
I thought of the following way to do this:
name = sapply(listanswer,function(x) x[2])
address = sapply(listanswer,function(x) x[3])
city = sapply(listanswer,function(x) x[4])
zip = sapply(listanswer,function(x) x[5])
final_data = data.frame(name, address, city, zip)
id = 1:nrow(final_data)
My Question: I just wanted to confirm - Is this the correct way to reference sub-elements in lists?
If it works, it's the correct way, although there might be a more efficient or more readable way to do the same thing.
Another way to do this is to create a data frame with your columns, and add rows to it. i. e.
#create an empty data frame
df <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(df) <- c("name", "address", "city", "zip")
#add rows
lapply(listanswer, \(x){df[nrow(df) + 1,] <- x[2:5]})
This is simply another way to solve the same problem. Readability is a personal preference, and there's nothing wrong with your solution either.
If this is based on your elephant question, for businesses in Vancouver, then this mostly works.
library(rvest)
url<-"Website/british-columbia/"
page <-read_html(url)
#find the div tab of class=one_third
b = page %>% html_nodes("div.one_third")
listanswer <- b %>% html_text() %>% strsplit("\\n")
#listanswer2 <- b %>% html_text2() %>% strsplit("\\n")
listanswer[[1]]<-NULL #remove first blank record
rows<-lapply(listanswer, function(element){
vect<-element[-1] #remove first blank field
cityindex<-as.integer(grep("Vancouver", vect)) #find city field
#add some error checking and corrections
if(length(cityindex)==0) {
cityindex <- length(vect)-1 }
else if(length(cityindex)>1) {
cityindex <- cityindex[2] }
#get the fields of interest
address <- vect[cityindex-1]
city<-vect[cityindex]
phone <- vect[cityindex+1]
if( cityindex < 3) {
cityindex <- 3
} #error check
#first groups combine into 1 name
name <- toString(vect[1:(cityindex-2)])
data.frame(name, address, city, phone)
})
answer<-bind_rows(rows)
#clean up
answer$phone <- sub("Website", "", answer$phone)
answer
This still needs some clean up to handle the inconsistences but should be 80-90% complete

R grepl not giving desired result loading CSV

I dont know what I could be overlooking here but I am importing a csv file with a bunch of names into a data.frame. When I pull the data frame value and run grepl against it there is no match. If I take that same value and manually create a string it matches fine. Any help would be appreciated.
I obviously cant give you the CSV or the data source so I have tried to include all the code below.
After further look, it seems the string no longer has a space
> Parks[1,2]
[1] "Abraham Lincoln Birthplace National Historical Park"
> typeof(Parks[1,2])
[1] "character"
> grepl(" ", Parks[1,2], fixed = TRUE)
[1] FALSE
> grepl("National Historical Park", Parks[1,2])
[1] FALSE
> grepl("National", Parks[1,2], fixed = TRUE)
[1] TRUE
> grepl("National Historical Park", "Abraham Lincoln Birthplace National Historical Park")
[1] TRUE
> grepl(" ", "Abraham Lincoln Birthplace National Historical Park")
[1] TRUE
The blank spaces were unicode \u2022 characters. Running the following code before grepl results in the desired result.
> Code <- Parks[1,2]
> Code <- gsub('[^\x20-\x7E]', ' ', Code)
> grepl(" ", Parks[1,2], fixed = TRUE)
[1] TRUE
> grepl("National Historical Park", Parks[1,2])
[1] TRUE

How can I efficiently extract data from pdf by looping through each line and extracting the characters in between a specific pattern of strings?

I'm attempting to parse text data from a pdf application using:
Credibly_text <- pdf_text("Credibly_Business__Funding
_Application(385).pdf") %>%
readr::read_lines()
reducedWhitespace <- Credibly_text[9:45] %>%
str_squish()
which returns lines that look like:
[1] "Legal/corporate name: RTT Enterprises Inc. DBA: RTT Enterprises
Inc."
[2] "Physical address: 2145 W Suhest St City: Springfield State: MO
Zip: 65807"
[3] "Mailing Address: 2145 W Suhest St City: Springfield State: MO
Zip: 65807"
[4] "Federal tax ID: 208088643 Business phone: 4178485439 Fax:"
[5] "Contact: Richard Hare Email: rchare1#msn.com Website:
airservicesheatac.com"
[6] "Date business started: 01/01/1964 Length of ownership: 2006-12-26
Years at location: 12 # of locations: 1"
So far I have been extracting the values that I need by using a brute force method:
temp <- reducedWhitespace[[1]]
res <- str_match(temp, "name: (.*?) DBA")
values[[1]] <- res[,2]
values[[1]]
which returns:
"RTT Enterprises Inc."
This feels way too inefficient and I was wondering if there was a way to loop through each line and identify the strings that I need to get the value between. I am very inexperienced with regex so I'm having trouble accounting for cases where in one line you can have
"Physical address: 2145 W Suhest St City:"
which I know you can just extract a string between strings that end with a colon. But then cases where the last string contains a space:
"Federal tax ID: 208088643 Business phone:"
or when the needed string is not between two strings that end in a colon but only comes after one such as:
"Zip: 65807"

In R, how do I wrap text around all words in a string, but a specific one(going from left to right)? Iteration and string manipulation

I know my question is a little vague, so I have an example of what I'm trying to do.
input <- c('I go to school')
#Output
'"I " * phantom("go to school")'
'phantom("I ") * "go" * phantom("to school")'
'phantom("I go ") * "to" * phantom("school")'
'phantom("I go to ") * "school"'
I've written a function, but I'm having a lot of trouble figuring out how to make it applicable to strings with different numbers of words and I can't figure out how I can include iteration to reduce copied code. It does generate the output above though.
Right now my function only works on strings with 4 words. It also includes no iteration.
My main questions are: How can I include iteration into my function? How can I make it work for any number of words?
add_phantom <- function(stuff){
strings <- c()
stuff <- str_split(stuff, ' ')
strings[1] <- str_c('"', stuff[[1]][[1]], ' "', ' * ',
'phantom("', str_c(stuff[[1]][[2]], stuff[[1]][[3]], stuff[[1]][[4]], sep = ' '), '")')
strings[2] <- str_c('phantom("', stuff[[1]][[1]], ' ")',
' * "', stuff[[1]][[2]], '" * ',
'phantom("', str_c(stuff[[1]][[3]], stuff[[1]][[4]], sep = ' '), '")')
strings[3] <- str_c('phantom("', str_c(stuff[[1]][[1]], stuff[[1]][[2]], sep = ' '), ' ")',
' * "', stuff[[1]][[3]], '" * ',
'phantom("', stuff[[1]][[4]], '")')
strings[4] <- str_c('phantom("', str_c(stuff[[1]][[1]], stuff[[1]][[2]], stuff[[1]][[3]], sep = ' '), ' ")',
' * "', stuff[[1]][[4]], '"')
return(strings)
}
this is some butcher work but it gives the expected output :):
input <- c('I go to school')
library(purrr)
inp <- c(list(NULL),strsplit(input," ")[[1]])
phantomize <- function(x,leftside = T){
if(length(x)==1) return("")
if(leftside)
ph <- paste0('phantom("',paste(x[-1],collapse=" "),' ") * ') else
ph <- paste0(' * phantom("',paste(x[-1],collapse=" "),'")')
ph
}
map(1:(length(inp)-1),
~paste0(phantomize(inp[1:.x]),
inp[[.x+1]],
phantomize(inp[(.x+1):length(inp)],F)))
# [[1]]
# [1] "I * phantom(\"go to school\")"
#
# [[2]]
# [1] "phantom(\"I \") * go * phantom(\"to school\")"
#
# [[3]]
# [1] "phantom(\"I go \") * to * phantom(\"school\")"
#
# [[4]]
# [1] "phantom(\"I go to \") * school"
This is a bit of a hack, but I think it gets at what you're trying to do:
library(corpus)
input <- 'I go to school'
types <- text_types(input, collapse = TRUE) # all word types
(loc <- text_locate(input, types)) # locate all word types, get context
## text before instance after
## 1 1 I go to school
## 2 1 I go to school
## 3 1 I go to school
## 4 1 I go to school
The return value is a data frame, with columns of type corpus_text. This approach seems crazy, but it doesn't actually allocate new strings for the before and after contexts (both of which have type corpus_text)
Here's the output you wanted:
paste0("phantom(", loc$before, ") *", loc$instance, "* phantom(", loc$after, ")")
## [1] "phantom() *I* phantom( go to school)"
## [2] "phantom(I ) *go* phantom( to school)"
## [3] "phantom(I go ) *to* phantom( school)"
## [4] "phantom(I go to ) *school* phantom()"
If you want to really get crazy and ignore punctuation:
phantomize <- function(input, ...) {
types <- text_types(input, collapse = TRUE, ...)
loc <- text_locate(input, types, ...)
paste0("phantom(", loc$before, ") *", loc$instance, "* phantom(",
loc$after, ")")
}
phantomize("I! go to school (?)...don't you?", drop_punct = TRUE)
## [1] "phantom() *I* phantom(! go to school (?)...don't you?)"
## [2] "phantom(I! ) *go* phantom( to school (?)...don't you?)"
## [3] "phantom(I! go ) *to* phantom( school (?)...don't you?)"
## [4] "phantom(I! go to ) *school* phantom( (?)...don't you?)"
## [5] "phantom(I! go to school (?)...) *don't* phantom( you?)"
## [6] "phantom(I! go to school (?)...don't ) *you* phantom(?)"
I would suggest something like this
library(tidyverse)
library(glue)
test_string <- "i go to school"
str_split(test_string, " ") %>%
map(~str_split(test_string, .x, simplify = T)) %>%
flatten() %>%
map(str_trim) %>%
keep(~.x != "") %>%
map(~glue("phantom({string})", string = .x))
This code snippet can easily be implemented in a function and will return the following output.
[[1]]
phantom(i)
[[2]]
phantom(i go)
[[3]]
phantom(i go to)
[[4]]
phantom(go to school)
[[5]]
phantom(to school)
[[6]]
phantom(school)
I might have misinterpreted your question -- i am not quite sure if you really want the output to have the same format as in your examplary output.

How to remove words in corpus that start with $ in R?

I am trying to do preprocessing in corpus in R, and i need to remove the words that start with $. Below code removes the $ but not the $words, I am puzzled.
inspect(data.corpus1[1:2])
# <<SimpleCorpus>>
# Metadata: corpus specific: 1, document level (indexed): 0
# Content: documents: 2
#
# [1] $rprx loading mid .60's, think potential. 12m vol fri already 11m today
# [2] members report success see track record $itek $rprx $nete $cnet $zn $cwbr $inpx
removePunctWords <- function(x) {
gsub(pattern = "\\$", "", x)
}
data.corpus1 <-
tm_map(data.corpus1,
content_transformer(removePunctWords))
inspect(data.corpus1[1:2])
# <<SimpleCorpus>>
# Metadata: corpus specific: 1, document level (indexed): 0
# Content: documents: 2
#
# [1] rprx loading mid .60's, think potential. 12m vol fri already 11m today
# [2] members report success see track record itek rprx nete cnet zn cwbr inpx
Your regular expression only specifies the $. You need to include the rest of the word.
removePunctWords <- function(x) {
gsub(pattern = "\\$\\w*", "", x)
}

Resources