Convert JSON nested data to dataframe in R - r

I have a json string that's is a nested dataframe and is full of characters that need to be escaped like \n, \r and \. I have not been able to convert it with jsonlite.
Here's a dput of the first element of the list.
fromJSON(json_data) gives the following error:
Replacing the character "{" with blank character is not working.
Any help would be greatly appreciated.

This solution is meant to be a stop-gap for one known flaw in the json validation: two (or more) dictionaries are not separated by a comma. I discourage the use of regular expressions to fix this, but a fixed string-replacement can suffice:
json_date <- gsub("} {", "},{", json_data, fixed = TRUE)

Related

How to remove "\" from paste function output with quotation marks?

I'm working with the following code:
Y_Columns <- c("Y.1.1")
paste('{"ImportId":"', Y_Columns, '"}', sep = "")
The paste function produces the following output:
"{\"ImportId\":\"Y.1.1\"}"
How do I get the paste function to omit the \? Such that, the output is:
"{"ImportId":"Y.1.1"}"
Thank you for your help.
Note: I did do a search on SO to see if there were any Q's that asked "what is an escape character in R". But I didn't review all the 160 answers, only the first 20.
This is one way of demonstrating what I wrote in my comment:
out <- paste('{"ImportId":"', Y_Columns, '"}', sep = "")
out
#[1] "{\"ImportId\":\"Y.1.1\"}"
?print
print(out,quote=FALSE)
#[1] {"ImportId":"Y.1.1"}
Both R and regex patterns use escape characters to allow special characters to be displayed in print output or input. (And sometimes regex patterns need to have doubled escapes.) R has a few characters that need to be "escaped" in certain situation. You illustrated one such situation: including double-quote character inside a result that will be printed with surrounding double-quotes. If you were intending to include any single quotes inside a character value that was delimited by single quotes at the time of creation, they would have needed to be escaped as well.
out2 <- '\'quoted\''
nchar(out2)
#[1] 8 ... note that neither the surround single-quotes nor the backslashes get counted
> out2
[1] "'quoted'" ... and the default output quote-char is a double-quote.
Here's a good Q&A to review:How to replace '+' using gsub() function in R
It has two answers, both useful: one shows how to double escape a special character and the other shows how to use teh fixed argument to get around that requirement.
And another potentially useful Q&A on the topic of handling Windows paths:
File path issues in R using Windows ("Hex digits in character string" error)
And some further useful reading suggestions: Look at the series of help pages that start with capital letters. (Since I can never remember which one has which nugget of essential information, I tried ?Syntax first and it has a "See Also" list of essential reading: Arithmetic, Comparison, Control, Extract, Logic, NumericConstants, Paren, Quotes, Reserved. and I then realized what I wanted to refer you to was most likely ?Quotes where all the R-specific escape sequence letters should be listed.

Using R, how does one extract multiple URLs/pattern matches from a string in a dataset, and then place each URL in its own adjacent column?

I have a (large) dataset that initially consists of an identifier and associated text (in raw HTML). Oftentimes the text will include one or more embedded links. Here's a sample dataset:
id text
1 <p>I love dogs!</p>
2 <p>My <strong>favorite</strong> dog is this kind.</p>
3 <p>I've had both Labs and Huskies in my life.</p>
What I'd like as output (with the text column included in the same spot, but I removed it for visibility here) is:
id link1 link2
1
2 doge.com
3 labs.com huskies.com
I've tried using str_extract_all() paired with <a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1, but even when I double escape the backslashes I either get an "unexpected" error OR it keeps asking me for more and I have to Escape out. I feel like this method is the one I want and SHOULD work, but I can't seem to get the regex to play nicely. Here are my results so far:
> str_extract_all(text, "<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1")
Error: '\s' is an unrecognized escape in character string starting ""<a\s"
> str_extract_all(text, perl(<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1))
Error: unexpected '<' in "str_extract_all(text, perl(<"
> str_extract_all(text, "<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1")
+
> str_extract_all(text, perl(<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1))
Error: unexpected '<' in "str_extract_all(text, perl(<"
I've also tried parseURI from the XML package and for whatever reason it crashes my R session.
The other solutions I've found to date either only deal with single links, or return items in a list or vector altogether. I want to keep things separated by their identifier and in a dataset.
If needed, I could tolerate generating a separate dataset and merging them together, but there will be cases where there are no links, so I'd want to avoid any pitfalls of rows being deleted due to not having a value in any of the link columns.
R does not like quotes within strings so in your example above R is considering the string ongoing:
str_extract_all(text, "<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1")
R is still looking for the end of the string since it was not escaped in the regex. R has special cases in which as single \ can be used for escaping, (e.g \n for new line), see this. \' escapes a single quote and \" escapes a double quote in R regex:
str_extract_all(text, "<a\\s+(?:[^>]*?\\s+)?href=([\"])(.*?)\\1", text, perl=T)
"\ itself is a special character that needs escape, e.g. \\d. Do not
confuse these regular expressions with R escape sequences such as
\t."
or in your case \"

R programming - How to remove special characters from a data set?

I have a data set that contains strings and special characters like the one below can be found in the data set.
Special character
How do I remove special characters like the above from my data set?
Use regular expressions to remove unwanted characters, for example:
dataset$textcolumn <- gsub("[^\\w\\s]", "", dataset$textcolumn, perl=TRUE)
to remove everything except word characters and spaces. To do more complex replacements look into the help topic ?regexp.
Also look into the encoding (Encoding and iconv are helpful here.), maybe the text is correct but the wrong encoding is assumed.

How to convert special characters into unicode in R?

When doing some textual data cleaning in R, I can found some special characters. In order to get rid of them, I have to know their unicodes, for example € is \u20AC. I would like to know if it is possible "see" the unicodes with a function that take into account the string within the special character as an input?
Refering to Cath comment, iconv can do the job :
iconv("é", toRaw = TRUE)
Then, you may want to unlist and paste with \u00.
special_char <- "%"
Unicode::as.u_char(utf8ToInt(special_char))

How to handle blank items when converting dates in R

I have a csv download of data from a Management Information system. There are some variables which are dates and are written in the csv as strings of the format "2012/11/16 00:00:00".
After reading in the csv file, I convert the date variables into a date using the function as.Date(). This works fine for all variables that do not contain any blank items.
For those which do contain blank items I get the following error message:
"character string is not in a standard unambiguous format"
How can I get R to replace blank items with something like "0000/00/00 00:00:00" so that the as.Date() function does not break? Are there other approaches you might recommend?
If they're strings, does something as simple as
mystr <- c("2012/11/16 00:00:00"," ","")
mystr[grepl("^ *$",mystr)] <- NA
as.Date(mystr)
work? (The regular expression "^ *$" looks for strings consisting of the start of the string (^), zero or more spaces (*), followed by the end of the string ($). More generally I think you could use "^[[:space:]]*$" to capture other kinds of whitespace (tabs etc.)
Even better, have the NAs correctly inserted when you read in the CSV:
read.csv(..., na.strings='')
or to specify a vector of all the values which should be read as NA...
read.csv(..., na.strings=c('',' ',' '))

Resources