Problems extracting data using JSON in R (getting a lexical error) - r

Related to the question asked here: R - Using SelectorGadget to grab a dataset
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
I am trying to grab the Hawaii dataset from the State tab. The code was working before but now it is throwing an error with this part of the code:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
I am getting the error:
Error: lexical error: invalid char in json text. NA (right here) ------^
Any proposed solutions? It seems that the website has remained the same for the year but what type of change is causing the code to break?
EDIT: The solution proposed by #QHarr:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
This was working for a while but then it seems that their website again changed the underlying HTML codes.

Change the regex pattern as shown below to ensure it correctly captures the desired string within the response text i.e. the JavaScript object to use for all_data
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
Note: in R the single escape is doubled e.g. \\s rather than shown \s above.

Related

How can I make R import each line of a .txt file as a character string?

I have a complex .txt file, of which I'll add a screenshot .txt file. I need to have each line as its own character string in order to group the lines of code by the 5 letter code near the beginning of each line (group together all GPGGA lines, for example; see screenshot) in order to process it as I need to. Here's what I've run so far:
df <- data.frame(Weather_data)
df %>%
mutate("Entry" = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry) %>%
filter(Entry == "GPGGA")
This received the error:
"Error: Problem with mutate() column Entry. i Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text). x cannot coerce type 'closure' to vector of type 'character'"
I had success filtering as I needed to when I copied and pasted the first few lines in and manually made then character strings to see if I could get the code to function, so making each line a character string NOT manually (there are over 3000 lines) is the next step. Anyone have a function to do this?
Here are some of the lines produced when I load the imported txt file:
HEADER
<chr>
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
13:30:00.827: <- $HCHDG,74.8,0.0,E,3.6,W*6E
13:30:01.003: <- $WIMDA,29.9641,I,1.0147,B,26.5,C,,,48.2,,14.6,C,323.0,T,326.6,M,1.4,N,0.7,M*66
13:30:01.051: <- $WIMWV,248.4,R,1.1,N,A*29
13:30:01.114: <- $WIMWV,255.6,T,1.3,N,A*23
13:30:01.195: <- $YXXDR,A,-53.9,D,PTCH,A,-34.2,D,ROLL*57
13:30:01.307: <- $YXXDR,A,0.571,G,XACC,A,0.783,G,YACC,A,-0.181,G,ZACC*57
13:30:01.578: <- $GPGGA,183001.30,4415.6242,N,08823.9769,W,1,7,1.7,225.9,M,-33.4,M,,*64
You referenced the variable text which does not exist in your data.frame. Your column is named HEADER.
df %>%
mutate("Entry" = gsub(".*\\$([A-Z]+),.*", "\\1", HEADER)) %>%
group_by(Entry) %>%
filter(Entry == "GPGGA")

Using seperate items in list as input for pipe operator in R

I have written a script which uses a list of URL's as input, and then scrapes certain information from the websites. I have done this with a for loop, but already the process time is verry long, I expect the list to get bigger over time, so I wanted to re-code my script a more efficient way. My idea was to eliminate the for loop and use pipe operators to reduce the processing time. My original (working code) is as follows;
imo <- c()
mmsi <- c()
for(i in 1:nrow(data)){
url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME[i])
page <- read_html(url)
CSSextract1 <- html_nodes(page, '.td_imo')
CSSextract2 <- html_nodes(page, '.td_mmsi')
imos <- html_text(CSSextract1)[2]
imo[i] <- imos
mmsis <- html_text(CSSextract2)[2]
mmsi[i] <- mmsis
}
data$IMO <- gsub("[\r \n \t]", "", imo)
data$MMSI <- gsub("[\r \n \t]", "", mmsi)
data$NAME <- gsub("\\+", " ", data$NAME)
I have re-written the code, trying to eliminate the for loop as follows;
CSSex1 <- function(page){
CSSextract <- html_nodes(page,'.td_imo')
return(CSSextract)
}
data$url <- sprintf("https://www.marinevesseltraffic.com/vessels?vessel=%s&flag=&page=1&sort=lenght&direction=desc",data$NAME)
data$mmsi <- data$url %>% read_html() %>% CSSex1() %>% html_text()[2]
However it gives me the error;
Error: `x` must be a string of length 1
I assume, the way I coded, the list (data$url) as a whole is now taken as input, so my question is;
Is it possible, and if yes how, to take each element from data$url as a input without using a (for) loop?
You may wish to set up url as a column of a data frame (data) to try:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = mmsi_func(url))
or something along those lines. I am not sure what the expected output is supposed to look like, but if it is a list rather than a vector, you can use this minor adjustment for a list column in the dataframe:
mmsi_func <- function(x) {
z <- x %>%
read_html() %>%
CSSex1() %>%
html_text()
z[2]
}
data <- data %>%
rowwise() %>%
dplyr::mutate(mmsi = list(mmsi_func(url)))

Errors with data frames from json and xml

I need to have a data frame from json or xml files (data is available in both formats here). Yet, I get errors when I try to get those data frames in R.
With json file, the error is the following text
Error in parse_con(txt, bigint_as_char) : lexical error: invalid bytes in UTF8 string.
stion":"0","name_question":"Óðî÷èñòå çàñ³äàííÿ Âåðõîâíî¿ Ðàä
(right here) ------^
With xml file, the error is like this
Error in [<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c(date_agenda = "27112014", : duplicate subscripts for columns
The commands I use are
library(jsonlite)
library(XML)
k <- fromJSON("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.json", encoding = "UTF-8")
m <- xmlToDataFrame("agendas_8_skl.xml")
Prior to executing the commands, I download files to the working directory.
I do not understand how I can get the data. Please, help!
This answer based on #user2554330's answer here
library(jsonlite)
library(RCurl)
#Incase you have locale different than ukrainian
Sys.setlocale("LC_CTYPE", "ukrainian")
k <- fromJSON(getURL("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.json",
.encoding = "ISO-8859-5"))
#transfer k into dataframe using tidyr::unnest
library(dplyr)
library(tidyr)
df <- tibble(date_agenda=k[[1]]$date_agenda, question=k[[1]]$question) %>%
unnest(question) %>%
unnest(reporter_question, keep_empty=TRUE)
Here is a solution working with the xml data.
See the code comments for details:
library(xml2)
library(dplyr)
#read page
page<-read_xml("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.xml")
#obtain a list of parent nodes
agendas<-xml_find_all(page, "agenda")
output<-lapply(agendas, function(agenda) {
#get date
date<- agenda %>% xml_find_first(".//date_agenda") %>% xml_text() %>% as.Date(format="%d%m%Y")
#pull question id from attribute
question_id <-agenda %>% xml_find_all(".//question") %>% xml_attr("id_question")
#obtain the information from all of the nodes (assumes equal number of each)
number_questions <-agenda %>%xml_find_all(".//number_question") %>% xml_text()
init_questions <-agenda %>%xml_find_all(".//init_question") %>% xml_text()
name_questions <-agenda %>%xml_find_all(".//name_question") %>% xml_text()
#create a data frame of answer (long format)
data.frame(date, question_id, number_questions, init_questions, name_questions, stringsAsFactors = FALSE)
})
#bind into 1 large long formatted data frame
finalanswer<-bind_rows(output)
head(finalanswer)

Adding ifelse() into a Map function

I've got a simple Map function that scrapes text files from a blog site. It's pretty easy to get a scraper that gets all of the text files and downloads them to my working directory. My goal: use an ifelse() or a plain if statement to only scrape a file based on a certain date.
Eg, if four files were posted on 1/31/19, and I pointed my ifelse at that date, the function would return those four files. Code:
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Getting date elements
dates <- page %>%
html_nodes("node.dates") %>%
html_text()
dates <- parse_date_time(dates, "%m/%d/%Y", tz = "EST",
locale = Sys.getlocale("LC_TIME"))
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
ifelse(dates == '2019-01-31', write, "He's dead, Jim")
}, links)
I've tried various ways to get that if statement in there, and also moving the writeBin around. (Usually the writeBin would not be vectorized - I did it for easy viewing in my ifelse). Error:
Error in ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok] :
replacement has length zero
If I leave out the if code, everything works great, it just returns many text files, when I only want the ones from the specified date.
Based on the description, it seems like check the corresponding 'dates' for each 'links' and then apply the if/else. If that is the case, then we can have two arguments in Map
Map(function(ln, y) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
if(y == '2019-01-31') {
write
} else "He's dead, Jim"
},
links, dates)

Convert character to numeric without NA in r

I know this question has been asked many times (Converting Character to Numeric without NA Coercion in R, Converting Character\Factor to Numeric without NA Coercion in R, etc.) but I cannot seem to figure out what is going on in this one particular case (Warning message:
NAs introduced by coercion). Here is some reproducible data I'm working with.
#dependencies
library(rvest)
library(dplyr)
library(pipeR)
library(stringr)
library(translateR)
#scrape data from website
url <- "http://irandataportal.syr.edu/election-data"
ir.pres2014 <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="content"]/div[16]/table') %>%
html_table(fill = TRUE)
ir.pres2014<-ir.pres2014[[1]]
colnames(ir.pres2014)<-c("province","Rouhani","Velayati","Jalili","Ghalibaf","Rezai","Gharazi")
ir.pres2014<-ir.pres2014[-1,]
#Get rid of unnecessary rows
ir.pres2014<-ir.pres2014 %>%
subset(province!="Votes Per Candidate") %>%
subset(province!="Total Votes")
#Get rid of commas
clean_numbers = function (x) str_replace_all(x, '[, ]', '')
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)
#remove any possible whitespace in string
no_space = function (x) gsub(" ","", x)
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(no_space), -province)
This is where things start going wrong for me. I tried each of the following lines of code but I got all NA's each time. For example, I begin by trying to convert the second column (Rouhani) to numeric:
#First check class of vector
class(ir.pres2014$Rouhani)
#convert character to numeric
ir.pres2014$Rouhani.num<-as.numeric(ir.pres2014$Rouhani)
Above returns a vector of all NA's. I also tried:
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
ir.pres2014$Rouhani2<-as.numeric.factor(ir.pres2014$Rouhani)
And:
ir.pres2014$Rouhani2<-as.numeric(levels(ir.pres2014$Rouhani))[ir.pres2014$Rouhani]
And:
ir.pres2014$Rouhani2<-as.numeric(paste(ir.pres2014$Rouhani))
All those return NA's. I also tried the following:
ir.pres2014$Rouhani2<-as.numeric(as.factor(ir.pres2014$Rouhani))
That created a list of single digit numbers so it was clearly not converting the string in the way I have in mind. Any help is much appreciated.
The reason is what looks like a leading space before the numbers:
> ir.pres2014$Rouhani
[1] " 1052345" " 885693" " 384751" " 1017516" " 519412" " 175608" …
Just remove that as well before the conversion. The situation is complicated by the fact that this character isn’t actually a space, it’s something else:
mystery_char = substr(ir.pres2014$Rouhani[1], 1, 1)
charToRaw(mystery_char)
# [1] c2 a0
I have no idea where it comes from but it needs to be replaced:
str_replace_all(x, rawToChar(as.raw(c(0xc2, 0xa0))), '')
Furthermore, you can simplify your code by applying the same transformation to all your columns at once:
mystery_char = rawToChar(as.raw(c(0xc2, 0xa0)))
to_replace = sprintf('[,%s]', mystery_char)
clean_numbers = function (x) as.numeric(str_replace_all(x, to_replace, ''))
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)

Resources