Errors with data frames from json and xml - r

I need to have a data frame from json or xml files (data is available in both formats here). Yet, I get errors when I try to get those data frames in R.
With json file, the error is the following text
Error in parse_con(txt, bigint_as_char) : lexical error: invalid bytes in UTF8 string.
stion":"0","name_question":"Óðî÷èñòå çàñ³äàííÿ Âåðõîâíî¿ Ðàä
(right here) ------^
With xml file, the error is like this
Error in [<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c(date_agenda = "27112014", : duplicate subscripts for columns
The commands I use are
library(jsonlite)
library(XML)
k <- fromJSON("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.json", encoding = "UTF-8")
m <- xmlToDataFrame("agendas_8_skl.xml")
Prior to executing the commands, I download files to the working directory.
I do not understand how I can get the data. Please, help!

This answer based on #user2554330's answer here
library(jsonlite)
library(RCurl)
#Incase you have locale different than ukrainian
Sys.setlocale("LC_CTYPE", "ukrainian")
k <- fromJSON(getURL("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.json",
.encoding = "ISO-8859-5"))
#transfer k into dataframe using tidyr::unnest
library(dplyr)
library(tidyr)
df <- tibble(date_agenda=k[[1]]$date_agenda, question=k[[1]]$question) %>%
unnest(question) %>%
unnest(reporter_question, keep_empty=TRUE)

Here is a solution working with the xml data.
See the code comments for details:
library(xml2)
library(dplyr)
#read page
page<-read_xml("https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/agendas_8_skl.xml")
#obtain a list of parent nodes
agendas<-xml_find_all(page, "agenda")
output<-lapply(agendas, function(agenda) {
#get date
date<- agenda %>% xml_find_first(".//date_agenda") %>% xml_text() %>% as.Date(format="%d%m%Y")
#pull question id from attribute
question_id <-agenda %>% xml_find_all(".//question") %>% xml_attr("id_question")
#obtain the information from all of the nodes (assumes equal number of each)
number_questions <-agenda %>%xml_find_all(".//number_question") %>% xml_text()
init_questions <-agenda %>%xml_find_all(".//init_question") %>% xml_text()
name_questions <-agenda %>%xml_find_all(".//name_question") %>% xml_text()
#create a data frame of answer (long format)
data.frame(date, question_id, number_questions, init_questions, name_questions, stringsAsFactors = FALSE)
})
#bind into 1 large long formatted data frame
finalanswer<-bind_rows(output)
head(finalanswer)

Related

Problems extracting data using JSON in R (getting a lexical error)

Related to the question asked here: R - Using SelectorGadget to grab a dataset
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
I am trying to grab the Hawaii dataset from the State tab. The code was working before but now it is throwing an error with this part of the code:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
I am getting the error:
Error: lexical error: invalid char in json text. NA (right here) ------^
Any proposed solutions? It seems that the website has remained the same for the year but what type of change is causing the code to break?
EDIT: The solution proposed by #QHarr:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
This was working for a while but then it seems that their website again changed the underlying HTML codes.
Change the regex pattern as shown below to ensure it correctly captures the desired string within the response text i.e. the JavaScript object to use for all_data
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
Note: in R the single escape is doubled e.g. \\s rather than shown \s above.

How can I make R import each line of a .txt file as a character string?

I have a complex .txt file, of which I'll add a screenshot .txt file. I need to have each line as its own character string in order to group the lines of code by the 5 letter code near the beginning of each line (group together all GPGGA lines, for example; see screenshot) in order to process it as I need to. Here's what I've run so far:
df <- data.frame(Weather_data)
df %>%
mutate("Entry" = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry) %>%
filter(Entry == "GPGGA")
This received the error:
"Error: Problem with mutate() column Entry. i Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text). x cannot coerce type 'closure' to vector of type 'character'"
I had success filtering as I needed to when I copied and pasted the first few lines in and manually made then character strings to see if I could get the code to function, so making each line a character string NOT manually (there are over 3000 lines) is the next step. Anyone have a function to do this?
Here are some of the lines produced when I load the imported txt file:
HEADER
<chr>
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
13:30:00.827: <- $HCHDG,74.8,0.0,E,3.6,W*6E
13:30:01.003: <- $WIMDA,29.9641,I,1.0147,B,26.5,C,,,48.2,,14.6,C,323.0,T,326.6,M,1.4,N,0.7,M*66
13:30:01.051: <- $WIMWV,248.4,R,1.1,N,A*29
13:30:01.114: <- $WIMWV,255.6,T,1.3,N,A*23
13:30:01.195: <- $YXXDR,A,-53.9,D,PTCH,A,-34.2,D,ROLL*57
13:30:01.307: <- $YXXDR,A,0.571,G,XACC,A,0.783,G,YACC,A,-0.181,G,ZACC*57
13:30:01.578: <- $GPGGA,183001.30,4415.6242,N,08823.9769,W,1,7,1.7,225.9,M,-33.4,M,,*64
You referenced the variable text which does not exist in your data.frame. Your column is named HEADER.
df %>%
mutate("Entry" = gsub(".*\\$([A-Z]+),.*", "\\1", HEADER)) %>%
group_by(Entry) %>%
filter(Entry == "GPGGA")

Error in stream-delim when exporting to CSV

I'm trying to write this StatsBomb Data into a CSV but I keep on getting the following error message:
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) :
Don't know how to handle vector of type list.
I'm lost (tried multiple things) and not sure what I did wrong here. Is there anyone out here who knows how to solve this? I've included my code below.
library(StatsBombR)
library(tidyverse)
### Read in all free events and matches from the FAWSL
data <- StatsBombFreeEvents()
matches <- FreeMatches(Competitions = 72)
### Clean and separate all data loaded above
dataclean <- allclean(data)
### Filter event data to include only FAWSL data.
data1 <- dataclean %>%
filter(dataclean$competition_id == 72)
### Join event and match data by "match_id"
data1 <- left_join(data1, matches, by = "match_id")
FullData <- data1 %>%
select(-c(related_events, tactics.lineup, shot.freeze_frame, location, pass.end_location, shot.end_location, goalkeeper.end_location))
setwd()
write_csv(FullData, "StatsBomb_FullData.csv")
I had the same problem. Unlisting the column fixed mine.
df$listcolumn <- sapply(df$listcolumn, function(x) paste0(unlist(x), collapse = "\n"))

Web Scraping in R using rvest and finding the html_note

I am trying to find the current html_note to fetch the replies count for each post in this forum: https://d.cosx.org/. I used CSS selector and it said .DiscussionListItem-count but it seems not working.
My code:
library(rvest)
library(tidyverse)
COS_link <- read_html("https://d.cosx.org/")
COS_link %>%
# The relevant tag
html_nodes(css = '.DiscussionListItem-count') %>%
html_text()
I would like to fetch the replies count, for example: 1k for 1st post and 30 for 2nd post. I am wondering if I miss something or anyone has a better idea?
You can use the API and parse the json response for the title and participantCount attributes
API endpoint returning that info is:
https://d.cosx.org/api
Substring the response to remove the trailing 0 and leading ac76 then parse with a json library of choice.
Less optimal is to regex out the json string from original url
library(rvest)
library(jsonlite)
library(stringr)
url <- "https://d.cosx.org/"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')
json <- jsonlite::fromJSON(x[[1]][,2])
counts <- json$resources$attributes$participantCount
For those wishing to pair up the title with count and who don't have chinese settings a colleague helped me write the following:
library(rvest)
library(jsonlite)
library(stringr)
library(corpus)
url <- "https://d.cosx.org/"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')
json <- jsonlite::fromJSON(x[[1]][,2])
titles <- json$resources$attributes$title
counts <- json$resources$attributes$participantCount
cf <- corpus_frame(name = titles, text = counts)
names(cf) <- c("titles", "counts")
print(cf[which(!is.na(cf$counts)),], 100)

Numbers of columns of arguments do not match

I am using this example to conduct sentiment analysis of a collection of txt documents in R. The code is:
library(tm)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
library(dplyr)
library(wordcloud)
require(reshape2)
files <- list.files(inputdir,pattern="*.txt")
GetNrcSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("nrc")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>% # positive - negative
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
return(sentiment)
}
The .txt files are stored in inputdirand have names AB-City.0000, where AB is an abbreviation of a country, City is a city name and 0000 is year (ranges from 2000 to 2017).
The function works for a single file as expected, i.e. GetNrcSentiment(files[1]) gives me a tibble with proper counts per sentiment. However, when i try to run it for the whole set, i.e.
nrc_sentiments <- data_frame()
for(i in files){
nrc_sentiments <- rbind(nrc_sentiments, GetNrcSentiment(i))
}
I get the following error message:
Joining, by = "word"
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The exact same code works well with longer documents, but gives an error when dealing with shorter texts. It seems that not all sentiments are found in small documents and as a result the number of columns vary for each document, which might lead to this error, but I am not sure. I would appreciate any advice on how to fix the problem. If a sentiment is not found, I would want the entry to be equal to zero (if it is the cause of my problem).
As an aside, bing sentiment function runs through about two dozen of files and gives a different error, which seems to point to the same problem (negative sentiment not found?):
GetBingSentiment <- function(file){
fileName <- glue(inputdir, file, sep = "")
fileName <- trimws(fileName)
fileText <- glue(read_file(fileName))
fileText <- gsub("\\$", "", fileText)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) %>%
mutate(file = file) %>% # add the name of our file
mutate(year = as.numeric(str_match(file, "\\d{4}"))) %>% # add the year
mutate(city = str_match(file, "(.*?).2")[2])
# return our sentiment dataframe
return(sentiment)
}
Error in mutate_impl(.data, dots) :
Evaluation error: object 'negative' not found.
EDIT: Following the recommendation by David Klotz I edited the code to
for(i in files){ nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i)) }
As a result, instead of throwing an error the nrc generates NA if words from a certain sentiment are not found, however after 22 joinings i get a different error:
Error in mutate_impl(.data, dots) : Evaluation error: object 'negative' not found.
The same error shows up when run the bing function with dplyr. Both dataframes by the time the functions reaches 22nd document contain columns for all sentiments. What may cause the error and how to can diagnose it?
dplyr's bind_rows function is more flexible than rbind, at least when it comes to missing columns:
nrc_sentiments <- dplyr::bind_rows(nrc_sentiments, GetNrcSentiment(i))
The input might be missing the "negative" column that is used in the expression

Resources