How to read a data frame in jsonlite - r

I am trying to read a data frame in jsonlite. I have a json file (which is made from an html code from a newspaper article , and structured in some variables: title, body, comments), and I managed to rename it.
article1 <- fromJSON("/Users/eva/Desktop/bdk.blufaerdighed.html.json")
I now want to "see" the content of my file (I want to see the text belonging to the variables; title, body, comments etc.), but I don't know which function to use.
my json file looks like this:
{ "author":"Ulla Gunge", "body":"De har »opført sig blufærdighedskrænkende«. I en uge har to otteårige drenge derfor været bortvist fra deres skole i Horsens.", "comments":[], "subtitle":"To otteårige drenge er blevet bortvist fra deres skole for at have vist deres tissemænd til en pige. Men den slags er helt normalt i den alder, fortæller forsker i voksnes syn på børns seksualitet, doktorlege og pædofilifrygt.", "title":"Blufærdighed i børnehøjde" }

To see those fields you just print them

Data frame structure is a list of vectors of equal length. The comment field in your json has length 0. You can see this for example by typing:
sapply(article1, length)
# author body comments subtitle title
# 1 1 0 1 1
If you can change the json from "comments":[] to "comments":["NA"] it is possible to get the data frame directly with fromJSON function by using simplifyDataFrame attribute.
Or you could change the list with article1$comments <- NA and get the data frame with:


Importing csv to R, extract polygon and decode it with googlePolylines packages

I have been trying to decode encrypted google polygons with R using the googlePolylines package which I have achieved by setting the polygon as a text but cannot do the same procedure by importing a csv file with the polygons.
For example, I have the following code:
This works but for this I have to change manually the encrypted polygon. What I want to do is to import a csv (datapolygons.csv) with the encrypted polygon (column polygon) and then decode it. I have tried with the following code but it gives me an error message.
datapolygons.csv (polygon is written the same as the previous encoded polygon)
store_id | store_name | polygon
12345 | RANDOM | "j~thAxdvtMtXmL....."
polyfile=read.table('datapolygon.csv', header=TRUE, sep=',')
Error message:
Error in rcpp_decode_polyline(polylines, "coords") :
basic_string::at: __n (which is 363) >= this->size() (which is 363)
Any help would be greatly appreciated. Thank you!
Is it possible the string is corrupt in your .csv file? Perhaps when you made the data.frame to write out? The corruption seems to occur during conversion to factor().
I have no trouble writing it out and back in:
polyfile <- data.frame(store_address_id = 12345,
store_name = "RANDOM",
polygon = "j~thAxdvtMtXmL|LgCpKxCjQzD~T`GzXrAlMePnDsKpVoF`QgBnNoM~IcA`JiGnd#iYvZuLzKkSlH_d#wDef#iJwFyZfUmc#hCsa#N{\\|SeCzOaFgQwD{SaPiC|WwBpPr#bGsG{FeQxEkOtXa#dXgFtHyJoIiGui#NqaAvFuRyQxEia#he#u_#b\\og#dHuZyJaX_d#zK_Ua#eBwMq`A}AyRcTkKce#c]uWeS{DuRzZtCh]|b#zGuCbm#fUdTsLbS__#wBoTlu#zVbm#dc#l]oExn#gf#vUgPbAkL|HtCnXi#zKtRvc#CzPxBpLxAnEvFvDpPx^cFlBTh#??Dn##f#~G|QFH~\\gMj#Su#}#",
stringsAsFactors = FALSE)
polyfile2 <- read.table('~/test.csv', header=TRUE, sep=',',stringsAsFactors = FALSE)
lat lon
1 -12.07286 -76.95965
2 -12.07697 -76.95750
3 -12.07920 -76.95682
4 -12.08121 -76.95759
5 -12.08415 -76.95853
6 -12.08767 -76.95982
7 -12.09181 -76.96024
8 -12.09412 -76.95749
9 -12.09500 -76.95547
10 -12.09877 -76.95427

Trouble merging two dataframes in R (VLOOKUP)

I need help merging two data frames with R. I'm a little desperate, since I have tried everthing I could. Any help would be appreciated.
The thing is that I'm doing some daily web scraping, and I need to compare today's results whith yesterday's results in order to to detect if there have been any changes.
I only have two variables (title of the page and url) in two dataframes (one for today and one for yesterday), and I want to merge them in one.
The possible changes are:
Changes in the name.
Changes in the url.
New programs (new name and new url).
Deleted programs.
I've tried with merge, cast & melt, ifelse, etc. etc. and I can't solve the problem. For example:
yesterday <- read.csv2("Yesterday.csv")
today <- read.csv2("Today.csv")
new <- merge(x = today, y = yesterday, all = TRUE, sort = TRUE)
But without the desired result. I'm attaching three files:
Today.csv, with the results of today scraping
Yesterdat.csv, with the results of yesterday scraping
Results.xlsx with the desired output. A VLOOKUP in Excel, highlighting the changes I want to detect (in this case name changes).
I would need a solution for the four changes options. The output could be different, I don't care about that, but I need the comparison to be correct Even if you found that this question is duplicated I would need the link to the other one, because I haven't been able to find it.
Thanks in advance.
Answer is updated in response to the comments bellow:
anti_join(today, yest) %>%
label = ifelse(programa %in% yest$programa, 'changed', 'added')
anti_join(yest, select(today, programa)) %>% mutate(label = "deleted")
Which, while applying it to the whole data sets, returns following results:
# # A tibble: 6 x 3
# programa url label
# <chr> <chr> <chr>
# 1 Carrera de Derecho a distancia |~ added
# 2 "Carrera de Criminolog\xeda a di~ added
# 3 "Carrera Ingenier\xeda Inform\xe~ added
# 4 Grado en Derecho a distancia | U~ dele~
# 5 "Grado en Criminolog\xeda a dist~ dele~
# 6 "Grado Ingenier\xeda Inform\xe1t~ dele~
In order to check, if it is able to register changes in the programm, we can do following:
yest[22, 2] <- yest[23, 2]
Piping the changed data into the code above, returns table with additional record, labelled as changed:
# # A tibble: 7 x 3
# programa url label
# <chr> <chr> <chr>
# 1 "M\xe1ster en Direcci\xf3n Hotel~ chan~
# 2 Carrera de Derecho a distancia |~ added
# 3 "Carrera de Criminolog\xeda a di~ added
# 4 "Carrera Ingenier\xeda Inform\xe~ added
# 5 Grado en Derecho a distancia | U~ dele~
# 6 "Grado en Criminolog\xeda a dist~ dele~
# 7 "Grado Ingenier\xeda Inform\xe1t~ dele~
Everything enclosed inside bind_rows() is combined into the single tibble. As far as we have two separate anti_join() statements here, and each of them returns it's own tibble, we have to rbind them into the one;
anti_join() is a set operation, which, giving two sets A and B, returns another set C which is subset of A but not subset of B. In other words, C is the difference between A and B.
When we call anti_join(today, yest) we obtain a subset of today with records either not present in yest at all, or those with program or url changed comparing to yest. We pipe those results into mutate() call, and assign the value changed to label, if the value of programa is the same as yesterday (programa %in% yest$programa), while url value was changed. If programa %in% yest$programa is FALSE, it means that program name wasn't present in yest so it is a new program, and we label it as added.
When we call anti_join() for a second time, we are looking for the difference between yest and today program names. In other words: 'Which programs present in yest are not present in today?' We achieve this by looking for subset of yest with program names which are not in program names of today (that's why you need to select(today, programa)). If any of such records where detected, they are labeled by deleted.
Sorry if this explanation is somewhat clumsy, but I hope it will help you to navigate the code.
tmp <- tempfile()
today <- read_delim(tmp, delim = ";")
yest <- read_delim(tmp, delim = ";")

CleanNLP package in R: metadata data frame?

Let's assume my dataframe looks like this:
bio_text <- c("Georg Aemilius, eigentlich Georg Oemler, andere Namensvariationen „Aemylius“ und „Emilius“ (* 25. Juni 1517 in Mansfeld; † 22. Mai 1569 in Stolberg (Harz))...", "Johannes Aepinus auch: Johann Hoeck, Huck, Hugk, Hoch oder Äpinus (* um 1499 in Ziesar; † 13. Mai 1553 in Hamburg) war ein deutscher evangelischer Theologe und Reformator.\nAepinus wurde als Sohn des Ratsherrn Hans Hoeck im brandenburgischen Ziesar 1499 geboren...")
doc_id <- c("1", "2")
url <- c("", "")
name <- c("Aemilius, Georg", "Aepinus, Johannes")
place_of_birth <- c("Mansfeld", "Ziesar")
full_wikidata <- data.frame(bio_text, doc_id, url, name, place_of_birth)
I want to carry out Named Entity Recognition with the cleanNLP package in R. Therefore, I initialize the tokenizers and the spaCy backend, everything works fine:
options(stringsAsFactors = FALSE)
wikidata <- full_wikidata[,c("doc_id", "bio_text")]
wikimeta <- full_wikidata[,c("url", "name", "place_of_birth")]
spacy_annotatedWikidata <- cleanNLP::cnlp_annotate(wikidata, as_strings = TRUE, meta = wikimeta)
My only problem is the metadata. When I run it like this, I get the following warning message: In cleanNLP::cnlp_annotate(full_wikidata, as_strings = TRUE, meta = wikimeta) : data frame input given along with meta; ignoring the latter. To be honest, I don't get the documentation concerning meta in cnlp_annotate: "an optional data frame to bind to the document table". This means that I should deliver a data frame containing the metadata, right?! Later on, I want to be able to do something like this, e.g. filter out all person entities in document no. 3:
cnlp_get_entity(spacy_annotatedWikidata) %>%
filter(doc_id == 3, entity_type == "PER") %>%
Therefore, I have to find a way to access the metadata. Any help would be highly appreciated!
Fortunatelly, in the meantime I got some help and the advice to take a closer look at the method code of cnlp_annotate on Github:
It says that you only can pass in a metadata dataframe if the input itself is not a dataframe but a file path. So if you do want to pass in a dataframe, the first row has to be doc_id, the second text and the remaining ones are automatically considered as metadata! So in my example only the order in full_wikidata has to be changed:
full_wikidata <- data.frame(doc_id, bio_text, url, name, place_of_birth)
Like this, it can be directly used as an input in clnp_annotate:
spacy_annotatedWikidata <- cleanNLP::cnlp_annotate(full_wikidata, as_strings = TRUE)

Convert R JSON Twitter data to list

When using SearchTwitter, I converted to dataframe and then exported to JSON. However, all the text is in one line, etc (sample below). I need to separate so that each tweet is its own.
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <-"rbind", lapply(phish,
exportJson <-toJSON(phishdf)
write(exportJson, file = "phishdf.json")
json_phishdf <- fromJSON(file="phishdf.json")
I tried converting to a list and am wondering if maybe converting to a data frame is a mistake.
However, for a list, I tried:
But this will just give me the text for the first tweet. Is there a way to iterate over the entire data set, maybe in a for loop?
{"text":["#ilazer #abbijacobson I do feel compelled to say that I phind phish awphul... sorry, Abbi!","#phish This on-sale was an embarrassment. Something needs to change.","FS: Have 2 Tix To Phish In Chula Vista #Phish #facevaluetickets #phish #facevalue GO:","RT #WKUPhiDelt: Come unwind from a busy week of class and kick off the weekend with a Phish Fry! 4:30-7:30 at the Phi Delt house. Cost is $\u2026","RT #phish: Tickets for Phish's July 15 & 16 shows at The Gorge go on sale in fifteen minutes at 1PM ET: https://t.c\u2026"],
"replyToUID":["26152867","14503997","NA","NA","NA"],"statusSource":["Mobile Web (M5)","Twitter for iPhone","CashorTrade - Face Value Tickets","Twitter for iPhone","Twitter for Android"],
I followed your code and don't have the issue you're describing. Are you using library(twitteR) and library(jsonlite)?
Here is the code, and a screenshot of it working
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <-"rbind", lapply(phish,
exportJson <-toJSON(phishdf)
write(exportJson, file = "./../phishdf.json")
## note the `txt` argument, as opposed to `file` used in the question
json_phishdf <- fromJSON(txt="./../phishdf.json")

Reading sdmx-xml files into a dataframe in R

I was wondering if anyone has managed to read SDMX-XML files into a dataframe. The file I’d like to read is (1mb).
I saved the file as “pensions_funds.xml” to the pwd and tried to use the XML package to read it:
fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))
The few lines above basically follow the example here
but I think I would first need to somehow ignore the header (I have pasted below the first couple of pages of the file I’m trying to read). So I think the above might work but it starts from the wrong node for my purposes. I would like to grab the obs_values, indexed by their time_period and ref_area.
The first thing would be to find the right node and start there however I suspect I might be on a fool’s errand since I have limited knowledge of data formats and I’m not sure the XML package can be used for SDMX-XML files. Smarter people appear to have tried to do this
I can’t find this package for download on its homepage here
(I can’t see any link/download section but maybe I’m blind) and it seems to be early stages. The existence of the rsdmx suggests using the xml package to read sdmx might not be easy so I’m ready to give up at this stage unless anyone has had success with this. Actually I’m mainly interested in reading this file
But this is a 10mb file so I was starting smaller.
attempting sgibb's answer on large file using changes in Mischa's comment
url <- ""
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA
## temp value to store current REF_AREA
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
i <<- i + 1
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- as.character(attr["BS_ITEM"])
bsCountSector <<- as.numeric(attr["BS_ITEM"])
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
the answer from sgibb looks ideal and works perfectly on the smaller file. I tried to run it on
url <-
(the 10mb file, original link corrected), with the only modification being the addition of two lines:
data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])
data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])
(these are additional id variables which are needed to identify a row in this larger dataset).
It ran for a few minutes then finished with this error:
Error: 1: Specification mandate value for attribute TIME_PE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 20743
4: Premature end of data in tag Series line 20689
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The basic format of the data seems very similar so I thought this might work. The basic format of the 10mb file is as below:
<Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
<Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>
desired data format:
Ref_area time_period obs_value
At 2006 118
At 2007 119
Be 2006 101
Here’s the first bit of the data.
DataSet xsi:schemaLocation="" xmlns="">
<Group DECIMALS="0" TITLE_COMPL="Austria, reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities), counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated, Current prices - Euro, Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D
RSDMX seems to be in an early development state. IMHO there is no package available yet. But you could easily implement it on your own using the XML package. I would suggest to use xmlEventParse (see ?xmlEventParse for details):
EDIT: adapt example to changed requirements of outstanding_amounts.xml
EDIT2: add download.file
#url <- ""
url <- ""
## download xml file to avoid download errors disturbing xmlEventParse
tmp <- tempfile()
download.file(url, tmp)
sdmxHandler <- function() {
## data.frame which stores results
data <- data.frame(stringsAsFactors=FALSE)
## counter to store current row
i <- 1
## temp value to store current REF_AREA, BS_ITEM and BS_COUNT_SECTOR
refArea <- NA
bsItem <- NA
bsCountSector <- NA
## handler subroutine for Obs tag
Obs <- function(name, attr) {
## found an Obs tag and now fill data.frame
data[i, "refArea"] <<- refArea
data[i, "bsItem"] <<- bsItem
data[i, "bsCountSector"] <<- bsCountSector
data[i, "timePeriod"] <<- as.Date(paste(attr["TIME_PERIOD"], "-01", sep=""), format="%Y-%m-%d")
data[i, "obsValue"] <<- as.double(attr["OBS_VALUE"])
## update current row
i <<- i + 1
## handler subroutine for Series tag
Series <- function(name, attr) {
refArea <<- attr["REF_AREA"]
bsItem <<- attr["BS_ITEM"]
bsCountSector <<- as.numeric(attr["BS_COUNT_SECTOR"])
return(list(getData=function() {return(data)},
Obs=Obs, Series=Series))
## run parser
df <- xmlEventParse(tmp, handlers=sdmxHandler())$getData()
# refArea bsItem bsCountSector timePeriod obsValue
#1 DE A20 2210 12053 39.6
#2 DE A20 2210 12084 46.1
#3 DE A20 2210 12112 50.2
#4 DE A20 2210 12143 52.0
#5 DE A20 2210 12173 52.3
#6 DE A20 2210 12204 47.3
The package rsdmx allows you to read SDMX-ML files and coerce them as data.frame. It is now hosted at Github, and currently available in CRAN, but in case you can install easily it from GitHub with the following:
install_github("rsdmx", "opensdmx")
Applying to your data, you can do the following:
sdmx <- readSDMX("")
df <-
More examples are given in the rsdmx wiki
Note that its functionalities currently load the xml object into R, as a slot part of the SDMX R objects instantiated by rsdmx. In the future, we would like to investigate how rsdmx can use xmlEventParse (as suggested above by #sgibb) to read very large datasets.
xmlparsed <- xmlParse(file(url))
## obtain dataset node::
series_data <- getNodeSet(xmlparsed, "//Series")
datasetnode <- xmlChildren( xmlChildren(xmlparsed)[[1]])[[2]]
series_data<-xmlChildren(datasetnode)[ names(xmlChildren(datasetnode))=="Series"]
## prepare dataset
dataset.frame <- data.frame(matrix(ncol=3))
colnames(dataset.frame) <- c('REF_AREA', 'TIME_PERIOD', 'OBS_VALUE')
## loop over data
for (i in 1: length(series_data)){
if('Obs'%in%names(xmlChildren(series_data[[i]])) ){ ## To ignore empty //Series nodes
for (j in 1: length(xmlChildren(series_data[[i]]))){
dataset.frame[counter,1] <- xmlAttrs(series_data[[i]])['REF_AREA']
dataset.frame[counter,2] <- xmlAttrs(series_data[[i]][[j]])['TIME_PERIOD']
dataset.frame[counter,3] <- xmlAttrs(series_data[[i]][[j]])['OBS_VALUE']
