Read JSON file with nested lists in R - r

I have a large json dataset and I would like to convert it to a data frame in R
(Sorry if it may be a duplicated question but other answers didn't help me)
My Json file is as follows:
[{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.899-1", "orig_lang": "en", "new": ["- having regard to its resolution of 13", "December 20071 on Justice for the", "'Comfort Women' (sex slaves in Asia", "before and during World War II) as well", "as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993", "and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the", "Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for", "wartime victims, including victims of the", "'comfort women' system,", "_______________________", "1", "OJ C 323E, 18.12.2008, p.531"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 6 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.863v01-00", "reference": "2014/2016(INI)", "date": "2014-02-27T00:00:00", "committee": ["AFET"], "seq": 1, "id": "PE529.863-1", "orig_lang": "en", "new": ["- having regard to the Statement by the", "Vice-President of the Commission/ High", "Representative of the Union for Foreign", "affairs and Security Policy (VP/HR)", "Catherine Ashton of 20 March 2013 on", "the Magnitsky case in the Russian", "Federation,"], "authors": "Jacek Protasiewicz", "meps": [23782], "location": [["Motion for a resolution", "Citation 4 a (new)"]], "meta": {"created": "2019-07-03T05:06:17"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.713v01-00", "reference": "2013/2149(INI)", "date": "2014-02-12T00:00:00", "committee": ["AFET"], "seq": 238, "id": "PE529.713-238", "orig_lang": "en", "old": ["A. whereas the European Neighbourhood", "Policy (ENP), in particular the Eastern", "Partnership (EaP), aims to extend the", "values and ideas of the founders of the EU;"], "new": ["A. whereas the European Neighbourhood", "Policy (ENP) embraces the values and", "ideas of the founders of the EU, notably", "the principles of Peace, Solidarity and", "Prosperity;"], "authors": "Mário David", "meps": [96973], "location": [["Motion for a resolution", "Recital A"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
,{"src": "http://www.europarl.eu", "peid": "PE529.899v01-00", "reference": "2014/2021(INI)", "date": "2014-03-05T00:00:00", "committee": ["AFET"], "seq": 2, "id": "PE529.899-2", "orig_lang": "en", "new": ["- having regard to the catastrophic", "earthquake and subsequent tsunami", "which devastated important parts of", "Japan's coast on 11 March 2011 and led", "to the destruction of the Fukushima", "nuclear power plant, causing possibly the", "greatest radiation disaster in human", "history,"], "authors": "Reinhard Bütikofer on behalf of the Verts/ALE Group", "meps": [96739], "location": [["Motion for a resolution", "Citation 11 a (new)"]], "meta": {"created": "2019-07-03T05:06:18"}, "changes": {}}
I would like to have a dataframe as follows:
src peid reference date committee seq id orig_lang new ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ... p.531"] ...
http://www.europarl.eu PE529.863v01-00 2014/2016(INI) 2014-02-27T00:00:00 AFET 128 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.713v01-00 2013/2149(INI) 2014-02-12T00:00:00 AFET 238 PE529.899-1 en ["- having ..."Federation,"] ...
http://www.europarl.eu PE529.899v01-00 2014/2021(INI) 2014-03-05T00:00:00 AFET 1 PE529.899-1 en ["- having ..."Federation,"] ...
(I didn't write the complete table above)
I have already tried the following codes:
library(rjson)
library(jsonlite)
Data <- fromJSON(file="data.json")
but each row is shown as below:
[[1]]
[[1]]$src
[1] "http://www.europarl.eu/sides/getDoc.do?pubRef=-//EP//NONSGML+COMPARL+PE-529.899+01+DOC+PDF+V0//EN&language=EN"
[[1]]$peid
[1] "PE529.899v01-00"
[[1]]$reference
[1] "2014/2021(INI)"
[[1]]$date
[1] "2014-03-05T00:00:00"
[[1]]$committee
[1] "AFET"
[[1]]$seq
[1] 1
[[1]]$id
[1] "PE529.899-1"
[[1]]$orig_lang
[1] "en"
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
[[1]]$authors
[1] "Reinhard Bütikofer on behalf of the Verts/ALE Group"
[[1]]$meps
[1] 96739
[[1]]$location
[[1]]$location[[1]]
[1] "Motion for a resolution" "Citation 6 a (new)"
[[1]]$meta
[[1]]$meta$created
[1] "2019-07-03T05:06:17"
[[1]]$changes
list()
dput version is below:
list(list(src = "http://www.europarl.eu",
peid = "PE529.899v01-00", reference = "2014/2021(INI)", date = "2014-03-05T00:00:00",
committee = "AFET", seq = 1, id = "PE529.899-1", orig_lang = "en",
new = c("- having regard to its resolution of 13", "December 20071 on Justice for the",
"'Comfort Women' (sex slaves in Asia", "before and during World War II) as well",
"as the statements by Japanese Chief", "Cabinet Secretary Yohei Kono in 1993",
"and by the then Prime Minister Tomiichi", "Murayama in 1995, the resolutions of the",
"Japanese parliament (the Diet) of 1995", "and 2005 expressing apologies for",
"wartime victims, including victims of the", "'comfort women' system,",
"_______________________", "1", "OJ C 323E, 18.12.2008, p.531"
), authors = "Reinhard Bütikofer on behalf of the Verts/ALE Group",
meps = 96739, location = list(c("Motion for a resolution",
"Citation 6 a (new)")), meta = list(created = "2019-07-03T05:06:17"),
changes = list()))
One of the problems that I have is in column 9 as you can see below, I want to put all the 15 components in one cell of the dataframe
[[1]]$new
[1] "- having regard to its resolution of 13" "December 20071 on Justice for the"
[3] "'Comfort Women' (sex slaves in Asia" "before and during World War II) as well"
[5] "as the statements by Japanese Chief" "Cabinet Secretary Yohei Kono in 1993"
[7] "and by the then Prime Minister Tomiichi" "Murayama in 1995, the resolutions of the"
[9] "Japanese parliament (the Diet) of 1995" "and 2005 expressing apologies for"
[11] "wartime victims, including victims of the" "'comfort women' system,"
[13] "_______________________" "1"
[15] "OJ C 323E, 18.12.2008, p.531"
How can I get the table I mentioned above?

We may either convert the nested list elements with lengths greater than 1 to a single string by pasteing (str_c) and then bind the named list to columns with _dfr
library(purrr)
library(dplyr)
library(stringr)
map_dfr(Data, ~ map(.x, unlist) %>%
map_dfr(~ if(length(.x) > 1) str_c(.x, collapse = ";") else .x))
Or use a recursive function rrapply to bind the elements having length greater than 1 as list column
library(rrapply)
map_dfr(Data, ~ rrapply(.x, how = "bind"))

Related

How to count how many string in a column

I have a datasets with a column "amenities" and I want to count how many amenities in each row.
> airbnbT$amenities[1]
[1] ["Essentials", "Refrigerator", "Shampoo", "TV", "Dedicated workspace", "Hangers", "Iron", "Long term stays allowed", "Dishes and silverware", "First aid kit", "Free parking on premises", "Hair dryer", "Patio or balcony", "Washer", "Dryer", "Cooking basics", "Coffee maker", "Private entrance", "Hot water", "Fire extinguisher", "Wifi", "Air conditioning", "Hot tub", "Kitchen", "Microwave", "Oven", "Smoke alarm"]
14673 Levels: ["Air conditioning", "Baby bath", "Long term stays allowed", "Baby monitor"] ...
> class(airbnbT$amenities[1])
[1] "factor"
Here for row 1, there are 27 amenities.
Is there a way to count the comma in each row "," ? This way would count the numbers of amenities.
Try str_count from the stringr package. You will need to add 1 since there will be one fewer comma than the number of amenities:
library(stringr)
airbnbT$amenities_count = str_count(airbnbT$amenities,",") + 1

how to extract ngrams from a text in R (newspaper articles)

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm:
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE)
I am trying to extract bigrams (e.g. "climate change", "global warming") but keep getting an error message when I type the following, saying the ngrams argument is not used.
dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE, ngrams = 2)
I have installed the tokenizer, tidyverse, dplyr, ngram, readtext, quanteda and stm libraries.
Below is a screenshot of my corpus.
Doc_iD is the article titles. I need the bigrams to be extracted from the "texts" column.
Do I need to extract the ngrams from the corpus first or can I do it from the dfm? Am I missing some piece of code that allows me to extract the bigrams?
Strictly speaking, if ngrams are what you want, then you can use tokens_ngrams() to form them. But sounds like you rather get more interesting multi-word expressions than "of the" etc. For that, I would use textstat_collocations(). You will want to do this on tokens, not on a dfm - the dfm will have already split your tokens into bag of words features, from which ngrams or MWEs can no longer be formed.
Here's an example from the built-in inaugural corpus. It removes stopwords but leaves a "pad" so that words that were not adjacent before the stopword removal will not appear as adjacent after their removal.
library("quanteda")
## Package version: 2.0.1
toks <- tokens(data_corpus_inaugural) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 united states 157 0 2 7.893348 41.19480
## 2 let us 97 0 2 6.291169 36.15544
## 3 fellow citizens 78 0 2 7.963377 32.93830
## 4 american people 40 0 2 4.426593 23.45074
## 5 years ago 26 0 2 7.896667 23.26947
## 6 federal government 32 0 2 5.312744 21.80345
These are by default scored and sorted in order of descending score.
To "extract" them, just take the collocation column:
head(colls$collocation, 50)
## [1] "united states" "let us" "fellow citizens"
## [4] "american people" "years ago" "federal government"
## [7] "almighty god" "general government" "fellow americans"
## [10] "go forward" "every citizen" "chief justice"
## [13] "four years" "god bless" "one another"
## [16] "state governments" "political parties" "foreign nations"
## [19] "solemn oath" "public debt" "religious liberty"
## [22] "public money" "domestic concerns" "national life"
## [25] "future generations" "two centuries" "social order"
## [28] "passed away" "good faith" "move forward"
## [31] "earnest desire" "naval force" "executive department"
## [34] "best interests" "human dignity" "public expenditures"
## [37] "public officers" "domestic institutions" "tariff bill"
## [40] "first time" "race feeling" "western hemisphere"
## [43] "upon us" "civil service" "nuclear weapons"
## [46] "foreign affairs" "executive branch" "may well"
## [49] "state authorities" "highest degree"
I think you need to create the ngram directly from the corpus. This is an example adapted from the quanteda tutorial website:
library(quanteda)
corp <- corpus(data_corpus_inaugural)
toks <- tokens(corp)
tokens_ngrams(toks, n = 2)
Tokens consisting of 58 documents and 4 docvars.
1789-Washington :
[1] "Fellow-Citizens_of" "of_the" "the_Senate" "Senate_and" "and_of" "of_the" "the_House"
[8] "House_of" "of_Representatives" "Representatives_:" ":_Among" "Among_the"
[ ... and 1,524 more ]
EDITED Hi this example from the help dfm may be useful
library(quanteda)
# You say you're already creating the corpus?
# where it says "data_corpus_inaugaral" put your corpus name
# Where is says "the_senate" put "climate change"
# where is says "the_house" put "global_warming"
tokens(data_corpus_inaugural) %>%
tokens_ngrams(n = 2) %>%
dfm(stem = TRUE, select = c("the_senate", "the_house"))
#> Document-feature matrix of: 58 documents, 2 features (89.7% sparse) and 4 docvars.
#> features
#> docs the_senat the_hous
#> 1789-Washington 1 2
#> 1793-Washington 0 0
#> 1797-Adams 0 0
#> 1801-Jefferson 0 0
#> 1805-Jefferson 0 0
#> 1809-Madison 0 0
#> [ reached max_ndoc ... 52 more documents ]

Split 2 numbers character into it's own new line

I've a character object with 84 elements.
> head(output.by.line)
[1] "\n17"
[2] "Now when Joseph saw that his father"
[3] "laid his right hand on the head of"
[4] "Ephraim, it displeased him; so he took"
[5] "hold of his father's hand to remove it"
[6] "from Ephraim's head to Manasseh's"
But there is a line that has 2 numbers (49) that is not in it's own line:
[35] "49And Jacob called his sons and"
I'd like to transform this into:
[35] "\n49"
[36] "And Jacob called his sons and"
And insert this in the correct numeration, after object 34.
Dput Output:
dput(output.by.line)
c("\n17", "Now when Joseph saw that his father", "laid his right hand on the head of",
"Ephraim, it displeased him; so he took", "hold of his father's hand to remove it",
"from Ephraim's head to Manasseh's", "head.", "\n18", "And Joseph said to his father, \"Not so,",
"my father, for this one is the firstborn;", "put your right hand on his head.\"",
"\n19", "But his father refused and said, \"I", "know, my son, I know. He also shall",
"become a people, and he also shall be", "great; but truly his younger brother shall",
"be greater than he, and his descendants", "shall become a multitude of nations.\"",
"\n20", "So he blessed them that day, saying,", "\"By you Israel will bless, saying, \"May",
"God make you as Ephraim and as", "Manasseh!\"' And thus he set Ephraim",
"before Manasseh.", "\n21", "Then Israel said to Joseph, \"Behold, I",
"am dying, but God will be with you and", "bring you back to the land of your",
"fathers.", "\n22", "Moreover I have given to you one", "portion above your brothers, which I",
"took from the hand of the Amorite with", "my sword and my bow.\"",
"49And Jacob called his sons and", "said, \"Gather together, that I may tell",
"you what shall befall you in the last", "days:", "\n2", "\"Gather together and hear, you sons of",
"Jacob, And listen to Israel your father.", "\n3", "\"Reuben, you are my firstborn, My",
"might and the beginning of my strength,", "The excellency of dignity and the",
"excellency of power.", "\n4", "Unstable as water, you shall not excel,",
"Because you went up to your father's", "bed; Then you defiled it-- He went up to",
"my couch.", "\n5", "\"Simeon and Levi are brothers;", "Instruments of cruelty are in their",
"dwelling place.", "\n6", "Let not my soul enter their council; Let",
"not my honor be united to their", "assembly; For in their anger they slew a",
"man, And in their self-will they", "hamstrung an ox.", "\n7",
"Cursed be their anger, for it is fierce;", "And their wrath, for it is cruel! I will",
"divide them in Jacob And scatter them", "in Israel.", "\n8",
"\"Judah, you are he whom your brothers", "shall praise; Your hand shall be on the",
"neck of your enemies; Your father's", "children shall bow down before you.",
"\n9", "Judah is a lion's whelp; From the prey,", "my son, you have gone up. He bows",
"down, he lies down as a lion; And as a", "lion, who shall rouse him?",
"\n10", "The scepter shall not depart from", "Judah, Nor a lawgiver from between his",
"feet, Until Shiloh comes; And to Him", "shall be the obedience of the people.",
"\n11", "Binding his donkey to the vine, And his", "donkey's colt to the choice vine, He"
)
Please, check this:
library(tidyverse)
split_line_number <- function(x) {
x %>%
str_replace("^([0-9]+)", "\n\\1\b") %>%
str_split("\b")
}
output.by.line %>%
map(split_line_number) %>%
unlist()
# Output:
# [35] "\n49"
# [36] "And Jacob called his sons and"
# [37] "said, \"Gather together, that I may tell"
# [38] "you what shall befall you in the last"
An option using stringr::str_match is to match two components of an optional number followed by everything. Get the captured output from the matched matrix (2:3) and create a new vector of strings by dropping NAs and empty strings.
vals <- c(t(stringr::str_match(output.by.line, "(\n?\\d+)?(.*)")[, 2:3]))
output <- vals[!is.na(vals) & vals != ""]
output[32:39]
#[1] "portion above your brothers, which I"
#[2] "took from the hand of the Amorite with"
#[3] "my sword and my bow.\""
#[4] "49"
#[5] "And Jacob called his sons and"
#[6] "said, \"Gather together, that I may tell"
#[7] "you what shall befall you in the last" "days:"
We'll make use of the stringr package:
library(stringr)
Modify the object:
output.by.line <- unlist(
ifelse(grepl('[[:digit:]][[:alpha:]]', output.by.line), str_split(gsub('([[:digit:]]+)([[:alpha:]])', paste0('\n', '\\1 \\2'), output.by.line), '[[:blank:]]', n = 2), output.by.line)
)
Print the resuts:
dput(output.by.line)
#[32] "portion above your brothers, which I"
#[33] "took from the hand of the Amorite with"
#[34] "my sword and my bow.\""
#[35] "\n49"
#[36] "And Jacob called his sons and"
#[37] "said, \"Gather together, that I may tell"
#[38] "you what shall befall you in the last"

using regular expressions with R

I have an array of characters in R. Some of the strings have a '(number)' pattern appended to that string. I'm trying to remove this '(number)' string from using regular expressions but cannot figure it out. I can access the rows of all the rows where the string has a whitespace than a character but there must be a way to find these number strings.
dat <- c("Alabama-Birmingham", "Arizona State", "Canisius", "UCF", "George Washington",
"Green Bay", "Iona", "Louisville (7)", "UMass", "Memphis", "Michigan State",
"Milwaukee", "Nebraska", "Niagara", "Northern Kentucky", "Notre Dame (21)",
"Quinnipiac", "Siena", "Tulsa", "Washington State", "Wright State",
"Xavier")
rows <- grep(" (.*)", dat)
fixed <- gsub(" (.*)","",games[rows,])
dat = fixed
First, you need to escape the parentheses and it would be good to be more specific about what is inside them
gsub("\\s+\\(\\d+\\)", "", dat)
[1] "Alabama-Birmingham" "Arizona State" "Canisius"
[4] "UCF" "George Washington" "Green Bay"
[7] "Iona" "Louisville" "UMass"
[10] "Memphis" "Michigan State" "Milwaukee"
[13] "Nebraska" "Niagara" "Northern Kentucky"
[16] "Notre Dame" "Quinnipiac" "Siena"
[19] "Tulsa" "Washington State" "Wright State"
[22] "Xavier"
We can do this with sub
sub("\\s*\\(.*", "", dat)
#[1] "Alabama-Birmingham" "Arizona State" "Canisius"
#[4] "UCF" "George Washington" "Green Bay"
#[7] "Iona" "Louisville" "UMass"
#[10] "Memphis" "Michigan State" "Milwaukee"
#[13] "Nebraska" "Niagara" "Northern Kentucky"
#[16] "Notre Dame" "Quinnipiac" "Siena"
#[19] "Tulsa" "Washington State" "Wright State"
#[22] "Xavier"

Parsing text file of one line JSON objects using RJSONIO

What I want: I would like to parse a text file of the form
{"business_id": "rncjoVoEFUJGCUoC1JgnUA", "full_address": "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345", "open": true, "categories": ["Accountants", "Professional Services", "Tax Services", "Financial Services"], "city": "Peoria", "review_count": 3, "name": "Peoria Income Tax Service", "neighborhoods": [], "longitude": -112.241596, "state": "AZ", "stars": 5.0, "latitude": 33.581867000000003, "type": "business"}
{"business_id": "0FNFSzCFP_rGUoJx8W7tJg", "full_address": "2149 W Wood Dr\nPhoenix, AZ 85029", "open": true, "categories": ["Sporting Goods", "Bikes", "Shopping"], "city": "Phoenix", "review_count": 5, "name": "Bike Doctor", "neighborhoods": [], "longitude": -112.10593299999999, "state": "AZ", "stars": 5.0, "latitude": 33.604053999999998, "type": "business"}
where every line is an individual json object. I would like the parsed form to be of a type which RPart can take as an argument.
I can get this working if I loop through every line but according to this SO answer it's more R like to use the apply function and not by looping through each line individually.
For each row in an R dataframe
Problem: When I run my code I'm getting this error
Error in apply(yelp_df, 1, fromJSON) : dim(X) must have a positive length
My code
#!/usr/bin/Rscript
require(graphics)
require(RJSONIO)
con <- file("yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json", "r")
yelp_df <- readLines(con) #rather then guessing what the optimal buffer size of the system is I'll just put everything into memeory
apply(yelp_df, 1, fromJSON)
readLines is returning a character vector. apply expects an array. Use lapply or something similar.
out <- lapply(readLines("test.txt"), fromJSON)
> head(out[[1]])
$business_id
[1] "rncjoVoEFUJGCUoC1JgnUA"
$full_address
[1] "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345"
$open
[1] TRUE
$categories
[1] "Accountants" "Professional Services" "Tax Services"
[4] "Financial Services"
$city
[1] "Peoria"
$review_count
[1] 3

Resources