CleanNLP package in R: metadata data frame? - r

Let's assume my dataframe looks like this:
bio_text <- c("Georg Aemilius, eigentlich Georg Oemler, andere Namensvariationen „Aemylius“ und „Emilius“ (* 25. Juni 1517 in Mansfeld; † 22. Mai 1569 in Stolberg (Harz))...", "Johannes Aepinus auch: Johann Hoeck, Huck, Hugk, Hoch oder Äpinus (* um 1499 in Ziesar; † 13. Mai 1553 in Hamburg) war ein deutscher evangelischer Theologe und Reformator.\nAepinus wurde als Sohn des Ratsherrn Hans Hoeck im brandenburgischen Ziesar 1499 geboren...")
doc_id <- c("1", "2")
url <- c("https://de.wikipedia.org/wiki/Georg_Aemilius", "https://de.wikipedia.org/wiki/Johannes_Aepinus")
name <- c("Aemilius, Georg", "Aepinus, Johannes")
place_of_birth <- c("Mansfeld", "Ziesar")
full_wikidata <- data.frame(bio_text, doc_id, url, name, place_of_birth)
I want to carry out Named Entity Recognition with the cleanNLP package in R. Therefore, I initialize the tokenizers and the spaCy backend, everything works fine:
options(stringsAsFactors = FALSE)
library(cleanNLP)
cnlp_init_tokenizers()
require(reticulate)
cnlp_init_spacy("de")
wikidata <- full_wikidata[,c("doc_id", "bio_text")]
wikimeta <- full_wikidata[,c("url", "name", "place_of_birth")]
spacy_annotatedWikidata <- cleanNLP::cnlp_annotate(wikidata, as_strings = TRUE, meta = wikimeta)
My only problem is the metadata. When I run it like this, I get the following warning message: In cleanNLP::cnlp_annotate(full_wikidata, as_strings = TRUE, meta = wikimeta) : data frame input given along with meta; ignoring the latter. To be honest, I don't get the documentation concerning meta in cnlp_annotate: "an optional data frame to bind to the document table". This means that I should deliver a data frame containing the metadata, right?! Later on, I want to be able to do something like this, e.g. filter out all person entities in document no. 3:
cnlp_get_entity(spacy_annotatedWikidata) %>%
filter(doc_id == 3, entity_type == "PER") %>%
count(entity)
Therefore, I have to find a way to access the metadata. Any help would be highly appreciated!

Fortunatelly, in the meantime I got some help and the advice to take a closer look at the method code of cnlp_annotate on Github: https://github.com/statsmaths/cleanNLP/blob/master/R/annotate.R
It says that you only can pass in a metadata dataframe if the input itself is not a dataframe but a file path. So if you do want to pass in a dataframe, the first row has to be doc_id, the second text and the remaining ones are automatically considered as metadata! So in my example only the order in full_wikidata has to be changed:
full_wikidata <- data.frame(doc_id, bio_text, url, name, place_of_birth)
Like this, it can be directly used as an input in clnp_annotate:
spacy_annotatedWikidata <- cleanNLP::cnlp_annotate(full_wikidata, as_strings = TRUE)

Related

bib list print to a character string in R

I'm reading a bib file extracted from Google Scholar with BIB <- bibtex::read.bib("file.bib") command and this created a list object. If I use paste(BIB) or as.character(BIB) the console shows for all items in the list lines like:
"list(title = "A Lealdade no Sistema Financeiro Portugu{\\^e}s", author = list(list(given = c("Francisco", "José", "dos", "Santos", "Mota", "Ferreira"), family = "Guerra", role = NULL, email = NULL, comment = NULL)), year = "2017", school = "Universidade de Coimbra")"
And if I use print() shows:
Guerra FJdSMF (2017). A Lealdade no Sistema Financeiro Português. Ph.D. thesis,
Universidade de Coimbra.
I need to extract the second kind to a new character string, but any command I try just doesn't work. I've tried A <- paste(print(BIB)), A <- as.character(print(BIB)) or just A <- print(BIB). I just get the first kind of line or an equal object.
I have already tried open the same file with bib2df::bib2df() but has some problems with the encoding and the dataframe's columns and rows
Try format(BIB) For example
bib <- read.bib( package = "bibtex" )
x <- format(bib)
x
# [1] "R Development Core Team (2009). _R: A Language and Environment for\nStatistical Computing_. R Foundation for Statistical Computing, Vienna,\nAustria. ISBN 3-900051-07-0, <http://www.R-project.org>."
I found this by looking at class(BIB) and saw "bibentry" then looked for all methods that recognize that object methods(class="bibentry") and format seemed like a good candidate.

Having problem with ggmap's mapdist() function

I have this code. I have my google API set up already, registered as well in R, Distance Matrix API has been initiated as well in the Google Cloud console.
Here is the dataframe I have, random 25 postal codes FROM and TO postal codes.
Dataset_test = data.frame(
FROM_POSTAL = c("V8A 0E5","T4G 6M4","V1N 8X3",
"C1B 5G1","R5H 2L4","H9S 8L4","L8E 4Y0","H2Y 7N6",
"K1B 7C0","G4A 5B0","E4P 3T2","E4V 5P4","H3J 1R5",
"G0B 4J7","E7A 6E7","E5B 2Y9","S4H 1T8","A2V 4G5",
"V8L 2A9","T9E 1M5","A5A 5M2","E4T 5B4","S2V 6C4",
"S9H 5P8","B1Y 0V0"),
TO_POSTAL = c("G0J 0B8","N0H 9N4","J9B 4Y4",
"L3Z 2Y7","E8K 4R4","B4P 7X9","S4H 2M0","A1Y 0B8",
"A1W 1E9","P9N 7X1","E4R 4B0","N0P 0M8","E1W 9Y7",
"T9W 8E2","G6X 4S9","A0E 0V4","J5X 7N8","N4N 8A1",
"V9K 0B9","L4G 3H7","E1W 0T2","G5R 9G3","L7C 9S2",
"E8P 2X6","E2A 2M1")
)
Here is the simple script I have to try to calculate the distance between the two postal codes by driving using Google's Distance Matrix API.
Driving_Distance = mapdist(from = Dataset_test[["FROM_POSTAL"]], to = Dataset_test[["TO_POSTAL"]], mode = c("driving")) %>% distinct()
When I run this, it throws an error in the Driving_Distance - says
Error: Argument 1 is a list, must contain atomic vectors
Your Canadian postal codes are hereby working with the mapdist() function.
The number of addresses used here were shortened for the sake of brevity.
A tibble was used instead of a dataframe so that the variables were character data types rather than factor data types. The actual Google API key that was used has been replaced with some text.
This was a good mapping question. The working code and output below:
library(ggmap)
library(plyr)
library(googleway)
library(tidyverse)
df = tibble(
FROM_POSTAL = c("V8A 0E5","T4G 6M4","V1N 8X3",
"C1B 5G1","R5H 2L4","H9S 8L4"),
TO_POSTAL = c("G0J 0B8","N0H 9N4","J9B 4Y4",
"L3Z 2Y7","E8K 4R4","B4P 7X9"))
dd <- apply(df, 1, function(x){
google_distance(origins = list(x["from"]),
destinations = list(x["to"]),
key="My_secret_key")
})
dd

How to `data.frame` with different number of rows but related (not `by`)

Here is the sample of the XML format in my dataset.
<info>
<a>1990-01-02T06:58:12+08:00</a>
<b>120.980</b>
<c>23.786</c>
<d>18.7</d>
<e>2</e>
</info>
<info>
<a>1990-02-02T06:58:12+08:00</a>
<b>120.804</b>
<c>23.790</c>
</info>
But the numbers of tag is not same as tag , for example there are 4000 rows tag a, b, c, and only 3950 rows for tag d, e
Here is my code in R
library(xml2)
data.frame(Time = xml_text(xml_find_all(xml_data, ".//a")),
Num = xml_text(xml_find_all(xml_data, ".//b")),
Dist = xml_text(xml_find_all(xml_data, ".//c")),
Gap = xml_text(xml_find_all(xml_data, ".//d")),
Type = xml_text(xml_find_all(xml_data, ".//e")),
stringsAsFactors = F)
}) -> df
The error message is: (I knew this will happened)
arguments imply differing number of rows
The output I want will be like the table below:
Time Num Dist Gap Type
1990-01-02T06:58:12+08:00 120.980 23.786 18.7 2
1990-02-02T06:58:12+08:00 120.804 23.790 <NA> <NA>
...
1993-03-03T08:42:15+08:00 120.412 23.523 <NA> 1
Which function or library should I try for this?
Thanks for helping me !!
I have tried some another method like map_if
Finally I found the solution!!
Once we are using the xml file, be sure to get the root node of the records at first.
Here I will show you how it works.
Take this xml file for example: (name it to test.xml)
<dataset>
<dataset_info>
<data_count>2</data_count>
<status>Actual</status>
</dataset_info>
<data>
<time>2019-06-01</time>
<event>event1</event>
<describe>describe for event1</describe>
</data>
<data>
<time>2019-06-02</time>
<event>event2</event>
</data>
</dataset>
We know that there is a tag describe missing in event2, but we hope to make data frame by this xml data. I was taught to use the function xml2::xml_find_all to get the value in the selected tag.
By the R code like this:
# library import
library(xml) #require(xml2)
# file reading
xml <- read_xml("path/where/the/file/is/test.xml")
data.frame(Time = xml_text(xml_find_all(xml, ".//time"))
Event = xml_text(xml_find_all(xml, ".//event"))
Describe = xml_text(xml_find_all(xml, ".//describe"))
)
Then we will get error message arguments imply differing number of rows
So what we need to do is get the root of records first!!
As the code below:
# library import
library(xml) #require(xml2)
# file reading
xml <- read_xml("path/where/the/file/is/test.xml")
record <- xml_find_all(xml, ".//data")
data.frame(Time = xml_text(xml_find_all(record, ".//time"))
Event = xml_text(xml_find_all(record, ".//event"))
Describe = xml_text(xml_find_all(record, ".//describe"))
)
After adding record <- xml_find_all(xml, ".//data"), we will no longer get the error cause by different counting of the results.
Hope this can help !!

Convert R JSON Twitter data to list

When using SearchTwitter, I converted to dataframe and then exported to JSON. However, all the text is in one line, etc (sample below). I need to separate so that each tweet is its own.
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "phishdf.json")
json_phishdf <- fromJSON(file="phishdf.json")
I tried converting to a list and am wondering if maybe converting to a data frame is a mistake.
However, for a list, I tried:
newlist['text']=phish[[1]]$getText()
But this will just give me the text for the first tweet. Is there a way to iterate over the entire data set, maybe in a for loop?
{"text":["#ilazer #abbijacobson I do feel compelled to say that I phind phish awphul... sorry, Abbi!","#phish This on-sale was an embarrassment. Something needs to change.","FS: Have 2 Tix To Phish In Chula Vista #Phish #facevaluetickets #phish #facevalue GO: https://t.co/dFdrpyaotp","RT #WKUPhiDelt: Come unwind from a busy week of class and kick off the weekend with a Phish Fry! 4:30-7:30 at the Phi Delt house. Cost is $\u2026","RT #phish: Tickets for Phish's July 15 & 16 shows at The Gorge go on sale in fifteen minutes at 1PM ET: https://t.co/tEKLNjI5u7 https://t.c\u2026"],
"favorited":[false,false,false,false,false],
"favoriteCount":[0,0,0,0,0],
"replyToSN":["rAlexandria","phish","NA","NA","NA"],
"created":[1456521159,1456521114,1456521022,1456521016,1456520988],
"truncated":[false,false,false,false,false],
"replyToSID":["703326502629277696","703304948990222337","NA","NA","NA"],
"id":["703326837720662016","703326646074343424","703326261045829632","703326236722991105","703326119328686080"],
"replyToUID":["26152867","14503997","NA","NA","NA"],"statusSource":["Mobile Web (M5)","Twitter for iPhone","CashorTrade - Face Value Tickets","Twitter for iPhone","Twitter for Android"],
"screenName":["rAlexandria","adamgelvan","CashorTrade","Kyle_Smith1087","timogrennell"],
"retweetCount":[0,0,0,2,5],
"isRetweet":[false,false,false,true,true],
"retweeted":[false,false,false,false,false],
"longitude":["NA","NA","NA","NA","NA"],
"latitude":["NA","NA","NA","NA","NA"]}
I followed your code and don't have the issue you're describing. Are you using library(twitteR) and library(jsonlite)?
Here is the code, and a screenshot of it working
library(twitteR)
library(jsonlite)
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "./../phishdf.json")
## note the `txt` argument, as opposed to `file` used in the question
json_phishdf <- fromJSON(txt="./../phishdf.json")

How to read a data frame in jsonlite

I am trying to read a data frame in jsonlite. I have a json file (which is made from an html code from a newspaper article , and structured in some variables: title, body, comments), and I managed to rename it.
install.packages("jsonlite")
library("jsonlite")
article1 <- fromJSON("/Users/eva/Desktop/bdk.blufaerdighed.html.json")
I now want to "see" the content of my file (I want to see the text belonging to the variables; title, body, comments etc.), but I don't know which function to use.
my json file looks like this:
{ "author":"Ulla Gunge", "body":"De har »opført sig blufærdighedskrænkende«. I en uge har to otteårige drenge derfor været bortvist fra deres skole i Horsens.", "comments":[], "subtitle":"To otteårige drenge er blevet bortvist fra deres skole for at have vist deres tissemænd til en pige. Men den slags er helt normalt i den alder, fortæller forsker i voksnes syn på børns seksualitet, doktorlege og pædofilifrygt.", "title":"Blufærdighed i børnehøjde" }
To see those fields you just print them
article1$title
article1$body
article1$comments
Data frame structure is a list of vectors of equal length. The comment field in your json has length 0. You can see this for example by typing:
sapply(article1, length)
# author body comments subtitle title
# 1 1 0 1 1
If you can change the json from "comments":[] to "comments":["NA"] it is possible to get the data frame directly with fromJSON function by using simplifyDataFrame attribute.
Or you could change the list with article1$comments <- NA and get the data frame with:
as.data.frame(article1)

Resources