Webscraping with R JSON - r

I want to get the names of the companies by two columns Region and Name of role-player. I find json links on each page already, but with RJSonio it didnt work. It's collect data, but how could I get it to a readable view? Could anybody help, thanks.
Here is the link
I try this code from another similiar question on Stackoverflow
library(RJSONIO)
library(RCurl)
grab the data
raw_data <- getURL("http://www.milksa.co.za/admin/settings/mis_rest/webservicereceive/GET/index/page:1/regionID:7.json")
#Then covert from JSON into a list in R
data <- fromJSON(raw_data)
length(data)
final_data <- do.call(rbind, data)
head (final_data)

My personal preference for this is to use the library jsonlite and not use fromJSON at all
require(jsonlite)
data<-jsonlite::fromJSON(raw_data, simplifyDataFrame = TRUE)
finalData<-data.frame(cbind(data$rolePlayers$RolePlayer$orgName, data$rolePlayers$Region$RegionNameEng))
colnames(finalData)<-c("Name", "Region")
Which gives you the following data frame:
Name Region
GoodHope Cheese (Pty) Ltd Western Cape
Jay Chem (Pty) Ltd Western Cape
Coltrade International cc Western Cape
GC Rieber Compact South Africa (Pty) Ltd Western Cape
Latana Cheese Pty Ltd Western Cape
Marco Frischknecht Western Cape
A great way to visualize how to query and what is in your JSON string can be found here:Chris Photo JSON viewer
You can just cut and paste it in there from the raw_data (removing external quotation marks). From there it becomes easy to see how to structure your data using addressing like you would with a traditional data frame and the $ operator.

Related

Extracting university names from affiliation in Pubmed data with R

I've been using the extremely useful rentrez package in R to get information about author, article ID and author affiliation from the Pubmed database. This works fine but now I would like to extract information from the affiliation field. Unfortunately the affiliation field is widely unstructured, not standardized string with various types of information such as the name of university, name of department, address and more delimited by commas. Therefore text mining approach is necessary to get any useful information from this field.
I tried the package easyPubmed in combination with rentrez, and even though easyPubmed package can extract some information from the affiliation field (e.g. email address, which is very useful), to my knowledge it cannot extract university name. I also tried the package pubmed.mineR, but unfortunately this also does not provide university name extraction. I startet to experiment with grep and regex functions but as I am no R expert I could not make this work.
I was able to find very similar threads solving the issue with python:
Regex for extracting names of colleges, universities, and institutes?
How to extract university/school/college name from string in python using regular expression?
But unfortunately I do not know how to convert the python regex function to an R regex function as I am not familiar with python.
Here is some example data:
PMID = c(121,122,123,124,125)
author=c("author1","author2","author3","author4","author5")
Affiliation = c("blabla,University Ghent,blablabla", "University Washington, blabla, blablabla, blablabalbalba","blabla,University of Florence,blabla", "University Chicago, Harvard University", "Oxford University")
df = as.data.frame(cbind(PMID,author,Affiliation))
df
PMID author Affiliation
1 121 author1 blabla,University Ghent,blablabla
2 122 author2 University Washington, blabla, blablabla, blablabalbalba
3 123 author3 blabla,University of Florence,blabla
4 124 author4 University Chicago, Harvard University
5 125 author5 Oxford University
What I would like to get:
PMID author Affiliation University
1 121 author1 blabla,University Ghent,blablabla University Ghent
2 122 author2 University Washington,ba, bla, bla University Washington
3 123 author3 blabla,University Florence,blabla University of Florence
4 124 author4 University Chicago, Harvard Univ University Chicago, Harvard University
5 125 author5 Oxford University Oxford University
Please sorry if there is already a solution online, but I honestly googled a lot and did not find any clear solution for R. I would be very thankful for any hints and solutions to this task.
In general, regex expressions can be ported to R with some changes. For example, using the php link you included, you can create a new variable with extracted text using that regex expression, and only changing the escape character ("\\" instead "\"). So, using dplyr and stringr packages:
library(dplyr)
library(stringr)
df <- df %>%
mutate(Organization=str_extract(Affiliation,
"([A-Z][^\\s,.]+[.]?\\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\\d]*(?=,|\\d)"))

Extract metadata with R

Good day
I am a newbie to Stackoverflow:)
I am trying my hand with programming with R and found this platform a great source of help.
I have developed some code leveraging stackoverflow, but now I am failing to read the metadata from this htm file
Please direct download this file before using in R
setwd("~/NLP")
library(tm)
library(rvest)
library(tm.plugin.factiva)
file <-read_html("facts.htm")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))
# See the contents of the documents
inspect(corpus)
head(corpus)
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3
See meta-data associated with first article
meta(corpus[[3]])
meta(corpus[[3]])
author : character(0)
datetimestamp: 2017-08-31
description : character(0)
heading : Rain, Rain, Rain
id : TIMEUK-170830-e
language : en
origin : thetimes.co.uk
edition : character(0)
section : Comment
subject : c("Hurricanes/Typhoons", "Storms", "Political/General News", "Disasters/Accidents", "Natural Disasters/Catastrophes", "Risk News", "Weather")
coverage : c("United States", "North America")
company : character(0)
industry : character(0)
infocode : character(0)
infodesc : character(0)
wordcount : 333
publisher : News UK & Ireland Limited
rights : © Times Newspapers Limited 2017
How can I save each metadata (SE, HD, AU, ..PUB, AU) - all 18 metadata elements column-wise in a dataframe or write to excel for each document in corpus?
Example of output:
SE HD AU ...
Doc 1
2
3
Thank you for your help
The simplest way I know of to do it is:
Make a data frame from each of the three lists in your corpus:
one<-data.frame(unlist(meta(corpus[[1]])))
two<-data.frame(unlist(meta(corpus[[2]])))
three<-data.frame(unlist(meta(corpus[[3]])))
Then you will want to merge them into a single data frame. For the first two, this is easy to do, as using "row.names" will cause them to merge on the NON VARIABLE row names. But the second merge, you need to merge based on the column now named "Row.Names" So you need to create and rename the first column of the third file with the row names, using setDT allows you to do this without adding another full set of information, just redirecting R to see the row names as the first column
setDT(three, keep.rownames = TRUE)[]
colnames(three)[1] <- "Row.names"
then you simply merge the first and second data frame into variable named meta, and then merge meta with three using "Row.names" (the new name of the first column now).
meta <- merge(one, two, by="row.names", all=TRUE)
meta <- merge(meta, three, by = "Row.names", all=TRUE)
Your data will look like this:
Row.names unlist.meta.corpus..1.... unlist.meta.corpus..2.... unlist.meta.corpus..3....
1 author Jenni Russell <NA> <NA>
2 coverage1 United States North Korea United States
3 coverage2 North America United States North America
4 coverage3 <NA> Japan <NA>
5 coverage4 <NA> Pyongyang <NA>
6 coverage5 <NA> Asia Pacific <NA>
Those NA values are there because not all of the sub-lists had values for all of the observations.
By using the all=TRUE on both merges, you preserve all of the fields, with and without data, which makes it easy to work with moving forward.
If you look at this PDF from CRAN on page two the section Details shows you how to access the content and metadata. From there is is simply about unlisting to move them into data frames.
If you get lost, send a comment and I will do what I can to help you out!
EDIT BY REQUEST:
To write this to Excel is not super difficult because the data is already "square" in a uniform data frame. You would just install xlsx package and xlxsjars then use the following function:
write.xlsx(meta, file, sheetName="Sheet1",
col.names=TRUE, row.names=TRUE, append=FALSE, showNA=TRUE)
You can find information about the package here: page 38 gives more detail.
And if you want to save the content, you can change meta to content in the files which extract the data from corpus and make the initial dataframes. The entire process will be the same otherwise

Extracting from result [duplicate]

I want to get the names of the companies by two columns Region and Name of role-player. I find json links on each page already, but with RJSonio it didnt work. It's collect data, but how could I get it to a readable view? Could anybody help, thanks.
Here is the link
I try this code from another similiar question on Stackoverflow
library(RJSONIO)
library(RCurl)
grab the data
raw_data <- getURL("http://www.milksa.co.za/admin/settings/mis_rest/webservicereceive/GET/index/page:1/regionID:7.json")
#Then covert from JSON into a list in R
data <- fromJSON(raw_data)
length(data)
final_data <- do.call(rbind, data)
head (final_data)
My personal preference for this is to use the library jsonlite and not use fromJSON at all
require(jsonlite)
data<-jsonlite::fromJSON(raw_data, simplifyDataFrame = TRUE)
finalData<-data.frame(cbind(data$rolePlayers$RolePlayer$orgName, data$rolePlayers$Region$RegionNameEng))
colnames(finalData)<-c("Name", "Region")
Which gives you the following data frame:
Name Region
GoodHope Cheese (Pty) Ltd Western Cape
Jay Chem (Pty) Ltd Western Cape
Coltrade International cc Western Cape
GC Rieber Compact South Africa (Pty) Ltd Western Cape
Latana Cheese Pty Ltd Western Cape
Marco Frischknecht Western Cape
A great way to visualize how to query and what is in your JSON string can be found here:Chris Photo JSON viewer
You can just cut and paste it in there from the raw_data (removing external quotation marks). From there it becomes easy to see how to structure your data using addressing like you would with a traditional data frame and the $ operator.

Can't import this excel file into R

I'm having trouble importing a file into R. The file was obtained from this website: https://report.nih.gov/award/index.cfm, where I clicked "Import Table" and downloaded a .xls file for the year 1992.
This image might help describe how I retrieved the data
Here's what I've tried typing into the console, along with the results:
Input:
> library('readxl')
> data1992 <- read_excel("1992.xls")
Output:
Not an excel file
Error in eval(substitute(expr), envir, enclos) :
Failed to open /home/chrx/Documents/NIH Funding Awards, 1992 - 2016/1992.xls
Input:
> data1992 <- read.csv ("1992.xls", sep ="\t")
Output:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I'm not sure whether or not this is relevant, but I'm using GalliumOS (linux). Because I'm using Linux, Excel isn't installed on my computer. LibreOffice is.
Why bother with getting the data in and out of a .csv if it's right there on the web page for you to scrape?
# note the query parameters in the url when you apply a filter, e.g. fy=
url <- 'http://report.nih.gov/award/index.cfm?fy=1992'
library('rvest')
library('magrittr')
library('dplyr')
df <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="orgtable"]') %>%
html_table()%>%
extract2(1) %>%
mutate(Funding = as.numeric(gsub('[^0-9.]','',Funding)))
head(df)
returns
Organization City State Country Awards Funding
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 356221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 1097158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 629946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 1757241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 2161146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 450411
If you need to loop through years 1992 to present, or something similar, this programmatic approach will save you a lot of time versus handling a bunch of flat files.
This works for me
library(gdata)
dat1 <- read.xls("1992.xls")
If you're on 32-bit Windows this will also work:
require(RODBC)
dat1 <- odbcConnectExcel("1992.xls")
For several more options that rely on rJava-based packages like xlsx you can check out this link.
As someone mentioned in the comments it's also easy to save the file as a .csv and read it in that way. This will save you the trouble of dealing with the effects of strange formatting or metadata on your imported file:
dat1 <- read.csv("1992.csv")
head(dat1)
ORGANIZATION CITY STATE COUNTRY AWARDS FUNDING
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 $356,221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 $1,097,158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 $629,946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 $1,757,241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 $2,161,146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 $450,411
Converting to .csv is also usually the fastest way in my opinion (though this is only an issue with Big Data).

How do I preserve prexisting identifiers when geocoding a list of addresses in R?

I'm currently working with an R script set up to use RDSTK, a wrapper for the Data Science Toolkit API based on this, to geocode a list of addresses from a CSV.
The script appears to work, but the list of addresses has a preexisting unique identifier which isn't preserved in the process - the input file has two columns: id, and address. The id column, for the purposes of the geocoding process, is meaningless, but I'd like the output to retain it - that is, I'd like the output, which has three columns (address, long, and lat) to have four - id being the first.
The issue is that
The output is not in the same order as the input addresses, or doesn't appear to be, so I cannot simply tack on the column of addresses at the end, and
The output does not include nulls, so the two would not be the same number of rows in any case, even if it was the same order, and
I am not sure how to effectively tie the id column in such that it becomes a part of the geocoding process, which obviously would be the ideal solution.
Here is the script:
require("RDSTK")
library(httr)
library(rjson)
dff = read.csv("C:/Users/name/Documents/batchtestv2.csv")
data <- paste0("[",paste(paste0("\"",dff$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json, function(x) c(long=x$longitude,lat=x$latitude)))
geocode
write.csv(geocode, file = "C:/Users/name/Documents/geocodetest.csv")
And here is a sample of the output:
2633 Camino Ramon Suite 500 San Ramon California 94583 United States -121.96208 37.77027
555 Lordship Boulevard Stratford Connecticut 6615 United States -73.14098 41.16542
500 West 13th Street Fort Worth Texas 76102 United States -97.33288 32.74782
50 North Laura Street Suite 2500 Jacksonville Florida 32202 United States -81.65923 30.32733
7781 South Little Egypt Road Stanley North Carolina 28164 United States -81.00597 35.44482
Maybe the solution is extraordinarily simple and I'm just being dense - it's entirely possible (I don't have extensive experience with any particular language, so I sometimes miss obvious things) but I haven't been able to solve it.
Thanks in advance!

Resources