I'm having trouble importing a file into R. The file was obtained from this website: https://report.nih.gov/award/index.cfm, where I clicked "Import Table" and downloaded a .xls file for the year 1992.
This image might help describe how I retrieved the data
Here's what I've tried typing into the console, along with the results:
Input:
> library('readxl')
> data1992 <- read_excel("1992.xls")
Output:
Not an excel file
Error in eval(substitute(expr), envir, enclos) :
Failed to open /home/chrx/Documents/NIH Funding Awards, 1992 - 2016/1992.xls
Input:
> data1992 <- read.csv ("1992.xls", sep ="\t")
Output:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I'm not sure whether or not this is relevant, but I'm using GalliumOS (linux). Because I'm using Linux, Excel isn't installed on my computer. LibreOffice is.
Why bother with getting the data in and out of a .csv if it's right there on the web page for you to scrape?
# note the query parameters in the url when you apply a filter, e.g. fy=
url <- 'http://report.nih.gov/award/index.cfm?fy=1992'
library('rvest')
library('magrittr')
library('dplyr')
df <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="orgtable"]') %>%
html_table()%>%
extract2(1) %>%
mutate(Funding = as.numeric(gsub('[^0-9.]','',Funding)))
head(df)
returns
Organization City State Country Awards Funding
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 356221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 1097158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 629946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 1757241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 2161146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 450411
If you need to loop through years 1992 to present, or something similar, this programmatic approach will save you a lot of time versus handling a bunch of flat files.
This works for me
library(gdata)
dat1 <- read.xls("1992.xls")
If you're on 32-bit Windows this will also work:
require(RODBC)
dat1 <- odbcConnectExcel("1992.xls")
For several more options that rely on rJava-based packages like xlsx you can check out this link.
As someone mentioned in the comments it's also easy to save the file as a .csv and read it in that way. This will save you the trouble of dealing with the effects of strange formatting or metadata on your imported file:
dat1 <- read.csv("1992.csv")
head(dat1)
ORGANIZATION CITY STATE COUNTRY AWARDS FUNDING
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 $356,221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 $1,097,158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 $629,946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 $1,757,241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 $2,161,146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 $450,411
Converting to .csv is also usually the fastest way in my opinion (though this is only an issue with Big Data).
Related
I am currently trying to webscrape the following website: https://chicago.suntimes.com/crime/archives
I have been relying on the CSS Selector Gadget to find the x-path and to do web scraping. However, I am unable to use the gadget in this website and I would have to use the Inspect Source to find what I need. I have been trying to find the relevant css and xpath by scrolling down each source, but I was not able to do it due to my limited capabilities.
Could you please help me find the xpath or css for
Title
Author
Date
I am so sorry if this is a dry laundry list of everything... but I am really stuck. I will really appreciate if you could give me some help!
Thank you very much.
For each element that you want to extract if you find the relevant tag with it's respective class using selector gadget you'll be able to get what you want.
library(rvest)
url <- 'https://chicago.suntimes.com/crime/archives'
webpage <- url %>% read_html()
title <- webpage %>% html_nodes('h2.c-entry-box--compact__title') %>% html_text()
author <- webpage %>% html_nodes('span.c-byline__author-name') %>% html_text()
date <- webpage %>% html_nodes('time.c-byline__item')%>% html_text() %>% trimws()
result <- data.frame(title, author, date)
result
result
# title author date
#1 Belmont Cragin man charged with carjacking in Little Village: police Sun-Times Wire February 17
#2 Gas station robbed, man carjacked in Horner Park Jermaine Nolen February 17
#3 8 shot, 2 fatally, Tuesday in Chicago Sun-Times Wire February 17
#4 Businesses robbed at gunpoint on the Northwest Side: police Sun-Times Wire February 17
#5 Man charged with carjacking in Aurora Sun-Times Wire February 16
#6 Woman fatally stabbed in Park Manor apartment Sun-Times Wire February 16
#7 Woman critically hurt by gunfire in Woodlawn David Struett February 16
#8 Teen boy, 17, charged with attempted carjacking in Back of the Yards Sun-Times Wire February 16
#...
#...
Sorry if this is repetitive, but I've looked everywhere and can't seem to find anything that addresses my specific problem in R. I have a column with city names:
cities <-data.frame(c("Sydney", "Dusseldorf", "LidCombe", "Portland"))
colnames(cities)[1]<-"CityName"
Ideally I'd like to attach a column with either the lat/long for each city or the time zone. I have tried using the "ggmap" package in R, but my request exceeds the maximum number of requests they allow per day. I found the "geonames" package that converts lat/long to timezones, so if I get the lat/long for the city I should be able to take it from there.
Edit to address potential duplicate question: I would like to do this without using the ggmap package, as I have too many rows and they have a maximum # of requests per day.
You can get at least many major cities from the world.cities data in the maps package.
## Changing your data to a vector
cities <- c("Sydney", "Dusseldorf", "LidCombe", "Portland")
## Load up data
library(maps)
data(world.cities)
world.cities[match(cities, world.cities$name), ]
name country.etc pop lat long capital
36817 Sydney Australia 4444513 -33.87 151.21 0
10026 Dusseldorf Germany 573521 51.24 6.79 0
NA <NA> <NA> NA NA NA NA
29625 Portland Australia 8757 -38.34 141.59 0
Note: LidCombe was not included.
Warning: For many names, there is more than one world city. For example,
world.cities[grep("Portland", world.cities$name), ]
name country.etc pop lat long capital
29625 Portland Australia 8757 -38.34 141.59 0
29626 Portland USA 542751 45.54 -122.66 0
29627 Portland USA 62882 43.66 -70.28 0
Of course the two in the USA are Portland, Maine and Portland, Oregon.
match is just giving the first one on the list. You may need to use more information than just the name to get a good result.
I want to get the names of the companies by two columns Region and Name of role-player. I find json links on each page already, but with RJSonio it didnt work. It's collect data, but how could I get it to a readable view? Could anybody help, thanks.
Here is the link
I try this code from another similiar question on Stackoverflow
library(RJSONIO)
library(RCurl)
grab the data
raw_data <- getURL("http://www.milksa.co.za/admin/settings/mis_rest/webservicereceive/GET/index/page:1/regionID:7.json")
#Then covert from JSON into a list in R
data <- fromJSON(raw_data)
length(data)
final_data <- do.call(rbind, data)
head (final_data)
My personal preference for this is to use the library jsonlite and not use fromJSON at all
require(jsonlite)
data<-jsonlite::fromJSON(raw_data, simplifyDataFrame = TRUE)
finalData<-data.frame(cbind(data$rolePlayers$RolePlayer$orgName, data$rolePlayers$Region$RegionNameEng))
colnames(finalData)<-c("Name", "Region")
Which gives you the following data frame:
Name Region
GoodHope Cheese (Pty) Ltd Western Cape
Jay Chem (Pty) Ltd Western Cape
Coltrade International cc Western Cape
GC Rieber Compact South Africa (Pty) Ltd Western Cape
Latana Cheese Pty Ltd Western Cape
Marco Frischknecht Western Cape
A great way to visualize how to query and what is in your JSON string can be found here:Chris Photo JSON viewer
You can just cut and paste it in there from the raw_data (removing external quotation marks). From there it becomes easy to see how to structure your data using addressing like you would with a traditional data frame and the $ operator.
I want to get the names of the companies by two columns Region and Name of role-player. I find json links on each page already, but with RJSonio it didnt work. It's collect data, but how could I get it to a readable view? Could anybody help, thanks.
Here is the link
I try this code from another similiar question on Stackoverflow
library(RJSONIO)
library(RCurl)
grab the data
raw_data <- getURL("http://www.milksa.co.za/admin/settings/mis_rest/webservicereceive/GET/index/page:1/regionID:7.json")
#Then covert from JSON into a list in R
data <- fromJSON(raw_data)
length(data)
final_data <- do.call(rbind, data)
head (final_data)
My personal preference for this is to use the library jsonlite and not use fromJSON at all
require(jsonlite)
data<-jsonlite::fromJSON(raw_data, simplifyDataFrame = TRUE)
finalData<-data.frame(cbind(data$rolePlayers$RolePlayer$orgName, data$rolePlayers$Region$RegionNameEng))
colnames(finalData)<-c("Name", "Region")
Which gives you the following data frame:
Name Region
GoodHope Cheese (Pty) Ltd Western Cape
Jay Chem (Pty) Ltd Western Cape
Coltrade International cc Western Cape
GC Rieber Compact South Africa (Pty) Ltd Western Cape
Latana Cheese Pty Ltd Western Cape
Marco Frischknecht Western Cape
A great way to visualize how to query and what is in your JSON string can be found here:Chris Photo JSON viewer
You can just cut and paste it in there from the raw_data (removing external quotation marks). From there it becomes easy to see how to structure your data using addressing like you would with a traditional data frame and the $ operator.
I'm currently working with an R script set up to use RDSTK, a wrapper for the Data Science Toolkit API based on this, to geocode a list of addresses from a CSV.
The script appears to work, but the list of addresses has a preexisting unique identifier which isn't preserved in the process - the input file has two columns: id, and address. The id column, for the purposes of the geocoding process, is meaningless, but I'd like the output to retain it - that is, I'd like the output, which has three columns (address, long, and lat) to have four - id being the first.
The issue is that
The output is not in the same order as the input addresses, or doesn't appear to be, so I cannot simply tack on the column of addresses at the end, and
The output does not include nulls, so the two would not be the same number of rows in any case, even if it was the same order, and
I am not sure how to effectively tie the id column in such that it becomes a part of the geocoding process, which obviously would be the ideal solution.
Here is the script:
require("RDSTK")
library(httr)
library(rjson)
dff = read.csv("C:/Users/name/Documents/batchtestv2.csv")
data <- paste0("[",paste(paste0("\"",dff$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json, function(x) c(long=x$longitude,lat=x$latitude)))
geocode
write.csv(geocode, file = "C:/Users/name/Documents/geocodetest.csv")
And here is a sample of the output:
2633 Camino Ramon Suite 500 San Ramon California 94583 United States -121.96208 37.77027
555 Lordship Boulevard Stratford Connecticut 6615 United States -73.14098 41.16542
500 West 13th Street Fort Worth Texas 76102 United States -97.33288 32.74782
50 North Laura Street Suite 2500 Jacksonville Florida 32202 United States -81.65923 30.32733
7781 South Little Egypt Road Stanley North Carolina 28164 United States -81.00597 35.44482
Maybe the solution is extraordinarily simple and I'm just being dense - it's entirely possible (I don't have extensive experience with any particular language, so I sometimes miss obvious things) but I haven't been able to solve it.
Thanks in advance!