Web Scraping in R--readHTMLTable has table names as NULL - r

Here's my code for reading the tables but the tables which are read are having a NULL name. Is there a better method for finding the land area of each state in square miles without the commas in the numbers? I had the idea of extracting the table and going to the second table and converting it to data.frame but now that they have NULL names I am not sure what should I do or if there's a better method
require("XML")
url="http://simple.wikipedia.org/wiki/List_of_U.S._states_by_area"
wiki_page=readLines(url)
length(wiki_page)
tables=readHTMLTable(url)
Here's a sample output:
> tables
$`NULL`
Rank State kmĀ² milesĀ²
1 1 Alaska 1,717,854 663,267
2 2 Texas 696,621 268,581
3 3 California 423,970 163,696
4 4 Montana 380,838 147,042
5 5 New Mexico 314,915 121,589
6 6 Arizona 295,254 113,998
7 7 Nevada 286,351 110,561
8 8 Colorado 269,601 104,094
9 9 Oregon 254,805 98,381
....

You should read the names and assign them to tables:
library(XML)
require("XML")
url="http://simple.wikipedia.org/wiki/List_of_U.S._states_by_area"
doc <- htmlParse(url)
nn <- xpathSApply(doc,'//*[#class="mw-headline"]',xmlValue)[-4]
tabs <- readHTMLTable(url)
names(tabs) <- nn
Check the result :
str(tabs,max=1)
# List of 3
# $ Total area:'data.frame': 50 obs. of 4 variables:
# $ Land area :'data.frame': 50 obs. of 4 variables:
# $ Water area:'data.frame': 50 obs. of 5 variables:
numeric conversion :
convert_num <-
function(x)as.numeric(gsub(',','',x))
lapply(tabs,function(y){
y[,-c(1,2)] <- sapply(y[,-c(1,2)],convert_num)
y
})

Related

How can I store a value in a name?

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...
I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

Raw data: reading attributes with varied number of spaces in R

I am trying to read in the Baylor dataset but I can't use read.csv since the spaces are not consistent.
I do have the column numbers so I was thinking read.fwf would help fix my issue but that means I have to review more than 100 attributes and check the line widths.
Is there an easier way to read the data?
baylor <- read.csv('C:/Users/Documents/baylor-religion-survey-data-2007.txt', header=F)
Column Numbers
Baylor Religion 2007 Survey Data
I haven't tested carefully, but I think this does it:
Define URLs:
lnum_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-column-numbers.txt"
survey_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-data-2007.txt"
Read file with column info:
nums <- read.table(url(lnum_url),as.is=TRUE,header=TRUE)
Extract starting column for each field:
startcol <- as.numeric( ## convert to numeric
sapply(
strsplit(nums[,3],"-"), ## split strings on dashes
"[",1)) ## select first element of each result
## sapply(z,"[",1) == sapply(z,function(x) x[1])
Field widths are differences (assume last field is length 1):
w <- c(diff(startcol),1)
Read fixed width:
r <- read.fwf(url(survey_url),widths=w)
Assign field names:
names(r) <- gsub(":","",nums$COL)
Some quick checks:
str(r[,1:8])
## 'data.frame': 1648 obs. of 8 variables:
## $ ID : num 1.1e+09 1.1e+09 1.1e+09 1.1e+09 1.1e+09 ...
## $ WEIGHT : num 0.822 0.312 1.604 1.184 1.35 ...
## $ REGION : int 3 3 4 3 2 2 2 4 2 2 ...
## $ RELIG1 : int 12 12 46 45 14 31 16 33 16 16 ...
## $ RELIG2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ DENOM : Factor w/ 301 levels " ",..: 231 231 1 1 1 1 83 113 1 23 ...
## $ RELGIOUS: int 3 4 1 3 3 4 4 4 3 4 ...
## $ ATTEND : int 5 8 0 8 3 0 8 7 1 8 ...
tail(sort(levels(r$DENOM)))
## [1] " RIVER OF LIFE EVANGELICAL FREE OF ELK RIVER"
## [2] " ELCA - EVANGELICAL LUTHERAN CHURCH OF AMERICA"
## [3] " WASHBURN CHRISTIAN CHURCH DISCIPLES OF CHRIST"
## [4] " THE CHURCH OF JESUS CHRIST OF LATTER DAY SAINTS"
## [5] " GENERAL ASSOCIATION OF REGULAR BAPTISTS CHURCHES"
## [6] "CONGREGATIONAL/METHODIST UNITED CHURCHES OF DURHAM,"
Some more processing (e.g. stripping white space in the denominations) might be in order, and I would certainly further check these results, but this should get you most of the way there.
For future reference it might be worth downloading the data from the original download site and checking cross-tabulations against the code book ...

Plotting great circles from a subset in R

I have a data frame that after some processing (as geocoding for example) has the following characteristics:
'data.frame': 13 obs. of 5 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ ciudad : Factor w/ 10 levels "Auch","Barcelona",..: 8 4 5 3 2 7 9 10 6 6 ...
$ proyecto: int 1 1 1 1 1 1 1 1 2 2 ...
$ lon : num -1.131 0.564 -9.139 0.627 2.173 ...
$ lat : num 38 44.2 38.7 44.5 41.4 ...
Every proyect (proyecto) has a list of cities. And I need to connect in a radial-way the first of them with the others (of the project). That is what I have been done so far:
# Capitalizing first letters
municipios <- read.csv("ciudades.csv", header=TRUE, sep=";")
stri_trans_totitle(as.character(municipios$ciudad))
write.csv(municipios, file = "municipios.csv")
# Obtaining latitude & longitude
lonlat <- geocode(as.character(municipios$ciudad))
municipios_lonlat <- cbind(municipios, lonlat)
write.csv(municipios_lonlat, file = "municipios_lonlat.csv")
str(municipios_lonlat)
# Plotting a simple map
xlim <- c(-13.08, 8.68)
ylim <- c(34.87, 49.50)
map("world", col="#191919", fill=TRUE, bg="#000000", lwd=0.05, xlim=xlim, ylim=ylim)
# Plotting cities
symbols(municipios_lonlat$lon, municipios_lonlat$lat, bg="#e2373f", fg="#ffffff", lwd=0.5, circles=rep(1, length(municipios_lonlat$lon)), inches=0.05, add=TRUE)
# Subsetting, splitting & connecting
uniq <- unique(unlist(municipios_lonlat$proyecto))
for (i in 1:length(uniq)){
data_1 <- subset(municipios_lonlat, proyecto == uniq[i])
for (i in 2:length(data_1$lon)-1){
lngs <- c(data_1$lon[1], data_1$lon[i])
lats <- c(data_1$lat[1], data_1$lat[i])
lines(lngs, lats, col="#e2373f", lwd=2)
}
}
But it does not like quite real. So I need to use great circles to improve the resulting map. I know I have to use the geosphere library, and use a similar loop as the one I have done in the last paragraph. But the things I tried did not work. Please could you help me. You are my only hope Obi Wan Kenobis.
Note: here you can download my data.

Histogram of Weekdays by Year R

I have a .csv file that I have loaded into R using the following basic command:
lace <- read.csv("lace for R.csv")
It pulls in my data just fine. Here is the str of the data:
str(lace)
'data.frame': 2054 obs. of 20 variables:
$ Admission.Day : Factor w/ 872 levels "1/1/2013","1/10/2011",..: 231 238 238 50 59 64 64 64 67 67 ...
$ Year : int 2010 2010 2010 2011 2011 2011 2011 2011 2011 2011 ...
$ Month : int 12 12 12 1 1 1 1 1 1 1 ...
$ Day : int 28 30 30 3 4 6 6 6 7 7 ...
$ DayOfWeekNumber : int 3 5 5 2 3 5 5 5 6 6 ...
$ Day.of.Week : Factor w/ 7 levels "Friday","Monday",..: 6 5 5 2 6 5 5 5 1 1 ...
What I am trying to do is create three (3) different histograms and then plot them all together on one. I want to create a histogram for each year, where the x axis or labels will be the days of the week starting with Sunday and ending on Saturday.
Firstly how would I go about creating a histogram out of Factors, which the days of the week are in?
Secondly how do I create a histogram for the days of the week for a given year?
I have tried using the following post here but cannot get it working. I use the Admission.Day as the variable and get an error message:
dat <- as.Date(lace$Admission.Day)
Error in charToDate(x) : character string is not in a standard unambiguous format
Thank you,
Expanding on the comment above: the problem seems to be with importing dates, rather than making the histogram. Assuming there is an excel workbook "lace for R.xlsx", with a sheet "lace":
## Not tested...
library(XLConnect)
myData <- "lace for R.xlsx" # NOTE: need path also...
wb <- loadWorkbook(myData)
lace <- readWorksheet(wb, sheet="lace")
lace$Admission.Day <- as.Date(lace$Admission.Day)
should provide dates that work with all the R date functions. Also, the lubridate package provides a number of functions that are more intuitive to use than format(...).
Then, as an example:
library(lubridate) # for year(...) and wday(...)
library(ggplot2)
# random dates around Jun 1, across 5 years...
set.seed(123)
lace <- data.frame(date=as.Date(rnorm(1000,sd=50)+365*(0:4),origin="2008/6/1"))
lace$year <- factor(year(lace$date))
lace$dow <- wday(lace$date, label=T)
# This creates the histograms...
ggplot(lace) +
geom_histogram(aes(x=dow, fill=year)) + # fill color by year
facet_grid(~year) + # facet by year
theme(axis.text.x=element_text(angle=90)) # to rotate weekday names...
Produces this:

Resources