Converting scraped R data using readHTMLTable()

Converting scraped R data using readHTMLTable() - r

I'm trying to scrape this website http://www.hockeyfights.com/fightlog/ but having hard time putting the into a nice data frame. So far I have this:
> asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
> asdf.asdf <- readHTMLTable(asdf)
Then I get this giant list. How do I convert this into a 2 column dataframe that has only player names (who were in a fight) with n rows (number of fights)?
Thanks for your help in advance.

Is this the output you're after?
require(RCurl); require(XML)
asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
asdf.asdf <- readHTMLTable(asdf)
First, make a table of each player and the count of fights they've been in...
# get variable with player names
one <- as.character(na.omit(asdf.asdf[[1]]$V3))
# get counts of how many times each name appears
two <- data.frame(table(one))
# remove non-name data
three <- two[two$one != 'Away / Home Player',]
# check
head(three)
one Freq
1 Aaron Volpatti 1
3 Brandon Bollig 1
4 Brian Boyle 1
5 Brian McGrattan 1
6 Chris Neil 2
7 Colin Greening 1
Second, make a table of who is in each fight...
# make data frame of pairs by subsetting the vector of names
four <- data.frame(away = one[seq(2, length(one), 3)],
home = one[seq(3, length(one), 3)])
# check
head(four)
away home
1 Brian Boyle Zdeno Chara
2 Tom Sestito Chris Neil
3 Dale Weise Mark Borowiecki
4 Brandon Bollig Brian McGrattan
5 Scott Hartnell Eric Brewer
6 Colin Greening Aaron Volpatti

Related

htmlparsing in a loop with a large list (18000 urls)

I'm new to R and have been trying to do some figure out what I can do to move this along. I know loops are not the best thing to use, but it's all I can figure out. I've searched the here and the net, and am seeing options like tapply, but I can't figure out if it's something I'm doing wrong, or if tapply isn't compatible with this type of data. I think it's the latter, but I'm new and what do I know. HA!
I have a data.frame that holds all the players that have played from a previous parse that amounts to over 18000 rows. The script below takes that url and scrapes another URL if they played last year. Is there anything a I can do to make this quicker or less of a memory pig, as it routinely pegs my ram at 99% after around 15 minutes? Thanks for any help!
#GET YEARS PLAYED LINKS
yplist = NULL
playerURLs <- paste("http://www.baseball-reference.com",datafile[,c("hrefs")],sep="")
for(thisplayerURL in playerURLs){
doc <- htmlParse(thisplayerURL)
yplinks <- data.frame(
names = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li[. = "2014"]/a',xmlValue),
hrefs = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li[. = "2014"]/a',xmlGetAttr,'href'))
yplist = rbind(yplist, yplinks)
}
yplist[,c("hrefs")]
Example datafile list in playerURLs (there are 2 mel queens, is different)
X names hrefs
1 1 Jason Kipnis /players/k/kipnija01.shtml
2 2 Tom Qualters /players/q/qualtto01.shtml
3 3 Paul Quantrill /players/q/quantpa01.shtml
4 4 Bill Quarles /players/q/quarlbi01.shtml
5 5 Billy Queen /players/q/queenbi01.shtml
6 6 Mel Queen /players/q/queenme01.shtml
7 7 Mel Queen /players/q/queenme02.shtml
If anyone of those guys played in 2014 my script above would return a data.frame that looks like the following
X names hrefs
1 1 Jason Kipnis players/gl.cgi?id=kipnija01&t=b&year=2014
2 2 Tom Qualters /players/gl.cgi?id=qualtto01&t=b&year=2014
3 3 Paul Quantrill /players/gl.cgi?id=quantpa01&t=b&year=2014
4 4 Bill Quarles /players/gl.cgi?id=quarlbi01&t=b&year=2014
5 5 Billy Queen /players/gl.cgi?id=queenbi01&t=b&year=2014
6 6 Mel Queen /players/gl.cgi?id=queenme01&t=b&year=2014
7 7 Mel Queen /players/gl.cgi?id=queenme02&t=b&year=2014

Using R, Randomly Assigning Students Into Groups Of 4

I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.

I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.

I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)

Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.

I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2

If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set

How to use R to check data consistency (make sure no contradiction between case and value)?

Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks

Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8

It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))

If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]

I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode

Manipulate factor list in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a vector in R that is a factor list, a list of 256 nfl teams. I need to change every team name from "Washington Redskins" into "WAS" or "New England Patriots" into "NE". What is the best technique for this type of problem. I'm sure this is something easy so don't beat me up on this one.

You could read the acronyms from a web page and match the team names against yours.
Here's one example.
library(XML)
tab <- readHTMLTable("http://sportsdelve.wordpress.com/abbreviations/")[[1]]
head(tab)
# V1 V2
# 1 ARZ Arizona Cardinals
# 2 ATL Atlanta Falcons
# 3 BAL Baltimore Ravens
# 4 BALC Baltimore Colts
# 5 BCLT Baltimore Colts (1950)
# 6 BALCLT Baltimore Colts (AAFC)
And you can use regular expression matching to find your teams...
tab[grepl("WAS|NE", tab[[1]]), ]
# V1 V2
# 38 NE New England Patriots
# 58 WAS Washington Redskins

One way is to have a dictionary, i.e. a file with each full name and each short name. You can then match this file to your full names, using the full names as the ID for the match.
Example:
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash")) ## needs to be a data frame in order for plyr::join to work
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd")) ## the dictionary; one row per unique name
matched <- plyr::join(x = full.names, y = dic, by = "full") ## using join from the plyr package
Output:
full short
1 wash ww
2 wash ww
3 denv dd
4 denv dd
5 wash ww

'merge' command also works: (Using Chaconne's data here)
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash"))
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd"))
merge(full.names,dic)
full short
1 denv dd
2 denv dd
3 wash ww
4 wash ww
5 wash ww

You can just change the levels directly
levels(team)
will list the order of the levels assigned to your factor
levels(team) <- c("ARZ","ATL", ...)
will change the labels.

Looping through csv and dumping to data frame in R

Using R, I am trying to take a csv file, loop through it, extract values, and dump them into a data frame. There are four columns in the csv: ID, UG_inst, Freq, and Year. Specifically, I want to loop through the UG_inst column by institution name for each year (2010-11,2011-12,2012-13,and 2013-14) and put the value at that cell into the respective "cell" in the R data frame. Right now, the csv just has a Year column, but the data frame I've created has a column for each year. The ultimate idea is to be able to create bar graphs representing the frequency per institution per year. Currently, the code below throws up NO errors, but appears to do nothing to the R data frame "j".
A couple of caveats: 1) Doing a nested for loop was making my head spin, so I decided to just use 2010-11 for now and just loop through the institution name. Since there are only 4 years, I can rewrite this four times, each time with a different year. 2) Also, in the csv, there are repeat names. So, if an institution name appears twice (will be adjacent rows in the csv due to alphabetical arrangement), is there a way to dump the SUM of these into the data frame in R?
All relevant info is below. Thanks so much for any help!!!!
Here is a link to the .csv file: https://www.dropbox.com/s/9et7muchkrgtgz7/UG_inst_ALL.csv
And here is the R code I am trying:
abc <- read.csv(insert file path to above csv here)
inst_string <- unique(abc$UG_inst)
j <- data.frame("UG_inst"=inst_string,"2010-11"=NA,"2011-12"=NA,"2012-13"=NA,"2013-14"=NA)
for (i in inst_string) {
inst.index <- which(abc$UG_inst == i && abc$Year == "2010-11")
j$X2010.11[j$Ug_inst==i] <- abc$Freq[inst.index]
}

Instead of using a nested loop (or a loop at all) I suggest using the reshape() function in base R.
abc <- read.csv("UG_inst_ALL.csv")
abc <- abc[2:4]
reshape(data = abc,
v.names = "Freq",
timevar = "Year",
idvar = "UG_inst",
direction = "wide")

This is known as "reshaping" your data, and you are going from a "long" format to a "wide" format.
In addition to base R's reshape function, here are a few other options to consider.
I'll assume that we are starting with data read in like the following.
abc <- read.csv("~/Downloads/UG_inst_ALL.csv", row.names = 1)
head(abc)
# UG_inst Freq Year
# 1 Abilene Christian University 0 2010-11
# 2 Adams State University 0 2010-11
# 3 Adrian College 1 2010-11
# 4 Agnes Scott College 0 2010-11
# 5 Alabama A&M University 1 2010-11
# 6 Albion College 1 2010-11
Option 1: xtabs
out <- as.data.frame.matrix(xtabs(Freq ~ UG_inst + Year, abc))
head(out)
# 2010-11 2011-12 2012-13 2013-14
# Abilene Christian University 0 1 0 0
# Adams State University 0 0 0 1
# Adrian College 1 0 0 0
# Agnes Scott College 0 0 1 0
# Alabama A&M University 1 3 1 2
# Albion College 1 0 0 0
Option 2: dcast from "reshape2"
library(reshape2)
head(dcast(abc, UG_inst ~ Year, value.var = "Freq"))
Option 3: spread from "tidyr"
library(dplyr)
library(tidyr)
abc %>% select(-X) %>% group_by(UG_inst) %>% spread(Year, Freq)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Converting scraped R data using readHTMLTable() - r

Related

htmlparsing in a loop with a large list (18000 urls)

Using R, Randomly Assigning Students Into Groups Of 4

How to use R to check data consistency (make sure no contradiction between case and value)?

Manipulate factor list in R [closed]

Looping through csv and dumping to data frame in R

Categories

Resources