htmlparsing in a loop with a large list (18000 urls) - r

I'm new to R and have been trying to do some figure out what I can do to move this along. I know loops are not the best thing to use, but it's all I can figure out. I've searched the here and the net, and am seeing options like tapply, but I can't figure out if it's something I'm doing wrong, or if tapply isn't compatible with this type of data. I think it's the latter, but I'm new and what do I know. HA!
I have a data.frame that holds all the players that have played from a previous parse that amounts to over 18000 rows. The script below takes that url and scrapes another URL if they played last year. Is there anything a I can do to make this quicker or less of a memory pig, as it routinely pegs my ram at 99% after around 15 minutes? Thanks for any help!
#GET YEARS PLAYED LINKS
yplist = NULL
playerURLs <- paste("http://www.baseball-reference.com",datafile[,c("hrefs")],sep="")
for(thisplayerURL in playerURLs){
doc <- htmlParse(thisplayerURL)
yplinks <- data.frame(
names = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li[. = "2014"]/a',xmlValue),
hrefs = xpathSApply(doc, '//*[#id="all_standard_batting"]/div/ul/li[2]/ul/li[. = "2014"]/a',xmlGetAttr,'href'))
yplist = rbind(yplist, yplinks)
}
yplist[,c("hrefs")]
Example datafile list in playerURLs (there are 2 mel queens, is different)
X names hrefs
1 1 Jason Kipnis /players/k/kipnija01.shtml
2 2 Tom Qualters /players/q/qualtto01.shtml
3 3 Paul Quantrill /players/q/quantpa01.shtml
4 4 Bill Quarles /players/q/quarlbi01.shtml
5 5 Billy Queen /players/q/queenbi01.shtml
6 6 Mel Queen /players/q/queenme01.shtml
7 7 Mel Queen /players/q/queenme02.shtml
If anyone of those guys played in 2014 my script above would return a data.frame that looks like the following
X names hrefs
1 1 Jason Kipnis players/gl.cgi?id=kipnija01&t=b&year=2014
2 2 Tom Qualters /players/gl.cgi?id=qualtto01&t=b&year=2014
3 3 Paul Quantrill /players/gl.cgi?id=quantpa01&t=b&year=2014
4 4 Bill Quarles /players/gl.cgi?id=quarlbi01&t=b&year=2014
5 5 Billy Queen /players/gl.cgi?id=queenbi01&t=b&year=2014
6 6 Mel Queen /players/gl.cgi?id=queenme01&t=b&year=2014
7 7 Mel Queen /players/gl.cgi?id=queenme02&t=b&year=2014

Related

R - Finding identical rows or rows that only differ by x columns

I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.
library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"

Summing matched values in two different dataframes

Extremely new to R and coding in general. My intuition is that this should have a very basic answer, so feel free to send me back to basic intro class if this is too basic to spend your time on.
To make things easier I will reduce my problem to a much more simple situation with the same salient features.
I have two dataframes. The first shows how many games some people played as "white". The second shows how many games some people payed as "black". Some players played both as white and black, some others played only in one of these roles.
I would like to merge these two dataframes into one showing all players who have played in either role and how many total games they played, whether as white or black.
A reproducible example:
player_as_white <- c('John', 'Max', 'Grace', 'Zoe', 'Peter')
games_white <- c(sample(1:20,5))
dat1 <- data.frame(player_as_white, games_white)
player_as_black <- c('John', 'Eddie', 'Zoe')
games_black <- c(sample(1:20, 3))
dat2 <- data.frame(player_as_black, games_black)
How do I get a consolidated dataset showing how many total games all 6 players have played, whether as white or black?
Thanks!
For reproducibility, it's good practice to specify a random seed so the example works the same each time you run it, and for others. I'd also suggest using stringsAsFactors = FALSE so that the names are treated as characters and not factors, which will make this a little simpler. (edit: But it should work fine here with the default, too.)
set.seed(0)
player_as_white <- c('John', 'Max', 'Grace', 'Zoe', 'Peter')
games_white <- c(sample(1:20,5))
dat1 <- data.frame(player_as_white, games_white, stringsAsFactors = FALSE)
player_as_black <- c('John', 'Eddie', 'Zoe')
games_black <- c(sample(1:20, 3))
dat2 <- data.frame(player_as_black, games_black, stringsAsFactors = FALSE)
Then we can use merge to combine the two:
merge(dat1, dat2, by.x = "player_as_white", by.y = "player_as_black", all = T)
# player_as_white games_white games_black
#1 Eddie NA 18
#2 Grace 7 NA
#3 John 18 5
#4 Max 6 NA
#5 Peter 15 NA
#6 Zoe 10 19
Or a dplyr solution, which keeps the order from dat1
library(dplyr)
full_join(dat1, dat2, by = c("player_as_white" = "player_as_black"))
# player_as_white games_white games_black
#1 John 18 5
#2 Max 6 NA
#3 Grace 7 NA
#4 Zoe 10 19
#5 Peter 15 NA
#6 Eddie NA 18

R Looping: Assign record to class with least existing records

I have a group of individuals that I am distributing items to in an effort to move toward even distribution of total items across individuals.
Each individual can receive only certain item types.
The starting distribution of items is not equal.
The number of available items of each type is known, and must fully be exhausted.
df contains an example format for the person data. Note that Chuck has 14 items total, not 14 bats and 14 gloves.
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),startingtotalitemspossessed=c(14,9,7,12,9,7,12,14))
otherdf contains and example format for the items and number needing assignment
otherdf<-data.frame(item=c("bat","ball","glove"),numberneedingassignment=c(3,4,7))
Is there a best method for coding this form of item distribution? I imagine the steps to be:
Check which person that can receive a given item has the lowest total items assigned. Break a tie at random.
Assign 1 of the given item to this person.
Update the startingtotalitemspossessed for the person receiving the item.
Update the remaining number of the item left to assign.
Stop this loop for a given item if the total remaining is 0, and move to the next item.
Below is a partial representation of something like how i'd imagine this working as a view inside the loop, left to right.
Note: The number of items and people is very large. If possible, a method that would scale to any given number of people or items would be ideal!
Thank you in advance for your help!
I'm sure there are better ways, but here is an example:
df<-data.frame(person=c("Chuck","Walter","Mickey","Vince","Walter","Mickey","Vince","Chuck"),
alloweditem=c("bat","bat","bat","bat","ball","ball","glove","glove"),
total=c(14,9,7,12,9,7,12,14))
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 9
## 3 Mickey bat 7
## 4 Vince bat 12
## 5 Walter ball 9
## 6 Mickey ball 7
## 7 Vince glove 12
## 8 Chuck glove 14
otherdf<-data.frame(item=c("bat","ball","glove"),
numberneedingassignment=c(3,4,7))
# Items in queue
queue <- rep(otherdf$item, otherdf$numberneedingassignment)
for (i in 1:length(queue)) {
# Find person with the lowest starting total
personToBeAssigned <- df[df$alloweditem == queue[i] &
df$total == min(df[df$alloweditem == queue[i], 3]), 1][1]
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] <-
df[df$person == personToBeAssigned & df$alloweditem == queue[i], 3] + 1
}
print(df)
## person alloweditem total
## 1 Chuck bat 14
## 2 Walter bat 10
## 3 Mickey bat 9
## 4 Vince bat 12
## 5 Walter ball 10
## 6 Mickey ball 10
## 7 Vince glove 17
## 8 Chuck glove 16

Manipulate factor list in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a vector in R that is a factor list, a list of 256 nfl teams. I need to change every team name from "Washington Redskins" into "WAS" or "New England Patriots" into "NE". What is the best technique for this type of problem. I'm sure this is something easy so don't beat me up on this one.
You could read the acronyms from a web page and match the team names against yours.
Here's one example.
library(XML)
tab <- readHTMLTable("http://sportsdelve.wordpress.com/abbreviations/")[[1]]
head(tab)
# V1 V2
# 1 ARZ Arizona Cardinals
# 2 ATL Atlanta Falcons
# 3 BAL Baltimore Ravens
# 4 BALC Baltimore Colts
# 5 BCLT Baltimore Colts (1950)
# 6 BALCLT Baltimore Colts (AAFC)
And you can use regular expression matching to find your teams...
tab[grepl("WAS|NE", tab[[1]]), ]
# V1 V2
# 38 NE New England Patriots
# 58 WAS Washington Redskins
One way is to have a dictionary, i.e. a file with each full name and each short name. You can then match this file to your full names, using the full names as the ID for the match.
Example:
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash")) ## needs to be a data frame in order for plyr::join to work
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd")) ## the dictionary; one row per unique name
matched <- plyr::join(x = full.names, y = dic, by = "full") ## using join from the plyr package
Output:
full short
1 wash ww
2 wash ww
3 denv dd
4 denv dd
5 wash ww
'merge' command also works: (Using Chaconne's data here)
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash"))
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd"))
merge(full.names,dic)
full short
1 denv dd
2 denv dd
3 wash ww
4 wash ww
5 wash ww
You can just change the levels directly
levels(team)
will list the order of the levels assigned to your factor
levels(team) <- c("ARZ","ATL", ...)
will change the labels.

Converting scraped R data using readHTMLTable()

I'm trying to scrape this website http://www.hockeyfights.com/fightlog/ but having hard time putting the into a nice data frame. So far I have this:
> asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
> asdf.asdf <- readHTMLTable(asdf)
Then I get this giant list. How do I convert this into a 2 column dataframe that has only player names (who were in a fight) with n rows (number of fights)?
Thanks for your help in advance.
Is this the output you're after?
require(RCurl); require(XML)
asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
asdf.asdf <- readHTMLTable(asdf)
First, make a table of each player and the count of fights they've been in...
# get variable with player names
one <- as.character(na.omit(asdf.asdf[[1]]$V3))
# get counts of how many times each name appears
two <- data.frame(table(one))
# remove non-name data
three <- two[two$one != 'Away / Home Player',]
# check
head(three)
one Freq
1 Aaron Volpatti 1
3 Brandon Bollig 1
4 Brian Boyle 1
5 Brian McGrattan 1
6 Chris Neil 2
7 Colin Greening 1
Second, make a table of who is in each fight...
# make data frame of pairs by subsetting the vector of names
four <- data.frame(away = one[seq(2, length(one), 3)],
home = one[seq(3, length(one), 3)])
# check
head(four)
away home
1 Brian Boyle Zdeno Chara
2 Tom Sestito Chris Neil
3 Dale Weise Mark Borowiecki
4 Brandon Bollig Brian McGrattan
5 Scott Hartnell Eric Brewer
6 Colin Greening Aaron Volpatti

Resources