Manipulate factor list in R [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a vector in R that is a factor list, a list of 256 nfl teams. I need to change every team name from "Washington Redskins" into "WAS" or "New England Patriots" into "NE". What is the best technique for this type of problem. I'm sure this is something easy so don't beat me up on this one.

You could read the acronyms from a web page and match the team names against yours.
Here's one example.
library(XML)
tab <- readHTMLTable("http://sportsdelve.wordpress.com/abbreviations/")[[1]]
head(tab)
# V1 V2
# 1 ARZ Arizona Cardinals
# 2 ATL Atlanta Falcons
# 3 BAL Baltimore Ravens
# 4 BALC Baltimore Colts
# 5 BCLT Baltimore Colts (1950)
# 6 BALCLT Baltimore Colts (AAFC)
And you can use regular expression matching to find your teams...
tab[grepl("WAS|NE", tab[[1]]), ]
# V1 V2
# 38 NE New England Patriots
# 58 WAS Washington Redskins

One way is to have a dictionary, i.e. a file with each full name and each short name. You can then match this file to your full names, using the full names as the ID for the match.
Example:
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash")) ## needs to be a data frame in order for plyr::join to work
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd")) ## the dictionary; one row per unique name
matched <- plyr::join(x = full.names, y = dic, by = "full") ## using join from the plyr package
Output:
full short
1 wash ww
2 wash ww
3 denv dd
4 denv dd
5 wash ww

'merge' command also works: (Using Chaconne's data here)
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash"))
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd"))
merge(full.names,dic)
full short
1 denv dd
2 denv dd
3 wash ww
4 wash ww
5 wash ww

You can just change the levels directly
levels(team)
will list the order of the levels assigned to your factor
levels(team) <- c("ARZ","ATL", ...)
will change the labels.

Related

Converting XML to dataframe [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I want to convert a XML to a dataframe.
I'm aware of XML::xmlToDataFrame, but it gives an error in my case.
The XML can be found here:
https://api.data.gov.hk/v1/historical-archive/get-file?url=https%3A%2F%2Fresource.data.one.gov.hk%2Ftd%2Ftraffic-detectors%2FrawSpeedVol-all.xml&time=20211216-0513
Thanks for all answers!
Since your XML file contains multiple nested children, XML::xmlToDataFrame was giving out error.
I've approached the problem using the naive method but it works!
Here's what I've done:
The following code creates a dataframe with the tags inside `'.
library(xml2)
require(XML)
pg <- read_xml("https://s3-ap-southeast-1.amazonaws.com/historical-resource-archive/2021/12/16/https%253A%252F%252Fresource.data.one.gov.hk%252Ftd%252Ftraffic-detectors%252FrawSpeedVol-all.xml/0513")
records <- xml_find_all(pg, "//lane")
nodenames<-xml_name(xml_children(records))
nodevalues<-trimws(xml_text(xml_children(records)))
lane_id <- nodevalues[seq(1, length(nodevalues), 6)]
speed <- nodevalues[seq(2, length(nodevalues), 6)]
occupancy <- nodevalues[seq(3, length(nodevalues), 6)]
volume <- nodevalues[seq(4, length(nodevalues), 6)]
s.d. <- nodevalues[seq(5, length(nodevalues), 6)]
valid <- nodevalues[seq(6, length(nodevalues), 6)]
df <- data.frame(lane_id, speed, occupancy, volume, s.d., valid)
head(df)
The df looks like this:
lane_id speed occupancy volume s.d. valid
1 Fast Lane 70 0 0 0 Y
2 Middle Lane 76 6 3 11.1 Y
3 Slow Lane 70 6 0 0 Y
4 Fast Lane 82 1 1 0 Y
5 Middle Lane 63 3 1 0 Y
6 Slow Lane 79 2 1 0 Y
If you want to extract the data of <detectors>, you can use the following code:
################ Extract Detector Data #########
records2 <- xml_find_all(pg, "//detector")
vals2 <- trimws(xml_text(records2))
nodenames2 <-xml_name(xml_children(records2))
nodevalues2 <-trimws(xml_text(xml_children(records2)))
detector_id <- nodevalues2[seq(1, length(nodevalues2), 3)]
direction <- nodevalues2[seq(2, length(nodevalues2), 3)]
lanes <- nodevalues2[seq(3, length(nodevalues2), 3)]
df2 <- data.frame(detector_id, direction, lanes)
head(df2)
The df2 looks like this:
detector_id direction lanes
1 AID01101 South East Fast Lane70000YMiddle Lane766311.1YSlow Lane70600Y
2 AID01102 North East Fast Lane82110YMiddle Lane63310YSlow Lane79210Y
3 AID01103 South East Fast Lane50000YMiddle Lane65210YSlow Lane192310Y
4 AID01104 North East Fast Lane50000YSlow Lane63110Y
5 AID01105 North East Fast Lane50100YSlow Lane53410Y
6 AID01106 South East Fast Lane50300YSlow Lane56510Y
But, as you can notice, the lanes column isn't cleaned as you would like since it is a grandchild tag inside the XML.
Although, you could create a new data frame from df and df2 as you would like.

Is there a way to selectively apply this stringr function?

I have a dataframe of users, with one column containing their self-reported location. Because of this, some locations reported are nonsensical but can lead to a false positive when matching this column to other columns of known locations. Below is an example of the data frame.
data <- data.frame(X = (1:5), Y = c("", "Washington, DC", "Huntsville, AL", "Mobile,AL", "ALL OVER"))
With this data, I then run this code below to establish matches with AL.
library(stringr)
data$match_ab <- str_extract(data[,2], str_c("AL", collapse = "|"))
This results in Huntsville and Mobile being correctly identified as positives, but the third match of ALL OVER incorrectly identifies as a match because of the AL within the string.
Is there a way to adapt this script so that it detects matches within strings while ignoring strings that have letters attached to the desired part of the string? In other words, can this detect AL while there might be spaces or punctuation on either side of the partial string while ignoring the match if alphabetical letters are adjacent to the string?
Thanks in advance.
Does this work for you If I understood you correctly:
data$match_ab <- str_extract(data[,2], "\\bAL\\b")
Using \\b which is a boundary condition so that it doesn't match anything if it is followed/preceded by a word or As per documentation: the symbol \b matches the empty string at either edge of a word
Just a little tweak of matching at a particular position: Add $ after your search_item, which is a regex that specifies: it needs to be matched if present only at the end of the string.
data$match_ab <- str_extract(data[,2], str_c("AL$", collapse = "|")); data;
X Y match_ab
1 1 <NA>
2 2 Washington, DC <NA>
3 3 Huntsville, AL AL
4 4 Mobile,AL AL
5 5 ALL OVER <NA>
Suppose the AL is in the middle of the string, then this might be more general to use:
data <- data.frame(X = (1:5), Y = c("", "Washington, DC", "Huntsville, AL,
SOMETHING_AT_THE_END", "Mobile,AL", "ALL OVER")); data;
X Y
1 1
2 2 Washington, DC
3 3 Huntsville, AL, SOMETHING_AT_THE_END
4 4 Mobile,AL
5 5 ALL OVER
data$match_ab <- str_extract(data[,2], str_c("AL(?!L)", collapse = "|")); data;
X Y match_ab
1 1 <NA>
2 2 Washington, DC <NA>
3 3 Huntsville, AL, SOMETHING_AT_THE_END AL
4 4 Mobile,AL AL
5 5 ALL OVER <NA>
Where (?!L) means not ! followed by ? L.
We can also use stri_extract from stringi
library(stringi)
data$match_ab <- stri_extract(data[,2], regex = "\\bAL\\b")

Splitting complex string between symbols R

I have a dataset full of IDs and qualification strings. My issue with this is two fold;
How to deal with splits between different symbols and,
how to iterate output down a dataframe whilst retaining an ID.
ID <- c(1,2,3)
Qualstring <- c("LE:Science = 45 Distinctions",
"A:Chemistry = A A:Biology = A A:Mathematics = A",
"A:Biology = A A:Chemistry = A A:Mathematics = A B:Baccalaureate Advanced Diploma = Pass"
)
s <- data.frame(ID, Qualstring)
The desired output would be:
ID Qualification Subject Grade
1 1 LE: Science 45 Distinctions
2 2 A: Chemistry A
3 2 A: Biology A
4 2 A: Mathematics A
5 3 A: Biology A
6 3 A: Chemistry A
7 3 A: Mathematics A
8 3 WB: Welsh Baccalaureate Advanced Diploma Pass
The commonality of the splits is the ":" and "=", and the codes/words around those.
Looking at the problem from my perspective, it appears complex and whether a continued fudge in excel is ultimately the way to go for this structure of data. Would love to know otherwise if there are any recommendations or direction.
A solution using data.table and stringr. The use of data.table is just for my personal convenience, you could use data.frame with do.call(rbind,.) instead of rbindlist()
library(stringr)
qual <- str_extract_all(s$Qualstring,"[A-Z]+(?=\\:)")
subject <- str_extract_all(s$Qualstring,"(?<=\\:)[\\w ]+")
grade <- str_extract_all(s$Qualstring,"(?<=\\= )[A-z0-9]+")
library(data.table)
df <- lapply(seq(s$ID),function(i){
N = length(qual[[i]])
data.table(ID = rep(s[i,"ID"],N),
Qualification = qual[[i]],
Subject = subject[[i]],
Grade = grade[[i]]
)
}) %>% rbindlist()
ID Qualification Subject Grade
1: 1 LE Science 45
2: 2 A Chemistry A
3: 2 A Biology A
4: 2 A Mathematics A
5: 3 A Biology A
6: 3 A Chemistry A
7: 3 A Mathematics A
8: 3 B Baccalaureate Advanced Diploma Pass
In short, I use positive look behind (?<=) and positive look ahead (?=). [A-Z]+ is for a group of upper letters, [\\w ]+ for a group of words and spaces, [A-z0-9]+ for letters (up and low cases) and numbers. string_extract_all gives a list with all the match on each cell of the character vector tested.

Substitute DT1.x with DT2.y when DT1.x and DT2.x match in R [duplicate]

This question already has answers here:
Conditional merge/replacement in R
(8 answers)
Closed 5 years ago.
I want to do this using data tables in R.
So I start with this
dtMain
Name state
1: CompanyC CA
2: CompanyM MN
3: CompanyC1 California
4: CompanyT TX
statesFile
stateExpan state
1: Texas TX
2: Minnesota MN
3: California CA
Where dtMain$State == statesFile$state, I want to replace dtMain$State with statesFile$stateExpan
and get this
dtMain
Name state
1: CompanyA California
2: CompanyB Minnesota
3: CompanyC California
4: CompanyD Texas
Here's code to create the 2 files
library(data.table)
dtMain <- data.table(Name = c("CompanyA" ,"CompanyB","CompanyC","CompanyD"),
state = c("CA","MN","California","TX"))
statesFile <- data.table( stateExpan = c("Texas","Minnesota","California"),
state = c("TX","MN","CA"))
My problem is the next level of this one
R finding rows of a data frame where certain columns match those of another
and I am looking for a data table solution.
Use an update join:
dtMain[statesFile, on=.(state), state := i.stateExpan ]
The i.* prefix indicates that it's from the i table in x[i, on=, j]. It is optional here.
See ?data.table for details.

Converting scraped R data using readHTMLTable()

I'm trying to scrape this website http://www.hockeyfights.com/fightlog/ but having hard time putting the into a nice data frame. So far I have this:
> asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
> asdf.asdf <- readHTMLTable(asdf)
Then I get this giant list. How do I convert this into a 2 column dataframe that has only player names (who were in a fight) with n rows (number of fights)?
Thanks for your help in advance.
Is this the output you're after?
require(RCurl); require(XML)
asdf <- htmlParse("http://www.hockeyfights.com/fightlog/1")
asdf.asdf <- readHTMLTable(asdf)
First, make a table of each player and the count of fights they've been in...
# get variable with player names
one <- as.character(na.omit(asdf.asdf[[1]]$V3))
# get counts of how many times each name appears
two <- data.frame(table(one))
# remove non-name data
three <- two[two$one != 'Away / Home Player',]
# check
head(three)
one Freq
1 Aaron Volpatti 1
3 Brandon Bollig 1
4 Brian Boyle 1
5 Brian McGrattan 1
6 Chris Neil 2
7 Colin Greening 1
Second, make a table of who is in each fight...
# make data frame of pairs by subsetting the vector of names
four <- data.frame(away = one[seq(2, length(one), 3)],
home = one[seq(3, length(one), 3)])
# check
head(four)
away home
1 Brian Boyle Zdeno Chara
2 Tom Sestito Chris Neil
3 Dale Weise Mark Borowiecki
4 Brandon Bollig Brian McGrattan
5 Scott Hartnell Eric Brewer
6 Colin Greening Aaron Volpatti

Resources