Strip HTML Formatting from R Strings - r

I'm trying to scrape information from this url: http://www.sports-reference.com/cbb/boxscores/index.cgi?month=2&day=3&year=2017 and have gotten decently far to the point where I have strings for each game that look like this:
str <-"Yale\n\t\t\t87\n\t\t\t\n\t\t\t\tFinal\n\t\t\t\t\n\t\t\t\n\t\tColumbia\n\t\t\t78\n\t\t\t \n\t\t\t\n\t\t"
Ideally I'd like to get to a vector or dataframe that looks something like:
str_vec <- c('Yale',87,'Columbia',78)
I've tried a few things that didn't work like:
without_n <- gsub(x = str, pattern = '\n')
without_Final <- gsub(x = without_n, pattern = 'Final')
str_vec <- strslpit(x = without_Final, split = '\t')
Thanks in advance for any helpful tips/answers!

You can use gsub to first replace all the non-alphanumeric characters in the string with an empty string. Then insert a space between the name and score. Thereafter you can split the string on space to a data structure needed.
require(stringr)
step_1 <- gsub('([^[:alnum:]]|(Final))', "", str)
#"Yale87Columbia78"
step_2 <- gsub("([[:alpha:]]+)([[:digit:]]+)", "\\1 \\2 ", step_1)
strsplit(str_trim(step_2)," ")
#"Yale" "87" "Columbia" "78"
I assume the string pattern is consistent, for this to work reliably.

Related

Within a column, I'd like to gsub each row of string values and remove any value that matches a list of values I created

Context
I am working with a messy datafile right now. I have a list of comments that I'd like to sort out and grab the most common combination of phrases. An example phrase would be "Did not qualify because of X and Y" and "Did not qualify because of Y and X". I am trying to go through and remove Stop Words so I can match X and Y as a common phrase. I was able to easily do this for common single words, but phrases are a little difficult. Below is my code for context
Create Datafile
dat1 <- dat %>% filter(Action != Exclude)
Remove problem characters
dat1$Comments <- stri_trans_general(dat1$Comments, "latin-ascii")
dat1$Comments <- gsub(pattern='<[^<>]*>', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern='\n', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern="[[:punct:]]", replacement=" ", x=dat1$Comments)
Remove stop words (Where my problem is)
sw <- paste0("\\b(", paste0(stop_words$word, collapse="|"), ")\\b")
dat1$Comments <- lapply(dat1$Comments, function(x) (gsub(pattern=sw, replacement=" ", x)))
Remove extra spaces between words
dat1$Comments <- trimws(gsub("\\s+", " ", dat1$Comments))
dat1$Comments <- gsub("(^[[:space:]]*)|([[:space:]]*$)", "", dat1$Comments)
Sweet Data
top_phrases <- data.frame(text = dat1$Comments) %>%
unnest_tokens(bigram, text, 'ngrams', n = Length, to_lower = TRUE) %>%
count(bigram, sort = TRUE)
Issue
This is what pops up and is traced back to the gsub code
Error in gsub(pattern = sw, replacement = " ", x) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
If anyone is curious, here is what is stored in "sw"
"\\b(a|a's|able|about|above|according|accordingly|across|actually|after|afterwards|again|against|ain't|all|allow|allows|almost|alone|along|already|also|although|always|am|among|amongst|an|and|another|any|anybody|anyhow|anyone|anything|anyway|anyways|anywhere|apart|appear|appreciate|appropriate|are|aren't|around|as|aside|ask|asking|associated|at|available|away|awfully|b|be|became|because|become|becomes|becoming|been|before|beforehand|behind|being|believe|below|beside|besides|best|better|between|beyond|both|brief|but|by|c|c'mon|c's|came|can|can't|cannot|cant|cause|causes|certain|certainly|changes|clearly|co|com|come|comes|concerning|consequently|consider|considering|contain|containing|contains|corresponding|could|couldn't|course|currently|d|definitely|described|despite|did|didn't|different|do|does|doesn't|doing|don't|done|down|downwards|during|e|each|edu|eg|eight|either|else|elsewhere|enough|entirely|especially|et|etc|even|ever|every|everybody|everyone|everything|everywhere|ex|exactly|example|except|f|far|few|fifth|first|five|followed|following|follows|for|former|formerly|forth|four|from|further|furthermore|g|get|gets|getting|given|gives|go|goes|going|gone|got|gotten|greetings|h|had|hadn't|happens|hardly|has|hasn't|have|haven't|having|he|he's|hello|help|hence|her|here|here's|hereafter|hereby|herein|hereupon|hers|herself|hi|him|himself|his|hither|hopefully|how|howbeit|however|i|i'd|i'll|i'm|i've|ie|if|ignored|immediate|in|inasmuch|inc|indeed|indicate|indicated|indicates|inner|insofar|instead|into|inward|is|isn't|it|it'd|it'll|it's|its|itself|j|just|k|keep|keeps|kept|know|knows|known|l|last|lately|later|latter|latterly|least|less|lest|let|let's|like|liked|likely|little|look|looking|looks|ltd|m|mainly|many|may|maybe|me|mean|meanwhile|merely|might|more|moreover|most|mostly|much|must|my|myself|n|name|namely|nd|near|nearly|necessary|need|needs|neither|never|nevertheless|new|next|nine|no|nobody|non|none|noone|nor|normally|not|nothing|novel|now|nowhere|o|obviously|of|off|often|oh|ok|okay|old|on|once|one|ones|only|onto|or|other|others|otherwise|ought|our|ours|ourselves|out|outside|over|overall|own|p|particular|particularly|per|perhaps|placed|please|plus|possible|presumably|probably|provides|q|que|quite|qv|r|rather|rd|re|really|reasonably|regarding|regardless|regards|relatively|respectively|right|s|said|same|saw|say|saying|says|second|secondly|see|seeing|seem|seemed|seeming|seems|seen|self|selves|sensible|sent|serious|seriously|seven|several|shall|she|should|shouldn't|since|six|so|some|somebody|somehow|someone|something|sometime|sometimes|somewhat|somewhere|soon|sorry|specified|specify|specifying|still|sub|such|sup|sure|t|t's|take|taken|tell|tends|th|than|thank|thanks|thanx|that|that's|thats|the|their|theirs|them|themselves|then|thence|there|there's|thereafter|thereby|therefore|therein|theres|thereupon|these|they|they'd|they'll|they're|they've|think|third|this|thorough|thoroughly|those|though|three|through|throughout|thru|thus|to|together|too|took|toward|towards|tried|tries|truly|try|trying|twice|two|u|un|under|unfortunately|unless|unlikely|until|unto|up|upon|us|use|used|useful|uses|using|usually|uucp|v|value|various|very|via|viz|vs|w|want|wants|was|wasn't|way|we|we'd|we'll|we're|we've|welcome|well|went|were|weren't|what|what's|whatever|when|whence|whenever|where|where's|whereafter|whereas|whereby|wherein|whereupon|wherever|whether|which|while|whither|who|who's|whoever|whole|whom|whose|why|will|willing|wish|with|within|without|won't|wonder|would|would|wouldn't|x|y|yes|yet|you|you'd|you'll|you're|you've|your|yours|yourself|yourselves|z|zero|i|me|my|myself|we|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his|himself|she|her|hers|herself|it|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|would|should|could|ought|i'm|you're|he's|she's|it's|we're|they're|i've|you've|we've|they've|i'd|you'd|he'd|she'd|we'd|they'd|i'll|you'll|he'll|she'll|we'll|they'll|isn't|aren't|wasn't|weren't|hasn't|haven't|hadn't|doesn't|don't|didn't|won't|wouldn't|shan't|shouldn't|can't|cannot|couldn't|mustn't|let's|that's|who's|what's|here's|there's|when's|where's|why's|how's|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|around|as|ask|asked|asking|asks|at|away|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|best|better|between|big|both|but|by|came|can|cannot|case|cases|certain|certainly|clear|clearly|come|could|did|differ|different|differently|do|does|done|down|down|downed|downing|downs|during|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|had|has|have|having|he|her|here|herself|high|high|high|higher|highest|him|himself|his|how|however|i|if|important|in|interest|interested|interesting|interests|into|is|it|its|itself|just|keep|keeps|kind|knew|know|known|knows|large|largely|last|later|latest|least|less|let|lets|like|likely|long|longer|longest|made|make|making|man|many|may|me|member|members|men|might|more|most|mostly|mr|mrs|much|must|my|myself|necessary|need|needed|needing|needs|never|new|new|newer|newest|next|no|nobody|non|noone|not|nothing|now|nowhere|number|numbers|of|off|often|old|older|oldest|on|once|one|only|open|opened|opening|opens|or|order|ordered|ordering|orders|other|others|our|out|over|part|parted|parting|parts|per|perhaps|place|places|point|pointed|pointing|points|possible|present|presented|presenting|presents|problem|problems|put|puts|quite|rather|really|right|right|room|rooms|said|same|saw|say|says|second|seconds|see|seem|seemed|seeming|seems|sees|several|shall|she|should|show|showed|showing|shows|side|sides|since|small|smaller|smallest|some|somebody|someone|something|somewhere|state|states|still|still|such|sure|take|taken|than|that|the|their|them|then|there|therefore|these|they|thing|things|think|thinks|this|those|though|thought|thoughts|three|through|thus|to|today|together|too|took|toward|turn|turned|turning|turns|two|under|until|up|upon|us|use|used|uses|very|want|wanted|wanting|wants|was|way|ways|we|well|wells|went|were|what|when|where|whether|which|while|who|whole|whose|why|will|with|within|without|work|worked|working|works|would|year|years|yet|you|young|younger|youngest|your|yours)\\b"
Both TRE (the default regex engine used in base R regex functions) and PCRE (the regex engine used in base R regex functions with perl=TRUE) have quite hard limits for the pattern length.
In your case, stringr regex functions will work better as they are using ICU regex engine that supports much longer regex patterns.
So, you may replace
gsub(pattern=sw, replacement=" ", x)
with
stringr::str_replace_all(x, sw, " ")

String: extract wanted character instead of removing unwanted

I was wandering if in R their is a function like KeepChar("abcde....xyz", some_text) that you feed with all the desired character that you want to keep, and returns the strings with only the desired character left in it. Here the function would only keep the letters of the alphabet in lower case. I would like something that looks like this:
some_text <- "Hel-_l0o W#oRr^ld"
some_text <- KeepChar("abcdefghijklmnopqrstuvwxyz ", some_text)
some_text
> "hello world"
I feel that the removing method that I am currently using gsub("#\\w+", "", some_text), tm_map(some_text, stripWhitespace) or str_replace_all(some_text,"[^[:graph:]]", " ") takes a lot of time and coding line with a constant risk of forgetting to remove a specific character, especially when you already know exactly what you want to keep.
Why I ask this question is because I am coding a plateform to process sentiment analysis on texts from various sources like twitter and I want to make sure not to forget to remove any unwanted character.
To handle a pattern without using regex I will try this:
string <- "Hel-_l0o W#oRr^ld"
pattern <- "abcdefghijklmnopqrstuvwxyz"
KeepChar = function(pattern, string){
splitted_string <- unlist(strsplit(string, ""))
splitted_pattern <- unlist(strsplit(pattern, ""))
ids_string <- splitted_string %in% splitted_pattern
return(paste(splitted_string[ids_string], sep = "", collapse = ""))
}
some_text <- KeepChar(pattern = pattern, string = string)
You can try this:
some_text <- "Hel-_l0o W#oRr^ld"
gsub("[^[:alpha:] ]", "", some_text)#will return all characters
gsub("[^[:lower:] ]", "", some_text)#will return only lower characters alongwith space
gsub("[^[:upper:] ]", "", some_text)#will return higher case characters alongwith space
You can also look at the page https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html to see the matches available in R

Formatting / adjusting incoming string to R

I'm having trouble doing some extraction & coercing of a string in R. I'm not very good with R... just enough to be dangerous. Any help would be appreciated.
I am trying to take a string of this form:
"AAA,BBB,CCC'
And create two items:
A list containing each element separately (i.e. 3 entries) - c("AAA","BBB","CCC"). I've tried strsplit(string, ",") but I get a list of length 1
A data frame with names = lower case entries, values = entries. e.g. df = data.frame(aaa=AAA, bbb=BBB, ccc=CCC). I'm not sure how to pull out each of the elements, and lowercase the references.
Hopefully this is doable with R. Appreciate your time!
If the string is malformed read in with quotes changed
malform <- read.table("weirdstring.txt", colClasses='character',quote = "")
str = gsub("\'|\"", "", malform[1,1])
The string should now look like:
str = "AAA,BBB,CCC"
## as list
ll <- unlist(strsplit(str, ","))
## df
df <- data.frame(t(ll))
names(df) <- sapply(ll, tolower)

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Parsing tweets to extract hashtags in R

I was wondering if anyone has a quick solution to extracting hashtags from the tweets in R.
For example, given the following string, how can I parse it to extract the word with the hashtag?
string <- 'Crowdsourcing is awesome. #stackoverflow'
Unlike HTML, I expect you probably can parse hashtags with regex.
library(stringr)
string <- "#hashtag Crowd#sourcing is awesome. #stackoverflow #question"
# I don't use Twitter, so maybe this regex is not right
# for the set of allowable hashtag characters.
hashtag.regex <- perl("(?<=^|\\s)#\\S+")
hashtags <- str_extract_all(string, hashtag.regex)
Which yields:
> print(hashtags)
[[1]]
[1] "#hashtag" "#stackoverflow" "#question"
Note that this also works unmodified if string is actually a vector of many tweets. It returns a list of character vectors.
Something like this?
string <- c('Crowdsourcing is awesome. #stackoverflow #answer',
"another #tag in this tweet")
step1 <- strsplit(string, "#")
step2 <- lapply(step1, tail, -1)
result <- lapply(step2, function(x){
sapply(strsplit(x, " "), head, 1)
})

Resources