Grabbing part of a link from a URL in R - r

I have parts of links pertaining to baseball players in my character vector:
teamplayerlinks <- c(
"/players/i/iannech01.shtml",
"/players/l/lindad01.shtml",
"/players/c/canoro01.shtml"
)
I would like to isolate the letters/numbers after the 3rd / sign, and before the .sthml portion. I want my resulting string to read:
desiredlinks
# [1] "iannech01" "lindad01" "canoro01"
I assume this may be a job for sub, but I after many trials and error, I'm having a very tough time learning the escape and character sequences. I know it can be done with two sub calls to remove the front and back portion, but I'd rather complete this to dynamically handle other links.
Thank you in advance to anyone who replies - I'm still learning R and trying to get better everyday.

You could try
gsub(".*/|\\..*$", "", teamplayerlinks)
# [1] "iannech01" "lindad01" "canoro01"
Here we have
.*/ remove everything up to and including the last /
| or
\\..*$ remove everything after the ., starting from the end of the string
By the way, these look a bit like player IDs given in the Lahman baseball data sets. If so, you can use the Lahman package in R and not have to scrape the web. It has numerous baseball data sets. It can be installed with install.packages("Lahman"). I also wrote a package retrosheet for downloading data sets from retrosheet.com. It's also on CRAN. Check it out!

The basename function is useful here.
gsub("\\.shtml", "", basename(teamplayerlinks))
# [1] "iannech01" "lindad01" "canoro01"

This can be also done without regex
tools::file_path_sans_ext(basename(teamplayerlinks))
#[1] "iannech01" "lindad01" "canoro01"

Related

How to grep a string ending in a specific punctuation mark

I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro
If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."

How to check whether an English word is meaningful in Julia?

In Julia, how can I check an English word is a meaningful word? Suppose I want to know whether "Hello" is meaningful or not. In Python, one can use the enchant or nltk packages(Examples: [1],[2]). Is it possible to do this in Julia as well?
What I need is a function like this:
is_english("Hello")
>>>true
is_english("Hlo")
>>>false
# Because it doesn't have meaning! We don't have such a word in English terminology!
is_english("explicit")
>>>true
is_english("eeplicit")
>>>false
Here is what I've tried so far:
I have a dataset that contains frequent 5char English words(link to google drive). So I decided to augment it to my question for better understanding. Although this dataset is not adequate (because it just contains frequent 5char meaningful words, not all the meaningful English words with any length), it's suitable to use it to show what I want:
using CSV
using DataFrames
df = CSV.read("frequent_5_char_words.csv" , DataFrame , skipto=2)
df = [lowercase(item) for item in df[:,"0"]]
function is_english(word::String)::Bool
return lowercase(word) in df
end
Then when I try these:
julia>is_english("Helo")
false
julia>is_english("Hello")
true
But I don't have an affluent dataset! So this isn't enough. So I'm curious if there are any packages like what I mentioned before, in Julia or not?
(not enough rep to post a comment!)
You can still use NLTK in Julia via PyCall. Or, as it seems you don't need an NLP tool but just a dictionary, you can use wiktionary to do some lookup or build the dataset.
There is a recently new package, Named LanguageDetect.jl. It does not return true/false, but a list of probabilities. You could define something like:
using LanguageDetect: detect
function is_english(text, threshold=0.8)
langs = detect(text)
for lang in langs
if lang.language == "en"
return lang.probability >= threshold
end
end
ret

Selecting / grouping by first character string until a specific symbol

Good day,
I'm currently working on a dataset where theres a column in this format.
PA-121-1512-asa-1241
PWW-121-1571-accs-21561
PSAWA-171-1616-gfaa-161
QSF-16-1613-63-asdfa
H-Elevator-15-asf-1112
QSF-asa-sda-afas-112
The first sequence of letters before the "-" symbol is identified as "building location" Due to this i would like to save these first sequence of letters in a seperate column.
I would like to know how to select > copy > paste these values in a new column so i end up with a column like e.g.:
Location:
PA
PPW
PSAWA
QSF
H
QSF
I tried the function:
str_extract("PA-121-1512-asa-1241", ".+?(?<=-)")
The PA-121-1512-asa-1241 is a example i selected the whole column.
Here what i got printed out was PA- instead of just PA.
If more data a more elaborate explenation is needed please do tell me. Im still fairly new to writing questions on this site.
Happy hollidays!,
E.D.D.
Post post...
After looking at my code again to copy paste a propper example as Mr. Cyrus suggested, I've found my mistake. instead of:
str_extract("PA-121-1512-asa-1241", ".+?(?<=-)")
it is:
str_extract("PA-121-1512-asa-1241", "[^-]+")
This returns PA instead of PA-
This shows that reading trough your code 50x does help because the previous 49 didnt.
If anyone has a more elegant / efficient method, I'm still interested! Since running this code trough 5million rows took me quite a while.
instead of:
str_extract("PA-121-1512-asa-1241", ".+?(?<=-)") it is
str_extract("PA-121-1512-asa-1241", "[^-]+")
This returns "PA" instead of "PA-"
This shows that reading trough your code 50x does help because the previous 49 didnt.
str_extract() required format for the "pattern" is quite confusing to get right. Reading more into Grep has been advised to me by my teacher.

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

R: Extract value and lines after key word (text file mining)

Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.

Resources