Does R have a wildcard expression (such as an asterisk (*))? - r

I hope I am not missing something obvious but here is my question:
Given data like this
Data1=
Name Number
AndyBullxxx 12
AlexPullxcx 14
AndyPamxvb 56
RickRalldfg 34
AndyPantert 45
SamSaltedf 45
I would like to be able to pull out all of the rows starting with "Andy"
subset(Data1,Name=="Andy*")
AndyBullxxx 12
AndyPamxvb 56
AndyPantert 45
So basically a wild card symbol that will let me subset all rows that begin with a certain character designation.

Try,
df[grep("^Andy",rownames(df)),]
the first argument of grep is a pattern, a regular expression. grep gives back the rows that contain the pattern given in this first argument. the ^ means beginning with.
Let's make this reproducible:
df <- data.frame(x=c(12,14,56,34,45,45))
rownames(df) <- c("AndyBullxxx", "AlexPullxcx","AndyPamxvb", "RickRalldfg","AndyPantert","SamSaltedf")
## see what grep does
grep("^Andy",rownames(df))

If you are not comfortable with regex there is a function in the utils package, which can convert wildcard based expressions to regex. So you can do
df[grepl(glob2rx('Andy*'), rownames(df)),]

I think you want to go for a regular expression:
subset(Data1, grepl("^\bAndy", Name))
or
Data1[grepl("^\bAndy", Data1$Name),]
In the regular expression the "^" stands for startswith, and \b for the next set of characters is going to be a word. Regular expressions are a powerful text processing tool that require some study. There are a lot of tutorials and websites online. One of them that I use is:
http://www.regular-expressions.info/
I can't use R right now (my wifes laptop ;)), so these pieces of code go untested. Maybe next time provide us with an example dataset, that makes it really easy to provide a working example.

Related

What is wrong with this gsub one liner

This may be a silly question, but its doing my head in. I have always used gsub, but for some reason, it isnt working for this one:
Example of dataset
ColumnS
I 2,[3],4:i:-
I 2,[3],4:i:-
I 2,[3],4:b:-
Give
Derby
Panama
Kentucky
This is what I've been trying
dataset$ColumnS<-gsub("I 2,[3],4","2,[3],4", dataset$ColumnS)
What is wrong with it?
Square brackets are special characters when it comes to pattern recognition and if you want to match them, you have to use escape characters to inform R.
dataset$ColumnS <- gsub("I 2,\\[3\\],4","2,[3],4", dataset$ColumnS)
You can also use the argument fixed=TRUE which will take the pattern as a string.
dataset$ColumnS <- gsub("I 2,[3],4","2,[3],4", dataset$ColumnS, fixed=TRUE)

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

Extract numerical value before a string in R

I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811
Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.
Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")

How to remove characters before matching pattern and after matching pattern in R in one line?

I have this vector Target <- c( "tes_1123_SS1G_340T01", "tes_23_SS2G_340T021". I want to remove anything before SS and anything after T0 (including T0).
Result I want in one line of code:
SS1G_340 SS2G_340
Code I have tried:
gsub("^.*?SS|\\T0", "", Target)
We can use str_extract
library(stringr)
str_extract(Target, "SS[^T]*")
#[1] "SS1G_340" "SS2G_340"
Try this:
gsub(".*(SS.*)T0.*","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Why it works:
With regex, we can choose to keep a pattern and remove everything outside of that pattern with a two-step process. Step 1 is to put the pattern we'd like to keep in parentheses. Step 2 is to reference the number of the parentheses-bound pattern we'd like to keep, as sometimes we might have multiple parentheses-bound elements. See the example below for example:
gsub(".*(SS.*)+(T0.*)","\\1",Target)
[1] "SS1G_340" "SS2G_340"
Note that I've put the T0.* in parentheses this time, but we still get the correct answer because I've told gsub to return the first of the two parentheses-bound patterns. But now see what happens if I use \\2 instead:
gsub(".*(SS.*)+(T0.*)","\\2",Target)
[1] "T01" "T021"
The .* are wild cards by the way. If you'd like to learn more about using regex in R, here's a reference that can get you started.

Finding number of occurrences of a word in a file using R functions

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10

Resources