Identifying if one field is partial derived from another in R - r

So given a pattern, say two letters, and the position of the pattern, is it a prefix, suffix or in the middle, I need to identify if a field is partial derived from another. So for example given the following dataset
data.V1 data.V2
1 GH GH1001
2 FD FD2002
3 TH 2345TH
4 ED ED56763
5 US 4345US
6 FG F6736tG
if LL is the pattern for column one where LL refers to two letters in this case. If the pattern for column 2 is LL#. This indicates that the position of the pattern is the first elements of each element in row 2. So in the dataset above rows 1,2&4 would obey the pattern.
I have tried if then statements but these did not work if the pattern was in the middle , #LL#. I have also tried the function regmathes but that did not work either.

apply(df,1,function(x) grepl(paste0("^",x["data.V1"]),x["data.V2"]))
For every row (that's the 1 in apply), this will check whether the contents of data.V1 appears right at the beginning of data.V2 (^ means the beginning of a string for regular expressions).
Result:
1 2 3 4 5 6
TRUE TRUE FALSE TRUE FALSE FALSE
Replace the grepl first argument with the following for:
End of string: paste0(x["data.V1"],"$")
Middle of string: paste0(".+",x["data.V1"],".+")
After n characters (n defined elsewhere): paste0(".{",n,"}",x["data.V1"])
(for the last one, the form "{n}" means the last character is repeated n times. Since it is preceded by ".", it means any n characters.)

Related

Regular expression weird result [duplicate]

This question already has answers here:
Multiple overlapping regex matches instead of one
(2 answers)
Biostrings gregexpr2 gives errors while gregexpr works fine
(1 answer)
Closed 3 years ago.
Code
gsub('101', '111', '110101101')
#[1] "111101111"
Would anyone know why the second 0 in the input isn't being substituted into a 1 in the output?
I'm looking for the pattern 101 in string and replace it with string 111. Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
You could use a lookahead ?=
The way this works is q(?=u) matches a q that is followed by a u, without making the u part of the match.
Example:
gsub('10(?=1)', '11', '110101101', perl=TRUE);
// Output: 111111111
Edit: you need to use gsub in perl mode to use lookaheads
Its because it doesnt work in a recursive way
gsub('101', '111', '110101101') divides the third string as it finds the matches. So it finds the first 101 and its left with 01101. Think about it. If it would replace "recursively", something like gsub('11', '111', '11'), would return an infinite string of '1' and break. It doesn't check in the already "replaced" text.
It is because when R first detected 110101101, it treat the next 0 as in 011 in 110101101.
It seems that you only want to replace '0' by '1'. Then you can just use gsub('0', '1', '110101101')
Later on I wish to turn longer sub-sequences into sequences of 1's, such as 10001 to 11111.
Hopefully, R provides a means to generate the replacement string based on the matched substring. (This is a common feature.)
If so, search for 10+, and have the replacement string generator create a string consisting of a number of 1 characters equal to the length of the match. (e.g. If 100 is matched, replace with 111. If 1000 is matched, replace with 1111. etc.)
I don't know R in the least. Here's how it's done in some other languages in case that helps:
Perl:
$s =~ s{10+}{ "1" x length($&) }ger
Python:
re.sub(r'10+', lambda match: '1' * len(match.group()), s)
JavaScript:
s.replace(/10+/g, function(match) { return '1'.repeat(match.length) })
JavaScript (ES6):
s.replace(/10+/g, match => '1'.repeat(match.length))
According to the OP
Later on I wish to turn longer sub-sequences into sequences of 1's,
such as 10001 to 11111.
If I understand correctly, the final goal is to replace any sub-sequence of consecutive 0 into the same number of 1 if they are surrounded by a 1 on both sides.
In R, this can be achieved by the str_replace_all() function from the stringr package. For demonstration and testing, the input vector contains some edge cases where substrings of 0 are not surrounded by 1.
input <- c("110101101",
"11010110001",
"110-01101",
"11010110000",
"00010110001")
library(stringr)
str_replace_all(input, "(?<=1)0+(?=1)", function(x) str_dup("1", str_length(x)))
[1] "111111111" "11111111111" "110-01111" "11111110000" "00011111111"
The regex "(?<=1)0+(?=1)" uses look behind (?<=1) as well as look ahead (?=1) to ensure that the subsequence 0+ to replace is surrounded by 1. Thus, leading and trailing subsequences of 0 are not replaced.
The replacement is computed by a functions which returns a subsequence of 1 of the same length as the subsequence of 0 to replace.

Remove single value from vector leaving other occurrences of the same value

Suppose I have a large vector of integers in which a single integer can occur in the vector multiple times. I do not know the order of the values within the vector. Consider the code below: I have vector and I want to remove a single 1 to get newVector. Since the order within the vector is not known outside this example, I cannot simply use vector[-1].
vector<-c(1,1,2,2,3)
newVector<-c(1,2,2,3)
Some background: I iteratively pick two values from the vector (using sample) and then want to remove the values I picked from the vector.
Of course I could loop through the vector until I find the first occurrence of the value I wish to remove and remove it using the index, however, that is very time consuming. All the other results I found end up removing all occurrences of the value, which I don't want.
I think this would work, as which.max returns the index of the first match and then we can remove them using negative subsetting.
vector[-which.max(vector == 1)]
#[1] 1 2 2 3
Also, match does the same
vector[-match(1, vector)]
#[1] 1 2 2 3
You could use match. This finds the first occurrence of the specified value returning its index
vector<-c(1,1,2,2,3)
vector[-match(1, vector)]
# [1] 1 2 2 3

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

Get indices of all elements that do not include a special character element in R

I want to get the indices of all vector elements that do not include a special character, for example "5".
Example:
a<-c("2","2.34","4.5","3","5.1")
with5<-grep("5",a)
[1] 3 5
How can I get the "without5" indices?
without5<- ...
[1] 1 2 4
Use the invert argument:
a = c("2","2.34","4.5","3","5.1")
grep("5", a, invert = TRUE)
However, would advise against dealing with numbers as characters unless there is good reason for this.
We can also match the pattern that starts (^) with one or more characters that are not 5 ([^5]+) until the end the string ($).
grep('^[^5]+$', a)
#[1] 1 2 4

help me understand partial matching in data.frame column names [duplicate]

I've encountered a strange behavior when dropping columns from data.frame. Initially I have:
> a <- data.frame("a" = c(1,2,3), "abc" = c(3,2,1)); print(a)
a abc
1 1 3
2 2 2
3 3 1
Now, I remove a$a from the data.frame
> a$a <- NULL; print(a)
abc
1 3
2 2
3 1
As expected, I have only abc column in my data.frame. But the strange part begins, when I try to reference deleted column a.
> print(a$a)
[1] 3 2 1
> print(is.null(a$a))
[1] FALSE
It looks like R returns value of the a$abc instead of NULL.
This happens when the beginning of the name of remaining column exactly matches the name of deleted column.
Is it a bug or do I miss something here?
From the the help. ?$
name: A literal character string or a
name (possibly backtick quoted). For
extraction, this is normally (see
under ‘Environments’) partially
matched to the names of the object.
So that's the normal behaviour because the name is partially matched. See ?pmatch for more info about partial matching.
Cheers
Perhaps it's worth pointing out (since it didn't come up on the previous related question) that this partial matching behavior is potentially a reason to avoid using '$' except as a convenient shorthand when using R interactively (at least, it's a reason to be careful using it).
Selecting a column via dat[,'ind'] if you know the name of the column, but not the position, or via dat[,3] if you know the position, is often safer since you won't run afoul of the partial matching.
While your exact question has already been answered in the comments, an alternative to avoid this behaviour is to convert your data.frame to a tibble, which is a stripped downed version of a data.frame, without column name munging, among other things:
library(tibble)
df_t <- as_data_frame(a)
df_t
# A tibble: 3 × 1
abc
<dbl>
1 3
2 2
3 1
> df_t$a
NULL
Warning message:
Unknown column 'a'
From the R Language Definition [section 3.4.1 pg.16-17] --
https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
• Character: The strings in i are matched against the names attribute of x and the resulting integers are used. For [[ and $ partial matching is used if exact matching fails, so x$aa will match x$aabb if x does not contain a component named "aa" and "aabb" is the only name which has prefix "aa". For [[, partial matching can be controlled via the exact argument which defaults to NA indicating that partial matching is allowed, but should result in a
warning when it occurs. Setting exact to TRUE prevents partial matching from occurring, a FALSE value allows it and does not issue any warnings. Note that [ always requires an exactmatch. The string "" is treated specially: it indicates ‘no name’ and matches no element (not even those without a name). Note that partial matching is only used when extracting
and not when replacing.

Resources