finding repeated letters in a string - r

I have a string in which i want to find the repeated letters. For example,
A <- c('A-B-A-B-C', 'A-B-C-D', 'A-B-A-C-D-E-F', 'A-B-A-B')
I want to create a vector B which takes a value of 0 if there is no repetition of letters, 1 otherwise.
B <- c('1','0','1','1')

You can combine strsplit and anyDuplicated in base R to get close to what you want.
sapply(strsplit(A, "-"), anyDuplicated, fixed=TRUE)
[1] 3 0 3 3
anyDuplicated returns the first index of the duplicated value. As #rich-scriven notes, adding fixed=TRUE to the strsplit function should increase efficiency as it does a literal mapping (no regular expressions involved)
You could wrap this in pmin to get your desired result:
pmin(sapply(strsplit(A, "-", fixed=TRUE), anyDuplicated), 1)
[1] 1 0 1 1
or as #rich-scriven notes, use sign to convert the values.
sign(sapply(strsplit(A, "-", fixed=TRUE), anyDuplicated))
[1] 1 0 1 1

Related

Is there a way in R to count the number of substrings in a string enclosed in square brackets, all substrings are separated by commas and are quoted?

['ax', 'byc', 'crm', 'dop']
This is a character string, and I want a count of all substrings, ie 4 here as output. Want to do this for the entire column containing such strings.
We may use str_count
library(stringr)
str_count(str1, "\\w+")
[1] 4
Or may also extract the alpha numeric characters into a list and get the lengths
lengths(str_extract_all(str1, "[[:alnum:]]+"))
If it is a data.frame column, extract the column as a vector and apply str_count
str_count(df1$str1, "\\w+")
data
str1 <- "['ax', 'byc', 'crm', 'dop']"
df1 <- data.frame(str1)
Here are a few base R approaches. We use the 2 row input defined reproducibly in the Note at the end. No packages are used.
lengths(strsplit(DF$b, ","))
## [1] 4 4
nchar(gsub("[^,]", "", DF$b)) + 1
## [1] 4 4
count.fields(textConnection(DF$b), ",")
## [1] 4 4
Note
DF <- data.frame(a = 1:2, b = "['ax', 'byc', 'crm', 'dop']")

How to read a comma-separated numerical string and perform various functions on it

I have a column with numerical comma-separated strings, e.g., '0,1,17,200,6,0,1'.
I want to create new columns for the sums of those numbers (or substrings) in the strings that are not equal to 0.
I can use something like this to count the sum of non-zero numbers for the whole string:
df$F1 <- sapply(strsplit(df1$a, ","), function(x) length(which(x>0)))
[1] 5
This outputs '5' as the number of substrings in for the example string above, which is correct as the number of substrings in '0,1,17,200,6,0,1' is indeed 5.
The challenge, however, is to be able to restrict the number of substrings. For example, how can I get the the count for only the first 3 or 6 substrings in the string?
You can use gsub and backreference to cut the string to the desired length before you count how many substrings are > 0:
DATA:
df1 <- data.frame(a = "0,1,17,200,6,0,1")
df1$a <- as.character(df1$a)
SOLUTION:
First cut the string to whatever number of substrings you want--here, I'm cutting it to three numeric characters (the first two of which are followed by a comma)--and store the result in a new vector:
df1$a_3 <- gsub("^(\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
df1$a_3
[1] "0,1,17"
Now insert the new vector into your sapply statement to count how many substrings are greater than 0:
sapply(strsplit(df1$a_3, ","), function(x) length(which(x>0)))
[1] 2
To vary the number of substrings, vary the number of repetitions of \\d+ in the pattern accordingly. For example, this works for 6 substrings:
df1$a_6 <- gsub("^(\\d+,\\d+,\\d+,\\d+,\\d+,\\d+)(.*)", "\\1", df1$a)
sapply(strsplit(df1$a_6, ","), function(x) length(which(x>0)))
[1] 4
EDIT TO ACCOUNT FOR NEW SET OF QUESTIONS:
To compute the maximum value of substrings > 0, exemplified here for df1$a, the string as a whole (for the restricted strings, just use the relevant vector accordingly, e.g., df1$a_3, df1$a_6 etc.):
First split the string using strsplit, then unlist the resulting list using unlist, and finally convert the resulting vector from character to numeric, storing the result in a vector, e.g., string_a:
string_a <- as.numeric(unlist(strsplit(df1$a, ",")))
string_a
[1] 0 1 17 200 6 0 1
On that vector you can perform all sorts of functions, including max for the maximum value, and sum for the sum of the values:
max(string_a)
[1] 200
sum(string_a)
[1] 225
Re the number of values that are equal to 0, adjust your sapply statement by setting x == 0:
sapply(strsplit(df1$a, ","), function(x) length(which(x == 0)))
[1] 2
Hope this helps!

How to remove only numbers from string

I have following dataframe in R
ID Village_Name
1 23
2 Name-23
3 34
4 Vasai2
5 23
I only want to remove numbers from Village_Name, my desired dataframe would be
ID Village_Name
1 Name-23
2 Vasai2
How can I do it in R?
We can use grepl to match one or more numbers from the start (^) till the end ($) of the numbers and negate (!) it so that all numbers only elements become FALSE and others TRUE
i1 <- !grepl("^[0-9]+$", df1$Village_Name)
df1[i1, ]
Based on the OP's post, it could be also
data.frame(ID = head(df1$ID, sum(i1)), Village_Name = df1$Village_Name[i1])
# ID Village_Name
#1 1 Name-23
#2 2 Vasai2
Or another option is to convert to numeric resulting in non-numeric elements to be NA and is changed to a logical vector with is.na
df1[is.na(as.numeric(df1$Village_Name)),]
Here is another option using sub:
df1[nchar(sub("\\d+", "", df1$Village_Name)) > 0, ]
Demo
The basic idea is to strip off all digits from the Village_Name column, then assert that there is at least one character remaining, which would imply that the entry is not entirely numerical.
But, I would probably go with the grepl option given by #akrun in practice.

Sum number in a character string (R)

I have a vector that looks like :
numbers <- c("1/1/1", "1/0/2", "1/1/1/1", "2/0/1/1", "1/2/1")
(not always the same number of "/" character)
How can I create another vector with the sum of the numbers of each string?
Something like :
sum
3
3
4
4
4
One solution with strsplit and sapply:
sapply(strsplit(numbers, '/'), function(x) sum(as.numeric(x)))
#[1] 3 3 4 4 4
strsplit will split your stings on / (doesn't matter how many /s you have). The output of strsplit is a list, so we iterate over it to calculate the sum with sapply.
What seems to me to be the most straightforward approach here is to convert your number strings to actual valid string arithmetic expressions, and then evaluate them in R using eval along with parse. Hence, the string 1/0/2 would become 1+0+2, and then we can simply evaluate that expression.
sapply(numbers, function(x) { eval(parse(text=gsub("/", "+", x))) })
1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
3 3 4 4 4
Demo
1) strapply strapply matches each string of digits using \\d+ and then applies as.numeric to it returning a list with one vector of numbers per input string. We then apply sum to each of those vectors. This solution seems particularly short.
library(gsubfn)
sapply(strapply(numbers, "\\d+", as.numeric), sum)
## [1] 3 3 4 4 4
2) read.table This applies sum(read.table(...)) to each string. It is a bit longer (but still only one line of code) but uses no packages.
sapply(numbers, function(x) sum(read.table(text = x, sep = "/")))
## 1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
## 3 3 4 4 4
Add the USE.NAMES = FALSE argument to sapply if you don't want names on the output.
scan(textConnection(x), sep = "/", quiet = TRUE) could be used in place of read.table but is longer.

Sum of Numbers in string in R separated by Vertical Bar

I have a string having value as given below separated by vertical bar.
String1 <- "5|10|25|25|10|10|10|5"
String2 <- "5|10|25|25"
Is there any Direct Function to get the sum of the numbers in string ,
in this case it Should be 100 for Srting1 and 65 for string2,and I have a character vector of such.
>chk
chk
1 5|10|25|25|10|10|10|5
2 5|55|20|5|5|5|5
3 6
4 Not Available
> sum(scan(text=gsub("\\Not Available\\b", "NA", chk$chk), sep="|", what = numeric(), quiet=TRUE), na.rm = TRUE)
[1] 206
As it Should be
[1]100 100 6 NA
We can do a scan and then sum
sum(scan(text=String1, sep="|", what = numeric(), quiet=TRUE))
For multiple vectors, place it in a list and do the same operation
sapply(mget(paste0("String", 1:2)), function(x)
sum(scan(text=x, sep="|", what=numeric(), quiet=TRUE)))
# String1 String2
# 100 65
Another option is eval(parse( (not recommended though) after replacing the | with +
eval(parse(text=gsub("[|]", "+", String1)))
#[1] 100
Or as #thelatemail mentioned in the comments, assign (<-) the | to + and then do the eval(parse(..
`|` <- `+`
eval(parse(text=String1))
#[1] 100
If we have a data.frame column with strings, then it may be better to split by | to a list of vectors, convert the vectors to numeric (all the non-numeric elements coerce to NA with a friendly warning), get the sum with na.rm=TRUE
sapply(strsplit(as.character(chk$chk), "[|]"),
function(x) sum(as.numeric(x), na.rm=TRUE))
#[1] 100 100 6 0
NOTE: The as.character is not needed if the 'chk' column is already a character class
Otherwise, if we are using scan or eval(parse, it should be done for each element.
We can extract all the numbers from the string and then sum over it
library(stringr)
sum(as.numeric(unlist(str_match_all(String1, "[0-9]+"))))
#[1] 100
sum(as.numeric(unlist(str_match_all(String2, "[0-9]+"))))
#[1] 65
For multiple vectors we can keep it in a list
sapply(list(String1, String2), function(x)
sum(as.numeric(unlist(str_match_all(x, "[0-9]+")))))
#[1] 100 65

Resources