I'm trying to learn R and a sample problem is asking to only reverse part of a string that is in alphabetical order:
String: "abctextdefgtext"
StringNew: "cbatextgfedtext"
Is there a way to identify alphabetical patterns to do this?
Here is one approach with base R based on the patterns showed in the example. We split the string to individual characters ('v1'), use match to find the position of characters with that of alphabet position (letters), get the difference of the index and check if it is equal to 1 ('i1'). Using the logical vector, we subset the vector ('v1'), create a grouping variable and reverse (rev) the vector based on grouping variable. Finally, paste the characters together to get the expected output
v1 <- strsplit(str1, "")[[1]]
i1 <- cumsum(c(TRUE, diff(match(v1, letters)) != 1L))
paste(ave(v1, i1, FUN = rev), collapse="")
#[1] "cbatextgfedtext"
Or as #alexislaz mentioned in the comments
v1 = as.integer(charToRaw(str1))
rawToChar(as.raw(ave(v1, cumsum(c(TRUE, diff(v1) != 1L)), FUN = rev)))
#[1] "cbatextgfedtext"
EDIT:
1) A mistake was corrected based on #alexislaz's comments
2) Updated with another method suggested by #alexislaz in the comments
data
str1 <- "abctextdefgtext"
You could do this in base R
vec <- match(unlist(strsplit(s, "")), letters)
x <- c(0, which(diff(vec) != 1), length(vec))
newvec <- unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
paste0(letters[newvec], collapse = "")
#[1] "cbatextgfedtext"
Where s <- "abctextdefgtext"
First you find the positions of each letter in the sequence of letters ([1] 1 2 3 20 5 24 20 4 5 6 7 20 5 24 20)
Having the positions in hand, you look for consecutive numbers and, when found, reverse that sequence. ([1] 3 2 1 20 5 24 20 7 6 5 4 20 5 24 20)
Finally, you get the letters back in the last line.
Related
Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi
This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)
In the example below, I would like the know the number of 010 sequences, or the number of 1010 sequences. Below is a workable example;
x <- c(1,0,0,1,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,1,0)
In this example, the number of 010 sequences would be 6 and the number of 1010 sequences would be 4.
What would be the most efficient/simplest way to count the number of consecutive sequences?
A stringless way:
f = function(x, patt){
if (length(x) == length(patt)) return(as.integer(x == patt))
w = head(seq_along(x), 1L-length(patt))
for (k in seq_along(patt)) w <- w[ x[w + k - 1L] == patt[k] ]
w
}
length(f(x, patt = c(0,1,0))) # 6
length(f(x, patt = c(1,0,1,0))) # 4
Alternatives. From #cryo11, here's another way:
function(x,patt) sum(apply(embed(x,length(patt)),1,function(x) all(!xor(x,patt))))
or another variation:
function(x,patt) sum(!colSums( xor(patt, t(embed(x,length(patt)))) ))
or with data.table:
library(data.table)
setkey(setDT(shift(x, seq_along(patt), type = "lead")))[as.list(patt), .N]
(The shift function is very similar to embed.)
Another solution would be this:
library(stringr)
x <- c(1,0,0,1,0,0,0,1,1,1,0,0,1,0,1,0,1,0,1,0,1,0)
xx = paste0(x, collapse = "")
str_count(xx, '(?<=010)')
[1] 6
str_count(xx, '(?<=1010)')
[1] 4
As #Pierre Lafortune pointed out in the comments this can be done without using any packages:
length(gregexpr("(?<=010)", xx, perl=TRUE)[[1]])
[1] 6
logic : take a substr of length of pattern you are searching for and compare it with the pattern.
xx = paste0(x, collapse = "")
# [1] "1001000111001010101010"
# case 1 :
xxx = "010"
sum(sapply(1:(length(x)-nchar(xxx)+1), function(i) substr(xx,i,i+nchar(xxx)-1)==xxx))
# [1] 6
# case 2 :
xxx = "1010"
# [1] 4
R introduced the startsWith function in 3.3.0. Using this and substring, we can implement #joel.wilson's method as
sum(startsWith(substring(paste(x, collapse=""),
head(seq_along(x), -2), tail(seq_along(x), -2)), "010"))
Here, substring constructs all three character adjacent sets and startsWith tests if each of these is the same as "010". The TRUE values are then summed together.
I have a vector:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
each element contains two numbers,separated by ",". I would like to get indexes of elements containing "1".
So the index list is expected:
1, 6, 7, 9, 10
grep() will work nicely for this. By default, it returns the indices of the matched pattern.
grep("^1,|,1$", lst)
# [1] 1 6 7 9 10
The regular expression ^1,|,1$ looks to match a string that
^1, = starts with 1,
| OR
,1$ = ends with ,1
each element contains two numbers. my answer is not ideal but I got what I need.
m <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",1)))
n <- as.numeric(unlist(lapply(strsplit(as.character(lst), "\\,"),"[[",2)))
sort(unique(c(which(m==1),which(n==1))))
Depending on background and context of this task it might be prudent to turn this vector into a data.frame:
lst <- c("2,1","7,10","11,0","7,0","10,0","1,1","1,0","4,0","4,1","0,1","6,0")
DF <- read.table(text = do.call(paste, list(lst, collapse = "\n")), sep = ",")
which(DF$V1 == 1L | DF$V2 == 1L)
#[1] 1 6 7 9 10
I have a vector as below
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
here i want to extract the first number before the "X" for each of the elements.
In case of situations with 2 "X" i.e. "6X2X75CL" the number 12 (6 multiplied by 2) should be calculated.
expected output
6, 24, 12, 168
Thank you for the help...
Here's a possible solution using regular expressions :
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
# this regular expression finds any group of digits followed
# by a upper-case 'X' in each string and returns a list of the matches
tokens <- regmatches(data,gregexpr('[[:digit:]]+(?=X)',data,perl=TRUE))
res <- sapply(tokens,function(x)prod(as.numeric(x)))
> res
[1] 6 24 12 168
Here is a method using base R:
dataList <- strsplit(data, split="X")
sapply(dataList, function(x) Reduce("*", as.numeric(head(x, -1))))
[1] 6 24 12 168
strplit breaks up the vector along "X". The resulting list is fed to sapply which the performs an operation on all but the final element of each vector in the list. The operation is to transform the elements into numerics and the multiply them. The final element is dropped using head(x, -1).
As #zheyuan-li comments, prod can fill in for Reduce and will probably be a bit faster:
sapply(dataList, function(x) prod(as.numeric(head(x, -1))))
[1] 6 24 12 168
We can also use str_extract_all
library(stringr)
sapply(str_extract_all(data, "\\d+(?=X)"), function(x) prod(as.numeric(x)))
#[1] 6 24 12 168
ind=regexpr("X",data)
val=as.integer(substr(data, 1, ind-1))
data2=substring(data,ind+1)
ind2=regexpr("[0-9]+X", data2)
if (!all(ind2!=1)) {
val2 = as.integer(substr(data2[ind2==1], 1, attr(ind2,"match.length")[ind2==1]-1))
val[ind2==1] = val[ind2==1] * val2
}
I have a column in a data.frame where each observation is a string of numbers (e.g. "1,5,6,7,0,21"). I am attempting to calculate the difference for the first instance of non-consecutive numbers. In the above example the result would be 5 - 1 = 4. However, with the code I currently have I get 6. If my input is "1,2,0,21" I get the correct result of 21 - 2 = 19 (the numbers are sorted before subtraction occurs). I thought maybe the zero was the issue, but adding one to all values did not solve the issue. Perhaps a problem with my indexing? Any suggestions?
# find distance between number in first gap of non-consecutive numbers
b <- c("1,5,6,7,0,21") # does not work as desired result is 6 instead of 4
# b <- ("1,2,0,21") # works as desired
b.Uncomma <- sort(unique(as.numeric(unlist(strsplit(b, split=","))))) # remove commas, remove duplicates, sort
#b.Uncomma <- b.Uncomma + 1 # same result
b.Gaps <- c(which(diff(b.Uncomma) != 1), length(b.Uncomma)) # find where the difference is not 1
b.FirstGap <- b.Gaps[1:2] # get the positions/index on either side of the first gap
b.Result <- b.Uncomma[(b.FirstGap[2])] - b.Uncomma[(b.FirstGap[1])] # subtract to get result
inp <- scan(text=b,sep=",")
#Read 6 items
sinp <- sort(inp)
diff(sinp)
#[1] 1 4 1 1 14
> diff(sinp)[diff(sinp) != 1][1]
#[1] 4
Try:
b.Uncomma <- sort(unique(as.numeric(unlist(strsplit(b, split=","))))) # remove commas, remove duplicates, sort
b.Gaps <- c(which(diff(b.Uncomma) != 1), length(b.Uncomma)) # find where the difference is not 1
b.FirstGap <- b.Gaps[1] # get the positions/index of the first gap
b.Result <- b.Uncomma[(b.FirstGap+1)] - b.Uncomma[(b.FirstGap)] # subtract to get result
b.Result