Partial String Matching by Row - r

I'm trying to create a unique column in a data frame that has a numeric of the character matches between two strings from the left side of both strings.
Each row represents has a comparison string, which we want to use as a test against a user given string. Given a dataframe:
df <- data.frame(x=c("yhf", "rnmqjk", "wok"), y=c("yh", "rnmj", "ok"))
x y
1 yhf yh
2 rnmqjk rnmj
3 wok ok
Where x is our comparison string and y is our given string, I'm looking to have the values of "2, 3, 0" output in column z., like so:
x y z
1 yhf yh 2
2 rnmqjk rnmj 3
3 wok ok 0
Essentially, I'm looking to have the given strings (y) checked from left -> right against a comparison string (x), and when the characters don't line up to not check the rest of the string and record the match numbers.
Thank you in advance!

This code works for your example:
df$z <- mapply(function(x, y) which.max(x != y),
strsplit(as.character(df$x), split=""),
strsplit(as.character(df$y), split="")) - 1
df
x y z
1 yhf yh 2
2 rnmqjk rnmj 3
3 wok ok 0
As an outline, strsplit splits a string vector into a list of character vectors. Here, each element of a vector is a single character (with the split="" argument). The which.max function returns the first position where it's argument is the maximum of the vector. Since The vectors returned by x != y are logical, which.max returns the first position where a difference is observed. mapply takes a function and lists and applies the provided function to corresponding elements of the lists.
Note that this produces warnings that the lengths of the strings don't match. This could be addressed in a couple of ways, the easiest is wrapping the function in suppressWarnings if the messages bug you.
As the OP notes int the comments if there are instances where the entire word matches, then which.max returns 1. To return the same length as the string, I'd add a second line of code that combines logical subsetting with the nchar function:
df$z[as.character(df$x) == as.character(df$y)] <-
nchar(as.character(df$x[as.character(df$x) == as.character(df$y)]))

Related

Find position of closest value to another value given a condition in R

let's say I have a vector that increases and then decreases like the simple example below. I want to identify the position (index) in the vector that is closest to a value - but with the condition that the following value must be lower (I always want to pick up the closest value on the downslope of the data).
In the example below, I want the answer to be 13 (rather than 6).
I can't think of a solution using which.min() or match.closest() which would reliably work for this.
Any help gratefully received!
# example vector which increases then decreases
vector <- c(1,2,3,4,5,6,7,8,9,9,8,7,6,5,4,3,2,1)
# index
index <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
value <- 6.2
Maybe you can use cummax + rev like below
which.min(abs(rev(cummax(rev(vector)))-value))
which gives
[1] 13
Assuming your points always continue to decrease in value after the first decrease, and value is between the point of the first decrease and the last point, you could do this:
closest <- function(value, vec, next_is){
lead_fun <- function(x) c(tail(x, -1), NA)
meets_cond <- get(next_is)(lead_fun(vec), vec)
which.min(abs(vec[meets_cond] - value)) + which.max(meets_cond) - 1
}
closest(6.2, vec = vector, next_is = '<')
# [1] 13
Check which elements in the vector meet your condition, find the index of the closest element in that vector, then add back the number of elements before the first which meets your condition.
Edit: ----------------------------------------
Another version of the function which accepts an arbitrary logical vector which is TRUE for indices meeting a condition:
closest <- function(value, vec, cond_vec){
which.min(abs(vec[cond_vec] - value)) + which.max(cond_vec) - 1
}
Note that this assumes the values matching your condition are all in one contiguous region (not e.g. the first matches, then the third, then the sixth, etc.)
If your condition is that the point comes after the max value:
closest(6.2, vec = vector, cond_vec = seq_along(vector) > which.max(vector))
# [1] 13

Extract first digit from each element of a numeric vector in R

A little bit like here, I would like to extract the first digit from each element of a numeric vector, without having it to turn into a character vector and back.
d <- c(123, 2, 45)
Expected Output:
[1] 1 2 4
I tried different stuff with floor(), but without the desired result.
One numerical approach here would be to divide each input number by 10 raised to the floor of log base 10. This means that, for example, we divide an input of 123 by 100, to yield 1.23. Then, we take the floor of that to yield the first digit 1.
getFirstDigit <- function(x) {
floor(x / (10 ^ floor(log10(x))))
}
d <- c(123, 2, 45)
getFirstDigit(d)
[1] 1 2 4
The more brute force way of doing this would be to cast the input vector to character, take the first character, and then cast back to a number. But, I doubt doing it that way would outperform what I have above.

How to find if two or more continuously elements of a vector are equal in R

I want to find a way to determine if two or more continuously elements of a vector are equal.
For example, in vector x=c(1,1,1,2,3,1,3), the first, the second and the third element are equal.
With the following command, I can determine if a vector, say y, contains two or more continuously elements that are equal to 2 or 3
all(rle(y)$lengths[which( rle(y)$values==2 | rle(y)$values==3 )]==1)
Is there any other faster way?
EDIT
Let say we have the vector z=c(1,1,2,1,2,2,3,2,3,3).
I want a vector with three elements as output. The first element will refer to value 1, the second to 2 and the third one to 3. The values of the elements of the output vector will be equal to 1 if two or more continuously elements of z are the same for one value of 1,2,3 and 0 otherwise. So, the output for the vector z will be (1,1,1).
For the vector w=c(1,1,2,3,2,3,1) the output will be 1,0,0, since only for the value 1 there are two continuously elements, that is in the first and in the second position of w.
I'm not entirely sure if I'm understanding your question as it could be worded better. The first part just asks how you find if continuous elements in a vector are equal. The answer is to use the diff() function combined with a check for a difference of zero:
z <- c(1,1,2,1,2,2,3,2,3,3)
sort(unique(z[which(diff(z) == 0)]))
# [1] 1 2 3
w <- c(1,1,2,3,2,3,1)
sort(unique(w[which(diff(w) == 0)]))
# [1] 1
But your edit example seems to imply you are looking to see if there are repeated units in a vector, of which will only be the integers 1, 2, or 3. Your output will always be X, Y, Z, where
X is 1 if there is at least one "1" repeated, else 0
Y is 2 if there is at least one "2" repeated, else 0
Z is 3 if there is at least one "3" repeated, else 0
Is this correct?
If so, see the following
continuously <- function(x){
s <- sort(unique(x[which(diff(x) == 0)]))
output <- c(0,0,0)
output[s] <- s
return(output)
}
continuously(z)
# [1] 1 2 3
continuously(w)
# [1] 1 0 0
Assuming your series name is z=c(1,1,2,1,2,2,3,2,3,3) then you can do:
(unique(z[c(FALSE, diff(z) == 0)]) >= 0)+0 which will output to 1, 1, 1,
When you run the above command on your other sequenc:
w=c(1,1,2,3,2,3,1)
then (unique(w[c(FALSE, diff(w) == 0)]) >= 0)+0 will return to 1
You may also try this for an exact output like 1,1,1 or 1,0,0
(unique(z[c(FALSE, diff(z) == 0)]) == unique(z))+0 #1,1,1 for z and 1,0,0 for w
Logic:
diff command will take difference between corresponding second and prior items, since total differences will always 1 less than the number of items, I have added first item as FALSE. Then subsetted with your original sequences and for boolean comparison whether the difference returned is zero or not. Finally we convert them to 1s by asking if they are greater than or equal to 0 (To get series of 1s, you may also check it with some other conditions to get 1s).
Assuming your sequence doesn't have negative numbers.

Split a column with varying delimiters into 2, along with unfortunate fraction structure

I am fairly new to R and programming in general. I was given a data set to work with that unfortunately was structured fairly rough.
It is in the form of
W-X/Y"-Z
The first number being inches, however for values <1 inch it is simply
X/Y"-Z
I need a way to:
a) split Z off, (the number after the last delimiter of "-"
as well as
b) convert the W-X/Y" or X/Y" value to its decimal equivalent.
So 1-1/2" to just 1.5
So split the original column into 2 columns, one with the Z value, and one with the decimal inches value. As shown below
input length bin
3-1/2"-14 3.5 14
3/4"-20 .75 20
We can split the 'input' column by the last - or "" to get a list output. Loop over the list (with lapply), remove the blank elements (x[nzchar(x)]), replace the - with +, use eval(parse to evaluate the fraction to get the numeric output, concatenate with the second value, rbind the list elements, and assign (<-) the output to create two new columns.
df1[c("length", "bin")] <- do.call(rbind, lapply(strsplit(df1$input,
'-(?=[^-]+$)|"', perl=TRUE), function(x) {
x1 <- x[nzchar(x)]
c(eval(parse(text=sub("-", "+", x1[1]))), as.numeric(x1[2]))}))
df1
# input length bin
#1 3-1/2"-14 3.50 14
#2 3/4"-20 0.75 20
NOTE: If the "input" column is factor class, convert to character and use it in strsplit ,i.e. strsplit(as.character(df1$input), ...
data
df1 <- data.frame(input=c('3-1/2"-14', '3/4"-20'), stringsAsFactors=FALSE)

Extract digit from numeric in r

I would like to extract the first digit after a decimal place from a numeric vector in R. Is there a way to do this without turning it into a character string? For example:
x <- c(1.0,1.1,1.2)
I would like the function to return a vector:
0,1,2
thanks.
There'll be a bunch of ways, but here's one:
(x %% 1)*10
# [1] 0 1 2
This assumes there's only ever one digit after the decimal place. If that's not the case:
floor((x %% 1)*10)

Resources