how do I find differences between similar strings?

how do I find differences between similar strings? - r

I have a vector of strings (file names to be exact).
pav <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_50_130_02_30cm_101mm_")
I'm looking for a simple way to find difference between these strings.
`> char_position_fun(pav) # gives unique character position
[1] 9 12 13 `
`> char_diff_fun(pav) # removes matching components (position and value)
[1] 3_4_5 4_4_5 4_5_0`

Here is my attempt. I decided to split all letters and create a data frame for each string containing position and letter information. Then, for each position, I checked if there is one unique letter or not. If FALSE, that suggests that not all letters are identical. Finally, subset the data frame with a logical condition. In this way, you can see position and letter information together.
library(tidyverse)
strsplit(mytext, split = "") %>%
map_dfr(.x = .,
.f = function(x) enframe(x, name = "position", value = "word"),
.id = "id") %>%
group_by(position) %>%
mutate(check = n_distinct(word) == 1) %>%
filter(check == FALSE)
id position word check
<chr> <int> <chr> <lgl>
1 1 9 3 FALSE
2 1 12 4 FALSE
3 1 13 5 FALSE
4 2 9 4 FALSE
5 2 12 4 FALSE
6 2 13 5 FALSE
7 3 9 4 FALSE
8 3 12 5 FALSE
9 3 13 0 FALSE
If you want to have the outcome as you described, you can add a bit more operation.
strsplit(mytext, split = "") %>%
map_dfr(.x = .,
.f = function(x) enframe(x, name = "position", value = "word"),
.id = "id") %>%
group_by(position) %>%
mutate(check = n_distinct(word) == 1) %>%
filter(check == FALSE) %>%
group_by(id) %>%
summarize_at(vars(position:word),
.funs = list(~paste0(., collapse = "_")))
id position word
<chr> <chr> <chr>
1 1 9_12_13 3_4_5
2 2 9_12_13 4_4_5
3 3 9_12_13 4_5_0
DATA
mytext <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_", "Sn_4Khz_4W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_50_130_02_30cm_101mm_")

Here is a base R solution.
At first, we can invert strings from UTF8 to Int, i.e.,
z <- Map(utf8ToInt,v)
the positions of differences
pos <- unique(unlist(outer(z,z,FUN = Vectorize(function(x,y) which(x!=y)))))
> pos
[1] 9 12 13
the chars that are different:
word <- Map(function(x) paste(intToUtf8(x[p],multiple = T),collapse = "_"),z)
> word
$Sn_4Khz_3W_45_130_02_30cm_101mm_
[1] "3_4_5"
$Sn_4Khz_4W_45_130_02_30cm_101mm_
[1] "4_4_5"
$Sn_4Khz_4W_50_130_02_30cm_101mm_
[1] "4_5_0"
DATA
v <- c("Sn_4Khz_3W_45_130_02_30cm_101mm_", "Sn_4Khz_4W_45_130_02_30cm_101mm_",
"Sn_4Khz_4W_50_130_02_30cm_101mm_")

Related

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.
In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.
Some reproducible code:
library(dplyr)
df1 = data.frame(ids = c(1:5),
var = c("a","a","c","b","b"))
df2 = data.frame(var = c('c','a','b','b','d'),
links = 0)
Ideally, I would like a resulting dataframe that looks like:
var links
1 c 3
2 a 1 or 2
3 b 4 or 5
4 b 4 or 5
5 d 0
where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".
Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".
Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.
So far, I've gone this route, but I'm unsure about what to put after the ~
linked_df = df2 %>%
mutate(links=case_when(links==0 & var %in% df1$var ~
sample(c(df1$ids),n(),replace=T) # unsure about this line
TRUE ~ links)

I think this is what you want. I've left the ids column in the result, but
it can be removed when the sampling is complete.
library(dplyr)
library(tidyr)
df1_nest = df1 %>%
group_by(var) %>%
summarize(ids = list(ids))
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
set.seed(47)
df2 %>%
left_join(df1_nest) %>%
mutate(
links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
)
# Joining, by = "var"
# var links ids
# 1 c 3 3
# 2 a 1 1, 2
# 3 b 4 4, 5
# 4 b 5 4, 5
# 5 d 0 NULL

Something like this could do the trick, just a map of a filter of the first dataframe:
df2 %>%
as_tibble() %>%
mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
index = row_number()) %>%
unnest(links, keep_empty = TRUE) %>%
group_by(index) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(-index)
# # A tibble: 5 × 2
# var links
# <chr> <int>
# 1 c 1
# 2 a 1
# 3 b 4
# 4 b 5
# 5 d NA

identifying last occurring duplicates in a vector in R

I would like to identify all unique values and last occurring instances of multiple values in a vector. For example, I would like to to identify the positions
c(2,3,4,6,7)
in the vector:
v <- c("m", "m", "k", "r", "l", "o", "l")
I see that
(duplicated(v) | duplicated(v, fromLast = T))
identifies all duplicated values, yet I would like to only identify the last occurring instances of duplicated elements.
How to achieve this without a loop?

Do you need:
duplicated(v)
[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE
# and for index
which(duplicated(v))
[1] 2 7
or as akrun suggests:
which(!duplicated(v, fromLast = TRUE))
[1] 2 3 4 6 7

You could do something like:
library(dplyr)
v %>%
as_tibble() %>%
mutate(index = row_number()) %>%
group_by(value) %>%
mutate(id=row_number()) %>%
filter(id == max(id))
Which gives us:
# A tibble: 5 × 3
# Groups: value [5]
value index id
<chr> <int> <int>
1 m 2 2
2 k 3 1
3 r 4 1
4 o 6 1
5 l 7 2
Additionally, if you just want the index, you can do:
v %>%
as_tibble() %>%
mutate(index = row_number()) %>%
group_by(value) %>%
mutate(id=row_number()) %>%
filter(id == max(id)) %>%
pull(index)
...to get:
[1] 2 3 4 6 7

We can try
> sort(tapply(seq_along(v), v, max))
m k r o l
2 3 4 6 7
or
> unique(ave(seq_along(v), v, FUN = max))
[1] 2 3 4 7 6
or
> rev(length(v) - which(!duplicated(rev(v))) + 1)
[1] 2 3 4 6 7

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!

Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6

You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

map over columns and apply custom function

Missing something small here and struggling to pass columns to function. I just want to map (or lapply) over columns and perform a custom function on each of the columns. Minimal example here:
library(tidyverse)
set.seed(10)
df <- data.frame(id = c(1,1,1,2,3,3,3,3),
r_r1 = sample(c(0,1), 8, replace = T),
r_r2 = sample(c(0,1), 8, replace = T),
r_r3 = sample(c(0,1), 8, replace = T))
df
# id r_r1 r_r2 r_r3
# 1 1 0 0 1
# 2 1 0 0 1
# 3 1 1 0 1
# 4 2 1 1 0
# 5 3 1 0 0
# 6 3 0 0 1
# 7 3 1 1 1
# 8 3 1 0 0
a function just to filter and counts unique ids remaining in the dataset:
cnt_un <- function(var) {
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
it works outside of map
cnt_un(r_r1)
# A tibble: 1 x 2
r_r1 n_uniq
<dbl> <int>
1 1 3
I want to apply the function over all r_r columns to get something like:
df2
# y n_uniq
# 1 r_r1 3
# 2 r_r2 2
# 3 r_r3 2
I thought the following would work but doesnt
map(dplyr::select(df, matches("r_r")), ~ cnt_un(.x))
any suggestions? thanks

I'm not sure if there's a direct tidyeval way to do this with something like map. The issue you're running into is that in calling map(df, *whatever_function*), the function is being called on each column of df as a vector, whereas your function expects a bare column name in the tidyeval style. To verify that:
map(df, class)
will return "numeric" for each column.
An alternative is to iterate over column names as strings, and convert those to symbols; this takes just one additional line in the function.
library(dplyr)
library(tidyr)
library(purrr)
cnt_un_name <- function(varname) {
var <- ensym(varname)
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
Calling the function is a little awkward because it keeps only the relevant column names (calling on "r_r1" gets columns "r_r1" and "n_uniq", etc). One way is to get the vector of column names you want, name it so you can add an ID column in map_dfr, and drop the extra columns, since they'll be mostly NA.
grep("^r_r\\d+", names(df), value = TRUE) %>%
set_names() %>%
map_dfr(cnt_un_name, .id = "y") %>%
select(y, n_uniq)
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 3
#> 2 r_r2 2
#> 3 r_r3 2
A better way is to call the function, then bind after reshaping.
grep("^r_r\\d+", names(df), value = TRUE) %>%
map(cnt_un_name) %>%
map_dfr(pivot_longer, 1, names_to = "y") %>%
select(y, n_uniq)
# same output as above
Alternatively (and maybe better/more scaleable) would be to do the column renaming inside the function definition.

Here's a base R solution that uses lapply. The tricky bit is that your function isn't actually running on single columns; it's using id, too, so you can't use canned functions that iterate column-wise.
do.call(rbind, lapply(grep("r_r", colnames(df), value = TRUE), function(i) {
X <- subset(df, df[,i] == 1)
row <- data.frame(y = i, n_uniq = length(unique(X$id)), stringsAsFactors = FALSE)
}))
y n_uniq
1 r_r1 2
2 r_r2 3
3 r_r3 2

Here is another solution. I changed the syntax of your function. Now you supply the pattern of the columns you want to select.
cnt_un <- function(var_pattern) {
df %>%
pivot_longer(cols = contains(var_pattern), values_to = "vals", names_to = "y") %>%
filter(vals == 1) %>%
group_by(y) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
cnt_un("r_r")
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 2
#> 2 r_r2 3
#> 3 r_r3 2

skipping elements with Map() and match() in R

I'd like to recode the values in the df1 data frame using the df2 data frame so that I end up with a data frame like df3.
The current code almost does the trick, but there are two problems. First, it introduces NA when there's no match, e.g. there is no match in df2 for the df1 aed_bloodpr variable value "1,2" so the value becomes NA. Second, when a variable in df1 can't be mapped to df2, the code won't run (error message).
Have looked into the nomatch argument for match() and the .default argument for Map(), but I can't figure out how to use them so that I end up with df3.
Starting point:
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
End point:
Df3 <- data.frame("aed_bloodpr" = c("1,2","normal","high","high"),
"aed_gluco" = c("normal","elevated","NA","normal"),
"add_bmi" = c("above","5,7","below","normal"),
"add_asthma"=c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Current code:
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))

You need to clean up before you can relabel. The actual relabeling is more easily accomplished by a join. Here using the tidyverse (translate as you like):
library(tidyverse)
Df1 <- data.frame("aed_bloodpr" = c("1,2","2","1","1"),
"aed_gluco" = c("2","1","3","2"),
"add_bmi" = c("2","5,7","7","5"),
"add_asthma" = c("2","2","7","5"),
"nausea" = c("3","3","4","5"))
Df2 <- data.frame("NameOfVariable" = c("aed_bloodpr","aed_bloodpr","aed_gluco","aed_gluco","aed_gluco","add_bmi","add_bmi","add_bmi"),
"VariableLevel" = c(1,2,1,2,3,2,5,7),
"VariableDef" = c("high","normal","elevated","normal","NA","above","normal","below"))
Df1_long <- Df1 %>%
mutate_all(as.character) %>% # change factors to strings
rowid_to_column('i') %>% # add row index to enable later long-to-wide reshape
gather(variable, value, -i) %>% # reshape to long form
separate_rows(value, convert = TRUE) # unnest nested values and convert to numeric
str(Df1_long)
#> 'data.frame': 22 obs. of 3 variables:
#> $ i : int 1 1 2 3 4 1 2 3 4 1 ...
#> $ variable: chr "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" "aed_bloodpr" ...
#> $ value : int 1 2 2 1 1 2 1 3 2 2 ...
Df2_clean <- Df2 %>%
mutate_if(is.factor, as.character) %>% # change factors to strings
mutate_all(na_if, 'NA') # change "NA" to NA
Df3 <- Df1_long %>%
left_join(Df2_clean, by = c('variable' = 'NameOfVariable', # merge
'value' = 'VariableLevel')) %>%
mutate(VariableDef = coalesce(VariableDef, as.character(value))) %>% # combine labels and values
group_by(i, variable) %>%
summarise(value = toString(VariableDef)) %>% # re-aggregate multiple values
spread(variable, value) # reshape to wide form
Df3
#> # A tibble: 4 x 6
#> # Groups: i [4]
#> i add_asthma add_bmi aed_bloodpr aed_gluco nausea
#> * <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 2 above high, normal normal 3
#> 2 2 2 normal, below normal elevated 3
#> 3 3 7 below high 3 4
#> 4 4 5 normal high normal 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how do I find differences between similar strings? - r

Related

R: conditionally mutate a variable when columns match in different dataframes

identifying last occurring duplicates in a vector in R

Separate rows with conditions

map over columns and apply custom function

skipping elements with Map() and match() in R

Categories

Resources