String matching two dataframes with a sliding window - r

I have two df.
df1
col1
1 a
2 b
3 c
4 c
df2
setID col1
1 1 a
2 1 b
3 1 b
4 1 a
5 2 w
6 2 v
7 2 c
8 2 b
9 3 a
10 3 a
11 3 b
12 3 a
13 4 a
14 4 b
15 4 c
16 4 a
I'm using the following code to match them.
scorematch <- function ()
{
require("dplyr")
#to make sure every element is preceded by the one before that element
combm <- rev(sapply(rev(seq_along(df1$col1)), function(i) paste0(df1$col1[i-1], df1$col1[i])));
tempdf <- df2
#group the history by their ID
tempdf <- group_by(tempdf, setID)
#collapse strings in history
tempdf <- summarise(tempdf, ss = paste(col1, collapse = ""))
tempdf <- rowwise(tempdf)
#add score based on how it matches compared to path
tempdf <- mutate(tempdf, score = sum(sapply(combm, function(x) sum(grepl(x, ss)))))
tempdf <- ungroup(tempdf)
#filter so that only IDs with scores more than 0 are available
tempdf <- filter(tempdf, score != 0)
tempdf <- pull(tempdf, setID)
#filter original history to reflect new history
tempdf2 <- filter(df2, setID %in% tempdf)
tempdf2
}
This code works great. But I want to take this further. I want to apply a sliding window function to get the df1 values I want to match against df2. So far I'm using this function as my sliding window.
slidingwindow <- function(data, window, step)
{
#data is dataframe with colname
total <- length(data)
#spots are start of each window
spots <- seq(from=1, to=(total-step), by=step)
result <- vector(length = length(spots))
for(i in 1:length(spots)){
...
}
return(result)
}
The scorematch function will be nested inside slidingwindow function. I'm unsure how to proceed from there though. Ideally df1 will be split into windows. Starting from the first window it will be matched against df2 using the scorematch function to get a filtered out df2. Then I want the second window of df1 to match against the newly filtered df2 and so on. The loop should end when df2 has been filtered down so that it contains only 1 distinct setID value. The final output can either be the whole filtered df2 or just the remaining setID.
Ideal output would be either
setID col1
1 4 a
2 4 b
3 4 c
4 4 a
or
[1] "4"

Here is a solution without using a for-loop. I use stringr because of its nice consistent syntax, purrr for map (although lapply would be sufficient in this case) and dplyr to group_by setID and collapse the strings for each group.
library(dplyr)
library(purrr)
library(stringr)
First I collapse the string for each group. This makes it easier to use pattern-matching with str_detect-later:
df2_collapse <- df2 %>%
group_by(setID) %>%
summarise(string = str_c(col1, collapse = ""))
df2_collapse
# A tibble: 4 x 2
# setID string
# <int> <chr>
# 1 1 abba
# 2 2 wvcb
# 3 3 aaba
# 4 4 abca
The "look-up" string is collapse as well and then the substrings (i.e. slding windows) are extract with str_sub. Here I work along the length of the string str_length and extract all possible groups following each letter in the string.
string <- str_c(df1$col1, collapse = "")
string
# [1] "abcc"
substrings <-
unlist(map(1:str_length(string), ~ str_sub(string, start = .x, end = .x:str_length(string))))
Store the substrings in a tibble with their length as score.
substrings
# [1] "a" "ab" "abc" "abcc" "b" "bc" "bcc" "c" "cc" "c"
substrings <- tibble(substring = substrings,
score = str_length(substrings))
substrings
# A tibble: 10 x 2
# substring score
# <chr> <int>
# 1 a 1
# 2 ab 2
# 3 abc 3
# 4 abcc 4
# 5 b 1
# 6 bc 2
# 7 bcc 3
# 8 c 1
# 9 cc 2
# 10 c 1
For each setID with extract the maximum score it matches in the substring-data and the filter out the row with the maximum score of all setIDs.
df2_collapse %>%
mutate(score = map_dbl(string,
~ max(substrings$score[str_detect(.x, substrings$substring)]))) %>%
filter(score == max(score))
# A tibble: 1 x 3
# setID string score
# <int> <chr> <dbl>
# 1 4 abca 3
Data
df1 <- structure(list(col1 = c("a", "b", "c", "c")),
class = "data.frame", row.names = c("1", "2", "3", "4"))
df2 <-
structure(list(setID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L),
col1 = c("a", "b", "b", "a", "w", "v", "c", "b", "a", "a", "b", "a", "a", "b", "c", "a")),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16"))

Related

How to subset R data frame based on duplicates in one column and unique values in another

This seems pretty straightforward but I am stumped. I have a data frame that looks like this:
df1 %>% head()
values paired
<ch> <int>
1 apples 1
2 x 1
3 oranges 2
4 z 2
5 bananas 3
6 y 3
7 apples 4
8 p 4
I would like to create a new data frame that extracts all paired values based on a search criteria. So if I want all pairs that correspond to apples I would like to end up with something like this:
df1 %>% head()
values paired
<ch> <int>
1 apples 1
2 x 1
3 apples 4
4 p 4
I have tried using:
new_pairs <- df1 %>%
arrange(values, paired) %>%
filter(duplicated(paired) == TRUE,
values=="apples")
But I am getting only the apple rows back
You'll need to group on the paired variable before filtering.
How about:
df1 %>%
group_by(paired) %>%
filter("apples" %in% values) %>%
ungroup()
Result:
# A tibble: 4 x 2
values paired
<chr> <int>
1 apples 1
2 x 1
3 apples 4
4 p 4
Your data:
df1 <- structure(list(values = c("apples", "x", "oranges", "z", "bananas", "y", "apples", "p"),
paired = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
Here is another tidyverse possibility. I filter for the rows that have apples and also keep the rows that immediately follow apples.
library(tidyverse)
df %>%
filter((values == "apples" |
row_number() %in% c(which(
values == "apples", arr.ind = TRUE
) + 1)))
Output
values paired
1 apples 1
2 x 1
3 apples 4
4 p 4
Here is a data.table option (subset is only used to change the order of the columns):
library(data.table)
dt <- as.data.table(df)
subset(dt[, .SD[any(values == "apples")], by = paired], select = c("values", "paired"))
values paired
1: apples 1
2: x 1
3: apples 4
4: p 4
Data
df <-
structure(list(
values = c("apples", "x", "oranges", "z", "bananas",
"y", "apples", "p"),
paired = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)
),
class = "data.frame",
row.names = c(NA,-8L))
In base R, find the values of the pairs of interest
pairs = subset(df1, values %in% "apples")$paired
and create a subset of the data
subset(df1, paired %in% pairs)

How to add a common number to rows that have same value in another column?

After years of using your advices to another users, here is my for now unsolvable issue...
I have a dataset with thousands of rows and hundreds of column, that have one column with a possible value in common. Here is a subset of my dataset :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("x1", "x2", "x3", "x2", "x3")
mat <- cbind(ID, Dose, Value)
What I want is to assign a unique value to the rows that have the "Value" column in common, like that :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("153254", "258634", "896411", "258634", "896411")
Code <- c("1", "2", "3", "2", "3")
mat <- cbind(ID, Dose, Value, Code)
Does anyone have an idea that could help me a little ?
Thanks !
We may use match here
library(dplyr)
mat %>%
mutate(Code = match(Value, unique(Value)))
-output
ID Dose Value Code
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3
data
mat <- data.frame(ID, Dose, Value)
You should consider using a data.frame:
mat <- data.frame(ID, Dose, Value)
Using dplyr you could create the desired output:
library(dplyr)
mat %>%
group_by(Value) %>%
mutate(Code = cur_group_id()) %>%
ungroup()
This returns
# A tibble: 5 x 4
ID Dose Value Code
<chr> <chr> <chr> <int>
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3

Rename columns of a dataframe based on another dataframe except columns not in that dataframe in R

Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5

Find index based on the minimum number for every group

I want to extract the index based of the minimum number for every Group
Group <- c("A","A","A","A","A","B","B","C","C","C","C")
Number <- c(12,45,15,65,54,21,23,12,3,5,6,11,34,656,754)
data.frame(Group,Number)
Group Number
1 A 12
2 A 45
3 A 15
4 A 65
5 A 54
6 B 21
7 B 23
8 C 12
9 C 3
10 C 5
11 C 6
The result should be a vector that contain the indices:
Answer
vector <- (1,6,9)
Create a sequence column, grouped by 'Group', summarise by returning the corresponding row number based on the index of min value of 'Number' (which.min) and pull the column as a vector
library(dplyr)
df1 %>%
mutate(rn = row_number()) %>%
group_by(Group) %>%
summarise(n = rn[which.min(Number)]) %>%
pull(n)
#[1] 1 6 9
data
df1 <- structure(list(Group = c("A", "A", "A", "A", "A", "B", "B", "C",
"C", "C", "C"), Number = c(12L, 45L, 15L, 65L, 54L, 21L, 23L,
12L, 3L, 5L, 6L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11"))
Does this work for you?
library(dplyr)
df %>%
mutate(row_n = row_number()) %>%
group_by(Group) %>%
slice_min(Number)
# A tibble: 3 x 3
# Groups: Group [3]
Group Number row_n
<chr> <dbl> <int>
1 A 12 1
2 B 12 7
3 C 3 8
The row numbers are in column row_n. If you want outputted only the row numbers, add %>% ungroup() %>% select(-c(1:2)) like so:
df %>%
mutate(row_n = row_number()) %>%
group_by(Group) %>%
slice_min(Number) %>%
ungroup() %>%
select(-c(1:2))
# A tibble: 3 x 1
row_n
<int>
1 1
2 7
3 8
Data:
Group <- c("A","A","A","A","A","B","B","C","C","C","C")
Number <- c(12,45,65,54,21,23,12,3,5,6,34)
df <- data.frame(Group,Number)
This function returns the index i of the smallest value in v
FUN = function(v, i) i[which.min(v)]
Here are the values by group
v = split(df$Number, df$Group)
and the index into the original data.frame by group
i = split(seq_along(df$Number), df$Group)
Apply our function to each group
mapply(FUN, v, i)
In one go:
FUN = function(v, i) i[which.min(v)]
v = split(df$Number, df$Group)
i = split(seq_along(df$Number), df$Group)
mapply(FUN, v, i)

How to swap row values in the same column of a data frame?

I have a data frame that looks like the following:
ID Loc
1 N
2 A
3 N
4 H
5 H
I would like to swap A and H in the column Loc while not touching rows that have values of N, such that I get:
ID Loc
1 N
2 H
3 N
4 A
5 A
This dataframe is the result of a pipe so I'm looking to see if it's possible to append this operation to the pipe.
You could try:
df$Loc <- chartr("AH", "HA", df$Loc)
df
ID Loc
1 1 N
2 2 H
3 3 N
4 4 A
5 5 A
We can try chaining together two calls to ifelse, for a base R option:
df <- data.frame(ID=c(1:5), Loc=c("N", "A", "N", "H", "H"), stringsAsFactors=FALSE)
df$Loc <- ifelse(df$Loc=="A", "H", ifelse(df$Loc=="H", "A", df$Loc))
df
ID Loc
1 1 N
2 2 H
3 3 N
4 4 A
5 5 A
If you have a factor, you could simply reverse those levels
l <- levels(df$Loc)
l[l %in% c("A", "N")] <- c("N", "A")
df
# ID Loc
# 1 1 A
# 2 2 N
# 3 3 A
# 4 4 H
# 5 5 H
Data:
df <- structure(list(ID = 1:5, Loc = structure(c(3L, 1L, 3L, 2L, 2L
), .Label = c("A", "H", "N"), class = "factor")), .Names = c("ID",
"Loc"), class = "data.frame", row.names = c(NA, -5L))

Resources