How to find common strings across several files

How to find common strings across several files - r

I have data like this:
df1<- structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA,
-17L))
df2<-structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ",
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
and
df3<- structure(list(test = c("SNTLKM", "STTTFSGTT", "GFD", "STOMQ",
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA,
-10L))
I want to know how many strings are common across all these 3 data frames.
If it was two, I could do it with match and join function but I want to know how many are shared between df1 and df2 and df3 or a combination.

example (if only identical strings count for duplicates):
library(dplyr)
df1 <- data.frame(test = c("A", "B", "C", "C"))
df2 <- data.frame(test = c("B", "C", "D"))
df3 <- data.frame(test = c("C", "D", "E"))
bind_rows(df1, df2, df3, .id = "origin") %>%
group_by(origin) %>%
distinct(test) %>% ## remove within-dataframe duplicates
group_by(test) %>%
summarise(replicates = n()) %>%
filter(replicates > 1)

Here is an update in case only identical strings are wished:
library(dplyr)
bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
filter(duplicated(test) | duplicated(test, fromLast=TRUE))
id test
1 df1 STOMQ
2 df1 TMEIL1
3 df1 YIPF1
4 df2 STOMQ
5 df2 TMEIL1
6 df2 YIPF1
7 df3 STOMQ
8 df3 YIPF1
First answer:
Here is a suggestion:
First bring all dataframes in a list of dataframes with an identifier and arrange by the the string. Now you could check visually:
library(dplyr)
x <- bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>%
arrange(test)
To automate the process you have to use a kind of string distance, there are some different out there and I can't tell which one is better or more appropriate. One example is Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index
Here we use the Jaro-Winkler distance: Learned here: How to group similar strings together in a database in R
in the group column you could find the similar strings:
You can define what does similar mean, by changing the value of "jw". Try and change it from 0.4 to 0.1 then you will see that the groups change:
library(tidyverse)
library(stringdist)
map_dfr(x$test, ~ {
i <- which(stringdist(., x$test, "jw") < 0.40)
tibble(index = i, title = x$test[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group)) +
bind_cols(df_id = x$id)
group index title df_id
<int> <int> <chr> <chr>
1 1 1 BRsts df3
2 2 2 GFD df3
3 3 3 RSEST df3
4 3 31 TRS df2
5 3 32 TRSF df3
6 4 4 SNTLK df1
7 4 5 SNTLKM df2
8 4 6 SNTM1 df1
9 4 8 STOLA df1
10 4 12 STR2 df2
# ... with 27 more rows

Related

Group dataframe row and column wise based on other dataframe?

I have a dataframe that I would like to group in both directions, first rowise and columnwise after. The first part worked well, but I am stuck with the second one. I would appreciate any help or advice for a solution that does both steps at the same time.
This is the dataframe:
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
This is the second dataframe, which holds the "recipe" for grouping:
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
Rowise grouping:
df1_grouped<-bind_cols(df1[1:2], map_df(df2, ~rowSums(df1[unique(.x)])))
Now i would like to apply the same grouping to the ID2 column and sum the values in the other columns. My idea was to mutate a another column (e.g. "group", which contains the name of the final group of ID2. After this i can use group_by() and summarise() to calculate the sum for each. However, I can't figure out an automated way to do it
bind_cols(df1_grouped,
#add group label
data.frame(
group = rep(c("Group_2","Group_1","Group_1","Group_2","Group_3"),2))) %>%
#remove temporary label column and make ID a character column
mutate(ID2=group,
ID=as.character(ID))%>%
select(-group) %>%
#summarise
group_by(ID,ID2)%>%
summarise_if(is.numeric, sum, na.rm = TRUE)
This is the final table I need, but I had to manually assign the groups, which is impossible for big datasets

I will offer such a solution
library(tidyverse)
set.seed(1)
df1 <- data.frame(
ID = c(rep(1,5),rep(2,5)),
ID2 = rep(c("A","B","C","D","E"),2),
A = rnorm(10,20,1),
B = rnorm(10,50,1),
C = rnorm(10,10,1),
D = rnorm(10,15,1),
E = rnorm(10,5,1)
)
df2 <- data.frame (
Group_1 = c("B","C"),
Group_2 = c("D","A"),
Group_3 = ("E"), stringsAsFactors = FALSE)
df2 <- df2 %>% pivot_longer(everything())
df1 %>%
pivot_longer(-c(ID, ID2)) %>%
mutate(gr_r = df2$name[match(ID2, table = df2$value)],
gr_c = df2$name[match(name, table = df2$value)]) %>%
arrange(ID, gr_r, gr_c) %>%
pivot_wider(c(ID, gr_r), names_from = gr_c, values_from = value, values_fn = list(value = sum))

Writing for loop in r to combine columns that has matching names (with little variance)

I have a data frame where column names are duplicated once. Now I need to combine them to get a proper data set. I can use dplyr select command to extract matching columns and combine them later. However, I wish to achieve it using for loop.
#Example data frame
x <- c(1, NA, 3)
y <- c(1, NA, 4)
x.1 <- c(NA, 3, NA)
y.1 <- c(NA, 5, NA)
data <- data.frame(x, y, x1, y1)
##with `dplyr` I can do like
t1 <- data%>%select(contains("x"))%>%
mutate(x = rowSums(., na.rm = TRUE))%>%
select(x)
t2 <- data%>%select(contains("y"))%>%
mutate(y = rowSums(., na.rm = TRUE))%>%
select(y)
data <- cbind(t1,t2)
This is cumbersome as I have more than 25 similar columns
How to achieve the same result using for loop by matching columns names and perform rowSums. Or even simple approach using dplyr will also help.

We can use split.default to split based on the substring of the column names into a list and then apply the rowSums
library(dplyr)
library(stringr)
library(purrr)
data %>%
split.default(str_remove(names(.), "\\.\\d+")) %>%
map_dfr(rowSums, na.rm = TRUE)
# A tibble: 3 x 2
# x y
# <dbl> <dbl>
#1 1 1
#2 3 5
#3 3 4
If we want to use a for loop
un1 <- unique(sub("\\..*", "", names(data)))
out <- setNames(rep(list(NA), length(un1)), un1)
for(un in un1) {
out[[un]] <- rowSums(data[grep(un, names(data))], na.rm = TRUE)
}
as.data.frame(out)
data
data <- structure(list(x = c(1, NA, 3), y = c(1, NA, 4), x.1 = c(NA,
3, NA), y.1 = c(NA, 5, NA)), class = "data.frame", row.names = c(NA,
-3L))

Using purrr::map_dfc and transmute instead of mutate
library(dplyr)
purrr::map_dfc(c('x','y'), ~data %>% select(contains(.x)) %>%
transmute(!!.x := rowSums(., na.rm = TRUE)))
x y
1 1 1
2 3 5
3 3 4

If column2 is NA join on column1 else join on column1 and column2

I'm sure this question has been asked before, but I can't find the answer for the life of me. I want to use dplyr to join two tibbles together. If the second column is NA then just join on the first column. If the second column is not NA, then join on the first and second column. The solution below doesn't work, but it's what I'm trying to do.
library(tidyverse)
df1 <- tibble(x = c("Name", "City", "City"), y = c("Table5", "Table1", "Table2"))
df2 <- tibble(x2 = c("Name", "City", "City"), y2 = c(NA, "Table1", "Table2"), z = c("a", "b", "c"))
joined_data <- if (is.na(df2$y2)) {
df1 %>%
left_join(df2, by = c("x" = "x2"))
} else {
df1 %>%
left_join(df2, by = c("x" = "x2", "y" = "y2"))
}
The final result should be
x y z
<chr> <chr> <chr>
1 Name Table5 a
2 City Table1 b
3 City Table2 c

We first find all the NA indices and then join them in two separate calls. For non-NA indices we join them on x and y whereas for NA indices we join them on only x and select non-NA value between y and y2 using coalesce and then bind the rows together.
library(tidyverse)
NAinds <- is.na(df2$y2)
df1[!NAinds,] %>%
left_join(df2, by = c("x" = "x2", "y" = "y2")) %>%
bind_rows(df1[NAinds, ] %>%
left_join(df2, by = c("x" = "x2")) %>%
mutate(y = coalesce(y, y2)) %>%
select(-y2))
# x y z
# <chr> <chr> <chr>
#1 City Table1 b
#2 City Table2 c
#3 Name Table5 a

Is it possible to use the by variable from the left and right dataframes separately after a left_join procedure

After joining two dataframes (df1 and df2) I would like to flag in column check which ids are in df1 but not df2 (example below).
df1 <-
data.frame(id = c(1, 2, 3))
df2 <-
data.frame(id = c(1, 2))
df3 <-
left_join(df1, df2)
Required result
id check
1 1 Y
2 2 Y
3 3 N
I can achieve the result using a temp column (example below)
df1 <-
data.frame(id = c(1, 2, 3))
df2 <-
data.frame(id = c(1, 2), temp = "Y")
df3 <-
left_join(df1, df2) %>%
mutate(check = ifelse(is.na(temp), "N", "Y")) %>%
select(-temp)
but I was hoping for a solution where a temp column isn't required, I've tried some different approaches (for example below), but haven't been able to find a better solution.
df3 <-
left_join(x = df1, y = df2) %>%
mutate(check = ifelse(is.na(y.id), "N", "Y"))
but this errors with...
Joining, by = "id"
Error: object 'y.id' not found

discard last or first group after group_by by referencing group directly

Data:
df <- data.frame(A=c(rep(letters[1],3),rep(letters[2],3),rep(letters[3],3)),
B=rnorm(9),
stringsAsFactors=F)
I don't know if there's a way to do this, but what I'd like to know is if there's way to discard the last group by directly referencing the groups after group_by(A) to get the desired output:
A B
1 a -0.4900863
2 a 1.4106594
3 a -0.2245738
4 b -0.2124955
5 b 0.6963785
6 b 0.9151825
I AM INTERESTED IN SOLUTIONS THAT DIRECTLY WORK AT THE GROUPS LEVEL
For instance, something like:
df %>% group_by(A) %>% head(.Groups,-1)
or
df %>% group_by(A) %>% Groups[1:2]
I AM NOT INTERESTED IN THE FOLLOWING KINDS OF SOLUTIONS
df %>% filter(!(A == max(A)))
df %>% filter(!(A %in% max(A)))
OR OTHER SOLUTIONS THAT DO NOT REQUIRE group_by TO WORK

I was assuming you were not supposed to be assuming that we knew in advance what the number of groups might be. Try using the labels attribute:
all_but_last <- df %>% group_by(A) %>% attr("labels") %>% head(-1)
A
1 a
2 b
... to extract desired rows
> df %>% filter(A %in% all_but_last[[1]])
A B
1 a -0.799026840
2 a -0.712402478
3 a 0.685320094
4 b 0.971492883
5 b -0.001479117
6 b -0.817766296
Helps to use dput to look at the actual contents of a "grouped_df":
dput( df %>% group_by(A) )
structure(list(A = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), B = c(-0.799026840397576, -0.712402478350695, 0.685320094252465,
0.971492883452258, -0.00147911717469651, -0.817766295631676,
-1.00112471676908, 1.88145909873596, -0.305560178617216)), .Names = c("A",
"B"), row.names = c(NA, -9L), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), vars = "A", drop = TRUE, indices = list(
0:2, 3:5, 6:8), group_sizes = c(3L, 3L, 3L), biggest_group_size = 3L,
labels = structure(list(
A = c("a", "b", "c")),
row.names = c(NA, -3L),
class = "data.frame",
vars = "A", drop = TRUE, .Names = "A"))
Note that the labels are a data.frame so you could have further applied unlist to the result that became all_but_last and you then would not have needed to extract its value with "[[".

Perhaps this helps
library(dplyr)
df %>%
group_by(A) %>%
group_indices(.) %in% 1:2 %>%
df[.,]
Or with data.table
library(data.table)
setDT(df)[, grp := .GRP, A][grp %in% unique(grp)[1:2]][, grp := NULL][]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to find common strings across several files - r

Related

Group dataframe row and column wise based on other dataframe?

Writing for loop in r to combine columns that has matching names (with little variance)

If column2 is NA join on column1 else join on column1 and column2

Is it possible to use the by variable from the left and right dataframes separately after a left_join procedure

discard last or first group after group_by by referencing group directly

Categories

Resources