How to concatenate character strings based on condition in r? - r

I need to prepare queries that are made of characters strings (DOI, Digital Object Identifier) stored in a data frame. All strings associated with the same case have to be joined to produce one query.
The df looks like this:
Case
DOI
1
1212313/dfsjk23
1
322332/jdkdsa12
2
21323/xsw.w3
2
311331313/q1231
2
1212121/1231312
The output should be a data frame looking like this:
Case
Query
1
DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
2
DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
The prefix ("DO="), suffix (")") and "OR" are not critical, I can add them later, but how to aggregate character strings based on a case number?

In base R you could do:
aggregate(DOI~Case, df1, function(x) sprintf('DO=(%s)', paste0(x, collapse = ' OR ')))
Case DOI
1 1 DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
2 2 DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
if Using R 4.1.0
aggregate(DOI~Case, df1, \(x)sprintf('DO=(%s)', paste0(x, collapse = ' OR ')))

We can use glue with str_c to collapse the 'DOI' column after grouping by 'Case'
library(stringr)
library(dplyr)
df1 %>%
group_by(Case) %>%
summarise(Query = glue::glue("DO=({str_c(DOI, collapse= ' OR ')})"))
-output
## A tibble: 2 x 2
# Case Query
# <int> <glue>
#1 1 DO=(1212313/dfsjk23 OR 322332/jdkdsa12)
#2 2 DO=(21323/xsw.w3 OR 311331313/q1231 OR 1212121/1231312)
data
df1 <- structure(list(Case = c(1L, 1L, 2L, 2L, 2L), DOI = c("1212313/dfsjk23",
"322332/jdkdsa12", "21323/xsw.w3", "311331313/q1231", "1212121/1231312"
)), class = "data.frame", row.names = c(NA, -5L))

Related

Using lapply to sum a subset of a dataframe

I'm quite new to R and using lapply. I have a large dataframe and I'm attempting to use lapply to output the sum of some subsets of this dataframe.
group_a
group_b
n_variants_a
n_variants_b
1
NA
1
2
NA
2
5
4
1
2
2
0
I want to look at subsets based on multiple different groups (group_a, group_b) and sum each column of n_variants.
Running this over just one group and n_variant set works:
sum(subset(df, (!is.na(group_a)))$n_variants_a
However I want to sum every n_variant column based on every grouping. My lapply script for this outputs values of 0 for each sum.
summed_variants <- lapply(list_of_groups, function(g) {
lapply(list_of_variants, function(v) {
sum(subset(df, !(is.na(g)))$v)
I was wondering if I need to use paste0 to paste the list of variants in, but I couldn't get this to work.
Thanks for your help!
We may use Map/mapply for this - loop over the group names, and its corresponding 'n_variants' (assuming they are in order), extract the columns based on the names, apply the condition (!is.na), subset the 'n_variants' and get the sum
mapply(function(x, y) sum(df1[[y]][!is.na(df1[[x]])]),
names(df1)[1:2], names(df1)[3:4])
group_a group_b
3 4
Or another option can be done using tidyverse. Loop across the 'n_variants' columns, get the column name (cur_column()) replace the substring with 'group', get the value, create the condition to subset the column and get the sum
library(stringr)
library(dplyr)
df1 %>%
summarise(across(contains('variants'),
~ sum(.x[!is.na(get(str_replace(cur_column(), 'n_variants', 'group')))])))
-output
n_variants_a n_variants_b
1 3 4
data
df1 <- structure(list(group_a = c(1L, NA, 1L), group_b = c(NA, 2L, 2L
), n_variants_a = c(1L, 5L, 2L), n_variants_b = c(2L, 4L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))

exclude part of a character string in a column in a dataframe

I have a df that has several ids separated with an underscore, like so:
df1:
id v1
1001 2
10002_10002 19
I want the underscore removed and anything after the underscore, like so:
df1:
id v1
1001 2
10002 19
I tried this code, but it's giving me a list, not a df. Can someone please help?
df2 <- strsplit(df1$id, split='_', fixed=TRUE)
You need to access the contents of the list, and then retain the first of two elements resulting from the split:
df1$id <- strsplit(df1$id, "_", fixed=TRUE)[[1]][1]
You could also use sub here:
df1$id <- sub("_.*$", "", df1$id)
Here a solution with the tidyverse/stringr:
library(tidyverse)
my_df <- data.frame(
stringsAsFactors = FALSE,
id = c("1001", "10002_10002"),
v1 = c(2L, 19L)
)
my_df %>%
mutate(id=str_remove(id, regex("(_.*)")))
#> id v1
#> 1 1001 2
#> 2 10002 19
Created on 2020-12-03 by the reprex package (v0.3.0)
I like to use the stringr package
require(stringr)
df1 <- data.frame(id = c("1001","1002_10002"), v1 = c(2,19))
df1$id <- str_remove(df1$id, pattern = "_.+")
We can use word from stringr
library(stringr)
library(dplyr)
df1 %>%
mutate(id = word(id, 1, sep="_"))
# id v1
#1 1001 2
#2 10002 19
Another option is trimws from base R
df1$id <- trimws(df1$id, whitespace = "_.*")
data
df1 <- structure(list(id = c("1001", "10002_10002"),
v1 = c(2L, 19L)), class = "data.frame", row.names = c(NA,
-2L))

Determine if all IDs in the column of one data.frame are in a column of a data.frame in R

I have two data.frame tables in R. Both have IDs for users who took particular actions. The users in the second table should all have done the actions in the first table, but I want to confirm. What would be the best way to determine if all the IDs in table 2 are represented in table, and if not what IDs aren't?
Table A
**Unique ID** **Count**
abc123 1
zyx456 15
888aaaa 4
Table B
**Unique ID** **Count**
abc123 1
zyx456 1
zzzzz123 2
I'm trying to get a response that abc123 and zyx456 in Table B are in Table A and that zzzzz123 is not represented in Table A but is in B (which would be an error, since all B should be in A).
This is an efficient one-liner in base R:
setdiff(TableB$ID, TableA$ID)
It will return an empty result if everything in TableB is in TableA, and return the missing fields if there are any.
Other answers may be better choices with broader context, but this is a simple solution for a simple problem.
We can do this easily with a join in the tidyverse:
library(tidyverse)
JoinedTable = full_join(
x = TableA %>% mutate(in.A = TRUE),
y = TableB %>% mutate(in.B = TRUE).
by = "UniqueID",
suffix = c(".A",".B")
)
### Use whichever of the following is applicable
## Is in both
JoinedTable %>%
filter(in.A, in.B)
## In A only
JoinedTable %>%
filter(in.A, !in.B)
## In B only
JoinedTable %>%
filter(!in.A, in.B)
Use a full_join to intersect the tables; set "by" to your ID column and adding a suffix to differentiate other columns that aren't unique to a particular column. I've added mutates to make the filtering code more clear, but you could also just look for NAs in the respective Counts columns (i.e. filter(!is.na(Count.A), is.na(Count.B)) to find ones in A but not B).
If you just want a vector of the ones that meet each condition, just tack on %>% pull(UniqueID) to grab that.
You can add another column to table B show if it is also in table A. Here is the code can make it (assuming dfA and dfB denote tables A and B):
dfB <- within(dfB, in_dfA <- UniqueID %in% tbla$UniqueID)
gives
> dfB
UniqueID Count in_dfA
1 abc123 1 TRUE
2 zyx456 1 TRUE
3 zzzzz123 2 FALSE
DATA
dfA <- structure(list(UniqueID = structure(c(2L, 3L, 1L), .Label = c("888aaaa",
"abc123", "zyx456"), class = "factor"), Count = c(1L, 15L, 4L
)), class = "data.frame", row.names = c(NA, -3L))
dfB <- structure(list(UniqueID = structure(1:3, .Label = c("abc123",
"zyx456", "zzzzz123"), class = "factor"), Count = c(1L, 1L, 2L
), in_dfA = c(TRUE, TRUE, FALSE)), row.names = c(NA, -3L), class = "data.frame")
How about using the %in% operator to see which are in both versus those that are not:
library(tibble)
library(tidyverse)
df1 <- tribble(~ID, ~Count,
'abc', 1,
'zyx', 15,
'other', 3)
df2 <- tribble(~ID, ~Count,
'abc', 2,
'zyx', 33,
'another', 334)
match <- df2[which(df2$ID %in% df1$ID),'ID']
notmatch <- df2[which(!(df2$ID %in% df1$ID)),'ID']
This outputs two comparisons that you can use to check for values in a function and pass errors if need be:
match
A tibble: 2 x 1
ID
<chr>
1 abc
2 zyx
notmatch
# A tibble: 1 x 1
ID
<chr>
1 another
You could do an update join to see which IDs are/aren't in the first table
tblb[tbla, on = 'UniqueID', in_tbla := i.UniqueID
][, in_tbla := !is.na(in_tbla)]
tblb
# UniqueID Count in_tbla
# 1: abc123 1 TRUE
# 2: zyx456 1 TRUE
# 3: zzzzz123 2 FALSE
Not sure if that's any better than #Onyambu's suggestion though (same output)
tblb[, in_tbla := UniqueID %in% tbla$UniqueID]
Data used:
tbla <- fread('
UniqueID Count
abc123 1
zyx456 15
888aaaa 4
')
tblb <- fread('
UniqueID Count
abc123 1
zyx456 1
zzzzz123 2
')

Compare strings in a column without considering character's order and if equal make them identical (same order) in R

My data frame has (8211 observation) but following is a simplified example. If I have the following data Frame in R
Var1 Freq
a/b/e 1
b/a/e 2
a/c/d 3
d/c/a 1
How can I obtain the following data frame:
Var1 Freq
a/b/e 3
a/c/d 4
Here is a way
df1[, "Var1"] <- sapply(strsplit(df1$Var1, "/"), function(x) paste0(sort(x), collapse = "/"))
aggregate(Freq ~ Var1, df1, FUN = sum)
# Var1 Freq
#1 a/b/e 3
#2 a/c/d 4
We use strsplit to split column Var1 on "/". This returns a list of character vectors which we sort, paste back together and later aggregate.
data
df1 <- structure(list(Var1 = c("a/b/e", "a/b/e", "a/c/d", "a/c/d"),
Freq = c(1L, 2L, 3L, 1L)), .Names = c("Var1", "Freq"), row.names = c(NA,
-4L), class = "data.frame")

Find and remove matching substrings from two data frames

I have two data frames: df1 and df2
df1<- structure(list(sample_1 = structure(c(7L, 6L, 5L, 1L, 2L, 4L,
3L), .Label = c("P41182;Q9HCP0", "Q09472", "Q9Y6H1;Q5T1J5", "Q9Y6I3",
"Q9Y6Q9", "Q9Y6U3", "Q9Y6W5"), class = "factor"), sample_2 = structure(c(7L,
6L, 4L, 3L, 2L, 5L, 1L), .Label = c("O15143", "P31908", "P3R117",
"P41356;P54612;A41PH2", "P54112", "P61809;Q92831", "Q16835"), class = "factor")), .Names = c("sample_1",
"sample_2"), class = "data.frame", row.names = c(NA, -7L))
df2<- structure(list(subunits..UniProt.IDs. = structure(c(4L, 6L, 5L,
12L, 3L, 9L, 14L, 16L, 15L, 11L, 13L, 8L, 1L, 2L, 10L, 7L), .Label = c("O55102,Q9CWG9,Q5U5M8,Q8VED2,Q91WZ8,Q8R015,Q9R0C0,Q9Z266",
"P30561,O08915,P07901,P11499", "P30561,P53762", "P41182,P56524",
"P41182,Q8WUI4", "P41182,Q9UQL6", "P61160,P61158,O15143,O15144,O15145,P59998,O15511",
"P78537,Q6QNY1,Q6QNY0,Q9NUP1,Q96EV8,Q8TDH9,Q9UL45,O95295", "Q15021,Q9BPX3,Q15003,O95347,Q9NTJ3",
"Q8WMR7,(P67776,P11493),(P54612,P54613)", "Q91VB4,P59438,Q8BLY7",
"Q92793,Q09472,Q9Y6Q9,Q92831", "Q92828,Q13227,O15379,O75376,O60907,Q9BZK7",
"Q92902,Q9NQG7", "Q92903,Q96NY9", "Q969F9,Q9UPZ3,Q86YV9"), class = "factor")), .Names = "subunits..UniProt.IDs.", class = "data.frame", row.names = c(NA,
-16L))
I want to look at each semicolon-separated string in df1 and if it contains a match to one of the comma-separated strings in df2, then remove it. So, my output will look like below:
sample_1 sample_2
1 Q9Y6W5 Q16835
2 Q9Y6U3 P61809
3 P41356;A41PH2
4 Q9HCP0 P3R117
5 P31908
6 Q9Y6I3 P54112
7 Q9Y6H1;Q5T1J5
The sample_1 has strings in row 3, 4 and 5 that match one of the strings in df2, and those matching strings are removed.
The sample_2 has strings in row 2, 3 and 7 that match strings in df2, and those matching strings are removed.
First, you could gather all the possible strings to remove:
toRmv <- unique(unlist(strsplit(as.character(df2[,1]), ",", fixed = TRUE)))
toRmv <- gsub("\\W", "", toRmv, perl = TRUE)
Then remove them. I like the stringi package here for its ability to replace multiple strings with an empty string using the handy vectorize_all argument set to FALSE.
library(stringi)
df1[] <- lapply(df1, stri_replace_all_fixed,
pattern = toRmv, replacement = "", vectorize_all = FALSE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809;
#3 P41356;;A41PH2
#4 ;Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
Now, it's just a matter of getting rid of leading semicolons (^;), trailing semicolons (;$), and multiple semicolons ((?<=;);):
df1[] <- lapply(df1, gsub, pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
df1
# sample_1 sample_2
#1 Q9Y6W5 Q16835
#2 Q9Y6U3 P61809
#3 P41356;A41PH2
#4 Q9HCP0 P3R117
#5 P31908
#6 Q9Y6I3 P54112
#7 Q9Y6H1;Q5T1J5
As requested in the comment, here it is in function form. I didn't test this part. Feel free to test and adjust as you see fit:
stringRemove <- function(removeFrom, toRemove) {
library(stringi)
toRemove <- unique(unlist(strsplit(as.character(toRemove), ",", fixed = TRUE)))
toRemove <- gsub("\\W", "", toRemove, perl = TRUE)
removeFrom[] <- lapply(removeFrom, stri_replace_all_fixed,
pattern = toRemove, replacement = "", vectorize_all = FALSE)
removeFrom[] <- lapply(removeFrom, gsub,
pattern = "^;|;$|(?<=;);", replacement = "", perl = TRUE)
removeFrom
}
# use it
stringRemove(removeFrom = df1, toRemove = df2[,1])
Firstly, you should almost definitely rearrange your data so it's tidy, i.e. has a column for each variable and a row for each observation, but not knowing what it is or how it's related, I can't do that for you. Thus, the only way left is to hack through what are effectively list columns:
library(dplyr)
# For each column,
df1 %>% mutate_each(funs(
# convert to character,
as.character(.) %>%
# split each string into a list of strings to evaluate,
strsplit(';') %>%
# loop over the items in each list,
lapply(function(x){
# replacing any in a similarly split and unlisted df2 with NA,
ifelse(x %in% unlist(strsplit(as.character(df2[,1]), '[(),]+')),
NA_character_, x)
}) %>%
# then loop over them again,
sapply(function(x){
# removing NAs where there are non-NA strings.
ifelse(all(is.na(x)), list(NA_character_), list(x[!is.na(x)]))
})))
# sample_1 sample_2
# 1 Q9Y6W5 Q16835
# 2 Q9Y6U3 P61809
# 3 NA P41356, A41PH2
# 4 Q9HCP0 P3R117
# 5 NA P31908
# 6 Q9Y6I3 P54112
# 7 Q9Y6H1, Q5T1J5 NA
If you want to collapse the actual list columns you end with back into strings, you can do so with paste, but really, list columns are more useful.
Edit
If your data is big enough that it's worth the annoyance to make it faster, take the munging of df2 out of the chain and store it separately so you don't calculate it for every iteration. Here's a version that does so, built in purrr, which works with lists instead of data.frames and can be faster than mutate_each for non-trivial functions. Edit as you like.
library(purrr)
df2_unlisted <- df2 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, '[(),]') %>% # split
unlist() # unlist to vector
df1 %>% map(as.character) %>% # convert; unnecessary if stringsAsFactors = FALSE
map(strsplit, ';') %>% # split
at_depth(2, ~.x[!.x %in% df2_unlisted]) %>% # subset out unwanted
at_depth(2, ~if(is_empty(.x)) NA_character_ else .x) %>% # insert NA for chr(0)
as_data_frame() %>% data.frame() # for printing
Results are identical.

Resources