Using regex to drop duplicated elements in columns of an R dataframe - r

I have a dummy dataframe df which has dimensions 6 X 4.
df <- data.frame(
Hits = c("Hit1", "Hit2", "Hit3", "Hit4", "Hit5", "Hit6"),
GO = c("GO:0005634~nucleus,", "", "GO:0005737~cytoplasm,", "GO:0005634~nucleus,GO:0005737~cytoplasm,", "",
"GO:0005634~nucleus,GO:0005654~nucleoplasm,"),
KEGG = c("", "", "", "", "", ""),
SMART = c("SM00394:RIIa,", "SM00394:RIIa,", "", "SM00054:EFh,",
"", "SM00394:RIIa,SM00239:C2,"))
df looks like this
The elements in the columns consist of two parts:
an identifier (e.g. GO:0005634~, SM00394: etc.)
a term (e.g. nucleus, EFh etc.)
For each column I want to retain a row if it contains atleast one term which is not present in any row above it. e.g. in the column GO rows 1 and 3 contain unique terms, so these should be retained. Row 4 contains terms which are already present in rows 1 and 3, so it should be dropped. Row 6 has one term which is not present in any row above it, hence it should also be retained.
I have been able to come up with regular expressions to extract the terms from the columns GO and SMART
Regex for GO: (?<=~).*?(?=,(?:GO:\\d+~|$))
Regex for SMART: (?<=:).*?(?=,(?:\\w+\\d+:|$))
But I'm unable to figure out a way to integrate the regex and the conditions mentioned above into a solution. The output should look like this
Any suggestions on how to solve this?

Here is a general approach that will handle GO, SMART, and potentially KEGG, though it is impossible to say without any information about KEGG.
The function f below takes as arguments
x, a character vector
split, the delimiter separating items in lists
sep, the delimiter separating identifiers and terms within items
and returns a logical vector indexing the elements of x with at least one non-duplicated term.
f <- function(x, split, sep) {
l1 <- strsplit(x, split)
tt <- sub(paste0("^[^", sep, "]*", sep), "", unlist(l1))
l2 <- relist(duplicated(tt), l1)
!vapply(l2, all, NA)
}
Applying f to GO and SMART:
nms <- c("GO", "SMART")
l <- Map(f, x = df[nms], split = ",", sep = c("~", ":"))
l
## $GO
## [1] TRUE FALSE TRUE FALSE FALSE TRUE
##
## $SMART
## [1] TRUE FALSE FALSE TRUE FALSE TRUE
Setting to "" elements of GO and SMART with zero non-duplicated terms, then filtering out empty rows, we obtain the desired result:
df2 <- df
df2[nms] <- Map(replace, df2[nms], lapply(l, `!`), "")
df2[Reduce(`|`, l), ]
## Hits GO KEGG SMART
## 1 Hit1 GO:0005634~nucleus, SM00394:RIIa,
## 3 Hit3 GO:0005737~cytoplasm,
## 4 Hit4 SM00054:EFh,
## 6 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm, SM00394:RIIa,SM00239:C2,

The following algorithm is applied to each term (GO, SMART, KEGG):
extract the identifier+term list as comma-separated. See stringr::str_split etc.
extract the term as regex
cumulate all the terms along the dataframe as they appear
extract the difference between each row and the row immediately preceding
replace the string with "" if no new term is introduced
filter rows where not all the terms are ""
library(dplyr)
library(stringr)
library(purrr)
termred <- function(terms, rx) {
terms |>
stringr::str_split(",") |>
purrr::map(stringr::str_trim) |>
purrr::map(~{.x[.x != ""]}) |>
purrr::map(~stringr::str_extract(.x, rx)) |>
purrr::accumulate(union) %>%
{mapply(setdiff, ., lag(., 1), SIMPLIFY = TRUE)} %>%
{ifelse(sapply(., length) > 0, terms, "")}
}
df |>
transform(GO = termred(GO, "~.*$")) |>
transform(SMART = termred(SMART, ":.*$")) |>
filter(GO != "" | SMART != ""| KEGG != "")
##> Hits GO KEGG SMART
##>1 Hit1 GO:0005634~nucleus, SM00394:RIIa,
##>2 Hit3 GO:0005737~cytoplasm,
##>3 Hit4 SM00054:EFh,
##>4 Hit6 GO:0005634~nucleus,GO:0005654~nucleoplasm, SM00394:RIIa,SM00239:C2,

Related

Filter rows where mirror-image delimiters are not paired

I have transcriptions of speech with "mirror-image" delimiters, i.e., paired symbols marking opening and, respectively, closing, such as ( and ) or < and >. The delimiter in this data is the square bracket:
df <- data.frame(
id = 1:9,
Utterance = c("[but if I came !ho!me", # <- closing square bracket is missing
"=[ye::a:h]", # OK!
"=[yeah] I mean [does it", # <- closing square bracket is missing
"bu[t if (.) you know", # <- closing square bracket is missing
"=ye::a:h]", # <- opening square bracket is missing
"[that's right] YEAH (laughs)] [ye::a:h]", # <- opening square bracket is missing
"cos I've [heard] very sketchy stories", # OK!
"[cos] I've [heard very sketchy [stories]", # <- closing square bracket is missing
"oh well] that's great" # <- opening square bracket is missing
))
I want to filter those rows where either at least one of the opening or at least one the closing delimiters is missing (as this represents a transcription error).
I'm actually doing fine with this str_count method:
library(string)
library(dplyr)
df %>%
filter(str_count(Utterance, "\\[|\\]") %in% c(1,3,5,7,9))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
but was wondering whether regexes could be devised to detect the strings with missing elements directly. I've been trying this regex, for missing closing brackets:
p_op <- "(?<!.{0,10}\\[.{0,10})\\].*$"
df %>%
filter(str_detect(Utterance, p_op))
which works well, and this for missing closing brackets, which does not capture all matches:
p_cl<- "\\[(?!.*\\]).*$"
df %>%
filter(str_detect(Utterance, p_cl))
How can the pattern or the patterns be formulated better?
May use the pattern (\\[[^\\]]+(\\[|$)|(^|\\])[^\\[]+\\]) in str_detect
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Utterance, "\\[[^\\]]+(\\[|$)|(^|\\])[^\\[]+\\]"))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
Here we check for a opening bracket [ followed by one or more characters that are not ] followed by a [ or the end of the string ($) or a similar pattern for the closing bracket
Another possible solution, using purrr::map_dfr.
EXPLANATION
I provide, in what follows, an explanation for my solution, as asked for by #ChrisRuehlemann:
With str_extract_all(df$Utterance, "\\[|\\]"), we extract all [ and ] of each utterance as a list and according to the order they appear in the utterance.
We iterate all lists created previously for the utterances. However, we have a list of square brackets. So, we need to beforehand collapse the list into a single string of square brackets (str_c(.x, collapse = "")).
We compare the string of square brackets of each utterance with a string like the following [][][]... (str_c(rep("[]", length(.x)/2), collapse = "")). If these two strings are not equal, then square brackets are missing!
When map_dfr finishes, we end up with a column of TRUE and FALSE, which we can use to filter the original dataframe as wanted.
library(tidyverse)
str_extract_all(df$Utterance, "\\[|\\]") %>%
map_dfr(~ list(OK = str_c(.x, collapse = "") !=
str_c(rep("[]", length(.x)/2), collapse = ""))) %>%
filter(df,.)
#> id Utterance
#> 1 1 [but if I came !ho!me
#> 2 3 =[yeah] I mean [does it
#> 3 4 bu[t if (.) you know
#> 4 5 =ye::a:h]
#> 5 6 [that's right] YEAH (laughs)] [ye::a:h]
#> 6 8 [cos] I've [heard very sketchy [stories]
#> 7 9 oh well] that's great
If you need a function to validate (nested) parenthesis, here is a stack based one.
valid_delim <- function(x, delim = c(open = "[", close = "]"), max_stack_size = 10L){
f <- function(x, delim, max_stack_size){
if(is.null(names(delim))) {
names(delim) <- c("open", "close")
}
if(nchar(x) > 0L){
valid <- TRUE
stack <- character(max_stack_size)
i_stack <- 0L
y <- unlist(strsplit(x, ""))
for(i in seq_along(y)){
if(y[i] == delim["open"]){
i_stack <- i_stack + 1L
stack[i_stack] <- delim["close"]
} else if(y[i] == delim["close"]) {
valid <- (stack[i_stack] == delim["close"]) && (i_stack > 0L)
if(valid)
i_stack <- i_stack - 1L
else break
}
}
valid && (i_stack == 0L)
} else NULL
}
x <- as.character(x)
y <- sapply(x, f, delim = delim, max_stack_size = max_stack_size)
unname(y)
}
library(dplyr)
valid_delim(df$Utterance)
#[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
df %>% filter(valid_delim(Utterance))
# id Utterance
#1 2 =[ye::a:h]
#2 7 cos I've [heard] very sketchy stories

How to split a sentence in two halves in R

I have a vector of string, and I want each string to be cut roughly in half, at the nearest space.
For exemple, with the following data :
test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
"qsdf",
"mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
"qsddddddddddddddddddddddddddddddd",
"qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)
I want to get something like this :
first sec
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd
5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj
Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.
First, we split the strings by spaces.
a <- strsplit(test$init, " ")
Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector:
b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))
Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word).
combined <- Map(function(x, y){
if(y == 1){
return(c(x, NA))
}else{
return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
}
}, a, b)
Finally, we rbind the combined strings and change the column names.
newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")
Result:
> newdf
first second
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf <NA>
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd <NA>
5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj
You can use the function nbreak from the package that I wrote:
devtools::install_github("igorkf/breaker")
library(tidyverse)
test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)
#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1
#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)
test
# init
# 1 Phrase with four words
# 2 That phrase has five words
#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
separate(init, c("first", "second"), sep = "\n")
# first second
# 1 Phrase with four words
# 2 That phrase has five words

How do I replace values in a matrix from an uploaded CSV file in R?

These are the steps I took:
1) Read in CSV file
rawdata <- read.csv('name of my file', stringsAsFactors=FALSE)
2) Cleaned my data by removing certain records based on x-criteria
data <- rawdata[!(rawdata$YOURID==""), all()]
data <- data[(data$thiscolumn=="right"), all()]
data <- data[(data$thatcolumn=="right"), all()]
3) Now I want to replace certain values throughout the whole matrix with a number (replace a string with a number value). I have tried the following commands and nothing works (I've tried gsub and replace):
gsub("Not the right string", 2, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
data <- replace(data, data$thiscolumn == "Not the right string" , 2)
gsub("\\Not the right string", "2", data$thiscolumn, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
I am new to R. I normally code in C++. The only other thing for me to try is a for loop. I potentially might only want R to look at certain columns for replace certain values, but I'd prefer a search through the whole matrix. Either is fine.
These are the guidelines per R Help:
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
replace(x, list, values)
Arguments
x vector
list an index vector
values replacement values
Example: I want to replace the text "Extremely Relevant 5" or whatever x-text, with a corresponding number value.
You can substitute the for loop by using logical indexing. First you need to identify the indices of what you want to replace, then assign the new value for these indices.
Here's small example. Let's say we have this vector:
x <- c(1, 2, 99, 4, 2, 99)
# x
# [1] 1 2 99 4 2 99
And we want to find all places where it's 99 and replace it with 0. when you apply x == 99 you get a TRUE and FALSE vector.
x == 99
# [1] FALSE FALSE TRUE FALSE FALSE TRUE
You can use this vector as an index to assign the new value where the condition is met.
x[x == 99] <- 0
# x
# [1] 1 2 0 4 2 0
Similarly you can use this approach to apply it across a dataframe or a matrix in a one-shot
df <- data.frame(col1 = c(2, 99, 3), col2 = c(99, 4, 99))
# df:
# col1 col2
# 1 2 99
# 2 99 4
# 3 3 99
df[df==99] <- 0
# df
# col1 col2
# 1 2 0
# 2 0 4
# 3 3 0
For dataframe with strings, it might be trickier since the column can be factor and the value you're trying to replace is not one of the levels. You can go around that by changing it to character and apply the replacement.
> df <- data.frame(col1 = c(2, "this string", 3), col2 = c("this string", 4, "this string"))
> df
col1 col2
1 2 this string
2 this string 4
3 3 this string
> sapply(df, class)
col1 col2
"factor" "factor"
> df <- sapply(df, as.character)
> df
col1 col2
[1,] "2" "this string"
[2,] "this string" "4"
[3,] "3" "this string"
> df[df == "this string"] <- 0
> df <- as.data.frame(df)
> df
col1 col2
1 2 0
2 0 4
3 3 0
I have found a few solutions to my own questions I thought I'd share in just working a little more out just now.
1) I had to add the package "library(stringr)" at the top so that R can understand matching strings.
2) I used a for loop to go down the entries of a specific column I wanted in my Matrix to change to the value indicated. See as follows:
`#possible solution 5 - This totally works!
for (i in 1:nrow(data)){
if (data$columnofinterest[i] == "String of Interest")
data$columnofinterest[i] <- "Becca is da bomb dot com"
}`
`#possible solution 6 - This totally works!
for (i in 1:nrow(data)){
if (data$columnofinterest[i] == "Becca is da bomb dot com")
data$columnofinterest[i] <- 7
}`
As you can see replacing specific records between text and a numerical value is
possible (text to numerical value and vice versa). And as the comments indicate it took me till the 5 and 6 problem solution to figure this much out. Still not the whole Matrix, but at least I can go through column of interest at a time, which is
still a lot faster.`
Here's a dplyr/tidyverse solution adapted from changing multiple column values given a condition in dplyr. You can use mutate_all:
library(tidyverse)
data <- tibble(a = c("don't change", "change", "don't change"),
b = c("change", "Change", "don't change"))
data %>%
mutate_all(funs(if_else(. == "change", "xxx", .)))

R - looking up strings and exclude based on other string

I could not find the answer how to count words in data frame and exclude if other word is found.
I have got below df:
words <- c("INSTANCE find", "LA LA LA", "instance during",
"instance", "instance", "instance", "find instance")
df <- data.frame(words)
df$words_count <- grepl("instance", df$words, ignore.case = T)
It counts all instances of "instance" I have been trying to exclude any row when word find is present as well.
I can add another grepl to look up for "find" and based on that exclude but I try to limit number of lines of my code.
I'm sure there's a solution using a single regular expression, but you could do
df$words_count <- Reduce(`-`, lapply(c('instance', 'find'), grepl, df$words)) > 0
or
df$words_count <- Reduce(`&`, lapply(c('instance', '^((?!find).)*$'), grepl, df$words, perl = T, ignore.case = T))
This might be easier to read
library(tidyverse)
df$words_count <- c('instance', '^((?!find).)*$') %>%
lapply(grepl, df$words, perl = T, ignore.case = T) %>%
reduce(`&`)
If all you need is the number of times "instance" appears in a string, negating all in that string if "find" is found anywhere:
df$counts <- sapply(gregexpr("\\binstance\\b", words, ignore.case=TRUE), function(a) length(a[a>0])) *
!grepl("\\bfind\\b", words, ignore.case=TRUE)
df
# words counts
# 1 INSTANCE find 0
# 2 LA LA LA 0
# 3 instance during 1
# 4 instance 1
# 5 instance 1
# 6 instance 1
# 7 find instance 0

How to delete everything after nth delimiter in R?

I have this vector myvec. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?
myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")
result
chr2:213403244
chr7:55240586
chr7:55241607
We can use sub. We match one or more characters that are not : from the start of the string (^([^:]+) followed by a :, followed by one more characters not a : ([^:]+), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1) in the replacement.
sub('^([^:]+:[^:]+).*', '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
The above works for the example posted. For general cases to remove after the nth delimiter,
n <- 2
pat <- paste0('^([^:]+(?::[^:]+){',n-1,'}).*')
sub(pat, '\\1', myvec)
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Checking with a different 'n'
n <- 3
and repeating the same steps
sub(pat, '\\1', myvec)
#[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
#[3] "chr7:55241607:55241607"
Or another option would be to split by : and then paste the n number of components together.
n <- 2
vapply(strsplit(myvec, ':'), function(x)
paste(x[seq.int(n)], collapse=':'), character(1L))
#[1] "chr2:213403244" "chr7:55240586" "chr7:55241607"
Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.
1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:
k <- 3 # keep first 3 fields only
do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":"))
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be ^((.*?:){2}.*?):.* ) and use it with sub:
k <- 3
sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 1: For k=1 this can be further simplified to sub(":.*", "", myvec) and for k=n-1 it can be further simplified to sub(":[^:]*$", "", myvec)
Note 2: Here is a visualization of the regular regular expression for k equal to 3:
^((.*?:){2}.*?):.*
Debuggex Demo
3) iteratively delete last field We could remove the last field n-k times using the last regular expression in Note 1 above like this:
n <- 6 # number of fields
k < - 3 # number of fields to retain
out <- myvec
for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)
If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:
n <- count.fields(textConnection(myvec[1]), sep = ":")
4) locate position of kth colon Locate the positions of the colons using gregexpr and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Use substr to extract that many characters from the respective strings.
k <- 3
substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1)
giving:
[1] "chr2:213403244:213403244" "chr7:55240586:55240586"
[3] "chr7:55241607:55241607"
Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using paste0(myvec, ":") as the input instead of myvec.
Note 4: We compare performance:
library(rbenchmark)
benchmark(
.read.table = do.call(paste, c(read.table(text = myvec, sep = ":")[1:k], sep = ":")),
.sprintf.sub = sub(sprintf("^((.*?:){%d}.*?):.*", k-1), "\\1", myvec),
.for = { out <- myvec; for(i in seq_len(n-k)) out <- sub(":[^:]*$", "", out)},
.gregexpr = substr(myvec, 1, sapply(gregexpr(":", myvec), "[", k) - 1),
order = "elapsed", replications = 1000)[1:4]
giving:
test replications elapsed relative
2 .sprintf.sub 1000 0.11 1.000
4 .gregexpr 1000 0.14 1.273
3 .for 1000 0.15 1.364
1 .read.table 1000 2.16 19.636
The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.
ADDED Added additional solutions and additional notes.

Resources