Compare two data.frame and delete rows with common characters - r

I have two data.frame x1 & x2. I want to remove rows from x2 if there is a common gene found in x1 and x2
x1 <- chr start end Genes
1 8401 8410 Mndal,Mnda,Ifi203,Ifi202b
2 8001 8020 Cyb5r1,Adipor1,Klhl12
3 4001 4020 Alyref2,Itln1,Cd244
x2 <- chr start end Genes
1 8861 8868 Olfr1193
1 8405 8420 Mrgprx3-ps,Mrgpra1,Mrgpra2a,Mndal,Mrgpra2b
2 8501 8520 Chia,Chi3l3,Chi3l4
3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
x2 <- chr start end Genes
1 8861 8868 Olfr1193
2 8501 8520 Chia,Chi3l3,Chi3l4
3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5

You could try
x2[mapply(function(x,y) !any(x %in% y),
strsplit(x1$Genes, ','), strsplit(x2$Genes, ',')),]
# chr start end Genes
#2 2 8501 8520 Chia,Chi3l3,Chi3l4
#3 3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
Or replace !any(x %in% y) with length(intersect(x,y))==0.
NOTE: If the "Genes" column is "factor", convert it to "character" as strsplit cannot take 'factor' class. i.e. strsplit(as.character(x1$Genes, ','))
Update
Based on the new dataset for 'x2', we can merge the two datasets by the 'chr' column, strsplit the 'Genes.x', 'Genes.y' from the output dataset ('xNew'), get the logical index based on the occurrence of any element of 'Genes.x' in 'Genes.y' strings, use that to subset the 'x2' dataset
xNew <- merge(x1, x2[,c(1,4)], by='chr')
indx <- mapply(function(x,y) any(x %in% y),
strsplit(xNew$Genes.x, ','), strsplit(xNew$Genes.y, ','))
x2[!indx,]
# chr start end Genes
#1 1 8861 8868 Olfr1193
#3 2 8501 8520 Chia,Chi3l3,Chi3l4
#4 3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5

Related

How do I gsub the complete time string behind #

(this is my first question, if i need to improve anything about it, pls let me know!)
I am analysing a large observational dataset. start and stop time of each observation have been indicated so that i was able to calculate the duration. But there is a note column which includes information on "pauses" / "breaks" or "out of sight" periods in which the animal was not seen. I would like to subtract those time periods from total duration.
My problem is, one column includes several notes, not only pauses ("HH:MM-HH:MM") but also info on certain events (xy happened "#HH:MM").
I only want to look at time periods in the format of HH:MM-HH:MM and i want to exclude all event times labeled "#HH:MM". I've managed to drop all words and be left with only numbers, so it looks like this
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.frame(id, timepoints)
tried several ways of grep or gsub trying to indicate, either which to keep, or which to leave out but i failed. The closest I got was r dropping "#HH" but keeping ":MM". for this I used
gsub("#([[:digit:]]|[_])*", "", df$timepoints)
, as found for a similar problem just with words here: remove all words that start with "#" from a string
The aim is to get (e.g.):
id
timepoints
3990
"7:16-7:23, 7:25-7:43"
or
id
timepoints
3990
"7:16-7:23", "7:25-7:43"
If possible separated by comma, or directly separated into different columns so i can extract the time and subtract it from my total observation time.
Any help would be greatly appreciated!
How about matching the strings you're interested in instead?
With base:
df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))
Output (with a list column):
id timepoints new_timepoints
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23, 7:25-7:43
2 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
3 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54
With tidyverse (in a long format for easy calculations!):
library(stringr)
library(dplyr)
library(tidyr)
df |>
group_by(id) |>
mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
unnest_longer(new_timepoints) |>
ungroup()
Output:
# A tibble: 6 × 3
id timepoints new_timepoints
<chr> <chr> <chr>
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23
2 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:25-7:43
3 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
4 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39
5 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:45-7:48
6 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:49-7:54
You can do something like this:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s[grepl("^\\d",s)]
})
}
and then apply that function to the timepoints column
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest(timepoints)
Output:
id timepoints
<chr> <chr>
1 3990 7:16-7:23
2 3990 7:25-7:43
3 3989 7:25-7:43
4 3004 7:30-7:39
5 3004 7:45-7:48
6 3004 7:49-7:54
You could also use unnest_wider() to get these as columns; for that I would adjust my f() to include the names of the timepoints:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s = s[grepl("^\\d",s)]
setNames(s, paste0("tp", 1:length(s)))
})
}
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest_wider(timepoints)
Output:
id tp1 tp2 tp3
<chr> <chr> <chr> <chr>
1 3990 7:16-7:23 7:25-7:43 NA
2 3989 7:25-7:43 NA NA
3 3004 7:30-7:39 7:45-7:48 7:49-7:54
Setting the data with the package data.table
library(data.table)
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.table(id, timepoints)
Note that I saved it as a data.table
Splitting the timepoints by comma and storing the value in the new_time column.
df[,new_time:=strsplit(timepoints, ",")]
Removing the string values that has #
df[,new_time:=sapply(new_time, function(x) return(x[!grepl("[#]", x)]))]
Since the timepoints column has multiple commas in a row empty string("") exists I remove them
df[,new_time:=sapply(new_time, function(x) return(x[!stringi::stri_isempty(x)]))]
Now the new_time column looks like this
df$new_time
[[1]]
[1] "7:16-7:23" "7:25-7:43"
[[2]]
[1] "7:25-7:43"
[[3]]
[1] "7:30-7:39" "7:45-7:48" "7:49-7:54"
If you want to have the new_time column to have whole strings
df[,new_time:=sapply(new_time, paste, collapse=", ")]
df$new_time
[1] "7:16-7:23, 7:25-7:43" "7:25-7:43" "7:30-7:39, 7:45-7:48, 7:49-7:54"
1) list Split by comma and then grep out the components with a dash. No packages are used. This gives a list of character vectors as the timepoints column.
df2 <- df
df2$timepoints <- lapply(strsplit(df$timepoints, ","),
grep, pattern = "-", value = TRUE)
df2
## id timepoints
## 1 3990 7:16-7:23, 7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39, 7:45-7:48, 7:49-7:54
str(df2)
## 'data.frame': 3 obs. of 2 variables:
## $ id : chr "3990" "3989" "3004"
## $ timepoints:List of 3
## ..$ : chr "7:16-7:23" "7:25-7:43"
## ..$ : chr "7:25-7:43"
## ..$ : chr "7:30-7:39" "7:45-7:48" "7:49-7:54"
2) character If you want a comma separated character string in each row add this:
transform(df2, timepoints = sapply(timepoints, paste, collapse = ","))
## id timepoints
## 1 3990 7:16-7:23,7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39,7:45-7:48,7:49-7:54
3) long form or if you prefer long form use this:
long <- with(df2, stack(setNames(timepoints, id))[2:1])
names(long) <- names(df2)
long
## id timepoints
## 1 3990 7:16-7:23
## 2 3990 7:25-7:43
## 3 3989 7:25-7:43
## 4 3004 7:30-7:39
## 5 3004 7:45-7:48
## 6 3004 7:49-7:54
4) wide form or a wide form matrix:
nr <- nrow(long)
L <- transform(long, seq = ave(1:nr, id, FUN = seq_along))
tapply(L$timepoints, L[c("id", "seq")], c)
## seq
## id 1 2 3
## 3990 "7:16-7:23" "7:25-7:43" NA
## 3989 "7:25-7:43" NA NA
## 3004 "7:30-7:39" "7:45-7:48" "7:49-7:54"

split dataframe with multiple delimiters in R

df1 <-
Gene GeneLocus
CPA1|1357 chr7:130020290-130027948:+
GUCY2D|3000 chr17:7905988-7923658:+
UBC|7316 chr12:125396194-125399577:-
C11orf95|65998 chr11:63527365-63536113:-
ANKMY2|57037 chr7:16639413-16685398:-
expected output
df2 <-
Gene.1 Gene.2 chr start end
CPA1 1357 7 130020290 130027948
GUCY2D 3000 17 7905988 7923658
UBC 7316 12 125396194 125399577
C11orf95 65998 11 63527365 63536113
ANKMY2 57037 7 16639413 16685398]]
I tried this way..
install.packages("splitstackshape")
library(splitstackshape)
df1 <- cSplit(df1,"Gene", sep="|", direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus",sep=":",direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus_2",sep="-",direction="wide", fixed=T)
df1 <- data.frame(df1)
df2$GeneLocus_1 <- gsub("chr","", df1$GeneLocus_1)
I would like to know if there is any other alternative way to do it in simpler way
Here you go...Just ignore the warning that does not affect the output; it actually has the side effect of removing the strand information (:+ or :-).
library(tidyr)
library(dplyr)
df1 %>% separate(Gene, c("Gene.1","Gene.2")) %>% separate(GeneLocus, c("chr","start","end")) %>% mutate(chr=sub("chr","",chr))
Output:
Gene.1 Gene.2 chr start end
1 CPA1 1357 7 130020290 130027948
2 GUCY2D 3000 17 7905988 7923658
3 UBC 7316 12 125396194 125399577
4 C11orf95 65998 11 63527365 63536113
5 ANKMY2 57037 7 16639413 16685398
I would suggest something like the following approach:
Make a single delimiter in your "GeneLocus" column (and strip out the unnecessary parts while you're at it).
Split both columns at once. Note that cSplit "balances" the columns being split according to the number of output columns detected. Thus, since the first column would only result in 2 columns when split, but the second would result in 4, you would need to drop columns 3 and 4 from the result.
library(splitstackshape)
GLPat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
cSplit(as.data.table(mydf)[, GeneLocus := gsub(
GLPat, "\\1|\\2|\\3|\\4", GeneLocus)], names(mydf), "|")[
, 3:4 := NULL, with = FALSE][]
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -
Alternatively, you can try col_flatten from my "SOfun" package, with which you can do:
library(SOfun)
Pat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
Fun <- function(invec) strsplit(gsub(Pat, "\\1|\\2|\\3|\\4", invec), "|", TRUE)
col_flatten(as.data.table(mydf)[, lapply(.SD, Fun)], names(mydf), drop = TRUE)
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -
SOfun is only on GitHub, so you can install it with:
source("http://news.mrdwab.com/install_github.R")
install_github("mrdwab/SOfun")

mapping common names in two different datasets

I have two data frames. I want to find out the alternative gene names for each genes in dataframe_1 by comparing it with dataframe_2.
data_frame_1
chr start end CNA Genes No.of.Gene
1 13991 1401 gain Cfh,Gm26048 2
1 14011 1490 gain Zfp788,Rik 2
data.frame_2
Associated_Gene_Name Chromosome_Name Gene_Start Gene_End Associated_Gene_Name_1 Chromosome_Name_1 Gene_Start_1 Gene_End_1
Cfh 1 13900 14100 CFH 3 43900 54100
Gm26048 1 13998 14010 TFE 1 76710 76790
Zfp788 2 43970 44180 ELF 4 131950 133100
Rik 3 202100 202600 RIK 5 881100 1036800
data_frame_result
chr start end CNA Genes No.of.Gene Associated.Gene.name_1
1 13991 1401 gain Cfh,Gm26048 2 CFH,TFE
1 14011 1490 gain Zfp788,Rik 2 ELF,RIK
Having multiple values separated by commas really makes things messy. Here's a chain that will "normalize" the data to make one value per row such that you can do a standard merge. I use the magrittr library to chain the commands
#test data
data_frame_1<-data.frame(
Genes=c("Cfh,Gm26048","Gm5852,Gm5773","Elf","Ttn")
)
data_frame_2<-data.frame(
Genes_1=c("Cfh","Gm26048","Gm5852","Gm5773","Elf","Ttn"),
Alternate_Gene_name = c("CFH","FGFR","NAA","TFE","ELF","TTN")
)
library(magrittr)
idxstack <- function(x, idx=if(!is.null(names(x))) {names(x)} else {seq_along(x)})
do.call(rbind, Map(function(a,b) cbind.data.frame(idx=a,val=b), idx, x))
as.character(data_frame_1$Genes) %>%
{setNames(strsplit(., , split=","), .)} %>%
idxstack %>%
merge(data_frame_2, by.x="val", by.y="Genes_1", all.x=TRUE) %>%
aggregate(Alternate_Gene_name~idx, ., paste0, collapse=",") %>%
merge(data_frame_1,., by.x="Genes", by.y="idx")
which returns
Genes Alternate_Gene_name
1 Cfh,Gm26048 CFH,FGFR
2 Elf ELF
3 Gm5852,Gm5773 TFE,NAA
4 Ttn TTN

Subset only those rows whose intervals does not fall within another data.frame

How can i compare two data frames (test and control) of unequal length, and remove the row from test based on three criteria, i) if the test$chr == control$chr
ii) test$start and test$end lies with in the range of control$start and control$end
iii) test$CNA and control$CNA are same.
test =
R_level logp chr start end CNA Gene
2 7.079 11 1159 1360 gain Recl,Bcl
11 2.4 12 6335 6345 loss Pekg
3 19 13 7180 7229 loss Sox1
control =
R_level logp chr start end CNA Gene
2 5.9 11 1100 1400 gain Recl,Bcl
2 3.46 11 1002 1345 gain Trp1
2 6.4 12 6705 6845 gain Pekg
4 7 13 6480 8129 loss Sox1
The result should look something like this
result =
R_level logp chr start end CNA Gene
11 2.4 12 6335 6345 loss Pekg
Here's one way using foverlaps() from data.table.
require(data.table) # v1.9.4+
dt1 <- as.data.table(test)
dt2 <- as.data.table(control)
setkey(dt2, chr, CNA, start, end)
olaps = foverlaps(dt1, dt2, nomatch=0L, which=TRUE, type="within")
# xid yid
# 1: 1 2
# 2: 3 4
dt1[!olaps$xid]
# R_level logp chr start end CNA Gene
# 1: 11 2.4 12 6335 6345 loss Pekg
Read ?foverlaps and see the examples section for more info.
Alternatively, you can also use GenomicRanges package. However, you might have to filter based on CNA after merging by overlapping regions (AFAICT).
When you say "exclude the variable", I assume you mean you want to remove the rows that satisfies those criteria.
If so, you are nearly there. The following should work:
exclude_bool <- data1[,3] == data2[,3] &
data1[,4] > data2[,5] &
data1[,5] < data2[,4] &
data1[,6] == data2[,6]
data1 <- data1[!exclude_bool , ]

a complex merge in R to flag unmatched observations?

I'm trying to join two datasets together. Call them x and y. I believe that the ID variables in y are a subset of the ID variables in x. But not in the pure sense because I know that x contains more IDs than y but I don't know the mapping. That is, some (but not all) of the IDs in x and y can be matched 1:1.
My ultimate goal is to figure out where this 1:1 mapping fails and flag these observations. I thought merge would be the way to go but maybe not. An example is below:
id <- c(1:10, 1:100)
X1 <- rnorm(110, mean = 0, sd = 1)
year <- c("2004","2005","2006","2001","2002")
year <- rep(year, 22)
month = c("Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr")
month <- rep(month, 11)
#dataset X
x <- cbind(id, X1, month, year)
#dataset Y
id2 <- c(1:10, 200)
Y1 <- rnorm(11, mean = 0 , sd = 1)
y <- cbind(id2,Y1)
#merge on the IDs; but we get an error because when id2 == 200 in y we don't
#have a match in x
result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE)
The merge threw an error because id2 == 200 had no match in the x dataset. Unfortunately, I lost the ID and all the information as well! (it should equal 200 in row 111):
tail(result)
id X1 month year Y1
106 95 -0.0748386054887876 Nov 2002 NA
107 96 0.196765325477989 Dec 2004 NA
108 97 0.527922135906927 Jan 2005 NA
109 98 0.197927230533413 Feb 2006 NA
110 99 -0.00720474886698309 Mar 2001 NA
111 <NA> <NA> <NA> <NA> -0.9664941
What's more, I get duplicate observations on the ID variable in the merged file. The id2 == 1 observation only existed once but it just copied it twice (e.g. Y1 takes on the value 1.55 twice).
head(result)
id X1 month year Y1
1 1 -0.67371266313441 Jul 2004 1.553220
2 1 -0.318666983469993 Jul 2004 1.553220
3 10 -0.608192898092431 Apr 2002 1.234325
4 10 -0.72299929212347 Apr 2002 1.234325
5 100 -0.842111221826554 Apr 2002 NA
6 11 -0.16316681842082 Jul 2004 NA
This merge has made things more complicated than I intended. I was hoping I could examine every observation in x and figure out where the id matched id2 in y and flag the ones that didn't. So I would get a new vector, call it flag, that takes on a value 1 if x$id had a match in y$id2 and zero otherwise. This way, I could know where the 1:1 mapping failed. I could potentially get some traction on this by re-coding the NAs, but what about the error that gets thrown when id2 == 200? It just discards the information.
I have tried appending by rows with no luck and it looks like I should give up merge as well, perhaps it's better to wring a loop or function to do something along these lines:
for every observation in x
id2 = which(id2) corresponds to id-month-year
flag = 1 if length of above is == 1, 0 otherwise
etc.
Hopefully this all makes sense. I'd be very grateful for any help or guidance.
If you are looking for which things in x$id are in y$id2, then you can use
x$id %in% y$id2
to get a logical vector returning matches. It does not guarantee a 1-to-1 correspondence, however; just a 1-to-many. You can then add this vector to your data frame
x$match.y <- x$id %in% y$id2
to see what rows of x have a corresponding ID in y.
To see which observations are 1-to-1, you could do something like
y$id2[duplicated(y$id2)] #vector of duplicate elements in y$id2
(x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
to filter out elements that appear more than once in y$id2. You can also add this to x:
x$match.y.unique <- (x$id %in% y$id2) & !(x$id %in% y$id2[duplicated(y$id2)])
The same procedure can be done for y to determine what rows of y match in x, and which ones match uniquely.
The reason your merge failed was that you gave it two different structures (one a numeric matrix and the other a character matrix) for x and y. Using cbind when data.frame should be chosen is a common strategy for failure.
> str(x)
chr [1:110, 1:4] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "1" "2" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "id" "X1" "month" "year"
> str(y)
num [1:11, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "id2" "Y1"
If you used the data.frame function (since dataframes are what merge is supposed to be working with) it would have succeeded:
> x <- data.frame(id, X1, month, year); y <- data.frame(id2,Y1)
> str( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
'data.frame': 111 obs. of 5 variables:
$ id : num 1 1 2 2 3 3 4 4 5 5 ...
$ X1 : num 1.5063 2.5035 0.7889 -0.4907 -0.0446 ...
$ month: Factor w/ 10 levels "Apr","Aug","Dec",..: 6 6 2 2 10 10 9 9 8 8 ...
$ year : Factor w/ 5 levels "2001","2002",..: 3 3 4 4 5 5 1 1 2 2 ...
$ Y1 : num 1.449 1.449 -0.134 -0.134 -0.828 ...
> tail( result <- merge(x, y, by.x="id", by.y = "id2", all =TRUE) )
id X1 month year Y1
106 96 -0.3869157 Dec 2004 NA
107 97 0.6373009 Jan 2005 NA
108 98 -0.7735626 Feb 2006 NA
109 99 -1.3537915 Mar 2001 NA
110 100 0.2626190 Apr 2002 NA
111 200 NA <NA> <NA> -1.509818
If you have duplicates in your 'x' argument, then you should get duplicates in the result. It's then your responsibility to use !duplicated in whatever manner you deem appropriate (either before or after the merge), but you cannot expect merge to be making decisions like that for you.

Resources