mapping common names in two different datasets - r

I have two data frames. I want to find out the alternative gene names for each genes in dataframe_1 by comparing it with dataframe_2.
data_frame_1
chr start end CNA Genes No.of.Gene
1 13991 1401 gain Cfh,Gm26048 2
1 14011 1490 gain Zfp788,Rik 2
data.frame_2
Associated_Gene_Name Chromosome_Name Gene_Start Gene_End Associated_Gene_Name_1 Chromosome_Name_1 Gene_Start_1 Gene_End_1
Cfh 1 13900 14100 CFH 3 43900 54100
Gm26048 1 13998 14010 TFE 1 76710 76790
Zfp788 2 43970 44180 ELF 4 131950 133100
Rik 3 202100 202600 RIK 5 881100 1036800
data_frame_result
chr start end CNA Genes No.of.Gene Associated.Gene.name_1
1 13991 1401 gain Cfh,Gm26048 2 CFH,TFE
1 14011 1490 gain Zfp788,Rik 2 ELF,RIK

Having multiple values separated by commas really makes things messy. Here's a chain that will "normalize" the data to make one value per row such that you can do a standard merge. I use the magrittr library to chain the commands
#test data
data_frame_1<-data.frame(
Genes=c("Cfh,Gm26048","Gm5852,Gm5773","Elf","Ttn")
)
data_frame_2<-data.frame(
Genes_1=c("Cfh","Gm26048","Gm5852","Gm5773","Elf","Ttn"),
Alternate_Gene_name = c("CFH","FGFR","NAA","TFE","ELF","TTN")
)
library(magrittr)
idxstack <- function(x, idx=if(!is.null(names(x))) {names(x)} else {seq_along(x)})
do.call(rbind, Map(function(a,b) cbind.data.frame(idx=a,val=b), idx, x))
as.character(data_frame_1$Genes) %>%
{setNames(strsplit(., , split=","), .)} %>%
idxstack %>%
merge(data_frame_2, by.x="val", by.y="Genes_1", all.x=TRUE) %>%
aggregate(Alternate_Gene_name~idx, ., paste0, collapse=",") %>%
merge(data_frame_1,., by.x="Genes", by.y="idx")
which returns
Genes Alternate_Gene_name
1 Cfh,Gm26048 CFH,FGFR
2 Elf ELF
3 Gm5852,Gm5773 TFE,NAA
4 Ttn TTN

Related

How do I gsub the complete time string behind #

(this is my first question, if i need to improve anything about it, pls let me know!)
I am analysing a large observational dataset. start and stop time of each observation have been indicated so that i was able to calculate the duration. But there is a note column which includes information on "pauses" / "breaks" or "out of sight" periods in which the animal was not seen. I would like to subtract those time periods from total duration.
My problem is, one column includes several notes, not only pauses ("HH:MM-HH:MM") but also info on certain events (xy happened "#HH:MM").
I only want to look at time periods in the format of HH:MM-HH:MM and i want to exclude all event times labeled "#HH:MM". I've managed to drop all words and be left with only numbers, so it looks like this
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.frame(id, timepoints)
tried several ways of grep or gsub trying to indicate, either which to keep, or which to leave out but i failed. The closest I got was r dropping "#HH" but keeping ":MM". for this I used
gsub("#([[:digit:]]|[_])*", "", df$timepoints)
, as found for a similar problem just with words here: remove all words that start with "#" from a string
The aim is to get (e.g.):
id
timepoints
3990
"7:16-7:23, 7:25-7:43"
or
id
timepoints
3990
"7:16-7:23", "7:25-7:43"
If possible separated by comma, or directly separated into different columns so i can extract the time and subtract it from my total observation time.
Any help would be greatly appreciated!
How about matching the strings you're interested in instead?
With base:
df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))
Output (with a list column):
id timepoints new_timepoints
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23, 7:25-7:43
2 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
3 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54
With tidyverse (in a long format for easy calculations!):
library(stringr)
library(dplyr)
library(tidyr)
df |>
group_by(id) |>
mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
unnest_longer(new_timepoints) |>
ungroup()
Output:
# A tibble: 6 × 3
id timepoints new_timepoints
<chr> <chr> <chr>
1 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:16-7:23
2 3990 #6:19,,7:16-7:23,7:25-7:43,#7:53, 7:25-7:43
3 3989 #6:19,,7:25-7:43,#7:53 7:25-7:43
4 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39
5 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:45-7:48
6 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:49-7:54
You can do something like this:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s[grepl("^\\d",s)]
})
}
and then apply that function to the timepoints column
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest(timepoints)
Output:
id timepoints
<chr> <chr>
1 3990 7:16-7:23
2 3990 7:25-7:43
3 3989 7:25-7:43
4 3004 7:30-7:39
5 3004 7:45-7:48
6 3004 7:49-7:54
You could also use unnest_wider() to get these as columns; for that I would adjust my f() to include the names of the timepoints:
f <- function(x) {
lapply(x, \(s) {
s = strsplit(s,",")[[1]]
s = s[grepl("^\\d",s)]
setNames(s, paste0("tp", 1:length(s)))
})
}
library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>%
unnest_wider(timepoints)
Output:
id tp1 tp2 tp3
<chr> <chr> <chr> <chr>
1 3990 7:16-7:23 7:25-7:43 NA
2 3989 7:25-7:43 NA NA
3 3004 7:30-7:39 7:45-7:48 7:49-7:54
Setting the data with the package data.table
library(data.table)
id <- c("3990", "3989", "3004")
timepoints <- c("#6:19,,7:16-7:23,7:25-7:43,#7:53,", "#6:19,,7:25-7:43,#7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.table(id, timepoints)
Note that I saved it as a data.table
Splitting the timepoints by comma and storing the value in the new_time column.
df[,new_time:=strsplit(timepoints, ",")]
Removing the string values that has #
df[,new_time:=sapply(new_time, function(x) return(x[!grepl("[#]", x)]))]
Since the timepoints column has multiple commas in a row empty string("") exists I remove them
df[,new_time:=sapply(new_time, function(x) return(x[!stringi::stri_isempty(x)]))]
Now the new_time column looks like this
df$new_time
[[1]]
[1] "7:16-7:23" "7:25-7:43"
[[2]]
[1] "7:25-7:43"
[[3]]
[1] "7:30-7:39" "7:45-7:48" "7:49-7:54"
If you want to have the new_time column to have whole strings
df[,new_time:=sapply(new_time, paste, collapse=", ")]
df$new_time
[1] "7:16-7:23, 7:25-7:43" "7:25-7:43" "7:30-7:39, 7:45-7:48, 7:49-7:54"
1) list Split by comma and then grep out the components with a dash. No packages are used. This gives a list of character vectors as the timepoints column.
df2 <- df
df2$timepoints <- lapply(strsplit(df$timepoints, ","),
grep, pattern = "-", value = TRUE)
df2
## id timepoints
## 1 3990 7:16-7:23, 7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39, 7:45-7:48, 7:49-7:54
str(df2)
## 'data.frame': 3 obs. of 2 variables:
## $ id : chr "3990" "3989" "3004"
## $ timepoints:List of 3
## ..$ : chr "7:16-7:23" "7:25-7:43"
## ..$ : chr "7:25-7:43"
## ..$ : chr "7:30-7:39" "7:45-7:48" "7:49-7:54"
2) character If you want a comma separated character string in each row add this:
transform(df2, timepoints = sapply(timepoints, paste, collapse = ","))
## id timepoints
## 1 3990 7:16-7:23,7:25-7:43
## 2 3989 7:25-7:43
## 3 3004 7:30-7:39,7:45-7:48,7:49-7:54
3) long form or if you prefer long form use this:
long <- with(df2, stack(setNames(timepoints, id))[2:1])
names(long) <- names(df2)
long
## id timepoints
## 1 3990 7:16-7:23
## 2 3990 7:25-7:43
## 3 3989 7:25-7:43
## 4 3004 7:30-7:39
## 5 3004 7:45-7:48
## 6 3004 7:49-7:54
4) wide form or a wide form matrix:
nr <- nrow(long)
L <- transform(long, seq = ave(1:nr, id, FUN = seq_along))
tapply(L$timepoints, L[c("id", "seq")], c)
## seq
## id 1 2 3
## 3990 "7:16-7:23" "7:25-7:43" NA
## 3989 "7:25-7:43" NA NA
## 3004 "7:30-7:39" "7:45-7:48" "7:49-7:54"

Is there a reason RowSums(df[grep wouldn't work accurately?

I used
df$Total.P.n <- rowSums(df[grep('p.n', names(df), ignore.case = FALSE)])
to sum count values from any column name containing p.n, but the values it produced are way off. The columns are counts of certain combinations of language types in a language corpus. I want to get a summary of all times p.n. was used within other combinations, but am struggling. It seems like perhaps it is counting other occurences like e.sp.NR in my variable names, but shouldn't ignore.case=FALSE take care of that? I've also tried tidyverse and dplyr solutions to no avail.
Here's example of df structure:
ID.
do.p.n.NP
do.p.n.SE
p.d.e.sp.SR
1510
4
6
2
1515
2
0
1
and what I need:
ID.
do.p.n.NP
do.p.n.SE
p.d.e.sp.SR
Total.P.n
1510
4
6
2
10
1515
2
0
1
2
Update after update(new column names) of OP:
The code is like:
df$Total.P.n <- rowSums(df[grep('p.n', names(df), ignore.case = FALSE)])
df$p.d.e.sp.SR <- rowSums(df[,2:3]!=0)
ID. do.p.n.NP do.p.n.SE. p.d.e.sp.SR Total.P.n
1 1510 4 6 2 10
2 1515 2 0 1 2
First answer:
The argument pattern you are searching for e.g. p.n does not exist in df. Therefore I think you mean pn: Then your code works as expectect:
df$Total.P.n <- rowSums(df[grep('pn', names(df), ignore.case = FALSE)])
ID. do.pn.NP do.pn.SE. p.d.e.sp.SR Total.P.n
1 1510 4 6 0 10
2 1515 2 0 1 2
If we can use dplyr, I would suggest using a tidy-select function / selection helper like matches. And please mind that your regex is likely wrong. If we need to match literal dots . , we need to escape the metacharacter with a double backslash. The appropriate regex would be n\\.p.
library(dplyr)
data
df <- tibble(`ID.` = c(1510, 1515), `do.p.n.NP` = c(4,2), `do.p.n.SE.` = c(6,0), `p.d.e.sp.SR` = c(0,1))
answer
df %>%
mutate(Total.P.n = rowSums(across(matches('p\\.n'))))
# A tibble: 2 × 5
ID. do.p.n.NP do.p.n.SE. p.d.e.sp.SR Total.P.n
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1510 4 6 0 10
2 1515 2 0 1 2

Cut function alternative in R

I have some data in the form:
Person.ID Household.ID Composition
1 4593 1A_0C
2 4992 2A_1C
3 9843 1A_1C
4 8385 2A_2C
5 9823 8A_1C
6 3458 1C_9C
7 7485 2C_0C
: : :
We can think of the composition variable as a count of adults/children i.e. 2A_1C would equate to two adults and two children.
What I want to do is reduce the amount of possible levels of composition. For person 5 we have composition of 8A_1C, I am looking for a way to reduce this to 4+A_0C. So for example we would have 4+ for any composition value with greater than 4A.
Person.ID Household.ID Composition
5 9823 4+A_1C
6 3458 1A_4+C
: : :
I am unsure of how to do this in R, I am thinking of using filter() or select() from dyplyr. Otherwise I would need to use some sort of regular expression.
Any help would be appreciated. Thanks
Data:
Person.ID <- c(1,2,3,4,5,6,7,8)
Household.ID <- c(4593,4992,9843,8385,9823,3458,7485)
Composition <- c("1A_0C","2A_1C","1A_1C","2A_2C","8A_1C","1A_9C","2A_0C")
dat <- tibble(Person.ID, Household.ID, Composition)
Function:
above4 <- function(f){
ff <- gsub("[^0-9]","",f)
if(ff>4){return("4+")}
if(ff<=4){return(ff)}
}
Apply function (done on separated data, but can recombine after):
dat_ <- dat %>% tidyr::separate(., col=Composition,
into=c("Adults", "Children"),
sep="_") %>%
dplyr::mutate(Adults_ = unlist(lapply(Adults,above4)),
Children_ = unlist(lapply(Children,above4)))
You might then use select, filter to get your required dataset.
dat_ %>% dplyr::mutate(Composition_ = paste0(Adults_, "A_", Children_, "C")) %>%
dplyr::select(Person.ID, Household.ID, Composition=Composition_)
# A tibble: 7 x 3
Person.ID Household.ID Composition
<dbl> <dbl> <chr>
1 1. 4593. 1A_0C
2 2. 4992. 2A_1C
3 3. 9843. 1A_1C
4 4. 8385. 2A_2C
5 5. 9823. 4+A_1C
6 6. 3458. 1A_4+C
7 7. 7485. 2A_0C
We can use gsub:
df$Composition <- gsub("(?<!\\d)([5-9]|\\d{2,})(?=[AC])", "4+", df$Composition, perl = TRUE)
This assumes that 2 or more consecutive digits represent a number that's always greater than 4 (i.e. no 01, 02, or 001).
Output:
Person.ID Household.ID Composition
1 1 4593 1A_0C
2 2 4992 2A_1C
3 3 9843 1A_1C
4 4 8385 2A_2C
5 5 9823 4+A_1C
6 6 3458 1C_4+C
7 7 7485 2C_0C

Compare two data.frame and delete rows with common characters

I have two data.frame x1 & x2. I want to remove rows from x2 if there is a common gene found in x1 and x2
x1 <- chr start end Genes
1 8401 8410 Mndal,Mnda,Ifi203,Ifi202b
2 8001 8020 Cyb5r1,Adipor1,Klhl12
3 4001 4020 Alyref2,Itln1,Cd244
x2 <- chr start end Genes
1 8861 8868 Olfr1193
1 8405 8420 Mrgprx3-ps,Mrgpra1,Mrgpra2a,Mndal,Mrgpra2b
2 8501 8520 Chia,Chi3l3,Chi3l4
3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
x2 <- chr start end Genes
1 8861 8868 Olfr1193
2 8501 8520 Chia,Chi3l3,Chi3l4
3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
You could try
x2[mapply(function(x,y) !any(x %in% y),
strsplit(x1$Genes, ','), strsplit(x2$Genes, ',')),]
# chr start end Genes
#2 2 8501 8520 Chia,Chi3l3,Chi3l4
#3 3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
Or replace !any(x %in% y) with length(intersect(x,y))==0.
NOTE: If the "Genes" column is "factor", convert it to "character" as strsplit cannot take 'factor' class. i.e. strsplit(as.character(x1$Genes, ','))
Update
Based on the new dataset for 'x2', we can merge the two datasets by the 'chr' column, strsplit the 'Genes.x', 'Genes.y' from the output dataset ('xNew'), get the logical index based on the occurrence of any element of 'Genes.x' in 'Genes.y' strings, use that to subset the 'x2' dataset
xNew <- merge(x1, x2[,c(1,4)], by='chr')
indx <- mapply(function(x,y) any(x %in% y),
strsplit(xNew$Genes.x, ','), strsplit(xNew$Genes.y, ','))
x2[!indx,]
# chr start end Genes
#1 1 8861 8868 Olfr1193
#3 2 8501 8520 Chia,Chi3l3,Chi3l4
#4 3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5

Subset only those rows whose intervals does not fall within another data.frame

How can i compare two data frames (test and control) of unequal length, and remove the row from test based on three criteria, i) if the test$chr == control$chr
ii) test$start and test$end lies with in the range of control$start and control$end
iii) test$CNA and control$CNA are same.
test =
R_level logp chr start end CNA Gene
2 7.079 11 1159 1360 gain Recl,Bcl
11 2.4 12 6335 6345 loss Pekg
3 19 13 7180 7229 loss Sox1
control =
R_level logp chr start end CNA Gene
2 5.9 11 1100 1400 gain Recl,Bcl
2 3.46 11 1002 1345 gain Trp1
2 6.4 12 6705 6845 gain Pekg
4 7 13 6480 8129 loss Sox1
The result should look something like this
result =
R_level logp chr start end CNA Gene
11 2.4 12 6335 6345 loss Pekg
Here's one way using foverlaps() from data.table.
require(data.table) # v1.9.4+
dt1 <- as.data.table(test)
dt2 <- as.data.table(control)
setkey(dt2, chr, CNA, start, end)
olaps = foverlaps(dt1, dt2, nomatch=0L, which=TRUE, type="within")
# xid yid
# 1: 1 2
# 2: 3 4
dt1[!olaps$xid]
# R_level logp chr start end CNA Gene
# 1: 11 2.4 12 6335 6345 loss Pekg
Read ?foverlaps and see the examples section for more info.
Alternatively, you can also use GenomicRanges package. However, you might have to filter based on CNA after merging by overlapping regions (AFAICT).
When you say "exclude the variable", I assume you mean you want to remove the rows that satisfies those criteria.
If so, you are nearly there. The following should work:
exclude_bool <- data1[,3] == data2[,3] &
data1[,4] > data2[,5] &
data1[,5] < data2[,4] &
data1[,6] == data2[,6]
data1 <- data1[!exclude_bool , ]

Resources