find closest match in gene expression data - r

I am analyzing a dataset and need to find matching samples between 2 versions of the data.
they (should) contain the same expression data but they have different sample identifiers. Lets say the first dataframe looks like this:
gene sample expression
1 a a 1
2 a b 2
3 a c 3
4 a d 4
5 a e 5
6 a f 6
7 a g 7
8 a h 8
9 a i 9
10 a j 10
11 a k 11
12 a l 12
13 a m 13
14 a n 14
I made the dataframe for one gene, but u can imagine that this is a large dataset containing ~20k genes. What I need to do is find the closest match in gene expression so I know which samples correspond. the second dataframe might look like this:
gene sample expression
1 a z 1.5
2 a y 2.5
3 a x 3
4 a w 4.5
5 a v 5.7
6 a u 6.2
7 a t 7.8
8 a s 8.1
9 a r 9.8
10 a q 10.5
11 a p 11
12 a o 12
13 a 2 13.3
14 a 4 14.4
what I need to do is write a function (or something like that) that try's to match the expressions of genes in a dataframe as closely as possible (for all genes) and report the sample identifiers with the closest match. I'm quite new to R and could use a little help.
I would like the output to look like this::
gene sample expression sample2
1 a z 1 z
2 a y 2 y
3 a x 3 x
4 a w 4 w
5 a v 5 v
6 a u 6 u
7 a t 7 t
8 a s 8 s
9 a r 9 r
10 a q 10 q
11 a p 11 p
12 a o 12 o
13 a 2 13 2
14 a 4 14 4
an extra column per sample that sepcifies the closest match in gene expression accros all genes. But the extra column must be created based on all genes and not on one gene.

Here are two options. In your example, it looks like there are always whole number matches, so you could join by whole number. Alternatively, you could try to extract the closest number. I use floor because it looks like you want 1.5 to be joined to 1 and not 2.
library(tidyverse)
#extract closest whole number
df1 |>
mutate(sample2 = map_chr(expression,
\(x)df2$sample[which.min(abs(x - floor(df2$expression)))]))
#> # A tibble: 14 x 4
#> gene sample expression sample2
#> <chr> <chr> <dbl> <chr>
#> 1 a a 1 z
#> 2 a b 2 y
#> 3 a c 3 x
#> 4 a d 4 w
#> 5 a e 5 v
#> 6 a f 6 u
#> 7 a g 7 t
#> 8 a h 8 s
#> 9 a i 9 r
#> 10 a j 10 q
#> 11 a k 11 p
#> 12 a l 12 o
#> 13 a m 13 2
#> 14 a n 14 4
#join by whole number
left_join(df1,
df2 |>
mutate(expression = as.numeric(gsub("^(.*)\\.\\d+$", "\\1", expression))) |>
select(sample2 = sample, expression),
by = "expression")
#> # A tibble: 14 x 4
#> gene sample expression sample2
#> <chr> <chr> <dbl> <chr>
#> 1 a a 1 z
#> 2 a b 2 y
#> 3 a c 3 x
#> 4 a d 4 w
#> 5 a e 5 v
#> 6 a f 6 u
#> 7 a g 7 t
#> 8 a h 8 s
#> 9 a i 9 r
#> 10 a j 10 q
#> 11 a k 11 p
#> 12 a l 12 o
#> 13 a m 13 2
#> 14 a n 14 4

Related

Removing rows from a data.frame with -grep removes all rows if no matches are found (-- how to prevent this?)

I have a data frame that gets updated frequently, and there are some rows that need to be removed from it if certain strings are found in them. I have done that previously using -grep to remove the rows containing the string in question, eg:
dataframe[-grep('some string', dataframe$column),]
However, at times that string doesn't appear in the dataframe, in which case the -grep is returning an empty dataframe. Here's a minimal reproducible example:
> test.df<-data.frame(number=c(1:10), letter=letters[1:10])
> test.df
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
> test.df[-grep('h', test.df$letter),]
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
9 9 i
10 10 j
> test.df[-grep('k', test.df$letter),]
[1] number letter
<0 rows> (or 0-length row.names)
I could wrap the 'test.df[-grep...' in an 'if' test to check if the search string is found prior to removing it, eg:
if(any(grepl('k',test.df$letter))){test.df<-test.df[-grep('k', test.df$letter),]}
...but it seems to me that this should be implicit in the -grep command. Is there a better (more efficient) way to accomplish row removal that doesn't threaten to remove all my data if the search string is absent from the data frame?
Using grepl you could do:
test.df <- data.frame(number = c(1:10), letter = letters[1:10])
test.df[!grepl("h", test.df$letter), ]
#> number letter
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
#> 7 7 g
#> 9 9 i
#> 10 10 j
test.df[!grepl("k", test.df$letter), ]
#> number letter
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
#> 7 7 g
#> 8 8 h
#> 9 9 i
#> 10 10 j
Created on 2023-01-19 with reprex v2.0.2
Instead of using - when subsetting, in grep invert could be used.
test.df[grep('k', test.df$letter, invert=TRUE),]
# number letter
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
#7 7 g
#8 8 h
#9 9 i
#10 10 j
test.df[grep('h', test.df$letter, invert=TRUE),]
# number letter
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
#6 6 f
#7 7 g
#9 9 i
#10 10 j
In this case it looks like that the whole sting should be matched, where an alternative would be to use == or !=.
test.df[test.df$letter != "k",]
test.df[test.df$letter != "h",]

collapse a dataframe in R that contains both numeric and character variables

I have the following data.frame:
data <- data.frame("ag" = rep(LETTERS[1:4],6),
"date" = c(sapply(1:3, function(x) rep(x, 8))),
"num_var1"= 1:24,
"num_var2"= 24:1,
"alpha_var1" = LETTERS[1:24],
"alpha_var2" = LETTERS[25:2] )
and I would like to summarize (mean) its rows by ag and date using dplyr. The issue is that some rows include characters: in this case, I would like to get the first entry by group (the example dataset is already sorted).
Since my dataset has several entries, I would like the code to be able to recognize whether a variable is numeric (including integers) or a character. However, the best solution that I have so far is the following one:
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), mean))
which creates NAs for non-numeric variables. Do you have a better solution?
Is this what you are looking for?
library(dplyr)
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), ~
if(is.numeric(.x)) mean(.x) else first(.x)))
#> `summarise()` has grouped output by 'ag'. You can override using the `.groups` argument.
#> # A tibble: 12 x 6
#> # Groups: ag [4]
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F
Created on 2022-03-03 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
data %>%
group_by(ag, date) %>%
summarise(across(where(is.numeric), mean),
across(where(is.character), first), .groups = "drop")
#> # A tibble: 12 × 6
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F

how to build a variable to summarized muti variables

I have a data looks like this:
The sample data can be get by following codes:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
I want to build a variable "Event" to capture all events. The final results will look like this:
What should I do? I would like to know as many ways as possible. Thanks.
One option could be using apply() like this. The suggestion from #AllanCameron is also a great choice. Here the code as option for you:
#Vectors
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
#Data
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C,stringsAsFactors = F)
#Option 1
index <- which(grepl('Event',names(Sample.data)))
Sample.data$Event <- apply(Sample.data[,index],1,function(x) paste0(x[x!=''],collapse='/'))
Output:
ID Days Event_P Event_N Event_C Event
1 1 -5 C C
2 1 1
3 1 18 P C P/C
4 1 30
5 2 1 N N
6 2 8
7 2 16 P N C P/N/C
8 3 1
9 3 8
10 3 6 P N C P/N/C
11 4 -6 N N
12 4 1
13 4 7 P N P/N
14 4 15 P N P/N
Duck's answer is very good, but you mentioned you want as many ways as possible so here are two more ways:
You could also use tidyverse's mutate and base r's interaction to combine the columns then use gsub to clear out all the unnecessary things:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
library(tidyverse)
Sample.data %>%
mutate(Event = paste(Event_P, Event_N, Event_C, sep='/'),
Event = gsub('^/|^//|/$|//$', '', Event),
Event = gsub('//', '/', Event))
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Sample.data$Event <-
interaction(Sample.data$Event_P, Sample.data$Event_N, Sample.data$Event_C, sep = '/') %>%
gsub('^/|^//|/$|//$', '', .) %>%
gsub('//', '/', .)
Sample.data
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Created on 2020-09-18 by the reprex package (v0.3.0)
What inside the gsub(^/|^//|/$|//$) does is
^/|^//: Take out all / or // that start the string
/$|//$: Take out all / or // that end the string

R: split rows into different rows [duplicate]

This question already has answers here:
Split delimited strings in a column and insert as new rows [duplicate]
(6 answers)
Closed 4 years ago.
Say I have a data frame like the following:
> mydf <- data.frame(a=c('A','B','C','D/E','F','G/H','I/J','K','L'), b=c(1,2,3,'4/5',6,'7/8','9/10',11,12))
> mydf
a b
1 A 1
2 B 2
3 C 3
4 D/E 4/5
5 F 6
6 G/H 7/8
7 I/J 9/10
8 K 11
9 L 12
How do I make it look like the following, with an easy one-liner (preferably base)? Thanks
> mydf2
a b
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F 6
7 G 7
8 H 8
9 I 9
10 J 10
11 K 11
12 L 12
You can use separate_rows from the tidyr package
library(tidyr)
mydf <- data.frame(a=c('A','B','C','D/E','F','G/H','I/J','K','L'), b=c(1,2,3,'4/5',6,'7/8','9/10',11,12))
mydf
#> a b
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D/E 4/5
#> 5 F 6
#> 6 G/H 7/8
#> 7 I/J 9/10
#> 8 K 11
#> 9 L 12
separate_rows(mydf, a, b, convert = TRUE)
#> a b
#> 1 A 1
#> 2 B 2
#> 3 C 3
#> 4 D 4
#> 5 E 5
#> 6 F 6
#> 7 G 7
#> 8 H 8
#> 9 I 9
#> 10 J 10
#> 11 K 11
#> 12 L 12
Created on 2018-04-18 by the reprex package (v0.2.0).

Randomizing conservative n rows of a dataframe in r

The problem I have is as explained in the title. I want to randomize the top, middle and bottom 3 rows in place. Here is a sample dataframe.
> set.seed(7)
> mydf
Id Name Score Feedback
1 1 AB 11 P
2 2 AA 12 P
3 3 AC 12 P
4 4 AD 31 P
5 5 AE 13 P
6 6 AF 15 P
7 7 AG 9 F
8 8 AH 8 F
9 9 AI 11 P
I could take the top, middle and last 3 rows independently and do a randomization and merge them back as follows:
# Take conservative 3 rows from mydf
top3 <- head(mydf,3)
middle3 <- mydf[4:6,]
tail3 <- tail(mydf,3)
# randomize the rows
top3r <- top3[sample(nrow(top3)),]
middle3r <- middle3[sample(nrow(middle3)),]
tail3r <- tail3[sample(nrow(tail3)),]
# merge them back
mydfr <- rbind(top3r, middle3r, tail3r)
> mydfr
Id Name Score Feedback
2 2 AA 12 P
1 1 AB 11 P
3 3 AC 12 P
6 6 AF 15 P
4 4 AD 31 P
5 5 AE 13 P
7 7 AG 9 F
8 8 AH 8 F
9 9 AI 11 P
Is there someway I could achieve the same without going through the manual process of pulling the n rows?
Thank you,
This is basically the same as your code, but without all the intermediate variables.
mydf[c(sample(1:3), sample(4:6), sample(7:9)), ]
Here is a way it could be done if you wanted to use dplyr (I do like the base solution by #Gregor in the comments though).
library(dplyr)
set.seed(1)
mydf %>%
mutate(grp = rep(1:3, each = 3)) %>%
group_by(grp) %>%
sample_n(3)
#> # A tibble: 9 x 5
#> # Groups: grp [3]
#> Id Name Score Feedback grp
#> <int> <chr> <int> <chr> <int>
#> 1 1 AB 11 P 1
#> 2 3 AC 12 P 1
#> 3 2 AA 12 P 1
#> 4 6 AF 15 P 2
#> 5 4 AD 31 P 2
#> 6 5 AE 13 P 2
#> 7 9 AI 11 P 3
#> 8 8 AH 8 F 3
#> 9 7 AG 9 F 3

Resources