Find nearest entry between dataframes in R - r

I have two dataframes that I want to compare. I already know that the values in dataframe one (df1$BP) are not within the range of values in dataframe two (df2$START and df2$STOP), but I want to return the row in dataframe two where df2$START or df2$STOP is closest in value to df1$BP, where the "Chr" column matches between datasets (df1$Chr, df2$Chr).
I have managed to do this (see bottom of Q), but it seems SERIOUSLY unwieldy, and I wondered if there was a more parsimonious manner of achieving the same thing.
So for dataframe one (df1), we have:
df1=data.frame(SNP=c("rs79247094","rs13325007"),
Chr=c(2,3),
BP=c(48554955,107916058))
df1
SNP Chr BP
rs79247094 2 48554955
rs13325007 3 107916058
For dataframe two, we have:
df2=data.frame(clump=c(1,2,3,4,5,6,7,8),
Chr=c(2,2,2,2,3,3,3,3),
START=c(28033538,37576136,58143438,60389362,80814042,107379837,136288405,161777035),
STOP=c(27451538,36998607,57845065,60242162,79814042,107118837,135530405,161092491))
df2
Clump Chr START STOP
1 2 28033538 27451538
2 2 37576136 36998607
3 2 58143438 57845065
4 2 60389362 60242162
5 3 80814042 79814042
6 3 107379837 107118837
7 3 136288405 135530405
8 3 161777035 161092491
I am interested in returning which START/STOP values are closest to BP. Ideally, I could return the row, and what the difference between BP and START or STOP is (df3$Dist), like:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
I have found similar questions, for example: Return rows establishing a "closest value to" in R
But these are finding the closest values based on a fixed value, rather than a value that changes (and matching on the Chr column).
My long winded method is:
df3<-right_join(df1,df2,by="Chr")
to give me all of the combinations of df1 and df2 together.
df3$start_dist<-abs(df3$START-df3$BP)
to create a column with the absolute difference between START and BP
df3$stop_dist<-abs(df3$STOP-df3$BP)
to create a column with the absolute difference between STOP and BP
df3$dist.compare<-ifelse(df3$start_dist<df3$stop_dist,df3$start_dist,df3$stop_dist)
df3<-df3[with(df3,order(SNP,"dist.compare")),]
to create a column (dist.compare) which prints the smallest difference between BP and START or STOP (and the re-order by that column)
df3<- df3 %>% group_by(SNP) %>% mutate(Dist = first(dist.compare))
to create a column (Dist) which prints the minimum value from df3$dist.compare
df3<-df3[which(df3$dist.compare==df3$Dist),c("clump","Chr","START","STOP","SNP","BP","Dist")]
df3<-df3[order(df3$clump),]
to only print rows where dist.compare matches Dist (so the minimum value), and drop the intermediate columns, and tidy up by re-ordering by clump. Now that gets me to where I want to be:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
But I feel like it is very very convoluted, and wondered if anyone had any tips on how to refine that process?
thanks in advance

Following the logic you've laid out in your syntax, here is a dplyr solution that is a bit cleaner:
right_join your dataframes
Create a variable dist.compare based on absolute values
Group by SNP
Filter to keep the smallest distance
Select variables in the order you'd like for your final dataframe. Note you can rename variables in a dplyr::select statement (Dist = dist.compare)
Order values by clump
library(dplyr)
df3 <- right_join(df1, df2, by = "Chr") %>%
mutate(dist.compare = ifelse(abs(START - BP) < abs(STOP - BP), abs(START - BP), abs(STOP - BP))) %>%
group_by(SNP) %>%
filter(dist.compare == min(dist.compare)) %>%
select(clump, Chr, START, STOP, SNP, BP, Dist = dist.compare) %>%
arrange(clump)
This gives us:
clump Chr START STOP SNP BP Dist
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 3 2 58143438 57845065 rs79247094 48554955 9290110
2 6 3 107379837 107118837 rs13325007 107916058 536221

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

Copy value into NA from one column using criteria from another column

I´m new using R and I´ve been struggling using tidyverse.
I have created the following data.frame as example. My original data.frame has 180000 obs and 34 vars.
name <- c("chem1", "chem2", "chem3", "chem4", "chem5")
cas <- c("29331-92-5", "29331-92-6", NA, "29331-92-4", "29331-92-1" )
tib <- tibble(name, cas)
which generate this:
tib
# A tibble: 5 x 2
name cas
<chr> <chr>
1 chem1 29331-92-5
2 chem2 29331-92-6
3 chem3 NA
4 chem4 29331-92-4
5 chem5 29331-92-1
chem3 and chem1 must have same cas value, however the input file came with a NA value for chem3.
I do not know how to copy into the NA cell the cas value belonging to chem1, that is "29331-92-5".
Although I´ve trying using tidyverse but I am happy receiving any base feedback.
What I undestood: the value of chem3 was supposed to be equal to chem1, but your input file came with an error, because in a lot of lines, the values of chem3 are different from chem1.
To correct this, I would make a lookup vector, where the values of each "chem" are the correct values. Whem I make sure that all the values in this lookup vector are correct, I would just past this lookup vector, trough your tib data.frame. So to make this, I will first extract all the current unique values of each "chem" as follow:
library(tidyverse)
group <- tib %>%
group_by(name, cas) %>%
summarise(
count = n()
)
After that, I transform these unique values of cas, into a vector, and them, I name each of these values according to their respective chem. Since the values of "chem3" are incorrect, I need to equal this value, to the value of "chem1" before I proceed.
levels <- group$cas
names(levels) <- group$name
levels["chem3"] <- levels["chem1"]
Now that I correct the value of "chem3" in the lookup vector levels, I just ask R to repeat these values of levels, in the same order as they appear in your tib data.frame, and them I save this result in a new column trough the mutate() function.
tib <- tib %>%
mutate(
correct_cas = levels[tib$name]
)
Resulting this
# A tibble: 5 x 3
name cas correct_cas
<chr> <chr> <chr>
1 chem1 29331-92-5 29331-92-5
2 chem2 29331-92-6 29331-92-6
3 chem3 NA 29331-92-5
4 chem4 29331-92-4 29331-92-4
5 chem5 29331-92-1 29331-92-1

Incorrect output from inner_join of dplyr package

I have two datasets, named "results" and "support2", available here.
I want to merge the two datasets by the only common column name "SNP". Code below:
> library(dplyr)
> results <- read_delim("<path>\\results", delim = "\t", col_name = T)
> support2 <- read_delim("<path>\\support2", delim = "\t", col_name = T)
> head(results)
# A tibble: 6 x 2
SNP p.value
<chr> <dbl>
1 rs28436661 0.334
2 rs9922067 0.322
3 rs2562132 0.848
4 rs3930588 0.332
5 rs2562137 0.323
6 rs3848343 0.363
> head(support2)
# A tibble: 6 x 2
SNP position
<chr> <dbl>
1 rs62028702 60054
2 rs190434815 60085
3 rs62028703 60087
4 rs62028704 60095
5 rs181534180 60164
6 rs186233776 60177
> dim(results)
[1] 188242 2
> dim(support2)
[1] 1210619 2
# determine the number of common SNPs
length(Reduce(intersect, list(results$SNP, support2$SNP)))
[1] 187613
I would expect that after inner_join, the new data would have 187613 rows.
> newdata <- inner_join(results, support2)
Joining, by = "SNP"
> dim(newdata)
[1] 1409812 3
Strangely, instead of have 187613 rows, the new data have 1409812 rows, which is even larger than the sum of the number of rows of the two dataframes.
I switched to the merge function as below:
> newdata2 <- merge(results, support2)
> dim(newdata2)
[1] 1409812 3
This second new dataframe has the same issue. No idea why.
I wish to know how should I obtain a new dataframe whose rows represent the common rows of the two dataframes (should have 187613 rows) and whose columns contain columns of both dataframes.
It could be a result of duplicate elements
results <- data.frame(col1 = rep(letters[1:3], each = 3), col2 = rnorm(9))
support2 <- data.frame(col1 = rep(letters[1:5],each = 2), newcol = runif(10))
library(dplyr)
out <- inner_join(results, support2)
nrow(out)
#[1] 18
Here, the initial datasets in the common column ('col1') are duplicated which confuses the join statement as to which row it should take as a match resulting in a situation similar to a cross join but not exactly that
As already pointed out by #akrun, the data may have duplicates, possibly that is the only explanation of this behavior.
From the documentation of intersect, it always returns a unique value but inner join can have duplicates if the "by" value has duplicates, Hence the count mismatch.
If you truly want to see its right, see the unique counts of by variable (unique key in your case), it should match with your intersect result. But that doesn't mean your join/merge is right, ideally any join which has duplicates in both table A and B is not recommended(unless offcourse you have business/other justification). So, check if the duplicates are present in both the tables or only one of them. If it only found in one of the tables then probably your merge/join should be alright. I hope I am able to explain the scenario.
Please let me know if it doesn't answer your question, I shall remove it.
From Documentations:
intersect:
Each of union, intersect, setdiff and setequal will discard any
duplicated values in the arguments, and they apply as.vector to their
arguments
inner_join():
return all rows from x where there are matching values in y, and all
columns from x and y. If there are multiple matches between x and y,
all combination of the matches are returned.

R: Populating a data frame with multiple matches for a single value without looping

I have a working solution to this problem using a while-loop. I have been made aware that it is typically bad practice to use loops in R so was wondering of alternative approaches.
I have two dataframes, one single-column df full of gene names:
head(genes)
Genes
1 C1QA
2 C1QB
3 C1QC
4 CSF1R
5 CTSC
6 CTSS
And a two-column df that has pairs of the gene name (HGNC.symbol) and accompanying ensembl ID (Gene.stable.ID) for each transcript of the given gene:
head(ensembl_key)
Gene.stable.ID HGNC.symbol
1 ENSG00000210049 MT-TF
2 ENSG00000211459 MT-RNR1
3 ENSG00000210077 MT-TV
4 ENSG00000210082 MT-RNR2
5 ENSG00000209082 MT-TL1
6 ENSG00000198888 MT-ND1
My goal is to create a df that for each gene in the genes df extracts all corresponding transcript ID's (Gene.stable.ID) from the ensembl_key df.
The reason I have only found the looping solution is because a single entry in genes may have multiple matches in ensembl_key. I need to retain all matches and include them in the final df and I also do not know the number of matches a single ID from genes has a priori.
Here is my current working solution:
# Create large empty df to hold all transcripts
gene_transcript<- data.frame(matrix(NA, nrow= 5000, ncol= 2))
colnames(gene_transcript)<- c("geneID", "ensemblID")
# Populate Ensembl column
curr_gene<- 1
gene_count<- 1
while(gene_count <= dim(genes)[1]){
transcripts<- ensembl_key[which(ensembl_key$HGNC.symbol== genes$Genes[gene_count]),1]
if(length(transcripts)>1){
num<- length(transcripts)-1
gene_transcript$geneID[curr_gene:(curr_gene+num)]<- genes$Genes[curr_gene]
gene_transcript$ensemblID[curr_gene:(curr_gene+num)]<- transcripts
gene_count<- gene_count+1
curr_gene<- curr_gene + num + 1
}
else{
gene_transcript$geneID[curr_gene]<- genes$Genes[curr_gene]
gene_transcript$ensemblID[curr_gene]<- transcripts
gene_count<- gene_count+1
curr_gene<- curr_gene + 1
}
}
# Remove unneccessary columns
last_row<- which(is.na(gene_transcript$geneID)==T)[1]-1
gene_transcript<- gene_transcript[1:last_row,]
Any help is greatly appreciated, thanks!
It sounds like you want to join or merge. Several ways to do this, but the following should work.
merge(genes,
ensembl_key,
by.x = "Genes",
by.y = "HGNC.symbol")

Scale variables over a moving date window in R: script works, but unacceptably slow. Ways to optimize? rstats

I have a data frame, where each rows represents data for a specific category on a specific day:
set.seed(1)
k <- 10
df <- data.frame(
name = c(rep('a',k), rep('b',k)),
date = rep(seq(as.Date('2017-01-01'),as.Date('2017-01-01')+k-1, 'days'),2),
x = runif(2*k,1,20),
y = runif(2*k,100,300)
)
View(df)
Head:
head(df)
name date x y
1 a 2017-01-01 6.044665 286.9410
2 a 2017-01-02 8.070354 142.4285
3 a 2017-01-03 11.884214 230.3348
4 a 2017-01-04 18.255948 125.1110
5 a 2017-01-05 4.831957 153.4441
6 a 2017-01-06 18.069404 177.2228
Structure:
str(df)
'data.frame': 20 obs. of 4 variables:
$ name: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
$ date: Date, format: "2017-01-01" "2017-01-02" "2017-01-03" "2017-01-04" ...
$ x : num 6.04 8.07 11.88 18.26 4.83 ...
$ y : num 287 142 230 125 153 ...
I need to scale x and y variables of this data over a specific date window.
The script I came up with is the following:
library(dplyr)
library(lubridate)
df2 <- df
moving_window_days <- 4
##Iterate over each row in df
for(i in 1:nrow(df)){
df2[i,] <- df %>%
##Give me only rows for 'name' on the current row
##which are within the date window of interest
filter(date <= date(df[i,"date"]) &
date >= date(df[i,"date"]) - moving_window_days &
name == df[i,"name"]
) %>%
##Now scale x and y on this date wondow
mutate(x = percent_rank(x),
y = percent_rank(y)
) %>%
##Get rid of the rest of the rows - leave only the row we are looking at
filter(date == date(df[i,"date"]))
}
It works as intended (well, I initially wanted to get each observation's percentile in a moving window, but scaled values will work just fine)
The problem is that the real dataset is much larger:
'name' column has 30 local branch offices
'date' is at least a year worth of data for each branch
instead of 'x' and 'y' I have 6 variables
the moving window is 90 days
I ran this script on the real data, and out of 30,000 rows it was able to go over only 5,000 in 4 hours...
This is the first time I run into a problem like this.
I am sure my script is highly inefficient (I'm sure because I am not a pro in R. I am just assuming there is always a better way)
Any way this script can be optimized / improved?
Any way to 'purrrify' (use some of the map functions in purrr)?
Nested dataframe? nest()? Thinking this is a solution... Not sure how to implement...
Anything I can do to perhaps tackle the problem in a different manner?
One thing you can do is parallel processing. I utilize the future package for this. This may annoy some poeple, who may consider it a hack, because the future package is intended... Well... For futures (or "promises" if you're a front end developer). This approach is finicky, but works very well.
library(future)
# Create a function that iterates over each row in the df:
my_function <- function(df, x) {
x <- df
for(i in 1:nrow(df)){
x[i, ] <- df %>%
##Give me only rows for 'name' on the current row
##which are within the date window of interest
filter(date <= date(df[i,"date"]) &
date >= date(df[i,"date"]) - moving_window_days &
name == df[i,"name"]
) %>%
##Now scale x and y on this date wondow
mutate(x = percent_rank(x),
y = percent_rank(y)
) %>%
##Get rid of the rest of the rows - leave only the row we are looking at
filter(date == date(df[i,"date"]))
}
return(x)
}
plan(multiprocess) # make sure to always include this in a run of the code.
# Divide df evenly into three separate dataframes:
df1 %<-% my_function(df[1:7, ], df1)
df2 %<-% my_function(df = df[(8 - moving_window_days):14, ], df2) # But from here on out, go back 4 days to include that data in the moving average calculation.
df3 %<-% my_function(df = df[(15 - moving_window_days):20, ], df3)
# See if your computer is able to split df into 4 or 5 separate dataframes.
# Now bind the dataframes together, but get the indexing right:
rbind(df1, df2[(nrow(df2) - 6):nrow(df2), ], df3[(nrow(df3) - 5):nrow(df3), ])
Parallel processing is one of many ways to optimize code for efficiency. This exact technique has substantially sped up code for me in the past. It has reduced the run time of a program from a day and a half, down to 3 or 4 hours!
Now, ideally, we'd like to work with the apply family and matrices. This answer is just one of many ways we can speed up code. Also, the future package allows us to parallel process without learning a new looping structure, such as in the parallel package (which, nonetheless, is still an amazing package).
Also check out the Rcpp package. It'll take some time to learn, but is incredible for unlocking the speed of C++.
zoo::rollapply can be quite fast.
df2 <- df %>%
group_by(name) %>%
mutate(x2 = zoo::rollapply(x, width = 4, FUN = percent_rank, fill = "extend")[,1],
y2 = zoo::rollapply(y, width = 4, FUN = percent_rank, fill = "extend")[,1])
Each call to rollapply generates a matrix with n=width columns. The first column is the value of the function for the window beginning with that observation, while the nth column is the value of the function for the window ending with that observation. You can change the [,1] to whichever column you want (the percentile in the middle of the window? at the end? at the beginning?).
The argument fill = "extend" duplicates the observations at the beginning or end of windows, since for the last n-k observations there are k missings from the window.
I expanded your dataset to a dummy of 28,496 rows, covering 26 names and 3 years of data, and ran this snippet with a width of 90 days. On my 4 year old desktop this took less than a minute for two variables:
user system elapsed
37.66 0.01 37.77
You could certainly use purrr::map2 to iterate over 6 variables (instead of calling rollapply 6 times in mutate), but I'm not sure it would speed it up substantially.
#OP You should be CAUTIOUS with the answers provided
--My original answer--
library(tidyverse)
I first split the data frame into a list of data frames grouped by name
split.df <- split(df, df$name)
Using the split data, use lapply and map_df to iterate over rows of each grouped df, filter for dates between relevant window of time using between, then mutate as you did before, and then filter for the relevant row again (I tried to 'copy' your code as closely as possible):
new <- lapply(split.df, function(z) map_df(1:nrow(z), ~z %>%
filter(between(date, z$date[.x]-moving_window_days, z$date[.x])) %>%
mutate(x=percent_rank(x),y=percent_rank(y)) %>%
filter(date==z$date[.x])))
This results in a list. To convert back to a single data frame
final <- Reduce("rbind",new)
Output (head)
name date x y
1 a 2017-01-01 0.0000000 0.00
2 a 2017-01-02 1.0000000 0.00
3 a 2017-01-03 1.0000000 0.50
4 a 2017-01-04 1.0000000 0.00
Let's make sure my result matches that of yours.
identical(final$x, OP.output$x)
[1] TRUE
--END of my original answer--
----------------------------COMPARING SOLUTIONS----------------------------
--#Brian's answer--
#Brian's answer does not give the same result you expect. You said your function works as intended, so let's compare Brian's result with yours. The first shows Brian's result. The second shows your result.
name date x y x2 y2
1 a 2017-01-01 6.044665 286.9410 0.0000000 1.0000000
2 a 2017-01-02 8.070354 142.4285 0.0000000 1.0000000
3 a 2017-01-03 11.884214 230.3348 0.3333333 0.3333333
4 a 2017-01-04 18.255948 125.1110 0.3333333 1.0000000
name date x y
1 a 2017-01-01 0.0000000 0.00
2 a 2017-01-02 1.0000000 0.00
3 a 2017-01-03 1.0000000 0.50
4 a 2017-01-04 1.0000000 0.00
identical(Brian.output$x2, OP.output$x, )
[1] FALSE
--END #Brian's answer--
--#Odysseus's answer--
#Odysseus's answer returns the right result since it uses your same function, but you have to split the data frame manually. See his code below that calls my_function
df1 %<-% my_function(df[1:7, ], df1)
df2 %<-% my_function(df = df[(8 - moving_window_days):14, ], df2) # But from here on out, go back 4 days to include that data in the moving average calculation.
df3 %<-% my_function(df = df[(15 - moving_window_days):20, ], df3)
--END #Odysseus's answer--
You're likely to get the best gain in performance from #Odysseus answer, but you'll need to benchmark it yourself since it will depend on the number of cores you have. Parallelization is not always faster than a vectorized operation. But you'll need to extend his solution to the rest of your data frame.

Resources