Can I identify the same values within a range between 2 columns? - r

I am trying to compare values between two different columns but I need it to accept values within a range of ±3.
I created this 2 tibbles:
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
And I want the program to link the ones that are the same within a ±3 range.
So for example, I want it to identify that 84 and 84.5 are the same, also 149 and 149.5; 489 and 489; 680.5 and 680.5. But I want it to also tell me that 534, 528.5 and 542 do not have a match.
Is there any way to do this?

This could be achieved via the fuzzyjoin package like so:
library(dplyr)
library(fuzzyjoin)
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
match_fun1 <- function(x, y) {
# (x >= y - 3) & (x <= y + 3)
# or following the suggestion by #DarrenTsai
abs(x - y) <= 3
}
fuzzy_full_join(example_tp1, example_tp2,
by = c("Object_centre"),
match_fun = match_fun1)
#> # A tibble: 7 x 2
#> Object_centre.x Object_centre.y
#> <dbl> <dbl>
#> 1 84 84.5
#> 2 149 150.
#> 3 489 489
#> 4 680. 680.
#> 5 534 NA
#> 6 NA 528.
#> 7 NA 542
Created on 2020-08-22 by the reprex package (v0.3.0)

You could look at all combinations of values and see which ones match.
# Data Frame of all combinations
example <- expand.grid(c(84, 149, 489, 534, 680.5), c(84.5, 149.5, 489, 528.5, 542, 680.5))
# Assigns a Match if the values are within a range of 3
example %>%
mutate(match = ifelse(abs(Var1-Var2) <= 3, "Match", "No Match"))
Var1 Var2 match
1 84.0 84.5 Match
2 149.0 84.5 No Match
3 489.0 84.5 No Match
4 534.0 84.5 No Match
5 680.5 84.5 No Match
6 84.0 149.5 No Match
7 149.0 149.5 Match
8 489.0 149.5 No Match
9 ..... ..... ........
10 ..... ..... ........
and so on
You could then filter out only the matches or see which values have no match.

Similar to #Jumble's answer using tidyverse functions :
tidyr::crossing(example_tp1, example_tp2, .name_repair = ~c('col1', 'col2')) %>%
dplyr::filter(abs(col1 - col2) <= 3)
# col1 col2
# <dbl> <dbl>
#1 84 84.5
#2 149 150.
#3 489 489
#4 680. 680.
crossing generates all combinations of example_tp1 and example_tp2 and we keep only those rows where the difference is less than equal to 3.

Related

How to extract a first 3 numbers within a variable?

My numeric variable looks like this:
u$a <- c(1234, 1432, 1456, 13467)
How do I create a new variable a1 which is the first three characters of the variable a such that it would look like this:
u$a1 <- c(123, 143, 145, 134)
Thank you.
use integer division.
u$a1 <- u$a%/% 10^(nchar(u$a)-3)
u
#> a a1
#> 1 1234 123
#> 2 1432 143
#> 3 1456 145
#> 4 13467 134
You could first convert it to a character and use substr to get the first until third character and convert it back to numeric like this:
u$a1 <- as.numeric(substr(as.character(u$a), 1, 3))
u
#> a a1
#> 1 1234 123
#> 2 1432 143
#> 3 1456 145
#> 4 13467 134
Created on 2023-01-26 with reprex v2.0.2
Data used:
u <- data.frame(a = c(1234, 1432, 1456, 13467))
Using sub
u$a1 <- as.numeric(sub("^(...).*", "\\1", u$a))

How to include number of rows aggregated using aggregate() in R [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I have dataset with a parentID variable and a childIQ variable which represents the IQ of the children of that specific parent:
df <- data.frame("parentID"=c(101,101,101,204,204,465,465),
"childIQ"=c(98,90,81,96,87,71,65))
parentID, childIQ
101, 98
101, 90
101, 81
204, 96
204, 87
465, 71
465, 65
I ran an aggregate() function so there is only 1 row per parent, and the childIQ value becomes the mean IQ of that parent's children:
df_agg <- aggregate(childIQ ~ parentID , data = df, mean)
parentID, avg_childIQ
101, 89.67
204, 91.5
465, 68
However, I want to add another column that represents the number of children for that parent, like this:
parentID, avg_childIQ, num_children
101, 90.67, 3
204, 91.5, 2
465, 68, 2
I'm not sure how to do this using data.table once I have already created df_agg?
It is possible to supply several functions to aggregate by using a function(x) c(...) code.
df_agg <- aggregate(childIQ ~ parentID , data = df,
function(x) c(mean = mean(x),
n = length(x)))
#> parentID childIQ.mean childIQ.n
#> 1 101 89.66667 3.00000
#> 2 204 91.50000 2.00000
#> 3 465 68.00000 2.00000
Using dplyr:
library(dplyr)
df %>% group_by(parentID) %>% summarise(avg_childID = mean(childIQ), num_children = n())
# A tibble: 3 x 3
parentID avg_childID num_children
<dbl> <dbl> <int>
1 101 89.7 3
2 204 91.5 2
3 465 68 2
Using data.table:
library(data.table)
setDT(df)[,list(avg_childID = mean(childIQ), num_children = .N), by=parentID]
parentID avg_childID num_children
1: 101 89.66667 3
2: 204 91.50000 2
3: 465 68.00000 2

How can I divide several entire numbers separated by a comma in one column by numbers in another column

I wanted to divide numbers separated by commas in a column
by other numbers.
Here is the input I have
> df = data.frame (SAMPLE1.DP=c("555","651","641","717"), SAMPLE1.AD=c("555", "68,583","2,639","358,359"), SAMPLE2.DP=c("1023","930","683","1179"), SAMPLE2.AD=c("1023","0,930","683","585,594"))
> df
SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD
1 555 555 1023 1023
2 651 68,583 930 0,930
3 641 2,639 683 683
4 717 358,359 1179 585,594
In the end I want to add two new columns (AD/DP) that divide the values SAMPLE1.AD by SAMPLE1.DP AND SAMPLE2.AD by SAMPLE2.DP, which represent pourcentage of numbers at each side of the comma, like this :
> end = data.frame(SAMPLE1.DP=c("555","651","641","717"),
+ SAMPLE1.AD=c("555", "68,583","204,437","358,359"),
+ SAMPLE1.AD_DP=c("1.00","0.10,0.90","0.32,0.68","0.50,0.50"),
+ SAMPLE2.DP=c("1023","930","683","1179"),
+ SAMPLE2.AD=c("1023","0,930","683","585,594"),
+ SAMPLE2.AD_DP=c("1.00","0.00,1.00","1.00","0.49,0,51"))
>end
SAMPLE1.DP SAMPLE1.AD SAMPLE1.AD_DP SAMPLE2.DP SAMPLE2.AD SAMPLE2.AD_DP
1 555 555 1.00 1023 1023 1.00
2 651 68,583 0.10,0.90 930 0,930 0.00,1.00
3 641 204,437 0.32,0.68 683 683 1.00
4 717 358,359 0.50,0.50 1179 585,594 0.49,0,51
it means :
XX YY,ZZ YY/XX,ZZ/XX AA BB,CC BB/AA,CC/AA
If I consider the values inside the table as.numeric, it does not work since values are separated by commas...
Do you have any idea to do this ?
Thanks in advance for your help
First thing you need to do is replace the , with . and cast to numeric. Then split based on your required condition and divide, i.e.
df[] <- lapply(df, function(i)as.numeric(gsub(',', '.', i)))
do.call(cbind, lapply(split.default(df, gsub('\\D+', '', names(df))), function(i) i[2] / i[1]))
# SAMPLE1.AD SAMPLE2.AD
#1 1.000000000 1.000000
#2 0.004066052 0.001000
#3 0.004117005 1.000000
#4 0.499803347 0.496687
If there are commas in your numbers than the column has most likely been poisoned and is cast as characters. What you need to do is convert your columns to numeric and then divide each column respectively.
library(tidyverse)
dat <- tribble(~"SAMPLE1.DP", ~"SAMPLE1.AD", ~"SAMPLE2.DP", ~"SAMPLE2.AD",
555, 555, 1023, 1023,
651, "2,647", 930, ",93",
641, "2,639", 683, 683,
717, "358,359", 1179, "585,594")
dat %>%
mutate_at(c(2,4), list(~str_replace(., ",", "."))) %>%
mutate_all(as.numeric) %>%
mutate(addp1 = SAMPLE1.AD / SAMPLE1.DP,
addp2 = SAMPLE2.AD / SAMPLE2.DP)
#> # A tibble: 4 x 6
#> SAMPLE1.DP SAMPLE1.AD SAMPLE2.DP SAMPLE2.AD addp1 addp2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 555 555 1023 1023 1 1
#> 2 651 2.65 930 0.93 0.00407 0.001
#> 3 641 2.64 683 683 0.00412 1
#> 4 717 358. 1179 586. 0.500 0.497
Created on 2019-05-20 by the reprex package (v0.2.1)
Thanks everyone but I was not very clear in my question, very sorry.
In my input example, I have only whole numbers separated by commas, no decimales.
For example, on line 3 of my example :
2,647 means 2 AND 647, and I want to divide both numbers by 651 in order to have as result : 2/651 , 647/651 , so it will be 0.01 and 0.99 (or 1% and 99%)
They are entire numbers (or integers), separated by commas.
Hope I am clearer ...thanks...

How to write a loop that looks for a condition in two columns then adds the value in the third of a data frame?

Table showing correct format of dataI have a data frame with four columns, and I need to find a way to sum the values in the third column. Only if the numbers in the first two columns are different. The only way I can think of is to maybe do an If loop? Is that something can be done or is there a better way?
Genotype summary`
Dnov1a Dnov1b Freq rel_geno_freq
1 220 220 1 0.003367003
7 220 224 4 0.013468013
8 224 224 8 0.026936027
13 220 228 14 0.047138047
This is a portion of the data as an example, I need to sum the third column Freq for rows 7 and 13 because they are different.
Here's a tidyverse way of doing it:
library(tidyverse)
data <- tribble(
~Dnov1a, ~Dnov1b, ~Freq, ~rel_geno_freq,
220, 220, 1, 0.003367003,
220, 224, 4, 0.013468013,
224, 224, 8, 0.026936027,
220, 228, 14, 0.047138047)
data %>%
mutate(filter_column = if_else(Dnov1a != Dnov1b, TRUE, FALSE)) %>%
filter(filter_column == TRUE) %>%
summarise(Total = sum(Freq))
# A tibble: 1 x 1
Total
<dbl>
1 18
data$new = data$Dnov1a!=data$Dnov1b
data
Dnov1a Dnov1b Freq rel_geno_freq new
<int> <int> <int> <dbl> <lgl>
1 220 220 1 0.00337 TRUE
2 220 224 4 0.0135 FALSE
3 224 224 8 0.0269 TRUE
4 220 228 14 0.0471 FALSE
sum(data$Freq[data$new])
28
Is this what you are looking for?

how to select data based on a list from a split data frame and then recombine in R

I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28

Resources