How to write a loop that looks for a condition in two columns then adds the value in the third of a data frame? - r

Table showing correct format of dataI have a data frame with four columns, and I need to find a way to sum the values in the third column. Only if the numbers in the first two columns are different. The only way I can think of is to maybe do an If loop? Is that something can be done or is there a better way?
Genotype summary`
Dnov1a Dnov1b Freq rel_geno_freq
1 220 220 1 0.003367003
7 220 224 4 0.013468013
8 224 224 8 0.026936027
13 220 228 14 0.047138047
This is a portion of the data as an example, I need to sum the third column Freq for rows 7 and 13 because they are different.

Here's a tidyverse way of doing it:
library(tidyverse)
data <- tribble(
~Dnov1a, ~Dnov1b, ~Freq, ~rel_geno_freq,
220, 220, 1, 0.003367003,
220, 224, 4, 0.013468013,
224, 224, 8, 0.026936027,
220, 228, 14, 0.047138047)
data %>%
mutate(filter_column = if_else(Dnov1a != Dnov1b, TRUE, FALSE)) %>%
filter(filter_column == TRUE) %>%
summarise(Total = sum(Freq))
# A tibble: 1 x 1
Total
<dbl>
1 18

data$new = data$Dnov1a!=data$Dnov1b
data
Dnov1a Dnov1b Freq rel_geno_freq new
<int> <int> <int> <dbl> <lgl>
1 220 220 1 0.00337 TRUE
2 220 224 4 0.0135 FALSE
3 224 224 8 0.0269 TRUE
4 220 228 14 0.0471 FALSE
sum(data$Freq[data$new])
28
Is this what you are looking for?

Related

r count instances where variables x and y are equal and place in table

I have the following code
length(which(tor$TorL==1& tor$SID==351))
length(which(tor$TorL==1 & tor$SID==352))
## The result of this is as follows
[1] 3843
[1] 223
The lines of code give me the count of TorL when SID==xxx.
TorL is a binary variable of a low value
SID goes from 351 to 358, I am only showing 351.
My second code query is
length(which(tor$TorH==1 & tor$SID==351))
length(which(tor$TorH==1 & tor$SID==352))
## Results from above
[1] 155
[1] 96
TorH is a binary variable of a high value
I would like to able to do this count as above and place the results in a table, something like as follows, as I would like to do a correlation on the results.
SID TorL TorH
351 3843 223
352 155 96
Thanks
With tidyverse:
df <- data.frame(SID = sample(c(351, 352, 353), 30, replace = T),
TorL = sample(c(0,1), 30, replace = T),
TorR = sample(c(0,1), 30 , replace = T))
df %>% group_by(SID) %>% summarise_at(vars(TorL, TorR), sum) %>% ungroup()
# A tibble: 3 × 3
SID TorL TorR
<dbl> <dbl> <dbl>
1 351 6 8
2 352 3 6
3 353 6 6
I got it working, playing around a little with asafpr answer
{r}
torlh <- df %>% group_by(SID)%>%
summarise(ltor = sum(TorL), htor = sum(TorH))
torlh

How to include number of rows aggregated using aggregate() in R [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I have dataset with a parentID variable and a childIQ variable which represents the IQ of the children of that specific parent:
df <- data.frame("parentID"=c(101,101,101,204,204,465,465),
"childIQ"=c(98,90,81,96,87,71,65))
parentID, childIQ
101, 98
101, 90
101, 81
204, 96
204, 87
465, 71
465, 65
I ran an aggregate() function so there is only 1 row per parent, and the childIQ value becomes the mean IQ of that parent's children:
df_agg <- aggregate(childIQ ~ parentID , data = df, mean)
parentID, avg_childIQ
101, 89.67
204, 91.5
465, 68
However, I want to add another column that represents the number of children for that parent, like this:
parentID, avg_childIQ, num_children
101, 90.67, 3
204, 91.5, 2
465, 68, 2
I'm not sure how to do this using data.table once I have already created df_agg?
It is possible to supply several functions to aggregate by using a function(x) c(...) code.
df_agg <- aggregate(childIQ ~ parentID , data = df,
function(x) c(mean = mean(x),
n = length(x)))
#> parentID childIQ.mean childIQ.n
#> 1 101 89.66667 3.00000
#> 2 204 91.50000 2.00000
#> 3 465 68.00000 2.00000
Using dplyr:
library(dplyr)
df %>% group_by(parentID) %>% summarise(avg_childID = mean(childIQ), num_children = n())
# A tibble: 3 x 3
parentID avg_childID num_children
<dbl> <dbl> <int>
1 101 89.7 3
2 204 91.5 2
3 465 68 2
Using data.table:
library(data.table)
setDT(df)[,list(avg_childID = mean(childIQ), num_children = .N), by=parentID]
parentID avg_childID num_children
1: 101 89.66667 3
2: 204 91.50000 2
3: 465 68.00000 2

Can I identify the same values within a range between 2 columns?

I am trying to compare values between two different columns but I need it to accept values within a range of ±3.
I created this 2 tibbles:
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
And I want the program to link the ones that are the same within a ±3 range.
So for example, I want it to identify that 84 and 84.5 are the same, also 149 and 149.5; 489 and 489; 680.5 and 680.5. But I want it to also tell me that 534, 528.5 and 542 do not have a match.
Is there any way to do this?
This could be achieved via the fuzzyjoin package like so:
library(dplyr)
library(fuzzyjoin)
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
match_fun1 <- function(x, y) {
# (x >= y - 3) & (x <= y + 3)
# or following the suggestion by #DarrenTsai
abs(x - y) <= 3
}
fuzzy_full_join(example_tp1, example_tp2,
by = c("Object_centre"),
match_fun = match_fun1)
#> # A tibble: 7 x 2
#> Object_centre.x Object_centre.y
#> <dbl> <dbl>
#> 1 84 84.5
#> 2 149 150.
#> 3 489 489
#> 4 680. 680.
#> 5 534 NA
#> 6 NA 528.
#> 7 NA 542
Created on 2020-08-22 by the reprex package (v0.3.0)
You could look at all combinations of values and see which ones match.
# Data Frame of all combinations
example <- expand.grid(c(84, 149, 489, 534, 680.5), c(84.5, 149.5, 489, 528.5, 542, 680.5))
# Assigns a Match if the values are within a range of 3
example %>%
mutate(match = ifelse(abs(Var1-Var2) <= 3, "Match", "No Match"))
Var1 Var2 match
1 84.0 84.5 Match
2 149.0 84.5 No Match
3 489.0 84.5 No Match
4 534.0 84.5 No Match
5 680.5 84.5 No Match
6 84.0 149.5 No Match
7 149.0 149.5 Match
8 489.0 149.5 No Match
9 ..... ..... ........
10 ..... ..... ........
and so on
You could then filter out only the matches or see which values have no match.
Similar to #Jumble's answer using tidyverse functions :
tidyr::crossing(example_tp1, example_tp2, .name_repair = ~c('col1', 'col2')) %>%
dplyr::filter(abs(col1 - col2) <= 3)
# col1 col2
# <dbl> <dbl>
#1 84 84.5
#2 149 150.
#3 489 489
#4 680. 680.
crossing generates all combinations of example_tp1 and example_tp2 and we keep only those rows where the difference is less than equal to 3.

dplyr mutate column with nearest value in external list

I'm trying to mutate a column and populate it with exact matches from a list if those occur, and if not, the closest match possible.
My data frame looks like this:
index <- seq(1, 10, 1)
blockID <- c(100, 120, 132, 133, 201, 207, 210, 238, 240, 256)
df <- as.data.frame(cbind(index, blockID))
index blockID
1 1 100
2 2 120
3 3 132
4 4 133
5 5 201
6 6 207
7 7 210
8 8 238
9 9 240
10 10 256
I want to mutate a new column that checks whether blockID is in a list. If yes, it should just keep the value of blockID. If not, It should return the nearest value in blocklist:
blocklist <- c(100, 120, 130, 150, 201, 205, 210, 238, 240, 256)
so the additional column should contain
100 (match),
120 (match),
130 (no match for 132--nearest value is 130),
130 (no match for 133--nearest value is 130),
201,
205 (no match for 207--nearest value is 205),
210,
238,
240,
256
Here's what I've tried:
df2 <- df %>% mutate(blockmatch = ifelse(blockID %in% blocklist, blockID, ifelse(match.closest(blockID, blocklist, tolerance = Inf), "missing")))
I just put in "missing" to complete the ifelse() statements, but it shouldn't actually be returned anywhere since the preceding cases will be fulfilled for every value of blockID. However, the resulting df2 just has "missing" in all the cells where it should have substituted the nearest number. I know there are base R alternatives to match.closest but I'm not sure that's the problem. Any ideas?
You don't need if..else. Your rule can simplified by saying that we always get the blocklist element with least absolute difference when compared to blockID. If values match then absolute difference is 0 (which will always be the least).
With that here's a simple base R solution -
df$blockmatch <- sapply(df$blockID, function(x) blocklist[order(abs(x - blocklist))][1])
index blockID blockmatch
1 1 100 100
2 2 120 120
3 3 132 130
4 4 133 130
5 5 201 201
6 6 207 205
7 7 210 210
8 8 238 238
9 9 240 240
10 10 256 256
Here are a couple of ways with dplyr -
df %>%
rowwise() %>%
mutate(
blockmatch = blocklist[order(abs(blockID - blocklist))][1]
)
df %>%
mutate(
blockmatch = sapply(blockID, function(x) blocklist[order(abs(x - blocklist))][1])
)
Thanks to #Onyambu, here's a faster way -
df$blockmatch <- blocklist[max.col(-abs(sapply(blocklist, '-', df$blockID)))]

how to select data based on a list from a split data frame and then recombine in R

I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28

Resources