How to include number of rows aggregated using aggregate() in R [duplicate] - r

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I have dataset with a parentID variable and a childIQ variable which represents the IQ of the children of that specific parent:
df <- data.frame("parentID"=c(101,101,101,204,204,465,465),
"childIQ"=c(98,90,81,96,87,71,65))
parentID, childIQ
101, 98
101, 90
101, 81
204, 96
204, 87
465, 71
465, 65
I ran an aggregate() function so there is only 1 row per parent, and the childIQ value becomes the mean IQ of that parent's children:
df_agg <- aggregate(childIQ ~ parentID , data = df, mean)
parentID, avg_childIQ
101, 89.67
204, 91.5
465, 68
However, I want to add another column that represents the number of children for that parent, like this:
parentID, avg_childIQ, num_children
101, 90.67, 3
204, 91.5, 2
465, 68, 2
I'm not sure how to do this using data.table once I have already created df_agg?

It is possible to supply several functions to aggregate by using a function(x) c(...) code.
df_agg <- aggregate(childIQ ~ parentID , data = df,
function(x) c(mean = mean(x),
n = length(x)))
#> parentID childIQ.mean childIQ.n
#> 1 101 89.66667 3.00000
#> 2 204 91.50000 2.00000
#> 3 465 68.00000 2.00000

Using dplyr:
library(dplyr)
df %>% group_by(parentID) %>% summarise(avg_childID = mean(childIQ), num_children = n())
# A tibble: 3 x 3
parentID avg_childID num_children
<dbl> <dbl> <int>
1 101 89.7 3
2 204 91.5 2
3 465 68 2
Using data.table:
library(data.table)
setDT(df)[,list(avg_childID = mean(childIQ), num_children = .N), by=parentID]
parentID avg_childID num_children
1: 101 89.66667 3
2: 204 91.50000 2
3: 465 68.00000 2

Related

r count instances where variables x and y are equal and place in table

I have the following code
length(which(tor$TorL==1& tor$SID==351))
length(which(tor$TorL==1 & tor$SID==352))
## The result of this is as follows
[1] 3843
[1] 223
The lines of code give me the count of TorL when SID==xxx.
TorL is a binary variable of a low value
SID goes from 351 to 358, I am only showing 351.
My second code query is
length(which(tor$TorH==1 & tor$SID==351))
length(which(tor$TorH==1 & tor$SID==352))
## Results from above
[1] 155
[1] 96
TorH is a binary variable of a high value
I would like to able to do this count as above and place the results in a table, something like as follows, as I would like to do a correlation on the results.
SID TorL TorH
351 3843 223
352 155 96
Thanks
With tidyverse:
df <- data.frame(SID = sample(c(351, 352, 353), 30, replace = T),
TorL = sample(c(0,1), 30, replace = T),
TorR = sample(c(0,1), 30 , replace = T))
df %>% group_by(SID) %>% summarise_at(vars(TorL, TorR), sum) %>% ungroup()
# A tibble: 3 × 3
SID TorL TorR
<dbl> <dbl> <dbl>
1 351 6 8
2 352 3 6
3 353 6 6
I got it working, playing around a little with asafpr answer
{r}
torlh <- df %>% group_by(SID)%>%
summarise(ltor = sum(TorL), htor = sum(TorH))
torlh

Can I identify the same values within a range between 2 columns?

I am trying to compare values between two different columns but I need it to accept values within a range of ±3.
I created this 2 tibbles:
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
And I want the program to link the ones that are the same within a ±3 range.
So for example, I want it to identify that 84 and 84.5 are the same, also 149 and 149.5; 489 and 489; 680.5 and 680.5. But I want it to also tell me that 534, 528.5 and 542 do not have a match.
Is there any way to do this?
This could be achieved via the fuzzyjoin package like so:
library(dplyr)
library(fuzzyjoin)
example_tp1 <- tibble(Object_centre = c(84, 149, 489, 534, 680.5))
example_tp2 <- tibble(Object_centre = c(84.5, 149.5, 489, 528.5, 542, 680.5))
match_fun1 <- function(x, y) {
# (x >= y - 3) & (x <= y + 3)
# or following the suggestion by #DarrenTsai
abs(x - y) <= 3
}
fuzzy_full_join(example_tp1, example_tp2,
by = c("Object_centre"),
match_fun = match_fun1)
#> # A tibble: 7 x 2
#> Object_centre.x Object_centre.y
#> <dbl> <dbl>
#> 1 84 84.5
#> 2 149 150.
#> 3 489 489
#> 4 680. 680.
#> 5 534 NA
#> 6 NA 528.
#> 7 NA 542
Created on 2020-08-22 by the reprex package (v0.3.0)
You could look at all combinations of values and see which ones match.
# Data Frame of all combinations
example <- expand.grid(c(84, 149, 489, 534, 680.5), c(84.5, 149.5, 489, 528.5, 542, 680.5))
# Assigns a Match if the values are within a range of 3
example %>%
mutate(match = ifelse(abs(Var1-Var2) <= 3, "Match", "No Match"))
Var1 Var2 match
1 84.0 84.5 Match
2 149.0 84.5 No Match
3 489.0 84.5 No Match
4 534.0 84.5 No Match
5 680.5 84.5 No Match
6 84.0 149.5 No Match
7 149.0 149.5 Match
8 489.0 149.5 No Match
9 ..... ..... ........
10 ..... ..... ........
and so on
You could then filter out only the matches or see which values have no match.
Similar to #Jumble's answer using tidyverse functions :
tidyr::crossing(example_tp1, example_tp2, .name_repair = ~c('col1', 'col2')) %>%
dplyr::filter(abs(col1 - col2) <= 3)
# col1 col2
# <dbl> <dbl>
#1 84 84.5
#2 149 150.
#3 489 489
#4 680. 680.
crossing generates all combinations of example_tp1 and example_tp2 and we keep only those rows where the difference is less than equal to 3.

In a series of series, how to subtract every 1st number in each sub-series event from every nth number in those events?

I have multiple series of timepoints. Some series have five timepoints, others have ten or fifteen timepoints. The series are in multiples of five because the event I am measuring is always five timepoints long; some recordings have multiple events in succession. For instance:
Series 1:
0
77
98
125
174
Series 2:
0
69
95
117
179
201
222
246
277
293
0 marks the beginning of each series. Series 1 is a single event, but Series 2 is two events in succession. The 6th timepoint in Series 2 is the start of the second event in that series.
I have an R dataframe that contains every timepoint in one column:
dd <- data.frame(
timepoint=c(0, 77, 98, 125, 174,
0, 69, 95, 117, 179, 201, 222, 246, 277, 293)
)
I need to know the duration from the start of each event to the 4th timepoint in each event. For the above data, that means:
Duration 1: 125 - 0 = 125
Duration 2: 179 - 0 = 179
Duration 3: 277 - 201 = 76
How can I write a simple piece of R code that will tell me the duration of that interval regardless of how many series or events there are, i.e. regardless of how many numbers are in the column?
I tried using diff() and seq_along(), but that seems only useful for every nth number, which doesn't work in this case.
diff(vec[seq_along(vec) %% 4 == 1])
This is maybe one way to do it with dplyr. We break up the data into "runs" which reset at each 0 and them we have the "sequences" which reset each 5 values.
dd %>%
group_by(run =cumsum(timepoint==0)) %>%
mutate(seq = (row_number()-1) %/% 5 + 1) %>%
group_by(run, seq) %>%
summarize(diff=timepoint[4]-timepoint[1])
# run seq diff
# <int> <dbl> <dbl>
# 1 1 1 125
# 2 2 1 117
# 3 2 2 76
It makes it somewhat easy to tie the value back to where it came from.
If you just wanted to use indexing, here's a helper function
diff4v1 <- function(x) {
idx <- (seq_along(x)-1) %% 5+1;
x[idx==4] - x[idx==1]
}
diff4v1(dd$timepoint)
# [1] 125 117 76
This is your data frame (hypothetical)
df = data.frame(series = round(rnorm(40, 100, 50)))
head(df)
series
1 16
2 35
3 75
4 125
5 190
6 85
And these are your differences
idx = c(1:nrow(df))
df[which(idx %% 5 == 4), "series"] - df[which(idx %% 5 == 1), "series"]
[1] 109 -38 -101 -47 34 -52 -63 -5

How to write a loop that looks for a condition in two columns then adds the value in the third of a data frame?

Table showing correct format of dataI have a data frame with four columns, and I need to find a way to sum the values in the third column. Only if the numbers in the first two columns are different. The only way I can think of is to maybe do an If loop? Is that something can be done or is there a better way?
Genotype summary`
Dnov1a Dnov1b Freq rel_geno_freq
1 220 220 1 0.003367003
7 220 224 4 0.013468013
8 224 224 8 0.026936027
13 220 228 14 0.047138047
This is a portion of the data as an example, I need to sum the third column Freq for rows 7 and 13 because they are different.
Here's a tidyverse way of doing it:
library(tidyverse)
data <- tribble(
~Dnov1a, ~Dnov1b, ~Freq, ~rel_geno_freq,
220, 220, 1, 0.003367003,
220, 224, 4, 0.013468013,
224, 224, 8, 0.026936027,
220, 228, 14, 0.047138047)
data %>%
mutate(filter_column = if_else(Dnov1a != Dnov1b, TRUE, FALSE)) %>%
filter(filter_column == TRUE) %>%
summarise(Total = sum(Freq))
# A tibble: 1 x 1
Total
<dbl>
1 18
data$new = data$Dnov1a!=data$Dnov1b
data
Dnov1a Dnov1b Freq rel_geno_freq new
<int> <int> <int> <dbl> <lgl>
1 220 220 1 0.00337 TRUE
2 220 224 4 0.0135 FALSE
3 224 224 8 0.0269 TRUE
4 220 228 14 0.0471 FALSE
sum(data$Freq[data$new])
28
Is this what you are looking for?

dplyr mutate column with nearest value in external list

I'm trying to mutate a column and populate it with exact matches from a list if those occur, and if not, the closest match possible.
My data frame looks like this:
index <- seq(1, 10, 1)
blockID <- c(100, 120, 132, 133, 201, 207, 210, 238, 240, 256)
df <- as.data.frame(cbind(index, blockID))
index blockID
1 1 100
2 2 120
3 3 132
4 4 133
5 5 201
6 6 207
7 7 210
8 8 238
9 9 240
10 10 256
I want to mutate a new column that checks whether blockID is in a list. If yes, it should just keep the value of blockID. If not, It should return the nearest value in blocklist:
blocklist <- c(100, 120, 130, 150, 201, 205, 210, 238, 240, 256)
so the additional column should contain
100 (match),
120 (match),
130 (no match for 132--nearest value is 130),
130 (no match for 133--nearest value is 130),
201,
205 (no match for 207--nearest value is 205),
210,
238,
240,
256
Here's what I've tried:
df2 <- df %>% mutate(blockmatch = ifelse(blockID %in% blocklist, blockID, ifelse(match.closest(blockID, blocklist, tolerance = Inf), "missing")))
I just put in "missing" to complete the ifelse() statements, but it shouldn't actually be returned anywhere since the preceding cases will be fulfilled for every value of blockID. However, the resulting df2 just has "missing" in all the cells where it should have substituted the nearest number. I know there are base R alternatives to match.closest but I'm not sure that's the problem. Any ideas?
You don't need if..else. Your rule can simplified by saying that we always get the blocklist element with least absolute difference when compared to blockID. If values match then absolute difference is 0 (which will always be the least).
With that here's a simple base R solution -
df$blockmatch <- sapply(df$blockID, function(x) blocklist[order(abs(x - blocklist))][1])
index blockID blockmatch
1 1 100 100
2 2 120 120
3 3 132 130
4 4 133 130
5 5 201 201
6 6 207 205
7 7 210 210
8 8 238 238
9 9 240 240
10 10 256 256
Here are a couple of ways with dplyr -
df %>%
rowwise() %>%
mutate(
blockmatch = blocklist[order(abs(blockID - blocklist))][1]
)
df %>%
mutate(
blockmatch = sapply(blockID, function(x) blocklist[order(abs(x - blocklist))][1])
)
Thanks to #Onyambu, here's a faster way -
df$blockmatch <- blocklist[max.col(-abs(sapply(blocklist, '-', df$blockID)))]

Resources