r count instances where variables x and y are equal and place in table - r

I have the following code
length(which(tor$TorL==1& tor$SID==351))
length(which(tor$TorL==1 & tor$SID==352))
## The result of this is as follows
[1] 3843
[1] 223
The lines of code give me the count of TorL when SID==xxx.
TorL is a binary variable of a low value
SID goes from 351 to 358, I am only showing 351.
My second code query is
length(which(tor$TorH==1 & tor$SID==351))
length(which(tor$TorH==1 & tor$SID==352))
## Results from above
[1] 155
[1] 96
TorH is a binary variable of a high value
I would like to able to do this count as above and place the results in a table, something like as follows, as I would like to do a correlation on the results.
SID TorL TorH
351 3843 223
352 155 96
Thanks

With tidyverse:
df <- data.frame(SID = sample(c(351, 352, 353), 30, replace = T),
TorL = sample(c(0,1), 30, replace = T),
TorR = sample(c(0,1), 30 , replace = T))
df %>% group_by(SID) %>% summarise_at(vars(TorL, TorR), sum) %>% ungroup()
# A tibble: 3 × 3
SID TorL TorR
<dbl> <dbl> <dbl>
1 351 6 8
2 352 3 6
3 353 6 6

I got it working, playing around a little with asafpr answer
{r}
torlh <- df %>% group_by(SID)%>%
summarise(ltor = sum(TorL), htor = sum(TorH))
torlh

Related

How to include number of rows aggregated using aggregate() in R [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I have dataset with a parentID variable and a childIQ variable which represents the IQ of the children of that specific parent:
df <- data.frame("parentID"=c(101,101,101,204,204,465,465),
"childIQ"=c(98,90,81,96,87,71,65))
parentID, childIQ
101, 98
101, 90
101, 81
204, 96
204, 87
465, 71
465, 65
I ran an aggregate() function so there is only 1 row per parent, and the childIQ value becomes the mean IQ of that parent's children:
df_agg <- aggregate(childIQ ~ parentID , data = df, mean)
parentID, avg_childIQ
101, 89.67
204, 91.5
465, 68
However, I want to add another column that represents the number of children for that parent, like this:
parentID, avg_childIQ, num_children
101, 90.67, 3
204, 91.5, 2
465, 68, 2
I'm not sure how to do this using data.table once I have already created df_agg?
It is possible to supply several functions to aggregate by using a function(x) c(...) code.
df_agg <- aggregate(childIQ ~ parentID , data = df,
function(x) c(mean = mean(x),
n = length(x)))
#> parentID childIQ.mean childIQ.n
#> 1 101 89.66667 3.00000
#> 2 204 91.50000 2.00000
#> 3 465 68.00000 2.00000
Using dplyr:
library(dplyr)
df %>% group_by(parentID) %>% summarise(avg_childID = mean(childIQ), num_children = n())
# A tibble: 3 x 3
parentID avg_childID num_children
<dbl> <dbl> <int>
1 101 89.7 3
2 204 91.5 2
3 465 68 2
Using data.table:
library(data.table)
setDT(df)[,list(avg_childID = mean(childIQ), num_children = .N), by=parentID]
parentID avg_childID num_children
1: 101 89.66667 3
2: 204 91.50000 2
3: 465 68.00000 2

Randomly divide df in list of df into equal subsets [duplicate]

This question already has an answer here:
Randomly split a dataframe in n equal pieces
(1 answer)
Closed 3 years ago.
yesterday I already asked a similar question: R - Randomly split a dataframe in n equal pieces
The answer I got is nearly what I need, but there are still problems with it. Also I thought about different other ways to get a result.
This is my example df-list:
set.seed(0L)
AB_df = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_df = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
AB_pc = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_pc = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
df_list = list(AB_df, BC_df, DE_df, FG_df, AB_pc, BC_pc, DE_pc, FG_pc)
names(df_list) = c("AB_df", "BC_df", "DE_df", "FG_df", "AB_pc", "BC_pc", "DE_pc", "FG_pc")
I want to randomly subset the single df within the list into n equal parts (or as close as possible to equal). I already got a very helpful answer from chinsoon12:
new = lapply(df_list, function(df) {
n <- nrow(df)
split(df, cut(sample(n), seq(1, n, by=floor(n/4)), labels=FALSE, include.lowest=TRUE))})
The problem is that its not working for any number of rows and also not all observations are taken in account. E.g. when I devide my df_list in 5 subsets with that methode I am getting subsets of 325, 324, 324, 324, 324 for AB_df and in total thats not 1624, so something is missing. When I devide it into 4 pieces, I only get 3 subsets...any idea why this is happening?
I also thought about 2 different ways of splitting the df in the list. One way might be to just reorder the observations randomly by changing the order of the rows in a random way:
for (a in 1:length(df_list)) {
df_list[[a]] = df_list[[a]][sample(nrow(df_list[[a]])),]}
Now I would only need to devide the dfs into n pieces...but this is the point where I am not sure how to do that.
3rd way I thought of would be to create a random list of numbers 1:n for n-subsamples and add them to the dataframes and then extract the df according to the number.
I still think the first way is the easiest and I would prefer this. Any idea whats wrong with the code?
The Problem resulting in your different group-sizes is a cut-thing. It does always need a hard interval-border on one side and I don't really know how to do that in your case.
You could solve your problem with gl, just ignore the warnings.
And when you randomize the generated levels before you apply them, you're there.
set.seed(0L)
AB_df = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_df = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
AB_pc = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_pc = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
df_list = list(AB_df, BC_df, DE_df, FG_df, AB_pc, BC_pc, DE_pc, FG_pc)
names(df_list) = c("AB_df", "BC_df", "DE_df", "FG_df", "AB_pc", "BC_pc", "DE_pc", "FG_pc")
#the number of groups you want to generate
subs <- 4
splittedList <- lapply(df_list,
function(df){
idx <- gl(n = subs,round(nrow(df)/subs))
split(df, sample(idx))# randomize the groups
})
#> Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...):
#> data length is not a multiple of split variable
#> Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...):
#> data length is not a multiple of split variable
## the groups are appr. equally sized:
lapply(splittedList,function(l){sapply(l,nrow)})
#> $AB_df
#> 1 2 3 4
#> 406 406 406 406
#>
#> $BC_df
#> 1 2 3 4
#> 414 414 414 414
#>
#> $DE_df
#> 1 2 3 4
#> 414 414 414 414
#>
#> $FG_df
#> 1 2 3 4
#> 432 432 433 432
#>
#> $AB_pc
#> 1 2 3 4
#> 406 406 406 406
#>
#> $BC_pc
#> 1 2 3 4
#> 414 414 414 414
#>
#> $DE_pc
#> 1 2 3 4
#> 414 414 414 414
#>
#> $FG_pc
#> 1 2 3 4
#> 432 432 433 432
## and the sizes are right:
sapply(df_list,nrow)
#> AB_df BC_df DE_df FG_df AB_pc BC_pc DE_pc FG_pc
#> 1624 1656 1656 1729 1624 1656 1656 1729
sapply(splittedList,function(l){sum(sapply(l,nrow))})
#> AB_df BC_df DE_df FG_df AB_pc BC_pc DE_pc FG_pc
#> 1624 1656 1656 1729 1624 1656 1656 1729

How to write a loop that looks for a condition in two columns then adds the value in the third of a data frame?

Table showing correct format of dataI have a data frame with four columns, and I need to find a way to sum the values in the third column. Only if the numbers in the first two columns are different. The only way I can think of is to maybe do an If loop? Is that something can be done or is there a better way?
Genotype summary`
Dnov1a Dnov1b Freq rel_geno_freq
1 220 220 1 0.003367003
7 220 224 4 0.013468013
8 224 224 8 0.026936027
13 220 228 14 0.047138047
This is a portion of the data as an example, I need to sum the third column Freq for rows 7 and 13 because they are different.
Here's a tidyverse way of doing it:
library(tidyverse)
data <- tribble(
~Dnov1a, ~Dnov1b, ~Freq, ~rel_geno_freq,
220, 220, 1, 0.003367003,
220, 224, 4, 0.013468013,
224, 224, 8, 0.026936027,
220, 228, 14, 0.047138047)
data %>%
mutate(filter_column = if_else(Dnov1a != Dnov1b, TRUE, FALSE)) %>%
filter(filter_column == TRUE) %>%
summarise(Total = sum(Freq))
# A tibble: 1 x 1
Total
<dbl>
1 18
data$new = data$Dnov1a!=data$Dnov1b
data
Dnov1a Dnov1b Freq rel_geno_freq new
<int> <int> <int> <dbl> <lgl>
1 220 220 1 0.00337 TRUE
2 220 224 4 0.0135 FALSE
3 224 224 8 0.0269 TRUE
4 220 228 14 0.0471 FALSE
sum(data$Freq[data$new])
28
Is this what you are looking for?

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Resources