Finding proportions based on data.frame subsets - r

I have a set of counts from data with three dimensions:
df <- data.frame(type = c("A", "B", "B", "A", "A", "C", "B", "C"), group = c("Tp", "Tp", "Tp", "Tp", "Fc", "Fc", "Fc", "Fc"), size = c(10,20,30,40,10,20,30,40), count = c(1, 4, 2, 3, 2, 10, 2, 3))
type group size count
1 A Tp 10 1
2 B Tp 20 4
3 B Tp 30 2
4 A Tp 40 3
5 A Fc 10 2
6 C Fc 20 10
7 B Fc 30 2
8 C Fc 40 3
I would like to find the proportion that each count takes up but subset over both type and group dimensions. That is, for example, what is the farction of size 10's that are in group "Tp" and of type "A"?
I thought there might be a function that was like aggregate or something within the plyr package but would calculate data per row based on subsets but I can't seem to find it. My best effort is using apply:
df$prop <- apply(df, 1, function(x) as.numeric(x["count"])/sum(df[df$type==x["type"] & df$group==x["group"], "count"]))
type group size count prop
1 A Tp 10 1 0.2500000
2 B Tp 20 4 0.6666667
3 B Tp 30 2 0.3333333
4 A Tp 40 3 0.7500000
5 A Fc 10 2 1.0000000
6 C Fc 20 10 0.7692308
7 B Fc 30 2 1.0000000
8 C Fc 40 3 0.2307692
I just wondered if there is an easier way of doing this? If not I will write this up as a custom function.
Thanks.

Try:
transform(df, prop=count/ave(count, type, group, FUN=sum))

With plyr,
ddply(df, c("type","group"), mutate, prop = count/sum(count))
type group size count prop
1 A Fc 10 2 1.0000000
2 A Tp 10 1 0.2500000
3 A Tp 40 3 0.7500000
4 B Fc 30 2 1.0000000
5 B Tp 20 4 0.6666667
6 B Tp 30 2 0.3333333
7 C Fc 20 10 0.7692308
8 C Fc 40 3 0.2307692

The much better scaling and imo more intuitive data.table way:
library(data.table)
dt = data.table(df)
dt[, prop := count/sum(count), by = list(type, group)]

Related

divide multiple column by a value based on each condition

I have a dataset that has 3 different conditions. Data within condition 1 will need to be divided by 15, data within conditions 2 and 3 will need to be divided by 10. I tried to do for() in order to create separate datasets for each condition and then merge the two groups (group 1 is composed of condition 1, group 2 is composed of conditions 2 and 3). This is what I have so far for condition 1. Is there an easier way to do this that does not require creating subgroups?
Group1 <- NULL
for (val in ParticipantID) {
ParticipantID_subset_Group1 <- subset(PronounData, ParticipantID == val & Condition == "1")
I_Words_PPM <- (ParticipantID_subset_Group1$I_Words/"15")
YOU_Words_PPM <- (ParticipantID_subset_Group1$YOU_Words/"15")
WE_Words_PPM <- (ParticipantID_subset_Group1$WE_Words/"15")
df <- data.frame(val, Group, I_Words_PPM, YOU_Words_PPM, WE_Words_PPM)
Group1 <- rbind(Group1, df)
}
dim(Group1)
colnames(Group1) <- c("ParticipantID", "Condition", "I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")
View(Group1)
Couldn't fully test this solution without example data, but this should do what you want:
# make some fake data
PronounData <- data.frame(
ParticipantID = 1:9,
Condition = rep(1:3, 3),
I_Words = sample(0:20, 9, replace = TRUE),
YOU_Words = sample(0:40, 9, replace = TRUE),
WE_Words = sample(0:10, 9, replace = TRUE)
)
# if Condition 1, divide by 15
PronounData[PronounData$Condition == 1, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition == 1, c("I_Words", "YOU_Words", "WE_Words")] / 15
# if Condition 2 or 3, divide by 10
PronounData[PronounData$Condition %in% 2:3, c("I_Words_PPM", "YOU_Words_PPM", "WE_Words_PPM")] <-
PronounData[PronounData$Condition %in% 2:3, c("I_Words", "YOU_Words", "WE_Words")] / 10
# result
PronounData
# ParticipantID Condition I_Words YOU_Words WE_Words I_Words_PPM YOU_Words_PPM WE_Words_PPM
# 1 1 1 17 40 6 1.1333 2.6667 0.4000
# 2 2 2 14 1 6 1.4000 0.1000 0.6000
# 3 3 3 2 34 8 0.2000 3.4000 0.8000
# 4 4 1 0 33 1 0.0000 2.2000 0.0667
# 5 5 2 4 15 0 0.4000 1.5000 0.0000
# 6 6 3 1 7 6 0.1000 0.7000 0.6000
# 7 7 1 6 10 1 0.4000 0.6667 0.0667
# 8 8 2 1 33 9 0.1000 3.3000 0.9000
# 9 9 3 9 40 0 0.9000 4.0000 0.0000
NB, R is built on vectorized operations, so looping through each row is rarely the best solution. Instead, you generally want to find a way of modifying whole vectors/columns at once, or at least subsets of them. This will usually be faster and simpler.

Replace values in one dataframe with another thats not NA

I have two dataframes A and B, that share have the same column names and the same first column (Location)
A <- data.frame("Location" = 1:3, "X" = c(21,15, 7), "Y" = c(41,5, 5), "Z" = c(12,103, 88))
B <- data.frame("Location" = 1:3, "X" = c(NA,NA, 14), "Y" = c(50,8, NA), "Z" = c(NA,14, 12))
How do i replace the values in dataframe A with the values from B if the value in B is not NA?
Thanks.
We can use coalesce
library(dplyr)
A %>%
mutate(across(-Location, ~ coalesce(B[[cur_column()]], .)))
-output
# Location X Y Z
#1 1 21 50 12
#2 2 15 8 14
#3 3 14 5 12
Here's an answer in base R:
i <- which(!is.na(B),arr.ind = T)
A[i] <- B[i]
A
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12
One option with fcoalesce from data.table pakcage
list2DF(Map(data.table::fcoalesce,B,A))
gives
Location X Y Z
1 1 21 50 12
2 2 15 8 14
3 3 14 5 12

R - Sum list of matrix with different columns

I have a large list of matrix with different columns and I would like to sum these matrix counting 0 if column X does not exist in one matrix.
If you have used the function rbind.fill from plyr I would like something similar but with sum function. Of course I could build a function to do that, but I'm thinking about a native function efficiently programmed in Frotrain or C due to my large data.
Here an example:
This is the easy example where I have the same columns:
aa <- list(
m1 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
aa
Reduce('+',aa)
Giving the results:
> aa
$m1
a b c
1 1 4 7
2 2 5 8
3 3 6 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
> Reduce('+',aa)
a b c
1 2 8 14
2 4 10 16
3 6 12 18
And with my data:
bb <- list(
m1 = matrix(c(1,2,3,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
bb
Reduce('+',bb)
Here I would like to have b = c(0,0,0) in the first matrix to sum them.
> bb
$m1
a c
1 1 7
2 2 8
3 3 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Many thanks!
Xevi
One option would be
un1 <- sort(unique(unlist(lapply(bb, colnames))))
bb1 <- lapply(bb, function(x) {
nm1 <- setdiff(un1, colnames(x))
m1 <- matrix(0, nrow = nrow(x), ncol = length(nm1), dimnames = list(NULL, nm1))
cbind(x, m1)[, un1]})
and use the Reduce
Reduce(`+`, bb1)
# a b c
# 1 2 4 14
# 2 4 5 16
# 3 6 6 18

Randomly sample values from a pool so that the sum is less than a threshold in R

Let's say we have a pool of values and I want to sample random number of values from this pool, so that the sum of these values is between two thresholds. I want to design a function in R to implemented that.
pool = data.frame(ID = letters, value = sample(1:5, size = 26, replace = T))
> print(pool)
ID value
1 a 1
2 b 4
3 c 4
4 d 2
5 e 2
6 f 4
7 g 5
8 h 5
9 i 4
10 j 3
11 k 3
12 l 5
13 m 3
14 n 2
15 o 3
16 p 4
17 q 1
18 r 1
19 s 5
20 t 1
21 u 2
22 v 4
23 w 5
24 x 2
25 y 4
26 z 1
I want to randomly sample what ever number of IDs so that the sum of values for these IDs are between two thresholds, let's say between 8 and 10 (including the two boundaries). The expected outcome should be like these:
c("a", "b", "c")
c("f", "g")
c("a", "d", "e", "j", "k")
I think this question has not been asked previously. Does anyone have clues?
Here's an approach where I shuffle the input and check the cumulative sum of the shuffled output to look for an acceptable sum.
If a subset of that initial sequence happens to work, it outputs that sequence (in this manifestation, the longest sequence under the max threshold). If it doesn't work, it reshuffles and looks again, up to the max number of iterations.
set.seed(42)
library(dplyr)
sample_in_range <- function(src_tbl, min_sum = 8, max_sum = 10, max_iter = 100) {
for(i in 1:max_iter) {
output <- src_tbl %>%
sample_n(nrow(src_tbl)) %>%
mutate(ID = as.character(ID),
cuml = cumsum(value)) %>%
filter(cuml <= max_sum)
if(max(output$cuml) >= min_sum) return(output)
}
}
output <- sample_in_range(pool)
output
ID value cuml
1 k 3 3
2 w 2 5
3 z 4 9
4 t 1 10
output %>% pull(ID)
[1] "k" "w" "z" "t"

R- random sample of groups in a data.table

How can I randomly sample e.g. three groups within a data.table so that the result contains three groups with all rows from the original data.table?
library(data.table)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C", "D", "E", "F"), 20,
replace=TRUE))
I know how to select 10 rows randomly from a data.table:
dat.sampl1 <- as.data.table(sapply(dat[], sample, 10))
And also how to sample by group
dat[,.SD[sample(.N, min(.N,3))], by = groups]
But how to randomly sample groups? So the result should look like:
ids groups
1 F
11 F
3 F
18 F
8 A
9 A
10 A
17 A
19 A
12 E
14 E
16 E
Do you mean something like:
set.seed(123)
dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C", "D", "E", "F"), 20,
replace=TRUE))
dat[groups %in% sample(unique(dat[, groups]), size = 3)][order(groups)]
# ids groups
# 1: 3 C
# 2: 10 C
# 3: 12 C
# 4: 7 D
# 5: 9 D
# 6: 14 D
# 7: 4 F
# 8: 5 F
# 9: 8 F
# 10: 11 F
# 11: 16 F
# 12: 20 F
If you want to sample groups with replacement, you can do the following, where A has been sampled twice:
dat[unique(dat[, list(groups)])[sample(.N, 3, replace = TRUE)], on = "groups"]
# ids groups
# 1: 3 C
# 2: 10 C
# 3: 12 C
# 4: 6 A
# 5: 15 A
# 6: 18 A
# 7: 6 A
# 8: 15 A
# 9: 18 A
This code works, using a single line of base R code using %in% to check an index which is generated using the sample function:
df1[df1[,'groups'] %in% sample(unique(df1[,'groups']), size = 3, replace = F), ]
For example:
> df1 <- data.frame("ids" = 1:20, "groups" = sample(LETTERS[1:4], size = 20, replace = T))
> df2 <- df1[df1[,'groups'] %in% sample(unique(df1[,'groups']), size = 3, replace = F), ]
> df2[order(df2[,'groups']),]
ids groups
4 4 B
6 6 B
18 18 B
20 20 B
1 1 C
2 2 C
3 3 C
9 9 C
12 12 C
16 16 C
19 19 C
7 7 D
11 11 D

Resources