How do I compare the relative frequencies in R? - r

I ran a study asking participants to choose between options A, B, and C based on various parameters.
For example: Which of the following did you find to be most inspirational?
My data looks something like this:
ID
Inspiration
1
A
2
C
3
B
4
C
5
B
6
C
7
B
8
C
9
A
10
B
11
A
12
B
I have calculated the relative frequencies so my data now looks like this:
ID
Inspiration
Proportion
1
A
.25
2
C
.33
3
B
.42
4
C
.33
5
B
.42
6
C
.33
7
B
.42
8
C
.33
9
A
.25
10
B
.42
11
A
.25
12
B
.42
My question is: how do I test to see if the relative frequency of each choice is significantly different from one another? That is, how do I know if the frequency of which people chose Option A is significantly different from the frequency of which they chose either B or C?
I have tried running t-tests, anovas, chi-squared tests, and two-proportion z-tests, but none of these options seem to be exactly what I'm looking for.

First use dput() to provide your data:
dta <- structure(list(ID = 1:12, Inspiration = c("A", "C", "B", "C",
"B", "C", "B", "C", "A", "B", "A", "B")), class = "data.frame", row.names = c(NA, -12L))
Now construct a table of the relative frequencies:
tbl <- table(dta$Inspiration)
tbl
# A B C
# 3 5 4
Now the null hypothesis that each choice is selected equally (i.e. frequency of A == B == C):
chisq.test(tbl)
#
# Chi-squared test for given probabilities
#
# data: tbl
# X-squared = 0.5, df = 2, p-value = 0.7788
#
# Warning message:
# In chisq.test(tbl) : Chi-squared approximation may be incorrect
No significant difference between the categories.

Related

How to get the size of sets given a list of pairs?

Let's say I have run different tests to see if some objects are identical. The testing was done pairwise, and I have a dataframe containing the pairs of objects that are the same:
same.pairs <- data.frame(Test=c(rep(1, 4), rep(2, 6)),
First=c("A", "A", "B", "D", "A", "A", "B", "C", "C", "D"),
Second=c("B", "C", "C", "E", "B", "E", "E", "D", "G", "G"))
##
Test First Second
1 A B
1 A C
1 B C
1 D E
2 A B
2 A E
2 B E
2 C D
2 C G
2 D G
From this I can see that in Test 1, because A = B and A = C and B = C, then A = B = C and these 3 objects belong in one set of size 3.
I want to know the full size of the sets for each test. For this example, I want to know that for Test 1, one set is 3 identical objects (A, B, C) and one set is 2 (D, E), and for Test 2, two sets are size 3 ((A, B, E) and (C, D, G)). I don't need to know which objects are in each set, just the size of the sets and the counts of how many sets are that size:
Test ReplicateSize Count
1 3 1
1 2 1
2 3 2
Is there an elegant way to do this? I thought I had it with this:
sets <- same.pairs %>%
group_by(Test, First) %>%
summarize(ReplicateSize=n()) %>%
# add 1 to size because above only counting second genotype, need to include first
mutate(ReplicateSize=ReplicateSize+1) %>%
select(-First) %>%
ungroup() %>%
group_by(Test, ReplicateSize) %>%
summarize(Count=n()) %>%
arrange(Test, ReplicateSize)
##
Test ReplicateSize Count
1 2 2
1 3 1
2 2 2
2 3 2
but this is double counting some of the sets as, for example in Test 1, B&C are counted as a set of size 2 instead of ignored as they are already part of a set with A. I'm not sure how to skip rows where the First object has already been observed as the Second object without making a complicated for loop.
Any guidance appreciated.
I don't fully understand what you are trying to accomplish, but your current code could be truncated to the following:
same.pairs %>%
count(Test, First, name = "ReplicateSize") %>%
count(Test, ReplicateSize, name = "Count") %>%
mutate(ReplicateSize = ReplicateSize + 1)
Test ReplicateSize Count
1 1 2 2
2 1 3 1
3 2 2 2
4 2 3 2

Randomly sample values from a pool so that the sum is less than a threshold in R

Let's say we have a pool of values and I want to sample random number of values from this pool, so that the sum of these values is between two thresholds. I want to design a function in R to implemented that.
pool = data.frame(ID = letters, value = sample(1:5, size = 26, replace = T))
> print(pool)
ID value
1 a 1
2 b 4
3 c 4
4 d 2
5 e 2
6 f 4
7 g 5
8 h 5
9 i 4
10 j 3
11 k 3
12 l 5
13 m 3
14 n 2
15 o 3
16 p 4
17 q 1
18 r 1
19 s 5
20 t 1
21 u 2
22 v 4
23 w 5
24 x 2
25 y 4
26 z 1
I want to randomly sample what ever number of IDs so that the sum of values for these IDs are between two thresholds, let's say between 8 and 10 (including the two boundaries). The expected outcome should be like these:
c("a", "b", "c")
c("f", "g")
c("a", "d", "e", "j", "k")
I think this question has not been asked previously. Does anyone have clues?
Here's an approach where I shuffle the input and check the cumulative sum of the shuffled output to look for an acceptable sum.
If a subset of that initial sequence happens to work, it outputs that sequence (in this manifestation, the longest sequence under the max threshold). If it doesn't work, it reshuffles and looks again, up to the max number of iterations.
set.seed(42)
library(dplyr)
sample_in_range <- function(src_tbl, min_sum = 8, max_sum = 10, max_iter = 100) {
for(i in 1:max_iter) {
output <- src_tbl %>%
sample_n(nrow(src_tbl)) %>%
mutate(ID = as.character(ID),
cuml = cumsum(value)) %>%
filter(cuml <= max_sum)
if(max(output$cuml) >= min_sum) return(output)
}
}
output <- sample_in_range(pool)
output
ID value cuml
1 k 3 3
2 w 2 5
3 z 4 9
4 t 1 10
output %>% pull(ID)
[1] "k" "w" "z" "t"

Correction for multiple testing for very large files with repetitions

I have 10 files with size ~8-9 Gb like:
7 72603 0.0780181622612
15 72603 0.027069072329
20 72603 0.00215643186987
24 72603 0.00247965378216
29 72603 0.0785606184492
32 72603 0.0486866833899
33 72603 0.000123332654879
For each pair of numbers (1st and 2nd column) I have p-value (3rd column).
However, I have repeated pairs (they can be in different files) and I want to get rid of one of them. If the files were smaller, I would use pandas. E.g.:
7 15 0.0012423442
...
15 7 0.0012423442
Also I want to apply to this set a correction for multiple testing, but the vector of values is very large.
Is it possible to do this with Python or R?
> df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
+ V2 = c("B", "C", "A", "C", "A", "B"),
+ n = c(1, 3, 1, 2, 3, 2))
> df
V1 V2 n
1 A B 1
2 A C 3
3 B A 1
4 B C 2
5 C A 3
6 C B 2
> df[!duplicated(t(apply(df, 1, sort))), ]
V1 V2 n
1 A B 1
2 A C 3
4 B C 2

Subtraction based on two factors

My dataframe looks like so:
group <- c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C")
value <- c(3:6, 1:4, 4:9)
type <- c("d", "d", "e", "e", "g", "g", "e", "e", "d", "d", "e", "e", "f", "f")
df <- cbind.data.frame(group, value, type)
df
group value type
1 A 3 d
2 A 4 d
3 A 5 e
4 A 6 e
5 B 1 g
6 B 2 g
7 B 3 e
8 B 4 e
9 C 4 d
10 C 5 d
11 C 6 e
12 C 7 e
13 C 8 f
14 C 9 f
Within each level of factor "group" I would like to subtract the values based on "type", such that (for group "A") 3 - 5 (1st value of d - 1st value of e) and 4 - 6 (2nd value of d - 2nd value of d). My outcome should look similarly to this..
A
group d_e
1 A -2
2 A -2
B
group g_e
1 B -2
2 B -2
C
group d_e d_f e_f
1 C -2 -4 -2
2 C -2 -4 -2
So if - as for group C - there are more than 2 types, I would like to calculate the difference between each combination of types.
Reading this post I reckon I could maybe use ddply and transform. However, I am struggling with finding a way to automatically assign the types, given that each group consists of different types and also different numbers of types.
Do you have any suggestions as to how I could manage that?
Its not clear why the sample answer in the post has two identical rows in each output group and not just one but at any rate this produces similar output to that shown:
DF <- df[!duplicated(df[-2]), ]
f <- function(x) setNames(
data.frame(group = x$group[1:2], as.list(- combn(x$value, 2, diff))),
c("group", combn(x$type, 2, paste, collapse = "_"))
)
by(DF, DF$group, f)
giving:
DF$group: A
group d_e
1 A -2
2 A -2
------------------------------------------------------------
DF$group: B
group d_e
1 B -2
2 B -2
------------------------------------------------------------
DF$group: C
group d_e d_f e_f
1 C -2 -4 -2
2 C -2 -4 -2
REVISED minor improvements.

map (align) smaller to larger sequence in r

I have following framework dataset:
master <- data.frame (namest = c("A","B", "C","D", "E", "F"),
position =c( 0, 10, 20, 25, 30, 35))
master
namest position
1 A 0
2 B 10
3 C 20
4 D 25
5 E 30
6 F 35
This is bigger map (say road map) where there is name of place and position. Now in second survey we have smaller subsets (many, here just 3).
subset1 <- data.frame (namest = c("I", "A", "ii", "iii", "B"),
position = c(0, 10, 12, 14, 20))
subset1
namest position
1 I 0
2 A 10
3 ii 12
4 iii 14
5 B 20
subset2 <- data.frame (namest = c("E", "vii", "F"), position = c(0, 3,5))
subset2
namest position
1 E 0
2 vii 3
3 F 5
subset3 <- data.frame (namest = c("D", "vi", "v", "C", "iv"),
position = c(0, 2, 3, 5, 8))
subset3
namest position
1 D 0
2 vi 2
3 v 3
4 C 5
5 iv 8
You can see that each subsets have at two names that are common to master, for example D and C in subset3.
Now I want to combine these subsets to make more detailed master. Means that new namest will be positioned in new map. See that some of subset (see subset3) have reverse order compared to master.
Thus expected output is:
subsetalign <- data.frame(subsett = c(rep ("A-B", nrow(subset1)),
rep("C-D", nrow(subset3)),
rep("E-F", nrow(subset2))), namest = c(c("I", "A", "ii", "iii", "B"),
rev (c("D", "vi", "v", "C", "iv")),c("E", "vii", "F")),
position = c(subset1$position, rev (subset3$position), subset2$position))
subsetalign
subsett namest position
1 A-B I 0
2 A-B A 10
3 A-B ii 12
4 A-B iii 14
5 A-B B 20
6 C-D iv 8
7 C-D C 5
8 C-D v 3
9 C-D vi 2
10 C-D D 0
11 E-F E 0
12 E-F vii 3
13 E-F F 5
The output process can be visualized as (I do not mean to create such figure,at this point, just to explain better):
Edits:
It is not simiply rbind due to two things:
(a) The subset are ordered based on how their comman namest are arranged in master file.
For example subset1 (A-B) + subset3 (C-D) + subset2 (E-F), as the order in master is A-B-C-D-E-F
(b) Also if the subset have reverse order than master, they should be reversed.
In subset 3, the order of namest is "D"-"vi"-"v"-"C"-"iv", but in master D comes after C, so this sustet 3 should reversed before binding.
Suppose the subsets are in a list
subsets <- list(subset1, subset2, subset3)
The location of the anchors in the master are
idx <- lapply(subsets, function(x, y) match(x$namest, y$namest), master)
The orientation of each subset is
orientation <- sapply(idx, function(elt) unique(diff(elt[!is.na(elt)])))
And the position in the master is
position <- sapply(idx, function(elt) min(elt, na.rm=TRUE))
The subsets can be ordered subsets[order(position)], reversed if necessary
updt <- Map(function(elt, dir) {
if (dir == -1)
elt[rev(seq_len(nrow(elt))),]
else elt
}, subsets[order(position)], orientation[order(position)])
and rbinded together, do.call(rbind, updt). This is assuming that all intervals in master are represented exactly once.

Resources