Bind the frequencies of two observations in r - r

I've two observations and my aim is to bind the frequency counts together so I can perform e.g. a chi-square test.
a <- c(1,1,5,6,3,6,1,5,5,1,2,5,2,1,3,3,6,5,7,4)
b <- c(1,5,4,4,1,5,4,4,2,1,2,1,2)
> table(a)
a
1 2 3 4 5 6 7
5 2 3 1 5 3 1
> table(b)
b
1 2 4 5
4 3 4 2
As the print shows, the second observation lacks observations for the factors 3,6 and 7. Hence I can't bind them using cbind(table(a), table(b)). As this results in:
> cbind(table(a), table(b))
[,1] [,2]
1 5 4
2 2 3
3 3 4
4 1 2
5 5 4
6 3 3
7 1 4
Warning message:
In cbind(table(a), table(b)) :
number of rows of result is not a multiple of vector length (arg 2)
I was wondering about appropriate methods to combine the observations to get a result similar to this:
[,1] [,2]
1 5 4
2 2 3
3 3 0
4 1 4
5 5 2
6 3 0
7 1 0

We can convert it to factor with levels specified as the sorted union of both vectors, get the frequency of each vector (table) and cbind it
un1 <- sort(union(a,b))
cbind(table(factor(a, levels = un1)), table(factor(b, levels = un1)))
# [,1] [,2]
#1 5 4
#2 2 3
#3 3 0
#4 1 4
#5 5 2
#6 3 0
#7 1 0

This will also work:
df <- merge(table(a), table(b), by.x='a', by.y='b', all=TRUE)[-1]
df[is.na(df)] <- 0
df
# Freq.x Freq.y
#1 5 4
#2 2 3
#3 3 0
#4 1 4
#5 5 2
#6 3 0
#7 1 0

Related

R left_join() replacing joined values rather than adding in new columns

I have the following dataframes:
A<-data.frame(AgentNo=c(1,2,3,4,5,6),
N=c(2,5,6,1,9,0),
Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 1 2 1
2 2 5 2
3 3 6 1
4 4 1 1
5 5 9 2
6 6 0 2
B<-data.frame(Rank=c(1,5),
AgentNo.x=c(2,5),
AgentNo.y=c(1,4),
N=c(3,1),
Rarity=c(1,2))
Rank AgentNo.x AgentNo.y N Rarity
1 1 2 1 3 1
2 5 5 4 1 2
I would like to left join B onto A by columns "AgentNo"="AgentNo.y" and "N"="N" but rather than add new columns to A from B I want the same columns from A but where joined values have been updated and taken from B.
For any joined rows I want A.AgentNo to now be B.AgentNo.x, A.N to be B.N and A.Rarity to be B.Rarity. I would like to drop B.Rank and B.Agent.y completely.
The result should be:
Result<-data.frame(AgentNo=c(2,2,3,5,5,6), N=c(2,5,6,1,9,0), Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 2
5 5 9 2
6 6 0 2
After some data wrangling, you can use rows_update to update the rows of A by the values of B:
library(dplyr)
A <- A %>%
mutate(AgentNo.y = AgentNo)
B <- select(B, AgentNo = AgentNo.x, AgentNo.y, N, Rarity)
rows_update(A, B, by = "AgentNo.y") %>%
select(-AgentNo.y)
output
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 1
5 5 9 2
6 6 0 2

Data merge with data.table for repeating unique values

I am trying two merge two columns in data table 'A' with another column in another data table 'B' which is the unique value of a column . I want to merge in such a way that for every unique combination of two variables in data table 'A' , we get all unique values of column in data table 'B' repeated.
I tried merge but it doesn't give me all the values.I also tried the automated recycling function in data.table but this also doesn't give me the result.
Input:
data.table A
X Y
1 1
1 2
1 3
2 1
3 1
4 4
4 5
5 6
data.table B
Z
1
2
Expected output
X Y Z
1 1 1
1 1 2
1 2 1
1 2 2
1 3 1
1 3 2
2 1 1
2 1 2
3 1 1
3 1 2
4 4 1
4 4 2
4 5 1
4 5 2
5 6 1
5 6 2
We can make use of crossing from tidyr
library(tidyr)
crossing(A, B)
# X Y Z
#1 1 1 1
#2 1 1 2
#3 1 2 1
#4 1 2 2
#5 1 3 1
#6 1 3 2
#7 2 1 1
#8 2 1 2
#9 3 1 1
#10 3 1 2
#11 4 4 1
#12 4 4 2
#13 4 5 1
#14 4 5 2
#15 5 6 1
#16 5 6 2
Or with merge from base R, but the order will be slightly different
merge(A, B)
To get the correct order, replace the arguments in reverse and then order the columns
merge(B, A)[c(names(A), names(B))]

In a large dataset is there any fast way to identify the repeated data records and also bucket the records with similar repeat pattern in R?

Assume I have a data frame as this:
my_df<- data.frame(mat1=c(1,2,2,2,1,2,2),
mat2=c(5,4,3,1,5,4,4),
mat3=c(4,1,6,9,4,1,1),
mat4=c(1,2,6,9,1,2,2))
I actually know how to identify the repeats, which gives me the follwoing:
mat1 mat2 mat3 mat4 Repeat
1 1 5 4 1 TRUE
2 2 4 1 2 TRUE
3 2 3 6 6 FALSE
4 2 1 9 9 FALSE
5 1 5 4 1 TRUE
6 2 4 1 2 TRUE
7 2 4 1 2 TRUE
I want to bucket the similar pattern, to generate the classes as follows:
mat1 mat2 mat3 mat4 Repeat repeat_class
1 1 5 4 1 TRUE 1
2 2 4 1 2 TRUE 2
3 2 3 6 6 FALSE 0
4 2 1 9 9 FALSE 0
5 1 5 4 1 TRUE 1
6 2 4 1 2 TRUE 2
7 2 4 1 2 TRUE 2
where, repeat_class=0 shows non-repeated data records,repeat_class=1,2,etc identifies the similar paterns found in the data records.
I can do it in for loops, but for a large dataset it is just taking too long. I'm wondering if there is any faster way to do that in R?
It looks like you want a column with a unique key for each repeat class in the data frame.
In dplyr, we can use the function group_indices:
library(dplyr)
my_df$repeat_class <- my_df%>%
group_indices(mat1, mat2, mat3, mat4)
mat1 mat2 mat3 mat4 repeat_class
1 1 5 4 1 1
2 2 4 1 2 4
3 2 3 6 6 3
4 2 1 9 9 2
5 1 5 4 1 1
6 2 4 1 2 4
7 2 4 1 2 4
To match your output, if we want non-duplicated keys to all match, we can set them to be 0:
my_df$repeat_class[!(duplicated(my_df$repeat_class) | duplicated(my_df$repeat_class, fromLast = T))] <- 0
mat1 mat2 mat3 mat4 id repeat_class
1 1 5 4 1 1 1
2 2 4 1 2 4 4
3 2 3 6 6 3 0
4 2 1 9 9 2 0
5 1 5 4 1 1 1
6 2 4 1 2 4 4
7 2 4 1 2 4 4
Here is one option with .GRP from data.table. We group by the names of 'my_df', and assign (:=) .GRP values with number of rows greater than 1 to 'repeat_class'
library(data.table)
setDT(my_df)[, repeat_class := .GRP * (.N > 1), by = names(my_df)]
my_df
# mat1 mat2 mat3 mat4 repeat_class
#1: 1 5 4 1 1
#2: 2 4 1 2 2
#3: 2 3 6 6 0
#4: 2 1 9 9 0
#5: 1 5 4 1 1
#6: 2 4 1 2 2
#7: 2 4 1 2 2
Here's my guess (which mirrors my comment):
#Make a character vector that reflects the pattern:
my_df$pat <- apply(my_df,1,paste, collapse="_")
#Then use ave to measure length of each pattern and subtract 1 from the tally:
(my_df$repeat_class <- ave( seq(nrow(my_df)), my_df$pat, FUN=length ) - 1 )
#[1] 1 2 0 0 1 2 2

melt the lower half from systematic matrix in R

Given that I have a three by three systematic matrix.
> x<-matrix(1:9,3)
> x[lower.tri(x)] = t(x)[lower.tri(x)]
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 4 5 8
[3,] 7 8 9
Then I apply library reshape2 to make it in long-format.
> library(reshape2)
> x <- melt(x)
> x
Var1 Var2 value
1 1 1 1
2 2 1 4
3 3 1 7
4 1 2 4
5 2 2 5
6 3 2 8
7 1 3 7
8 2 3 8
9 3 3 9
As the upper diagonal and bottom diagonal are identical, I only need half of result, which will look like below.
Var1 Var2 value
1 1 1
2 1 4
3 1 7
2 2 5
3 2 8
3 3 9
Any elegant approach to do this?
You can change the values for the bottom or upper half to NA, and then melt ignoring missing values, assume there are not missing values in the matrix originally or you don't need to keep them in the result if there are:
x[upper.tri(x)] = NA
reshape2::melt(x, na.rm=T)
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9
As the 'x' was already assigned and melted, we can get a logical index of the non-duplicate rows after sorting the subset of dataset with 1st and 2nd column by row and then use it to subset the rows
x[!duplicated(t(apply(x[1:2], 1, sort))),]
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9

Randomly Assign Integers in R within groups without replacement

I am running an experiment with two experiments: experiment_1 and experiment_2. Each experiment has 5 different treatments (i.e. 1, 2, 3, 4, 5). We are trying to randomly assign the treatments within groups.
We would like to do this via sampling without replacement iteratively within each group. We want to do this to insure that we get as a balanced a sample as possible in the treatment (e.g. we don't want to end up with 4 subjects in group 1 getting assigned to treatment 2 and no one getting treatment 1). So if a group has 23 subjects, we want to split the respondent into 4 subgroups of 5, and 1 subgroup of 3. We then want to randomly sample without replacement across the first subgroup of 5, so everyone gets assigned 1 of the treatments, do the same things for the the second, third and 4th subgroup of 5, and for the final subgroup of 3 randomly sample without replacement. So we would guarantee that every treatment is assigned to at least 4 subjects, and 3 are assigned to 5 subjects within this group. We would like to do this for all the groups in the experiment and for both treatments. The resultant output would look something like this...
group experiment_1 experiment_2
[1,] 1 5 3
[2,] 1 3 2
[3,] 1 4 4
[4,] 1 1 5
[5,] 1 2 1
[6,] 1 2 3
[7,] 1 4 1
[8,] 1 3 2
[9,] 2 5 5
[10,] 2 1 4
[11,] 2 3 4
[12,] 2 1 5
[13,] 2 2 1
. . . .
. . . .
. . . .
I know how to use the sample function, but am unsure how to sample without replacement within each group, so that our output corresponds to above described procedure. Any help would be appreciated.
I think we just need to shuffle sample IDs, see this example:
set.seed(124)
#prepare groups and samples(shuffled)
df <- data.frame(group=sort(rep(1:3,9)),
sampleID=sample(1:27,27))
#treatments repeated nrow of df
df$ex1 <- rep(c(1,2,3,4,5),ceiling(nrow(df)/5))[1:nrow(df)]
df$ex2 <- rep(c(2,3,4,5,1),ceiling(nrow(df)/5))[1:nrow(df)]
df <- df[ order(df$group,df$sampleID),]
#check treatment distribution
with(df,table(group,ex1))
# ex1
# group 1 2 3 4 5
# 1 2 2 2 2 1
# 2 2 2 2 1 2
# 3 2 2 1 2 2
with(df,table(group,ex2))
# ex2
# group 1 2 3 4 5
# 1 1 2 2 2 2
# 2 2 2 2 2 1
# 3 2 2 2 1 2
How about this function:
f <- function(n,m) {sample( c( rep(1:m,n%/%m), sample(1:m,n%%m) ), n )}
"n" is the group size, "m" the number of treatments.
Each treatment must be containt at least "n %/% m" times in the group.
The treatment numbers of the remaining "n %% m" group members are
assigned arbitrarily without repetition.
The vector "c( rep(1:m,n%/%m), sample(1:m,n%%m) )" contains these treatment numbers. Finally the "sample" function
perturbes these numbers.
> f(8,5)
[1] 5 3 1 5 4 2 2 1
> f(8,5)
[1] 4 5 3 4 2 2 1 1
> f(8,5)
[1] 4 2 1 5 3 5 2 3
Here is a function that creates a dataframe, using the above function:
Plan <- function( groupSizes, numExp=2, numTreatment=5 )
{
numGroups <- length(groupSizes)
df <- data.frame( group = rep(1:numGroups,groupSizes) )
for ( e in 1:numExp )
{
df <- cbind(df,unlist(lapply(groupSizes,function(n){f(n,numTreatment)})))
colnames(df)[e+1] <- sprintf("Exp_%i", e)
}
return(df)
}
Example:
> P <- Plan(c(8,23,13,19))
> P
group Exp_1 Exp_2
1 1 4 1
2 1 1 4
3 1 2 2
4 1 2 1
5 1 3 5
6 1 5 5
7 1 1 2
8 1 3 3
9 2 5 1
10 2 2 1
11 2 5 2
12 2 1 2
13 2 2 1
14 2 1 4
15 2 3 5
16 2 5 3
17 2 2 4
18 2 5 4
19 2 2 5
20 2 1 1
21 2 4 2
22 2 3 3
23 2 4 3
24 2 2 5
25 2 3 3
26 2 5 2
27 2 1 5
28 2 3 4
29 2 4 4
30 2 4 2
31 2 4 3
32 3 2 5
33 3 5 3
34 3 5 1
35 3 5 1
36 3 2 5
37 3 4 4
38 3 1 4
39 3 3 2
40 3 3 2
41 3 3 3
42 3 1 1
43 3 4 2
44 3 4 4
45 4 5 1
46 4 3 1
47 4 1 2
48 4 1 5
49 4 3 3
50 4 3 1
51 4 4 5
52 4 2 4
53 4 5 3
54 4 2 1
55 4 4 2
56 4 2 5
57 4 4 4
58 4 5 3
59 4 5 4
60 4 1 2
61 4 2 5
62 4 3 2
63 4 4 4
Check the distribution:
> with(P,table(group,Exp_1))
Exp_1
group 1 2 3 4 5
1 2 2 2 1 1
2 4 5 4 5 5
3 2 2 3 3 3
4 3 4 4 4 4
> with(P,table(group,Exp_2))
Exp_2
group 1 2 3 4 5
1 2 2 1 1 2
2 4 5 5 5 4
3 3 3 2 3 2
4 4 4 3 4 4
>
The design of efficient experiments is a science on its own and there are a few R-packages dealing with this issue:
https://cran.r-project.org/web/views/ExperimentalDesign.html
I am afraid your approach is not optimal regarding the resources, no matter how you create the samples...
However this might help:
n <- 23
group <- sort(rep(1:5, ceiling(n/5)))[1:n]
exp1 <- rep(NA, length(group))
for(i in 1:max(group)) {
exp1[which(group == i)] <- sample(1:5)[1:sum(group == i)]
}
Not exactly sure if this meets all your constraints, but you could use the randomizr package:
library(randomizr)
experiment_1 <- complete_ra(N = 23, num_arms = 5)
experiment_2 <- block_ra(experiment_1, num_arms = 5)
table(experiment_1)
table(experiment_2)
table(experiment_1, experiment_2)
Produces output like this:
> table(experiment_1)
experiment_1
T1 T2 T3 T4 T5
4 5 5 4 5
> table(experiment_2)
experiment_2
T1 T2 T3 T4 T5
6 3 6 4 4
> table(experiment_1, experiment_2)
experiment_2
experiment_1 T1 T2 T3 T4 T5
T1 2 0 1 1 0
T2 1 1 1 1 1
T3 1 1 1 1 1
T4 1 0 2 0 1
T5 1 1 1 1 1

Resources