Find frequency of an element in a matrix in R - r

I have dataset "data" with 7 rows and 4 columns, as follows:
var1 var2 var3 var4
A C
A C B
B A C D
D B
B
D B
B C
I want to create following table "Mat" based on the data I have:
A B C D
1 1
1 1 1
1 1 1 1
1 1
1
1 1
1 1 1
Basically, I have taken unique elements from the original data and create a matrix "Mat" where number of rows in Mat=number of rows in Data and number of columns in "Mat"=number of unique elements in Data (that is, A, B, C, D)
I wrote following code in R:
rule <-c("A","B","C","D")
mat<-matrix(, nrow = dim(data)[1], ncol = dim(rule)[1])
mat<-data.frame(mat)
x<-rule[,1]
nm<-as.character(x)
names(mat)<-nm
n_data<-dim(data)[1]
for(i in 1:n_data)
{
for(j in 2:dim(data)[2])
{
for(k in 1:dim(mat)[2])
{
ifelse(data[i,j]==names(mat)[k],mat[i,k]==1,0)
}
}
}
I am getting all NA in "mat". Also, the running time is too much because in my original data set I have 20,000 rows and 100 columns in "Mat".
Any advice will be highly appreciated. Thanks!

This should be faster than the nested for loops:
> sapply(c("A", "B", "C", "D"), function(x) { rowSums(df == x, na.rm = T) })
# A B C D
# [1,] 1 0 1 0
# [2,] 1 1 1 0
# [3,] 1 1 1 1
# [4,] 0 1 0 1
# [5,] 0 1 0 0
# [6,] 0 1 0 1
# [7,] 0 1 1 0
Data
df <- read.table(text = "var1 var2 var3 var4
A C NA NA
A C B NA
B A C D
D B NA NA
NA B NA NA
D B NA NA
B C NA NA", header = T, stringsAsFactors = F)

By using table and rep
table(rep(1:nrow(df),dim(df)[2]),unlist(df))
A B C D
1 1 0 1 0
2 1 1 1 0
3 1 1 1 1
4 0 1 0 1
5 0 1 0 0
6 0 1 0 1
7 0 1 1 0

Related

Crosstab of two identical variables in R - reflect in diagonal

I've got a dataset where I'm interested in the frequencies of different pairs emerging, but it doesn't matter which order the elements occur. For example:
library(janitor)
set.seed(24601)
options <- c("a", "b", "c", "d", "e", "f")
data.frame(x = sample(options, 20, replace = TRUE),
y = sample(options, 20, replace = TRUE)) %>%
tabyl(x, y)
provides me with the output
x a b c d e f
a 1 0 1 0 1 0
b 0 2 0 1 0 0
c 2 0 1 0 0 0
d 0 0 0 0 1 0
e 1 1 2 0 0 3
f 0 0 1 1 0 1
I'd ideally have the top right or bottom left of this table, where the combination of values a and c would be a total of 3. This is the sum of 1 (in the top right) and 2 (in the middle left). And so on for each other pair of values.
I'm sure there must be a simple way to do this, but I can't figure out what it is...
Edited to add (thanks #Akrun for the request): ideally I'd like the following output
x a b c d e f
a 1 0 3 0 2 0
b 2 0 1 1 0
c 1 0 2 1
d 0 1 1
e 0 3
f 1
We could + with the transposed output (except the first column), then replace the 'out' object upper triangle values (subset the elements based on the upper.tri - returns a logical vector) with that corresponding elements, and assign the lower triangle elements to NA
out2 <- out[-1] + t(out[-1])
out[-1][upper.tri(out[-1])] <- out2[upper.tri(out2)]
out[-1][lower.tri(out[-1])] <- NA
-output
out
# x a b c d e f
# a 1 0 3 0 2 0
# b NA 2 0 1 1 0
# c NA NA 1 0 2 1
# d NA NA NA 0 1 1
# e NA NA NA NA 0 3
# f NA NA NA NA NA 1
data
set.seed(24601)
options <- c("a", "b", "c", "d", "e", "f")
out <- data.frame(x = sample(options, 20, replace = TRUE),
y = sample(options, 20, replace = TRUE)) %>%
tabyl(x, y)
Here is another option, using igraph
out[-1] <- get.adjacency(
graph_from_data_frame(
get.data.frame(
graph_from_adjacency_matrix(
as.matrix(out[-1]), "directed"
)
), FALSE
),
type = "upper",
sparse = FALSE
)
which gives
> out
x a b c d e f
a 1 0 3 0 2 0
b 0 2 0 1 1 0
c 0 0 1 0 2 1
d 0 0 0 0 1 1
e 0 0 0 0 0 3
f 0 0 0 0 0 1

R data.table - remove rows corresponding to a given marginal

I have the following problem. I have a data.table and a subset of columns M. I have vector x defined on M.
library(data.table)
data <- matrix(c(0,0,NA,1,0,1,NA,1,0,0,1,0,1,1,NA,NA,1,0,0,1,0,0,1,1,1,0,0,1,NA,0,1,1,0,1,1,1), byrow = T, ncol = 6, dimnames = LETTERS[1:6])
dt <- data.table(data)
dt
% A B C D E F
% 1: 0 0 NA 1 0 1
% 2: NA 1 0 0 1 0
% 3: 1 1 NA NA 1 0
% 4: 0 1 0 0 1 1
% 5: 1 0 0 1 NA 0
% 6: 1 1 0 1 1 1
M = LETTERS[2:5]
x <- dt[2,..M]
x
% B C D E
% 1: 1 0 0 1
I would like to remove all rows from dt with marginal on M equal to x. I.e. rows no. 2 and 4. Both M and x change during the program. The result for the given M and x will be:
A B C D E F
1: 0 0 NA 1 0 1
2: 1 1 NA NA 1 0
3: 1 0 0 1 NA 0
4: 1 1 0 1 1 1
data.table anti-join
dt[!x, on = M] # also works: dt[!dt[2], on = M]
# A B C D E F
# 1: 0 0 NA 1 0 1
# 2: 1 1 NA NA 1 0
# 3: 1 0 0 1 NA 0
# 4: 1 1 0 1 1 1
Base R
eq2 <- Reduce('&', lapply(dt[, ..M], function(x) x == x[2]))
dt[-which(eq2),]
# A B C D E F
# 1: 0 0 NA 1 0 1
# 2: 1 1 NA NA 1 0
# 3: 1 0 0 1 NA 0
# 4: 1 1 0 1 1 1
Not really a data.table option, but with base R you can do:
data[rowSums(sweep(data[, M], 2, FUN = `==`, x), na.rm = TRUE) != length(x), ]
A B C D E F
[1,] 0 0 NA 1 0 1
[2,] 1 1 NA NA 1 0
[3,] 1 0 0 1 NA 0
[4,] 1 1 0 1 1 1
Another base R solution
> subset(dt,!data.frame(t(dt[,..M])) %in% data.frame(t(x)))
A B C D E F
1: 0 0 NA 1 0 1
2: 1 1 NA NA 1 0
3: 1 0 0 1 NA 0
4: 1 1 0 1 1 1

Formatting exam results to perform a t-test in R

Question Overview: I have a dataset containing the results to a 15 question pre-instructional and post-instructional exam. I am looking to run a t-test on the results to compare the overall means but am having difficulty formatting the dataset properly. An example portion of the Dataset is given below:
1Pre 1Post 2Pre 2Post 3Pre 3Post 4Pre 4Post
Correct B B A A B B C C
1 B B C D C B C C
2 C B B D C B C A
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 B B B A B B C C
5 B B B A B B C C
6 C B D A A D C B
7 C C D D E E C C
8 C A B B A A <NA> <NA>
Objective: I would like to match the "Correct" value to the values in the rows below for the test takers, such that a value of 1 is correct, and a value of 0 is incorrect. I have accomplished this using the following code:
for(j in 1:ncol(qDat)){
for(i in 1:nrow(qDat)){
if(qDat[i,j] == correctAns[1]){
qDat[i,j]=1
}else{
qDat[i,j]=0
}
}
}
I would then like to run a t-test comparing the pre and post means in addition to comparing the difference between the pre and post scores from each question, however, I need to omit any data points with NA. Currently, my method does not work with any NA values and thus replaces them with zero. Is there any method of running these tests and simply omitting NA values? Thank you!
The Desired Output:
1Pre 1Post 2Pre 2Post 3Pre 3Post
Correct B B A A B B
1 1 1 0 0 0 1
2 0 1 0 0 0 1
3 <NA> <NA> <NA> <NA> <NA> <NA>
4 1 1 0 0 1 1
5 1 1 0 0 1 1
6 0 1 0 1 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
You can try passing the following argument to the t.test call:
na.action = na.omit
Something like:
with(qDat, t.test(`1Pre`, `1Post`, na.action = na.omit))
What about this:
rewrote your loop - no need to worry to much about NAs as you treat them as 0, we can simply test the results and after set NAs as FALSE:
test <- qDat == correctAns # or correctAns[1] depending on your needs
test[is.na(test)] <- FALSE
storage.mode(test) <- "integer"
test
# X1 X2 X3 X4 X5 X6 X7 X8
# [1,] 0 1 0 0 1 0 1 0
# [2,] 0 0 1 0 0 0 0 0
# [3,] 0 1 0 0 1 0 0 0
# [4,] 0 0 1 0 0 0 0 0
# [5,] 1 0 0 0 0 0 1 0
# [6,] 0 0 1 1 1 1 1 0
# [7,] 0 0 0 1 0 0 1 0
# [8,] 0 0 0 0 0 0 0 1
with the data
set.seed(123)
correctAns <- sample(LETTERS[1:3], 8, replace = TRUE)
correctAns
# [1] "A" "C" "B" "C" "C" "A" "B" "C"
qDat <- sample(c(LETTERS[1:3], NA_character_), 8*2*4, replace = TRUE)
qDat <- data.frame(matrix(qDat, 8, 4*2), stringsAsFactors = FALSE)
qDat
# X1 X2 X3 X4 X5 X6 X7 X8
# 1 C A C C A B A <NA>
# 2 B A C <NA> B <NA> <NA> B
# 3 <NA> B C A B A <NA> <NA>
# 4 B <NA> C B B B B <NA>
# 5 C <NA> B <NA> A <NA> C <NA>
# 6 C C A A A A A B
# 7 A C <NA> B A C B <NA>
# 8 <NA> <NA> <NA> A B A B C
Edit
set.seed(123)
# correctAns is a vector of length 30
correctAns <- sample(LETTERS[1:3], 30, replace = TRUE)
length(correctAns)
# [1] 30
# qDat is a dataframe of dimensions 106x30
qDat <- sample(c(LETTERS[1:3], NA_character_), 106*30, replace = TRUE)
qDat <- data.frame(matrix(qDat, 106, 30), stringsAsFactors = FALSE)
dim(qDat)
# [1] 106 30
# still works
test <- qDat == correctAns
test[is.na(test)] <- FALSE
storage.mode(test) <- "integer"
str(test)
# int [1:106, 1:30] 0 0 0 0 0 0 0 0 1 0 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:30] "X1" "X2" "X3" "X4" ...

R: Update adjacency matrix/data frame using pairwise combinations

Question
Let's say I have this dataframe:
# mock data set
df.size = 10
cluster.id<- sample(c(1:5), df.size, replace = TRUE)
letters <- sample(LETTERS[1:5], df.size, replace = TRUE)
test.set <- data.frame(cluster.id, letters)
Will be something like:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Now I want to group these per cluster.id and see what kind of letters I can find within a cluster, so for example cluster 3 contains the letters A,E,D,C. Then I want to get all unique pairwise combinations (but not combinations with itself so no A,A e.g.): A,E ; A,D, A,C etc. Then I want to update the pairwise distance for these combination in an adjacency matrix/data frame.
Idea
# group by cluster.id
# per group get all (unique) pairwise combinations for the letters (excluding pairwise combinations with itself, e.g. A,A)
# update adjacency for each pairwise combinations
What I tried
# empty adjacency df
possible <- LETTERS
adj.df <- data.frame(matrix(0, ncol = length(possible), nrow = length(possible)))
colnames(adj.df) <- rownames(adj.df) <- possible
# what I tried
update.adj <- function( data ) {
for (comb in combn(data$letters,2)) {
# stucked
}
}
test.set %>% group_by(cluster.id) %>% update.adj(.)
Probably there is an easy way to do this because I see adjacency matrices all the time, but I'm not able to figure it out.. Please let me know if it's not clear
Answer to comment
Answer to #Manuel Bickel:
For the data I gave as example (the table under "will be something like"):
This matrix will be A-->Z for the full dataset, keep that in mind.
A B C D E
A 0 0 1 1 2
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
I will explain what I did:
cluster.id letters
<int> <fctr>
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A
Only the clusters containing more > 1 unique letter are relevant (because we don't want combinations with itself, e.g cluster 1 containing only letter B, so it would result in combination B,B and is therefore not relevant):
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
Now I look for each cluster what pairwise combinations I can make:
cluster 3:
A,E
A,D
A,C
E,D
E,C
D,C
Update these combination in the adjacency matrix:
A B C D E
A 0 0 1 1 1
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
Then go to the next cluster
cluster 2
A,E
Update the adjacency matrix again:
A B C D E
A 0 0 1 1 2 <-- note the 2 now
B 0 0 0 0 0
C 1 0 0 1 1
D 1 0 1 0 1
E 2 0 1 1 0
As reaction to the huge dataset
library(reshape2)
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x1 <- reshape2::dcast(test.set, cluster.id ~ letters)
x1
#cluster.id A B C D E
#1 1 1 0 0 0 0
#2 2 1 0 0 0 1
#3 3 1 0 1 1 1
#4 4 0 2 0 0 0
#5 5 1 0 0 0 0
x2 <- table(test.set)
x2
# letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
x1.c <- crossprod(x1)
#Error in crossprod(x, y) :
# requires numeric/complex matrix/vector arguments
x2.c <- crossprod(x2)
#works fine
Following above comment, here the code of Tyler Rinker used with your data. I hope this is what you want.
UPDATE: Following below comments, I added a solution using the package reshape2 in order to be able to handle larger amounts of data.
test.set <- read.table(text = "
cluster.id letters
1 5 A
2 4 B
3 4 B
4 3 A
5 3 E
6 3 D
7 3 C
8 2 A
9 2 E
10 1 A", header = T, stringsAsFactors = F)
x <- table(test.set)
x
letters
#cluster.id A B C D E
# 1 1 0 0 0 0
# 2 1 0 0 0 1
# 3 1 0 1 1 1
# 4 0 2 0 0 0
# 5 1 0 0 0 0
#base approach, based on answer by Tyler Rinker
x <- crossprod(x)
diag(x) <- 0 #this is to set matches such as AA, BB, etc. to zero
x
# letters
# letters
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0
#reshape2 approach
x <- acast(test.set, cluster.id ~ letters)
x <- crossprod(x)
diag(x) <- 0
x
# A B C D E
# A 0 0 1 1 2
# B 0 0 0 0 0
# C 1 0 0 1 1
# D 1 0 1 0 1
# E 2 0 1 1 0

Summing labels line-section by line in r

I have a large dataframe of 34,000 rows x 24 columns, each of which contain a category label. I would like to efficiently go through the dataframe and count up how many times each label was listed in a section of the line, including 0s.
(I've used a for loop driving a length(which) statement that wasn't terribly efficient)
Example:
df.test<-as.data.frame(rbind(c("A", "B", "C","B","A","A"),c("C", "C", "C","C","C","C"), c("A", "B", "B","A","A","A")))
df.res<-as.data.frame(matrix(ncol=6, nrow=3))
Let's say columns 1:3 in df.test are from one dataset, 4:6 from the other. What is the most efficient way to generate df.res to show this:
A B C A B C
1 1 1 2 1 0
0 0 3 0 0 3
1 2 0 3 0 0
A way -using a lot of _applys- is the following:
#list with the different data frames
df_ls <- sapply(seq(1, ncol(df.test), 3), function(x) df.test[,x:(x+2)], simplify = F)
#count each category
df.res <- do.call(cbind,
lapply(df_ls, function(df.) { t(apply(df., 1,
function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
#> df.res
# A B C A B C
#[1,] 1 1 1 2 1 0
#[2,] 0 0 3 0 0 3
#[3,] 1 2 0 3 0 0
Simulating a dataframe like the one you described:
DF <- data.frame(replicate(24, sample(LETTERS[1:3], 34000, T)), stringsAsFactors = F)
#> head(DF)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
#1 B C C C B A C B B A C C B C B B B C B C C B B C
#2 C B C A B C B C A B A C B B A A C A B B B C A B
#3 B C C A A A C A C A A A B B A A A C B B A C C C
#4 C C A B A B B B A A A C C A B A C C A C C C B A
#5 B B A A A A C A B B A B B A C A A A C A A C B C
#6 C A C C A B B C C C B C A B B B B B A C A A B A
#> dim(DF)
#[1] 34000 24
DF_ls <- sapply(seq(1, ncol(DF), 3), function(x) DF[,x:(x+2)], simplify = F)
system.time(
DF.res <- do.call(cbind,
lapply(DF_ls, function(df.) { t(apply(df., 1,
function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) })))
#user system elapsed
#59.84 0.07 60.73
#> head(DF.res)
# A B C A B C A B C A B C A B C A B C A B C A B C
#[1,] 0 1 2 1 1 1 0 2 1 1 0 2 0 2 1 0 2 1 0 1 2 0 2 1
#[2,] 0 1 2 1 1 1 1 1 1 1 1 1 1 2 0 2 0 1 0 3 0 1 1 1
#[3,] 0 1 2 3 0 0 1 0 2 3 0 0 1 2 0 2 0 1 1 2 0 0 0 3
#[4,] 1 0 2 1 2 0 1 2 0 2 0 1 1 1 1 1 0 2 1 0 2 1 1 1
#[5,] 1 2 0 3 0 0 1 1 1 1 2 0 1 1 1 3 0 0 2 0 1 0 1 2
#[6,] 1 0 2 1 1 1 0 1 2 0 1 2 1 2 0 0 3 0 2 0 1 2 1 0
EDIT Some more comments on the approach.
I'll do the above step by step.
The first step is to subset the different dataframes that are bound together; each one of those dataframes is put in a list. The function function(x) { df.test[,x:(x+2)], simplify = F } subsets the whole dataframe based on those values of x: seq(1, ncol(df.test), 3). Extending this, if your different dataframes where 4 columns distant, 3 would have been changed with 4 in the above sequence.
#> df_ls <- sapply(seq(1, ncol(df.test), 3), function(x) df.test[,x:(x+2)], simplify = F)
#> df_ls
#[[1]]
# V1 V2 V3
#1 A B C
#2 C C C
#3 A B B
#[[2]]
# V4 V5 V6
#1 B A A
#2 C C C
#3 A A A
The next step is to lapply to the -previously made- list a function that counts each category in each row of one dataframe (i.e. element of the list). The function is this: t(apply(df., 1, function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })). The inside function (function(x)) turns one row in a factor with levels all the categories and counts (table) the number each category occured in that row. apply applies this function to each row (MARGIN = 1) of the dataframe. So, now, we have counted the frequency of each category in each row of one dataframe.
#> table(factor(unlist(df_ls[[1]][3,]), levels = c("A", "B", "C")))
#df_ls[[1]][3,] is the third row of the first dataframe of df_ls
#(i.e. _one_ row of _one_ dataframe)
#A B C
#1 2 0
#> apply(df_ls[[1]], 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })
# [,1] [,2] [,3] #df_ls[[1]] is the first dataframe of df_ls (i.e. _one_ dataframe)
#A 1 0 1
#B 1 0 2
#C 1 3 0
Because, the return of apply is not in the wanted form we use t to swap rows with columns.
The next step, is to lapply all the above to each dataframe (i.e. element of the list).
#> lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) })
#[[1]]
# A B C
#[1,] 1 1 1
#[2,] 0 0 3
#[3,] 1 2 0
#[[2]]
# A B C
#[1,] 2 1 0
#[2,] 0 0 3
#[3,] 3 0 0
The last step is to cbind all those elements together. The way to bind by column all the elements of a list is to do.call cbind in that list.
#NOT the expected, using only cbind
#> cbind(lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
# [,1]
#[1,] Integer,9
#[2,] Integer,9
#Correct!
#> do.call(cbind, lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
# A B C A B C
#[1,] 1 1 1 2 1 0
#[2,] 0 0 3 0 0 3
#[3,] 1 2 0 3 0 0

Resources