R Frequency table with condition - r

I have a dataframe with two columns, "CaseID" and "Event" and want to know how often Event with ID=X is followed by Event with ID=Y. But I am only interested in consecutive events with the same CaseID.
The command
df <- data.frame(CaseID = c(1,1,1,2,2,2,3,3,3),
Event = c("A","B","C","A","B","D","B","C","E"))
df
table(df[1:nrow(df) -1, 2], df[2:nrow(df), 2])
results in
CaseID Event
1 1 A
2 1 B
3 1 C
4 2 A
5 2 B
6 2 D
7 3 B
8 3 C
9 3 E
A B C D E
A 0 2 0 0 0
B 0 0 2 1 0
C 1 0 0 0 1
D 0 1 0 0 0
E 0 0 0 0 0
C -> A and D -> B have different CaseID's and should be 0 so what I am looking for is
B C D E
A 2 0 0 0
B 0 2 1 0
C 0 0 0 1
D 0 0 0 0
E 0 0 0 0
Is there any elegant way to add a condition to the table-command, based on two consecutive rows?
Ben

We can only tabulate consecutive Events with the same CaseID:
> x <- diff(df$CaseID) == 0
> table(df[1:nrow(df) -1, 2][x], df[2:nrow(df), 2][x])
A B C D E
A 0 2 0 0 0
B 0 0 2 1 0
C 0 0 0 0 1
D 0 0 0 0 0
E 0 0 0 0 0
In case CaseID might be non-numeric:
x <- df$CaseID[-1] == df$CaseID[-length(df$CaseID)]
table(df[1:nrow(df) -1, 2][x], df[2:nrow(df), 2][x])

Related

R: Squared contingency table [duplicate]

This question already has an answer here:
How to create missing values in table in R?
(1 answer)
Closed 2 years ago.
I want to make a contingency table with observations and their predictions based on a neural network. Since I want positives to be on the diagonal, I would like my table to be squared, regardless if there are rows with just 0's. That is, I would like to have
b
a a b c d e f g
a 1 0 1 0 2 1 0
b 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0
d 2 3 1 2 2 3 2
e 1 2 1 1 0 1 3
f 0 0 0 0 0 0 0
g 4 2 1 0 3 1 0
Instead of:
> set.seed(1)
> b<-sample(letters[1:7],40,rep=TRUE)
> a<-sample(letters[1:4],40,rep=TRUE)
>
> table(a,b)
b
a a b c d e f g
a 1 0 1 0 2 1 0
d 2 3 1 2 2 3 2
e 1 2 1 1 0 1 3
g 4 2 1 0 3 1 0
How can I do this?
Convert a and b to factor with levels as union of both :
tmp <- sort(union(a, b))
table(factor(a, levels = tmp), factor(b, levels = tmp))
# a b c d e f g
# a 0 1 1 2 2 1 4
# b 2 1 1 1 2 3 2
# c 4 0 1 2 0 1 1
# d 0 1 1 1 3 1 1
# e 0 0 0 0 0 0 0
# f 0 0 0 0 0 0 0
# g 0 0 0 0 0 0 0

How could i calculate the sparsity of a data.frame in R?

i have a data.frame structured like this:
A B C D E
F 1 0 7 0 0
G 0 0 0 1 1
H 1 1 0 0 0
I 1 2 1 0 0
L 1 0 0 0 0
and i want to calculate the sparsity(i.e. the percentage of 0 values) of this data.frame.
How could i do?
sum(df == 0)/(dim(df)[1]*dim(df)[2])
[1] 0.6

Many categorical variable at the same time in Matrix

I've collected data as follow :
A B C D E F G
1 1 0 0 0 0 0 0
1,2 0 1 0 0 0 0 2
1,2,3 0 0 0 0 0 0 0
1,3 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
2,3 4 0 0 0 5 0 0
3 1 3 0 0 0 2 0
4 0 0 0 0 0 0 0
For each Color (A,B,C,D,E,F,G) it corresponds to one or many category at the same time(1,2,3,4) according sample. For many category, there is comma separation.
I want to simplify my data to have it as follows :
A B C D E F G
1 1 1 0 0 0 0 2
3 4 0 0 0 5 2 0
2 4 1 0 0 5 0 2
4 0 0 0 0 0 0 0
is there a simple way (function) to do this ?
Reproducible example :
DF <- read.table(text = " Color Cat
A 1
B 1
C 4,2
D 1,3
E 1,2
F 3
G 5
A 2
B 3
C 1,2
D 4,3
E 3
F 1
G 1" , header = TRUE)
DF = table(DF$Cat,DF$Color)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)
DF <- read.table(text = " A B C D E F G
1 1 0 0 0 0 0 0
1,2 0 1 0 0 0 0 2
1,2,3 0 0 0 0 0 0 0
1,3 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
2,3 4 0 0 0 5 0 0
3 1 3 0 0 0 2 0
4 0 0 0 0 0 0 0", header = TRUE)
#split the row names
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
#repeat each row of the DF times the number of cats
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
#add column with cats
DF$cat <- unlist(cats)
#aggregate (your question is unclear regarding how)
DF <- aggregate(. ~ cat, DF, FUN = sum) #or FUN = max???
# cat A B C D E F G
#1 1 1 1 0 0 0 0 2
#2 2 4 1 0 0 5 0 2
#3 3 5 3 0 0 5 2 0
#4 4 0 0 0 0 0 0 0

Split data frame into chunk and assign names to chunks from vectors

I have a data frame with 5*n columns, where n is the number of categories listed in a vector. I want to break the data frame into chunks of 5 columns (eg. category 1 is columns 1:5, category 2 is columns 6:10) and then assign the category names from the vector to the chunks.
eg.
*original data frame* *vector of category names*
X a b c d e a b c d e a b c d e 1 apples
1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 2 oranges
2 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 3 bananas
Will become
*apples* *oranges* *bananas*
X a b c d e X a b c d e X a b c d e
1 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0
2 0 1 0 1 0 2 0 0 1 0 1 2 1 0 0 0 1
I can find a whole lot of information about splitting data.frames by rows, which is much more common to do, but I can't find anything about splitting a data frame into n chunks by columns. Thanks!
You could split your original_data_frame by column indices similarely:
df <- read.table(header=T, check.names = F, text="
X a b c d e a b c d e a b c d e
1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0
2 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1")
n <- 5 # fixed chunksize (a-e)
lst <- lapply(split(2:ncol(df), rep(seq(ncol(df[-1])/n), each=n)), function(x) df[, x])
names(lst) <- c("apples", "oranges", "bananas")
# lst
# $apples
# a b c d e
# 1 1 0 0 0 1
# 2 0 1 0 1 0
#
# $oranges
# a b c d e
# 1 0 1 0 1 0
# 2 0 0 1 0 1
#
# $bananas
# a b c d e
# 1 0 0 1 1 0
# 2 1 0 0 0 1
I don't know if this is elegant, but it came to my mind, first.

Delete columns from a square matrix that sum to zero along with corresponding rows

I have a binary transition matrix. I want to delete rows associated with columns that sum to zero. For example, if
A B C D E
A 0 0 0 1 0
B 1 0 0 1 0
C 0 0 1 1 0
D 0 0 1 0 0
E 0 0 1 1 0
column B and E sum to zero. I know how to get rid of the columns like this,
> a.adj=a[,!!colSums(a)]
> a.adj
A C D
A 0 0 1
B 1 0 1
C 0 1 1
D 0 1 0
E 0 1 1
but how can I at the same time delete rows B and E to get
A C D
A 0 0 1
C 0 1 1
D 0 1 0
If the rownames and colnames are in the same order
indx <- !!colSums(a)
a[indx,indx]
# A C D
#A 0 0 1
#C 0 1 1
#D 0 1 0
Use names to select both columns and rows
> ind <- colnames(a[,!!colSums(a)])
> a[ind, ind]
A C D
A 0 0 1
C 0 1 1
D 0 1 0

Resources