Frequency table comparison using R - r

I have two frequency tables created using R's table() function:
freq1 <- table(unlist(strsplit(topic_list1, split=";")))
freq2 <- table(unlist(strsplit(topic_list2, split=";")))
topic_list1 and topic_list2 are strings that contains textual representations of topics separated by ;.
I want a way to compare the two frequencies, graphically if possible.
So if the two lists contain the same topic with different frequencies, I would like to be able to see it. The same goes for topics present in one frequency table, but not in the other.

There's probably a more elegant way to do this, but this ought to work:
# here I'm generating some example data
set.seed(5)
topic_list1 <- paste(sample(letters, 20, replace=T), sep=";")
topic_list2 <- paste(sample(letters, 15, replace=T), sep=";")
# I don't make the tables right away
tl1 <- unlist(strsplit(topic_list1, split=";"))
tl2 <- unlist(strsplit(topic_list2, split=";"))
big_list <- unique(c(tl1, tl2))
# this computes your frequencies
lbl <- length(big_list)
tMat1 <- matrix(rep(tl1, lbl), byrow=T, nrow=lbl)
tMat2 <- matrix(rep(tl2, lbl), byrow=T, nrow=lbl)
tMat1 <- cbind(big_list, tMat1)
tMat2 <- cbind(big_list, tMat2)
counts1 <- apply(tMat1, 1, function(x){sum(x[1]==x[2:length(x)])})
counts2 <- apply(tMat2, 1, function(x){sum(x[1]==x[2:length(x)])})
total_freqs <- rbind(counts1, counts2, counts1-counts2)
# this makes it nice looking & user friendly
colnames(total_freqs) <- big_list
rownames(total_freqs) <- c("topics1", "topics2", "difference")
total_freqs <- total_freqs[ ,order(total_freqs[3,])]
total_freqs
d l a z b f s y m r x h n i g k c v o
topics1 0 0 0 0 0 2 1 1 1 1 2 2 1 1 1 1 2 2 2
topics2 2 2 2 1 1 2 1 1 1 0 1 1 0 0 0 0 0 0 0
difference -2 -2 -2 -1 -1 0 0 0 0 1 1 1 1 1 1 1 2 2 2
From there you could just use the straight numbers or visualize them however you want (e.g, dotplots, etc.). Here's a simple dotplot:
windows()
dotchart(t(total_freqs)[,3], main="Frequencies of topics1 - topics2")
abline(v=0)

You can simply barplot them (with beside=T argument), which will give you a way to visually compare the counts per level ...
below is an example:
counts <- table(mtcars$vs, mtcars$gear)
barplot(counts, col=c("darkblue","red"), legend=rownames(counts), beside=T)

Related

Creating new columns with combinations of string patterns in R

I have a data frame - in which I have a column with a lengthy string separated by _. Now I am interested in counting the patterns and several possible combinations from the long string. In the use case I provided below, you can find that I would like to count the occurrence of events A and B but not anything else.
If A and B repeat like A_B or B_A alone or if they repeats itself n number of times, I want to count them and also if there are several occurrences of those combinations.
Example data frame:
participant <- c("A", "B", "C")
trial <- c(1,1,2)
string_pattern <- c("A_B_A_C_A_B", "B_A_B_A_C_D_A_B", "A_B_C_A_B")
df <- data.frame(participant, trial, string_pattern)
Expected output:
participant trial string_pattern A_B B_A A_B_A B_A_B B_A_B_A
1. A 1 A_B_A_C_A_B 2 1 1 0 0
2. B 1 B_A_B_A_C_D_A_B 2 2 1 1 1
3. C 2 A_B_C_A_B 2 0 0 0 0
My code:
revised_df <- df%>%
dplyr::mutate(A_B = stringr::str_count(string_pattern, "A_B"),
B_A = stringr::str_count(string_pattern, "B_A"),
B_A_B = string::str_count(string_pattern, "B_A_B"))
My approach gets complicated as the number of combinations increases. Hence, looking for a better solution.
You could write a function to solve this:
m <- function(s){
a <- seq(nchar(s)-1)
start <- rep(a, rev(a))
stop <- ave(start, start, FUN = \(x)seq_along(x)+x)
b <- substring(s, start, stop)
gsub('(?<=\\B)|(?=\\B)', '_', b, perl = TRUE)
}
n <- function(x){
names(x) <- x
a <- strsplit(gsub("_", '', gsub("_[^AB]+_", ':', x)), ':')
b <- t(table(stack(lapply(a, \(y)unlist(sapply(y, m))))))
data.frame(pattern=x, as.data.frame.matrix(b), row.names = NULL)
}
n(string_pattern)
pattern A_B A_B_A B_A B_A_B B_A_B_A
1 A_B_A_C_A_B 2 1 1 0 0
2 B_A_B_A_C_D_A_B 2 1 2 1 1
3 A_B_C_A_B 2 0 0 0 0
Try: This checks each string row for current column name
library(dplyr)
df |>
mutate(A_B = 0, B_A = 0, A_B_A = 0, B_A_B = 0, B_A_B_A = 0) |>
mutate(across(A_B:B_A_B_A, ~ str_count(string_pattern, cur_column())))
participant trial string_pattern A_B B_A A_B_A B_A_B B_A_B_A
1 A 1 A_B_A_C_A_B 2 1 1 0 0
2 B 1 B_A_B_A_C_D_A_B 2 2 1 1 1
3 C 2 A_B_C_A_B 2 0 0 0 0

How to assign 1s and 0s to columns if variable in row matches or not match in R

I'm an absolute beginner in coding and R and this is my third week doing it for a project. (for biologists, I'm trying to find the sum of risk alleles for PRS) but I need help with this part
df
x y z
1 t c a
2 a t a
3 g g t
so when code applied:
x y z
1 t 0 0
2 a 0 1
3 g 1 0
```
I'm trying to make it that if the rows in y or z match x the value changes to 1 and if not, zero
I started with:
```
for(i in 1:ncol(df)){
df[, i]<-df[df$x == df[,i], df[ ,i]<- 1]
}
```
But got all NA values
In reality, I have 100 columns I have to compare with x in the data frame. Any help is appreciated
An alternative way to do this is by using ifelse() in base R.
df$y <- ifelse(df$y == df$x, 1, 0)
df$z <- ifelse(df$z == df$x, 1, 0)
df
# x y z
#1 t 0 0
#2 a 0 1
#3 g 1 0
Edit to extend this step to all columns efficiently
For example:
df1
# x y z w
#1 t c a t
#2 a t a a
#3 g g t m
To apply column editing efficiently, a better approach is to use a function applied to all targeted columns in the data frame. Here is a simple function to do the work:
edit_col <- function(any_col) any_col <- ifelse(any_col == df1$x, 1, 0)
This function takes a column, and then compare the elements in the column with the elements of df1$x, and then edit the column accordingly. This function takes a single column. To apply this to all targeted columns, you can use apply(). Because in your case x is not a targeted column, you need to exclude it by indexing [,-1] because it is the first column in df.
# Here number 2 indicates columns. Use number 1 for rows.
df1[, -1] <- apply(df1[,-1], 2, edit_col)
df1
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
Of course you can also define a function that edit the data frame so you don't need to do apply() manually.
Here is an example of such function
edit_df <- function(any_df){
edit_col <- function(any_col) any_col <- ifelse(any_col == any_df$x, 1, 0)
# Create a vector containing all names of the targeted columns.
target_col_names <- setdiff(colnames(any_df), "x")
any_df[,target_col_names] <-apply( any_df[,target_col_names], 2, edit_col)
return(any_df)
}
Then use the function:
edit_df(df1)
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
A tidyverse approach
library(dplyr)
df <-
tibble(
x = c("t","a","g"),
y = c("c","t","g"),
z = c("a","a","t")
)
df %>%
mutate(
across(
.cols = c(y,z),
.fns = ~if_else(. == x,1,0)
)
)
# A tibble: 3 x 3
x y z
<chr> <dbl> <dbl>
1 t 0 0
2 a 0 1
3 g 1 0

Confusion matrix using table in k-means and hierarchical clustering

I have some problems with calculating of confusion matrix. I have created three sets of points by multivariate normal distibution:
library('MASS')
library('ggplot2')
library('reshape2')
library("ClusterR")
library("cluster")
library("dplyr")
library ("factoextra")
library("dendextend")
library("circlize")
mu1<-c(1,1)
mu2<-c(1,-9)
mu3<-c(-7,-2)
sigma1<-matrix(c(1,1,1,2), nrow=2, ncol=2, byrow = TRUE)
sigma2<-matrix(c(1,-1,-1,2), nrow=2, ncol=2, byrow = TRUE)
sigma3<-matrix(c(2,0.5,0.5,0.3), nrow=2, ncol=2, byrow = TRUE)
simulation1<-mvrnorm(100,mu1,sigma1)
simulation2<-mvrnorm(100,mu2,sigma2)
simulation3<-mvrnorm(100,mu3,sigma3)
X<-rbind(simulation1,simulation2,simulation3)
colnames(X)<-c("x","y")
X<-data.frame(X)
I have also constructed clusters using k-means clustering and hierarchical clustering with k initial centers (k=3):
//k-means clustering
k<-3
B<-kmeans(X, centers = k, nstart = 10)
x_cluster = data.frame(X, group=factor(B$cluster))
ggplot(x_cluster, aes(x, y, color = group)) + geom_point()
//hierarchical clustering
single<-hclust(dist(X), method = "single")
clusters2<-cutree(single, k = 3)
fviz_cluster(list (data = X, cluster=clusters2))
How can I calculate confusion matrix for full dataset(X) using table in both of these cases?
Using your data, insert set.seed(42) just before you create sigma1 so that we have a reproducible example. Then after you created X:
X.df <- data.frame(Grp=rep(1:3, each=100), x=X[, 1], y=X[, 2])
k <- 3
B <- kmeans(X, centers = k, nstart = 10)
table(X.df$Grp, B$cluster)
#
# 1 2 3
# 1 1 0 99
# 2 0 100 0
# 3 100 0 0
Original group 1 is identified as group 3 with one specimen assigned to group 1. Original group 2 is assigned to group 2 and original group 3 is assigned to group 1. The group numbers are irrelevant. The classification is perfect if each row/column contains all values in a single cell. In this case only 1 specimen was missplaced.
single <- hclust(dist(X), method = "single")
clusters2 <- cutree(single, k = 3)
table(X.df$Grp, clusters2)
# clusters2
# 1 2 3
# 1 99 1 0
# 2 0 0 100
# 3 0 100 0
The results are the same, but the cluster numbers are different. One specimen from the original group 1 was assigned to the same group as the group 3 specimens. To compare these results:
table(Kmeans=B$cluster, Hierarch=clusters2)
# Hierarch
# Kmeans 1 2 3
# 1 0 101 0
# 2 0 0 100
# 3 99 0 0
Notice that each row/column contains only one cell that is nonzero. The two cluster analyses agree with one another even though the cluster designations differ.
D <- lda(Grp~x + y, X.df)
table(X.df$Grp, predict(D)$class)
#
# 1 2 3
# 1 99 0 1
# 2 0 100 0
# 3 0 0 100
Linear discriminant analysis tries to predict the specimen number given the values of x and y. Because of this, the cluster numbers are not arbitrary and the correct predictions all fall on the diagonal of the table. This is what is usually described as a confusion matrix.

Get variable combination matrix

Data
We have numerous text strings that look like this (way longer in our real dataset):
df <- data.frame(
id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)
>df
id text
1 text1 ABA
2 text2 ABA
3 text3 AAA
We want to create a matrix that tells how often a letter at position x is found together with the other letters at other positions, so in this case:
3A 3 1 2 3
2B 2 0 2 2
2A 1 1 0 1
1A 3 1 2 3
1A 2A 2B 3A
What I tried
I previously converted the matrix to a binary matrix, looking like this:
structure(list(pos1_A = c(1, 1, 1), pos2_A = c(0, 0, 1), pos2_B = c(1,
1, 0), pos3_A = c(1, 1, 1)), class = "data.frame", row.names = c("text1",
"text2", "text3"))
pos1_A pos2_A pos2_B pos3_A
text1 1 0 1 1
text2 1 0 1 1
text3 1 1 0 1
Then I can run commands like cor to get correlations, however, instead of correlations I want the frequencies.
Note this is different from questions about co-occurrences wherein the variable name itself (here position) is neglected, for example like "How to use R to create a word co-occurrence matrix"
Huge credit to #Ronak Shah with the answer here
It's much simpler if we convert the categorical data to a numerical (binary matrix), for example using this hacky but easy way with the homals package and then apply the method by #Ronak Shah linked above:
# The dataset
df <- data.frame(
id = c('text1','text2','text3'),text = c('ABA','ABA','AAA')
)
# Split the strings in characters and add column names
df2 <- df %>% splitstackshape::cSplit('text', sep = '', stripWhite = FALSE, type.convert = FALSE, direction = 'wide') %>%
column_to_rownames('id')
colnames(df2) <- paste0('pos', 1:ncol(df2))
# Convert to binary matrix (hacky way)
bin.mat <- homals:::expandFrame(df2, clean = F)
# Method by #Ronak Shah to get the frequency matrix
fun <- function(x, y) sum(bin.mat[, x] & bin.mat[, y])
n <- seq_along(bin.mat)
mat <- outer(n, n, Vectorize(fun))
dimnames(mat) <- list(names(bin.mat)[n], names(bin.mat[n]))
This produces the matrix:
>mat
pos1_A pos2_A pos2_B pos3_A
pos1_A 3 1 2 3
pos2_A 1 1 0 1
pos2_B 2 0 2 2
pos3_A 3 1 2 3
Here's an alternative approach that produces a matrix as originally requested:
# Make all strings the same length:
df$text <- stringr::str_pad(df$text, side = "right", max(nchar(df$text)))
# Create a matrix with all letters labelled by their position:
all_vals <- apply(do.call(rbind, strsplit(df$text, "")), 1,
function(x) paste0(seq_along(x), x))
# Create a vector of all possible letter / position combos
all_labs <- do.call(paste0, expand.grid(seq(max(nchar(df$text))),
unique(unlist(strsplit(df$text, "")))))
# Create a function that will count all co-occurences per data frame row
f <- function(y, x) as.vector(outer(x, x, function(a, b) 1 * (a %in% y & b %in% y)))
# Create the results matrix and label it
m <- matrix(rowSums(apply(as.data.frame(all_vals), 2, f, all_labs)), nrow = length(all_labs))
rownames(m) <- all_labs
colnames(m) <- all_labs
m
#> 1A 2A 3A 1B 2B 3B
#> 1A 3 1 3 0 2 0
#> 2A 1 1 1 0 0 0
#> 3A 3 1 3 0 2 0
#> 1B 0 0 0 0 0 0
#> 2B 2 0 2 0 2 0
#> 3B 0 0 0 0 0 0
Created on 2020-05-24 by the reprex package (v0.3.0)

List with output from for loop returns empty

I have written a code to obtain crosstab results of a rasterstack for different regions (delimited by a shapefile) covering the raster. However, I am getting an empty list.
This is the function:
transitions <- function(bound, themat) { # bound = shapefile # themat = rasterstack
result = vector("list", nrow(bound)) # empty result list
names(result) = bound#data$GEOCODIGO
for (i in 1:nrow(bound)) { # this is the number of polygons to iterate through
single <- bound[i,] # selects a single polygon
clip <- mask(crop(themat, single), single) # crops the raster to the polygon boundary
result[i] <- crosstab(clip, digits = 0, long = FALSE, useNA = FALSE)
return(result)
}
}
I have tested the steps for the first object in the shapefile/bound outside of the for loop; and it worked well. But I still cannot figure out why I am getting an empty list. Any ideas?
Example data:
p <- shapefile(system.file("external/lux.shp", package="raster"))
b <- brick(raster(p), nl=2)
values(b) = sample(2, 200, replace=TRUE)
fixed function:
transitions <- function(poly, rast) {
result = vector("list", nrow(poly))
for (i in 1:nrow(poly)) {
clip <- mask(crop(rast, poly[i,]), poly[i,])
result[[i]] <- crosstab(clip, digits = 0, long = FALSE, useNA = FALSE)
}
return(result)
}
transitions(p, b)
An alternative would be to use extract
e <- extract(b, p)
To tabulate as in crosstab:
ee <- lapply(e, function(x) aggregate(data.frame(count=rep(1, nrow(x))), data.frame(x), FUN=sum))
To understand that last line, you need to unpack it.
class(e)
#[1] "list"
length(e)
#[1] 12
e[[1]]
# layer.1 layer.2
#[1,] 1 1
#[2,] 1 2
#[3,] 2 2
#[4,] 2 1
#[5,] 2 1
#[6,] 1 2
#[7,] 2 2
e is a list with the same length as the number of polygons (see length(p))
Let's that the first element and aggregate it to get a table with cases and counts.
x <- e[[1]]
aggregate(data.frame(count=rep(1, nrow(x))), data.frame(x), FUN=sum)
# layer.1 layer.2 count
#1 1 1 1
#2 2 1 2
#3 1 2 2
#4 2 2 2
A similar approach via table (the difference is that you could get Freq values that are zero
as.data.frame(table(x[,1], x[,2]))
# Var1 Var2 Freq
#1 1 1 1
#2 2 1 2
#3 1 2 2
#4 2 2 2
Now wrap the function you like into a lapply
z <- lapply(e, function(x) aggregate(data.frame(count=rep(1, nrow(x))), data.frame(x), FUN=sum))
And to take it further, bind the data.frames and add an identifier to link the data back to the polygons
y <- do.call(rbind, z,)
y$id <- rep(1:length(z), sapply(z, nrow))
head(y)
# Var1 Var2 Freq id
#1 1 1 1 1
#2 2 1 2 1
#3 1 2 2 1
#4 2 2 2 1
#5 1 1 1 2
#6 2 1 2 2

Resources