Making a heatmap of similar rows based on categorical values

Making a heatmap of similar rows based on categorical values - r

In this simple example, I'm trying to see the number of students sharing the same class. This is what I came up with but I'd like to know how to do this without loops, and potentially how to show which students (or positions P1, P2, P3, P4) share a class together. If these were numbers I think it would be done simply through a correlation matrix, but given the categorical nature I'm not sure how to go about it other than this.
DF <- (data.frame(row.names= c("ClassA", "ClassB","ClassC","ClassD","ClassE","ClassF"),
P1=c("John","John","Dave","Patrick","Steve","John"),
P2=c("Jim","Jim","Robert","Matt","Jim","Ben"),
P3=c("Marty","Mike","Stu","Geoff","Mike","Leif"),
P4=c("Mark","Mark","Tim","Moby","Chester","Larry")))
DFtally <- matrix(ncol=6, nrow=6)
for (i in 1:dim(DF)[1]) {
for (j in 1:dim(DF)[1]) {
DFtally[i,j] <- length(intersect(t(DF[i,]),t(DF[j,])))
}
}
library(plotly)
p <- plot_ly(z = DFtally, type = "heatmap")
p

Try this:
DF2 <- split(as.matrix(DF), 1:nrow(DF))
DF2 <- crossprod(table(stack(DF2)))
DF2
# ind
# ind 1 2 3 4 5 6
# 1 4 3 0 0 1 1
# 2 3 4 0 0 2 1
# 3 0 0 4 0 0 0
# 4 0 0 0 4 0 0
# 5 1 2 0 0 4 0
# 6 1 1 0 0 0 4

Related

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks

We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))

Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

extracting maximum value of cumulative sum into a new column

A sample of data set:
testdf <- data.frame(risk_11111 = c(0,0,1,2,3,0,1,2,3,4,0), risk_11112 = c(0,0,1,2,3,0,1,2,0,1,0))
And I need output data set which would contain new column where only maximum values of cumulative sum will be maintained:
testdf <- data.frame(risk_11111 = c(0,0,1,2,3,0,1,2,3,4,0),
risk_11111_max = c(0,0,0,0,3,0,0,0,0,4,0),
risk_11112 = c(0,0,1,2,3,0,1,2,0,1,0),
risk_11112_max = c(0,0,0,0,3,0,0,2,0,1,0))
I am guessing some logical subseting of vectors colwise with apply and extracting max value with position index, and mutate into new variables.
I dont know how to extract values for new variable.
Thanks

Something like this with base R:
lapply(testdf, function(x) {
x[diff(x) > 0] <- 0
x
})
And to have all in one data.frame:
dfout <- cbind(testdf, lapply(testdf, function(x) {
x[diff(x) > 0] <- 0
x
}))
names(dfout) <- c(names(testdf), 'risk_1111_max', 'risk_1112_max')
Output:
risk_11111 risk_11112 risk_1111_max risk_1112_max
1 0 0 0 0
2 0 0 0 0
3 1 1 0 0
4 2 2 0 0
5 3 3 3 3
6 0 0 0 0
7 1 1 0 0
8 2 2 0 2
9 3 0 0 0
10 4 1 4 1
11 0 0 0 0

Creating a boolean data frame from a data frame in R

I have a data frame and I want to create a boolean data frame from it. I want to make all unique values of every column in the original data frame as column names in the bolean data frame. To show it using an example:
mydata =
sex route
m oral
f oral
m topical
f unknown
Then, I want to create
m f oral topical unknown
1 0 1 0 0
0 1 1 0 0
1 0 0 1 0
0 1 0 0 1
I am using the code below to create the bolean data frame. It works in R but not in shiny. What could be the problem?
col_names=c()
for(i in seq(1,ncol(mydata))){
col_names=c(col_names,unique(mydata[i]))
}
col_names= as.vector(unlist(col_names))
my_boolean= data.frame(matrix(0, nrow = nrow(mydata), ncol = length(col_names)))
colnames( my_boolean)=col_names
for(i in seq(1,nrow(mydata))){
for(j in seq(1,ncol(mydata)))
{
my_boolean[i,which(mydata[i,j]==colnames(my_boolean))]=1
}}

There are a few ways you can do this, but I always find table the easiest to understand. Here's an approach with table:
do.call(cbind, lapply(mydf, function(x) table(1:nrow(mydf), x)))
## f m oral topical unknown
## 1 0 1 1 0 0
## 2 1 0 1 0 0
## 3 0 1 0 1 0
## 4 1 0 0 0 1

How to create a variable that indicates agreement from two dichotomous variables

I d like to create a new variable that contains 1 and 0. A 1 represents agreement between the rater (both raters 1 or both raters 0) and a zero represents disagreement.
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- cbind(rater_A, rater_B)
The new variable would be like the following vector I created manually:
df$agreement <- c(1,0,0,0,1,0,1,1,1,1)
Maybe there's a package or a function I don't know. Any help would be great.

You could create df as a data.frame (instead of using cbind) and use within and ifelse:
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- data.frame(rater_A, rater_B)
##
df <- within(df,
agreement <- ifelse(
rater_A==rater_B,1,0))
##
> df
rater_A rater_B agreement
1 1 1 1
2 0 1 0
3 1 0 0
4 1 0 0
5 1 1 1
6 0 1 0
7 0 0 1
8 1 1 1
9 0 0 1
10 0 0 1

Creating subgroups from categorical data by using lapply in R

I was wondering if you kind folks could answer a question I have. In the sample data I've provided below, in column 1 I have a categorical variable, and in column 2 p-values.
x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
categorical_data=as.matrix(sample(x,10000))
p_val=as.matrix(runif(10000,0,1))
combi=as.data.frame(cbind(categorical_data,p_val))
head(combi)
V1 V2
1 A 0.484525170875713
2 C 0.48046557046473
3 C 0.228440979029983
4 B 0.216991128632799
5 C 0.521497668232769
6 D 0.358560319757089
I want to now take one of the categorical variables, let's say "C", and create another variable if it is C (print 1 in column 3, or 0 if it isn't).
combi$NEWVAR[combi$V1=="C"] <-1
combi$NEWVAR[combi$V1!="C" <-0
V1 V2 NEWVAR
1 A 0.484525170875713 0
2 C 0.48046557046473 1
3 C 0.228440979029983 1
4 B 0.216991128632799 0
5 C 0.521497668232769 1
6 D 0.358560319757089 0
I'd like to do this for each of the variables in V1, and then loop over using lapply:
variables=unique(combi$V1)
loopeddata=lapply(variables,function(x){
combi$NEWVAR[combi$V1==x] <-1
combi$NEWVAR[combi$V1!=x]<-0
}
)
My output however looks like this:
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
My desired output would be like the table in the second block of code, but when looping over the third column would be A=1, while B,C,D=0. Then B=1, A,C,D=0 etc.
If anyone could help me out that would be very much appreciated.

How about something like this:
model.matrix(~ -1 + V1, data=combi)
Then you can cbind it to combi if you desire:
combi <- cbind(combi, model.matrix(~ -1 + V1, data=combi))

model.matrix is definitely the way to do this in R. You can, however, also consider using table.
Here's an example using the result I get when using set.seed(1) (always use a seed when sharing example problems with random data).
LoopedData <- table(sequence(nrow(combi)), combi$V1)
head(LoopedData)
#
# A B C D
# 1 0 1 0 0
# 2 0 0 1 0
# 3 0 0 1 0
# 4 0 0 1 0
# 5 0 1 0 0
# 6 0 0 1 0
## If you want to bind it back with the original data
combi <- cbind(combi, as.data.frame.matrix(LoopedData))
head(combi)
# V1 V2 A B C D
# 1 B 0.0647124934475869 0 1 0 0
# 2 C 0.676612401846796 0 0 1 0
# 3 C 0.735371692571789 0 0 1 0
# 4 C 0.111299667274579 0 0 1 0
# 5 B 0.0466546178795397 0 1 0 0
# 6 C 0.130910312291235 0 0 1 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Making a heatmap of similar rows based on categorical values - r

Try this: DF2 <- split(as.matrix(DF), 1:nrow(DF)) DF2 <- crossprod(table(stack(DF2))) DF2 # ind # ind 1 2 3 4 5 6 # 1 4 3 0 0 1 1 # 2 3 4 0 0 2 1 # 3 0 0 4 0 0 0 # 4 0 0 0 4 0 0 # 5 1 2 0 0 4 0 # 6 1 1 0 0 0 4

Related

Sub-setting or arrange the data in R

extracting maximum value of cumulative sum into a new column

Creating a boolean data frame from a data frame in R

How to create a variable that indicates agreement from two dichotomous variables

Creating subgroups from categorical data by using lapply in R

Categories

Resources