Generating a large matrix from smaller matrices in R - r

I have a directory matrix that contains a series of text file matrices of 0s and 1s of varying sizes which look like:
txt.1
0 1 0
1 1 1
0 0 1
txt.2
1 1 0
0 1 1
txt.3
1 1 1 1
0 1 0 1
0 0 0 0
I am trying create a larger diagonal matrix from these smaller matrices that replaces all the values in the smaller matrices with 0 and fills in the empty spaces in the diagonal with 1s so that the final result looks like:
print(bigmatrix)
0 0 0 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
Is there some way to use bdiag or some other function here? I have only been able to get bigdiag to fill in everything with 0s.

You don't need to know the elements of each small matrix, just create N matrices filled with 1's and dimension of max(dim(mx))
m1 = matrix(1,3,3)
m2 = matrix(1,3,3)
m3 = matrix(1,4,4)
lst = list(m1,m2,m3)
print(lst)
m0 = as.matrix(bdiag(lst))
m0 = ifelse(m0 == 0, 1, 0)
View(m0)
Result:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 0 0 1 1 1 1 1 1 1
2 0 0 0 1 1 1 1 1 1 1
3 0 0 0 1 1 1 1 1 1 1
4 1 1 1 0 0 0 1 1 1 1
5 1 1 1 0 0 0 1 1 1 1
6 1 1 1 0 0 0 1 1 1 1
7 1 1 1 1 1 1 0 0 0 0
8 1 1 1 1 1 1 0 0 0 0
9 1 1 1 1 1 1 0 0 0 0
10 1 1 1 1 1 1 0 0 0 0

This method works:
library(Matrix)
library(MASS)
structural0<-lapply(dir(), function(x){as.matrix(read.table(x))})
structural0<-lapply(structural0,function(x){ifelse(x==0,1,1)})
structural0<-bdiag(structural0)
write.matrix(structural0, file="structural0.txt")
structural0a<-as.matrix(read.table("structural0.txt"))
structural0a<-ifelse(structural0a==0,1,0)
write.matrix(structural0a, file="structural0a.txt")
However, I wonder if there is a more efficient way of doing it. Thank you.

Related

R: new column names in data frame and integration of the original columnames as part of the data

I have the following dataframe:
df <-read.table(header=TRUE, text="1 0 0 1
1 0 1 1
1 0 0 0
1 1 1 0
2 1 0 0
2 1 0 0
2 1 1 0
3 0 1 1
3 0 0 1
3 0 0 1")
I want to bring the column names as part of the data and create a column name such as V1, V2, V3 and V4:
df_new <-read.table(header=TRUE, text="V1 V2 V3 V4
1 0 0 1
1 0 1 1
1 0 0 0
1 1 1 0
2 1 0 0
2 1 0 0
2 1 1 0
3 0 1 1
3 0 0 1
3 0 0 1")
Assuming in your actual case you have data like this -
df <-read.table(header=TRUE, text="1 0 0 1
1 0 1 1
1 0 0 0
1 1 1 0
2 1 0 0
2 1 0 0
2 1 1 0
3 0 1 1
3 0 0 1
3 0 0 1", check.names = FALSE)
You can make the column names as first row by -
rbind(t(as.numeric(names(df))), setNames(df, paste0('V', seq_along(df))))
# V1 V2 V3 V4
#1 1 0 0 1
#2 1 0 1 1
#3 1 0 0 0
#4 1 1 1 0
#5 2 1 0 0
#6 2 1 0 0
#7 2 1 1 0
#8 3 0 1 1
#9 3 0 0 1
#10 3 0 0 1

R: How to drop columns with less than 10% 1's

My dataset:
a b c
1 1 0
1 0 0
1 1 0
I want to drop columns which have less than 10% 1's. I have this code but it's not working:
sapply(df, function(x) df[df[,c(x)]==1]>0.1))
Maybe I need a totally different approach.
Try this option with apply() and a build-in function to test the threshold of 1 across each column. I have created a dummy example. The index i contains the columns that will be dropped after using myfun to compute the proportion of 1's in each column. Here the code:
#Data
df <- as.data.frame(matrix(c(1,0),20,10))
df$V1<-c(1,rep(0,19))
df$V2<-c(1,rep(0,19))
#Function
myfun <- function(x) {sum(x==1)/length(x)}
#Index For removing
i <- unname(which(apply(df,2,myfun)<0.1))
#Drop
df2 <- df[,-i]
The output:
df2
V3 V4 V5 V6 V7 V8 V9 V10
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0
7 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0 0
11 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0
15 1 1 1 1 1 1 1 1
16 0 0 0 0 0 0 0 0
17 1 1 1 1 1 1 1 1
18 0 0 0 0 0 0 0 0
19 1 1 1 1 1 1 1 1
20 0 0 0 0 0 0 0 0
Where columns V1 and V2 are dropped due to having 1's less than 0.1.
You can use colMeans in base R to keep columns that have more than 10% of 1's.
df[colMeans(df == 1) >= 0.1, ]
Or in dplyr use select with where :
library(dplyr)
df %>% select(where(~mean(. == 1) >= 0.1))

Recreate heatmap

I have the following data set in test.csv (6500 rows)...
Field 1 Field 2 Field 3 Field 4 Field 5 Field 6 Field 7 Field 8 Field 9 Field 10 Field 11 Field 12 Field 13 Field 14 Field 15 Field 16 Field 17 Field 18
1 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
2 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
3 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
4 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
5 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
6 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
7 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
8 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
9 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1
I need to recreate the layout and format of this heatmap.
The legend and y axis label (Rows / observations) look like they were added outside of R, but so far, here is what I have done...
library(RColorBrewer)
# Read in dataset
df <- read.csv("test2.csv")
# Set as integer matrix
m <- as.matrix(df[, -1])
# Create colors
blackgrey <- c("black", "grey")
pal <- colorRampPalette(blackgrey)(100)
# Create heatmap
heatmap(m, Rowv = NA, Colv = NA, scale="column", col = pal)
Which yielded...
How do I change the bounds of the y axis units (instead of counting from 1, skip every 2000). Also, can I move this to the left like the heatmap I am trying to replicate?

Calculate mean values of multiple measurements in a table with two categorical variables and a single continues variable

I have this puzzle to solve.
This is given data
# A tibble: 351 x 3
# Groups: expcode [?]
expcode rank distributpermm.3
<chr> <int> <dbl>
1 ER02 1 892.325
2 ER02 2 694.030
3 ER02 3 917.110
4 ER02 4 991.475
5 ER02 5 1487.210
6 ER02 6 892.325
7 ER02 7 694.030
8 ER02 8 1710.290
9 ER02 9 1090.620
10 ER02 10 1288.915
# ... with 341 more rows
When I call table on this data like this:
table(ranktab$expcode, ranktab$rank)
I get a ordinary table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
ER02 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER03 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER04 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER05 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER07 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ER11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
ER16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
ER18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
ER19 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER23 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER26 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Now I would like to get a matrix looks like this table above, but instate of sum of cases I would like to have the valves of third variable in the data frame, if there are two observations, then the mean of these.
Let's consider that your initial data is in df dataframe
df1 <- with(df, aggregate(distributpermm.3, by = list(expcode, rank), mean))
colnames(df1) <- colnames(df)
#this will give you final output in the desired format
xtabs(distributpermm.3 ~ expcode + rank, df1)
Hope this helps!
If you just want to obtain the means of variable relative to variable, you can use aggregate function.
Try this:
expcode = c (rep ("ER02", 3), rep ("ER03", 4), "ER04", rep ("ER05", 2))
rank = c (1, 2, 3, 1, 2, 3, 4, 1, 1, 2)
ddistributpermml.3 = c (892.325, 694.030, 917.110, 991.475, 1487.210, 892.325, 694.030, 1710.290, 1090.620, 1288.915)
data = data.frame (expcode, rank, ddistributpermml)
res = aggregate (data [, 3], list (data$expcode), mean)
colnames (res) = c ("expcode", "mean (distributpermm.3)")
res
# > res
# expcode mean (distributpermm.3)
# 1 ER02 834.4883
# 2 ER03 1016.2600
# 3 ER04 1710.2900
# 4 ER05 1189.7675
If you want to keep variable in some way, please clarify what you want to obtain.

Create block diagonal data frame in R

I have a data set that looks like this:
Person Team
114 1
115 1
116 1
117 1
121 1
122 1
123 1
214 2
215 2
216 2
217 2
221 2
222 2
223 2
"Team" ranges from 1 to 33, and teams vary in terms of size (i.e., there can be 5, 6, or 7 members, depending on the team). I need to create a data set into something that looks like this:
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
The sizes of the individual blocks are given by the number of people in a team. How can I do this in R?
You could use bdiag from the package Matrix. For example:
> bdiag(matrix(1,ncol=7,nrow=7),matrix(1,ncol=7,nrow=7))
Another idea, although, I guess this is less efficient/elegant than RStudent's:
DF = data.frame(Person = sample(100, 21), Team = rep(1:5, c(3,6,4,5,3)))
DF
lengths = tapply(DF$Person, DF$Team, length)
mat = matrix(0, sum(lengths), sum(lengths))
mat[do.call(rbind,
mapply(function(a, b) arrayInd(seq_len(a ^ 2), c(a, a)) + b,
lengths, cumsum(c(0, lengths[-length(lengths)])),
SIMPLIFY = F))] = 1
mat

Resources