Create block diagonal data frame in R - r

I have a data set that looks like this:
Person Team
114 1
115 1
116 1
117 1
121 1
122 1
123 1
214 2
215 2
216 2
217 2
221 2
222 2
223 2
"Team" ranges from 1 to 33, and teams vary in terms of size (i.e., there can be 5, 6, or 7 members, depending on the team). I need to create a data set into something that looks like this:
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
The sizes of the individual blocks are given by the number of people in a team. How can I do this in R?

You could use bdiag from the package Matrix. For example:
> bdiag(matrix(1,ncol=7,nrow=7),matrix(1,ncol=7,nrow=7))

Another idea, although, I guess this is less efficient/elegant than RStudent's:
DF = data.frame(Person = sample(100, 21), Team = rep(1:5, c(3,6,4,5,3)))
DF
lengths = tapply(DF$Person, DF$Team, length)
mat = matrix(0, sum(lengths), sum(lengths))
mat[do.call(rbind,
mapply(function(a, b) arrayInd(seq_len(a ^ 2), c(a, a)) + b,
lengths, cumsum(c(0, lengths[-length(lengths)])),
SIMPLIFY = F))] = 1
mat

Related

Simulating baseline for Q learning research

Im trying to built a function for my Q learning research. It suppose to receive number of trails and repeat probability and simulate data of choosing between 2 actions (0,1) each step, according to the repeat probability. (baseline so no learning according to reward, only switching between actions according to given prob).
sim.block = function(Ntrl, repeat_p){
for (i in 2:Ntrl){
action = sample(x = c(0,1), size = Ntrl, replace=T)
last.action <- action[i-1] # the number in the previous step
if(last.action==0){
action[i] <- sample(x = c(0,1), size = 1, prob = c(repeat_p,1-repeat_p))
} else {
action[i] <- sample(x = c(0,1), size = 1, prob = c(1-repeat_p,repeat_p))
}}
return (data.frame(action))
}
When i test the functions with extreme repeat probability i dont get what i expect.
For example, when i insert sim.block(400,0.000000001) i expcect to get no repeat at all but its not the case. the same for 0.999999999, i expect to get only repeat but i get random list of 1s and 0s.
Where is the problem?
There is just a small error regarding the creation of the sample. This should be outside of the loop, e.g.:
set.seed(1)
action = sample(x = c(0,1), size = Ntrl, replace=T)
for (i in 2:Ntrl){
last.action <- action[i-1] # the number in the previous step
#if (i == 4) break
if(last.action==0){
action[i] <- sample(x = c(0,1), size = 1, prob = c(repeat_p,1-repeat_p))
} else {
action[i] <- sample(x = c(0,1), size = 1, prob = c(1-repeat_p,repeat_p))
}}
the result for your example of Ntrl=400 and repeat_p is than:
> action
[1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
[73] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
[145] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
[217] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
[289] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
[361] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

R: How to drop columns with less than 10% 1's

My dataset:
a b c
1 1 0
1 0 0
1 1 0
I want to drop columns which have less than 10% 1's. I have this code but it's not working:
sapply(df, function(x) df[df[,c(x)]==1]>0.1))
Maybe I need a totally different approach.
Try this option with apply() and a build-in function to test the threshold of 1 across each column. I have created a dummy example. The index i contains the columns that will be dropped after using myfun to compute the proportion of 1's in each column. Here the code:
#Data
df <- as.data.frame(matrix(c(1,0),20,10))
df$V1<-c(1,rep(0,19))
df$V2<-c(1,rep(0,19))
#Function
myfun <- function(x) {sum(x==1)/length(x)}
#Index For removing
i <- unname(which(apply(df,2,myfun)<0.1))
#Drop
df2 <- df[,-i]
The output:
df2
V3 V4 V5 V6 V7 V8 V9 V10
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0
7 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0 0
11 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0
15 1 1 1 1 1 1 1 1
16 0 0 0 0 0 0 0 0
17 1 1 1 1 1 1 1 1
18 0 0 0 0 0 0 0 0
19 1 1 1 1 1 1 1 1
20 0 0 0 0 0 0 0 0
Where columns V1 and V2 are dropped due to having 1's less than 0.1.
You can use colMeans in base R to keep columns that have more than 10% of 1's.
df[colMeans(df == 1) >= 0.1, ]
Or in dplyr use select with where :
library(dplyr)
df %>% select(where(~mean(. == 1) >= 0.1))

Calculate mean values of multiple measurements in a table with two categorical variables and a single continues variable

I have this puzzle to solve.
This is given data
# A tibble: 351 x 3
# Groups: expcode [?]
expcode rank distributpermm.3
<chr> <int> <dbl>
1 ER02 1 892.325
2 ER02 2 694.030
3 ER02 3 917.110
4 ER02 4 991.475
5 ER02 5 1487.210
6 ER02 6 892.325
7 ER02 7 694.030
8 ER02 8 1710.290
9 ER02 9 1090.620
10 ER02 10 1288.915
# ... with 341 more rows
When I call table on this data like this:
table(ranktab$expcode, ranktab$rank)
I get a ordinary table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
ER02 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER03 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER04 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER05 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER07 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
ER11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
ER16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
ER18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
ER19 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER22 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER23 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ER26 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Now I would like to get a matrix looks like this table above, but instate of sum of cases I would like to have the valves of third variable in the data frame, if there are two observations, then the mean of these.
Let's consider that your initial data is in df dataframe
df1 <- with(df, aggregate(distributpermm.3, by = list(expcode, rank), mean))
colnames(df1) <- colnames(df)
#this will give you final output in the desired format
xtabs(distributpermm.3 ~ expcode + rank, df1)
Hope this helps!
If you just want to obtain the means of variable relative to variable, you can use aggregate function.
Try this:
expcode = c (rep ("ER02", 3), rep ("ER03", 4), "ER04", rep ("ER05", 2))
rank = c (1, 2, 3, 1, 2, 3, 4, 1, 1, 2)
ddistributpermml.3 = c (892.325, 694.030, 917.110, 991.475, 1487.210, 892.325, 694.030, 1710.290, 1090.620, 1288.915)
data = data.frame (expcode, rank, ddistributpermml)
res = aggregate (data [, 3], list (data$expcode), mean)
colnames (res) = c ("expcode", "mean (distributpermm.3)")
res
# > res
# expcode mean (distributpermm.3)
# 1 ER02 834.4883
# 2 ER03 1016.2600
# 3 ER04 1710.2900
# 4 ER05 1189.7675
If you want to keep variable in some way, please clarify what you want to obtain.

Generating a large matrix from smaller matrices in R

I have a directory matrix that contains a series of text file matrices of 0s and 1s of varying sizes which look like:
txt.1
0 1 0
1 1 1
0 0 1
txt.2
1 1 0
0 1 1
txt.3
1 1 1 1
0 1 0 1
0 0 0 0
I am trying create a larger diagonal matrix from these smaller matrices that replaces all the values in the smaller matrices with 0 and fills in the empty spaces in the diagonal with 1s so that the final result looks like:
print(bigmatrix)
0 0 0 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
Is there some way to use bdiag or some other function here? I have only been able to get bigdiag to fill in everything with 0s.
You don't need to know the elements of each small matrix, just create N matrices filled with 1's and dimension of max(dim(mx))
m1 = matrix(1,3,3)
m2 = matrix(1,3,3)
m3 = matrix(1,4,4)
lst = list(m1,m2,m3)
print(lst)
m0 = as.matrix(bdiag(lst))
m0 = ifelse(m0 == 0, 1, 0)
View(m0)
Result:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 0 0 1 1 1 1 1 1 1
2 0 0 0 1 1 1 1 1 1 1
3 0 0 0 1 1 1 1 1 1 1
4 1 1 1 0 0 0 1 1 1 1
5 1 1 1 0 0 0 1 1 1 1
6 1 1 1 0 0 0 1 1 1 1
7 1 1 1 1 1 1 0 0 0 0
8 1 1 1 1 1 1 0 0 0 0
9 1 1 1 1 1 1 0 0 0 0
10 1 1 1 1 1 1 0 0 0 0
This method works:
library(Matrix)
library(MASS)
structural0<-lapply(dir(), function(x){as.matrix(read.table(x))})
structural0<-lapply(structural0,function(x){ifelse(x==0,1,1)})
structural0<-bdiag(structural0)
write.matrix(structural0, file="structural0.txt")
structural0a<-as.matrix(read.table("structural0.txt"))
structural0a<-ifelse(structural0a==0,1,0)
write.matrix(structural0a, file="structural0a.txt")
However, I wonder if there is a more efficient way of doing it. Thank you.

Retrieve values in each cluster in R

I have successfully run the DBSCAN algorithm (here is the stripped down command):
results <- dbscan(data,MinPts=15, eps=0.01)
and plotted my clusters:
plot(results, data)
results$cluster returns a list with numeric values. The value at each index reflects the cluster to which the original data in that index belongs:
[1] 0 1 2 1 0 0 2 1 0 0 0 1 2 0 2 0 2 0 0 1 2 0 2 2 0 1 2 0 1 0 1 0 2 0 0 0 1 1 0 1 2 0 0 0 1 0 0 1 1 0 1
[52] 0 2 2 0 0 1 2 2 0 2 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 2 2 2 2 2 0 0 0 0 0 2 1 2 1 0 2 0 0 1 1 1 0 0 1
[103] 2 1 1 0 1 0 1 1 0 0 0 0 1 2 0 0 1 1 1 1 0 0 0 1 0 0 2 2 1 1 0 1 2 1 0 0 1 0 1 2 0 0 2 0 0 2 2 2 2 0 1
However, how can I retrieve the values of the original data that is in each cluster? For example, how can I get all the values from the original data that are in cluster #2?
Okay, this should do the trick for, e.g., cluster #2:
data[results$cluster==2,]

Resources