R: How to drop columns with less than 10% 1's - r

My dataset:
a b c
1 1 0
1 0 0
1 1 0
I want to drop columns which have less than 10% 1's. I have this code but it's not working:
sapply(df, function(x) df[df[,c(x)]==1]>0.1))
Maybe I need a totally different approach.

Try this option with apply() and a build-in function to test the threshold of 1 across each column. I have created a dummy example. The index i contains the columns that will be dropped after using myfun to compute the proportion of 1's in each column. Here the code:
#Data
df <- as.data.frame(matrix(c(1,0),20,10))
df$V1<-c(1,rep(0,19))
df$V2<-c(1,rep(0,19))
#Function
myfun <- function(x) {sum(x==1)/length(x)}
#Index For removing
i <- unname(which(apply(df,2,myfun)<0.1))
#Drop
df2 <- df[,-i]
The output:
df2
V3 V4 V5 V6 V7 V8 V9 V10
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0
7 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0 0
11 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0
15 1 1 1 1 1 1 1 1
16 0 0 0 0 0 0 0 0
17 1 1 1 1 1 1 1 1
18 0 0 0 0 0 0 0 0
19 1 1 1 1 1 1 1 1
20 0 0 0 0 0 0 0 0
Where columns V1 and V2 are dropped due to having 1's less than 0.1.

You can use colMeans in base R to keep columns that have more than 10% of 1's.
df[colMeans(df == 1) >= 0.1, ]
Or in dplyr use select with where :
library(dplyr)
df %>% select(where(~mean(. == 1) >= 0.1))

Related

R: new column names in data frame and integration of the original columnames as part of the data

I have the following dataframe:
df <-read.table(header=TRUE, text="1 0 0 1
1 0 1 1
1 0 0 0
1 1 1 0
2 1 0 0
2 1 0 0
2 1 1 0
3 0 1 1
3 0 0 1
3 0 0 1")
I want to bring the column names as part of the data and create a column name such as V1, V2, V3 and V4:
df_new <-read.table(header=TRUE, text="V1 V2 V3 V4
1 0 0 1
1 0 1 1
1 0 0 0
1 1 1 0
2 1 0 0
2 1 0 0
2 1 1 0
3 0 1 1
3 0 0 1
3 0 0 1")
Assuming in your actual case you have data like this -
df <-read.table(header=TRUE, text="1 0 0 1
1 0 1 1
1 0 0 0
1 1 1 0
2 1 0 0
2 1 0 0
2 1 1 0
3 0 1 1
3 0 0 1
3 0 0 1", check.names = FALSE)
You can make the column names as first row by -
rbind(t(as.numeric(names(df))), setNames(df, paste0('V', seq_along(df))))
# V1 V2 V3 V4
#1 1 0 0 1
#2 1 0 1 1
#3 1 0 0 0
#4 1 1 1 0
#5 2 1 0 0
#6 2 1 0 0
#7 2 1 1 0
#8 3 0 1 1
#9 3 0 0 1
#10 3 0 0 1

create a loop to get samples in grouped data which meet a condition

I have a dataframe where data are grouped by ID. I need to know how many cells are the 10% of each group in order to select this number in a sample, but this sample should select the cells which EP is 1.
I've tried to do a nested For loop: one For to know the quantity of cells which are the 10% for each group and the bigger one to sample this number meeting the condition EP==1
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
x
ID EP
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
7 1 0
8 1 1
9 1 0
10 1 1
11 2 0
12 2 1
13 2 0
14 2 1
15 2 0
16 2 1
17 2 0
18 2 1
19 2 0
20 2 1
for(j in 1:1000){
for (i in 1:nrow(x)){
d <- x[x$ID==i,]
npix <- 10*nrow(d)/100
}
r <- sample(d[d$EP==1,],npix)
print(r)
}
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
data frame with 0 columns and 0 rows
.
.
.
until 1000
I would want to get this dataframe, where each sample is in a new column in x, and the cell sampled has "1":
ID EP s1 s2....s1000
1 1 0 0 0 ....
2 1 1 0 1
3 1 0 0 0
4 1 1 0 0
5 1 0 0 0
6 1 1 0 0
7 1 0 0 0
8 1 1 0 0
9 1 0 0 0
10 1 1 1 0
11 2 0 0 0
12 2 1 0 0
13 2 0 0 0
14 2 1 0 1
15 2 0 0 0
16 2 1 0 0
17 2 0 0 0
18 2 1 1 0
19 2 0 0 0
20 2 1 0 0
see that each 1 in S1 and s2 are the sampled cells and correspond to 10% of cells in each group (1, 2) which meet the condition EP==1
you can try
set.seed(1231)
x <- data.frame("ID"=rep(1:2, each=10),"EP" = rep(0:1, times=10))
library(tidyverse)
x %>%
group_by(ID) %>%
mutate(index= ifelse(EP==1, 1:n(),0)) %>%
mutate(s1 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0)) %>%
mutate(s2 = ifelse(index %in% sample(index[index!=0], n()*0.1), 1, 0))
# A tibble: 20 x 5
# Groups: ID [2]
ID EP index s1 s2
<int> <int> <dbl> <dbl> <dbl>
1 1 0 0 0 0
2 1 1 2 0 0
3 1 0 0 0 0
4 1 1 4 0 0
5 1 0 0 0 0
6 1 1 6 1 1
7 1 0 0 0 0
8 1 1 8 0 0
9 1 0 0 0 0
10 1 1 10 0 0
11 2 0 0 0 0
12 2 1 2 0 0
13 2 0 0 0 0
14 2 1 4 0 1
15 2 0 0 0 0
16 2 1 6 0 0
17 2 0 0 0 0
18 2 1 8 0 0
19 2 0 0 0 0
20 2 1 10 1 0
We can write a function which gives us 1's which are 10% for each ID and place it where EP = 1.
library(dplyr)
rep_func <- function() {
x %>%
group_by(ID) %>%
mutate(s1 = 0,
s1 = replace(s1, sample(which(EP == 1), floor(0.1 * n())), 1)) %>%
pull(s1)
}
then use replicate to repeat it for n times
n <- 5
x[paste0("s", seq_len(n))] <- replicate(n, rep_func())
x
# ID EP s1 s2 s3 s4 s5
#1 1 0 0 0 0 0 0
#2 1 1 0 0 0 0 0
#3 1 0 0 0 0 0 0
#4 1 1 0 0 0 0 0
#5 1 0 0 0 0 0 0
#6 1 1 1 0 0 1 0
#7 1 0 0 0 0 0 0
#8 1 1 0 1 0 0 0
#9 1 0 0 0 0 0 0
#10 1 1 0 0 1 0 1
#11 2 0 0 0 0 0 0
#12 2 1 0 0 1 0 0
#13 2 0 0 0 0 0 0
#14 2 1 1 1 0 0 0
#15 2 0 0 0 0 0 0
#16 2 1 0 0 0 0 1
#17 2 0 0 0 0 0 0
#18 2 1 0 0 0 1 0
#19 2 0 0 0 0 0 0
#20 2 1 0 0 0 0 0

Generating a large matrix from smaller matrices in R

I have a directory matrix that contains a series of text file matrices of 0s and 1s of varying sizes which look like:
txt.1
0 1 0
1 1 1
0 0 1
txt.2
1 1 0
0 1 1
txt.3
1 1 1 1
0 1 0 1
0 0 0 0
I am trying create a larger diagonal matrix from these smaller matrices that replaces all the values in the smaller matrices with 0 and fills in the empty spaces in the diagonal with 1s so that the final result looks like:
print(bigmatrix)
0 0 0 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 0 0 0 1 1 1 1
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
1 1 1 1 1 1 0 0 0 0
Is there some way to use bdiag or some other function here? I have only been able to get bigdiag to fill in everything with 0s.
You don't need to know the elements of each small matrix, just create N matrices filled with 1's and dimension of max(dim(mx))
m1 = matrix(1,3,3)
m2 = matrix(1,3,3)
m3 = matrix(1,4,4)
lst = list(m1,m2,m3)
print(lst)
m0 = as.matrix(bdiag(lst))
m0 = ifelse(m0 == 0, 1, 0)
View(m0)
Result:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0 0 0 1 1 1 1 1 1 1
2 0 0 0 1 1 1 1 1 1 1
3 0 0 0 1 1 1 1 1 1 1
4 1 1 1 0 0 0 1 1 1 1
5 1 1 1 0 0 0 1 1 1 1
6 1 1 1 0 0 0 1 1 1 1
7 1 1 1 1 1 1 0 0 0 0
8 1 1 1 1 1 1 0 0 0 0
9 1 1 1 1 1 1 0 0 0 0
10 1 1 1 1 1 1 0 0 0 0
This method works:
library(Matrix)
library(MASS)
structural0<-lapply(dir(), function(x){as.matrix(read.table(x))})
structural0<-lapply(structural0,function(x){ifelse(x==0,1,1)})
structural0<-bdiag(structural0)
write.matrix(structural0, file="structural0.txt")
structural0a<-as.matrix(read.table("structural0.txt"))
structural0a<-ifelse(structural0a==0,1,0)
write.matrix(structural0a, file="structural0a.txt")
However, I wonder if there is a more efficient way of doing it. Thank you.

How to get a matrix with all combinations of three values in R?

Suppose we have a vector x with three values:
x <- c(0,1,2)
How to fill a matrix with 5 columns (V1, V2, V3, V4, V5) with combinations of all those values.
For example, we'd have:
V1 V2 V3 V4 V5
0 0 0 0 0
0 0 0 0 1
0 0 0 1 1
...
0 1 0 0 0
...
1 1 1 1 1
...
1 2 1 0 1
...
Is there a way to do that?
Something like:
head(expand.grid(x,x,x,x,x))
Var1 Var2 Var3 Var4 Var5
1 0 0 0 0 0
2 1 0 0 0 0
3 2 0 0 0 0
4 0 1 0 0 0
5 1 1 0 0 0
6 2 1 0 0 0

Create block diagonal data frame in R

I have a data set that looks like this:
Person Team
114 1
115 1
116 1
117 1
121 1
122 1
123 1
214 2
215 2
216 2
217 2
221 2
222 2
223 2
"Team" ranges from 1 to 33, and teams vary in terms of size (i.e., there can be 5, 6, or 7 members, depending on the team). I need to create a data set into something that looks like this:
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 1 1 1 1 1 1 1
The sizes of the individual blocks are given by the number of people in a team. How can I do this in R?
You could use bdiag from the package Matrix. For example:
> bdiag(matrix(1,ncol=7,nrow=7),matrix(1,ncol=7,nrow=7))
Another idea, although, I guess this is less efficient/elegant than RStudent's:
DF = data.frame(Person = sample(100, 21), Team = rep(1:5, c(3,6,4,5,3)))
DF
lengths = tapply(DF$Person, DF$Team, length)
mat = matrix(0, sum(lengths), sum(lengths))
mat[do.call(rbind,
mapply(function(a, b) arrayInd(seq_len(a ^ 2), c(a, a)) + b,
lengths, cumsum(c(0, lengths[-length(lengths)])),
SIMPLIFY = F))] = 1
mat

Resources