Create subset of the sample by different variables simultaneously

Create subset of the sample by different variables simultaneously - r

I have a data frame as the following. Variables a and b are continuous, and variables v1-v7 are binary.
> df <- data.frame(a= c(1,1,2,3,5),
+ b = c(3, 6,8, 2, 4),
+ v1 = c(0,0,0,0,0),
+ v2 = c(1,0,0,0,0),
+ v3 = c(0,1,1,1,1),
+ v4 = c(0,1,1,1,1),
+ v5 = c(0,0,0,0,1),
+ v6 = c(0,0,0,0,0),
+ v7 = c(0,0,0,0,0))
> df
a b v1 v2 v3 v4 v5 v6 v7
1 1 3 0 1 0 0 0 0 0
2 1 6 0 0 1 1 0 0 0
3 2 8 0 0 1 1 0 0 0
4 3 2 0 0 1 1 0 0 0
5 5 4 0 0 1 1 1 0 0
>
I want to create seven subsamples based on the data frame I showed above. Specifically, I want to make seven subsamples that only include variables a and b and when each v1-v7 equals 1. For example,
> df1 <- df %>% filter(v1==1)
> df1
[1] a b v1 v2 v3 v4 v5 v6 v7
<0 rows> (or 0-length row.names)
> df2 <- df %>% filter(v2==1)
> df2
a b v1 v2 v3 v4 v5 v6 v7
1 1 3 0 1 0 0 0 0 0
> df3 <- df %>% filter(v3==1)
> df3
a b v1 v2 v3 v4 v5 v6 v7
1 1 6 0 0 1 1 0 0 0
2 2 8 0 0 1 1 0 0 0
3 3 2 0 0 1 1 0 0 0
4 5 4 0 0 1 1 1 0 0
I want to know how can I do these simultaneously in R? Thanks.

Here's a way with lapply(). You are better off keeping your results in a list. Subsample for v1 would be subsamples[[1]] and so on. -
subsamples <- lapply(3:9, function(x) df[df[[x]]==1, ])
subsamples
[[1]]
[1] a b v1 v2 v3 v4 v5 v6 v7
<0 rows> (or 0-length row.names)
[[2]]
a b v1 v2 v3 v4 v5 v6 v7
1 1 3 0 1 0 0 0 0 0
[[3]]
a b v1 v2 v3 v4 v5 v6 v7
2 1 6 0 0 1 1 0 0 0
3 2 8 0 0 1 1 0 0 0
4 3 2 0 0 1 1 0 0 0
5 5 4 0 0 1 1 1 0 0
[[4]]
a b v1 v2 v3 v4 v5 v6 v7
2 1 6 0 0 1 1 0 0 0
3 2 8 0 0 1 1 0 0 0
4 3 2 0 0 1 1 0 0 0
5 5 4 0 0 1 1 1 0 0
[[5]]
a b v1 v2 v3 v4 v5 v6 v7
5 5 4 0 0 1 1 1 0 0
[[6]]
[1] a b v1 v2 v3 v4 v5 v6 v7
<0 rows> (or 0-length row.names)
[[7]]
[1] a b v1 v2 v3 v4 v5 v6 v7
<0 rows> (or 0-length row.names)

in dplyr you can specify a variable name as character string with the pronoun .data (see data masking)
df_samples <- list()
for(i in 1:7)
df_samples[[i]] <- filter(df, .data[[paste0("v", i)]] == 1)

Just loop over the columns 'v1' to 'v7' and do the filter and return in a list
library(dplyr)
library(stringr)
library(purrr)
lst1 <- str_subset(names(df), "^v\\d+") %>%
map(~ df %>%
filter(if_all(all_of(.x), ~ .x == 1)))
names(lst1) <- str_c('df', seq_along(lst1))
It is better to keep it in a list. If we need objects created in the global env (not recommended), use list2env on the named list
list2env(lst1, .GlobalEnv)

Related

How to find top n% of records in a dataframe and change it to 1, otherwise 0

I have a data frame like：
v1 v2 v2 v4 v5 v6 v7 v8
1 10 8 8 50 19 41 20
11 21 87 67 23 49 14 0
88 24 55 67 24 67 56 90
what I want is that if the value in the top 5% or 10% of all values, then change to 1;
if not, replace by 0.
the structure is like below :(it is not a true result, just show the structure I want to get)
0 0 0 0 1 0 0 1
0 0 1 0 0 0 0 1
0 1 0 0 1 0 0 1
are there any fast ways? My data is about 60*639

In base R:
1*(as.matrix(df) > quantile(unlist(df), 0.95))
#> v1 v2 v2.1 v4 v5 v6 v7 v8
#> [1,] 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0
#> [3,] 1 0 0 0 0 0 0 1

Does this work:
df
# A tibble: 3 x 8
v1 v2 v2_1 v4 v5 v6 v7 v8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 8 8 50 19 41 20
2 11 21 87 67 23 49 14 0
3 88 24 55 67 24 67 56 90
df[] <- lapply(df, function(x) +(x > quantile(sort(unlist(df)), 0.95, names = F)))
df
# A tibble: 3 x 8
v1 v2 v2_1 v4 v5 v6 v7 v8
<int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 1

Another option can be pivot values to longer, then arrange by group. The lower top 5 will be allocated in the first 5 rows. Then you compute a binary variable and reshape to wide (Not sure if top 5 can be understood as the first 5 elements or other thing. Also I did the computing by in some sense "rows"):
library(dplyr)
library(tidyr)
#Function
newdf <- df %>% mutate(id=row_number()) %>%
pivot_longer(-id) %>%
arrange(id,value) %>%
group_by(id) %>%
mutate(Var=ifelse(row_number()<=5,0,1)) %>%
select(-value) %>%
pivot_wider(names_from=name,values_from=Var) %>%
ungroup() %>% select(-id)
Output:
# A tibble: 3 x 8
v1 v3 v4 v2 v6 v8 v7 v5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 1 1 1
2 0 1 1 0 1 0 0 0
3 1 0 0 0 1 1 0 0

One dplyr option could be:
df %>%
mutate(across(everything(), ~ as.numeric(. > quantile(unlist(df), 0.95))))
v1 v2 v2.1 v4 v5 v6 v7 v8
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 1

How to find percentage of people belonging to atleast 2 groups in r

I just started using R. And I have a stata dataset which I opened in R. In the questionnaire there is a question “Please look carefully at the following list of political groups and say which, if any, do you belong to?” . Variable v1 to v10 represents the different groups and each have values of 1 or 0 which is ‘yes’ or ‘no’.
My question is: How do I find the percentage of people who are members of atleast 2 groups?
I think I’m supposed to use dplyr but I am not sure.
One of the idea that I've got was to use filter and mutate.

Does this work:
> library(dplyr)
> stat <- data.frame(v1 = sample(c(0,1), 10, T),
+ v2 = sample(c(0,1), 10, T),
+ v3 = sample(c(0,1), 10, T),
+ v4 = sample(c(0,1), 10, T),
+ v5 = sample(c(0,1), 10, T),
+ v6 = sample(c(0,1), 10, T),
+ v7 = sample(c(0,1), 10, T),
+ v8 = sample(c(0,1), 10, T),
+ v9 = sample(c(0,1), 10, T),
+ v10 = sample(c(0,1), 10, T), stringsAsFactors = F)
> stat
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
1 0 1 1 1 1 1 1 0 0 1
2 0 1 1 0 0 1 1 1 0 1
3 0 1 1 0 1 0 0 1 1 0
4 0 0 1 1 0 1 0 1 0 0
5 0 0 1 1 0 1 0 1 1 0
6 0 1 0 1 1 1 1 1 1 0
7 0 0 1 0 0 0 0 1 0 1
8 0 0 1 1 1 1 0 0 0 1
9 0 1 0 0 0 1 0 0 0 1
10 0 1 1 0 0 0 0 0 1 1
> stat %>% mutate(groups_member = rowSums(.)) %>% mutate(atleast_two_groups = case_when(groups_member >= 2 ~ 'Yes', TRUE ~ 'No')) %>% select(-groups_member)
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 atleast_two_groups
1 0 1 1 1 1 1 1 0 0 1 Yes
2 0 1 1 0 0 1 1 1 0 1 Yes
3 0 1 1 0 1 0 0 1 1 0 Yes
4 0 0 1 1 0 1 0 1 0 0 Yes
5 0 0 1 1 0 1 0 1 1 0 Yes
6 0 1 0 1 1 1 1 1 1 0 Yes
7 0 0 1 0 0 0 0 1 0 1 Yes
8 0 0 1 1 1 1 0 0 0 1 Yes
9 0 1 0 0 0 1 0 0 0 1 Yes
10 0 1 1 0 0 0 0 0 1 1 Yes
>
So the dataframe is like a matrix with 10 variables each having either 0 or 1. So creating a new column that sums up all rows and if the total count is more than 2 which is more than atleast 20% (2/10) then it tells whether it satisfies your query.

You can create a new column, where you add up all 1's and 0' then sum up the values that are greater or smaller than 2.
set.seed(1234)
dat <- matrix(ifelse(runif(100)>=0.1,0,1),10,10) %>%
as_tibble(,.name_repair = "unique")
dat %>%
mutate(rsum = rowSums(.)) %>%
summarise(fewer_than_two = 100*sum(rsum<2)/n(),
more_than_two = 100*sum(rsum>=2)/n())
# A tibble: 1 x 2
fewer_than_two more_than_two
<dbl> <dbl>
1 80 20

Start creating some fake data
library(dplyr)
df <- tibble(id = 1:5,
v1 = c(1, 1, 0, 0, 0),
v2 = c(1, 1, 0, 0, 0),
v3 = rep(0, 5),
v4 = rep(0, 5),
v5 = rep(0, 5),
v6 = rep(0, 5),
v7 = rep(0, 5),
v8 = rep(0, 5),
v9 = rep(0, 5),
v10 = rep(0, 5))
This is our table. Note that out of 5 observations we have 2 people (40%) who are members of at least 2 groups
> df
# A tibble: 5 x 11
id v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 0 0 0 0 0 0 0
2 2 1 1 0 0 0 0 0 0 0 0
3 3 0 0 0 0 0 0 0 0 0 0
4 4 0 0 0 0 0 0 0 0 0 0
5 5 0 0 0 0 0 0 0 0 0 0
First, I calculate the sum of the variables 1 to 10, creating a variable that gets true if greater than or equal to 2 and false otherwise. Then we group by this new variable and calculate the percentages
result <- df %>%
rowwise() %>%
mutate(two_or_more = sum(c_across(v1:v10)) >= 2) %>%
group_by(two_or_more) %>%
summarize(percentage = sum(n()) / nrow(df) * 100)
The result should look like this
> result
# A tibble: 2 x 2
two_or_more percentage
<lgl> <dbl>
1 FALSE 60
2 TRUE 40

Subsetting a data frame using the sum of each row vector R

Hi I have some data I am reading in from a csv, which is set out in binary form:
1 2 3 4...N
1 0 1 0 1...1
2 1 1 0 1...1
3 0 0 0 0...0
4 1 0 1 1...1
. 1 1 1 0...1
. 1 0 0 0...1
N 0 0 1 1...0
screenshot of str(data)
I want to take a subset of this data where the sum of the row vectors is greater than a number say 10, or x. The first column is a placeholder column for customer ID, so this needs to be excluded. Do you have any suggestions about how I could go about doing this?
I've been trying various things like df=subset() but I've not been able to get the syntax correct.
Thanks in advance.

We can do this with rowSums
df1[rowSums(df1) > 10, , drop = FALSE]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
#7 0 0 0 1 0 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1
#9 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1
Update
In the OP's dataset, the first column 'X' is not binary and have bigger numbers. So, when we include that variable, the rowSums would be greater than 10. It is the index ID and not to be used in the calculation. So, by removing it in the rowSums, it would subset well
df1[rowSums(df1[-1])> 10,]
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:1, 10* 20, replace = TRUE), ncol = 20))

Find biggest independent subset of a connectivity matrix

I have two groups linked by a connectivity matrix like the following:
#
# X1 X2 X3 X4 X5 X6
# 1 0 0 0 0 0 V1
# 1 1 1 0 0 0 V2
# 0 1 0 0 0 0 V3
# 0 0 1 0 0 0 V4
# 0 0 0 1 0 0 V5
# 0 0 0 1 0 0 V6
# 0 0 0 0 1 0 V7
# 0 0 0 0 1 1 V8
# 0 0 0 0 1 0 V9
# 0 0 0 0 0 1 V10
#
So X1 is linked to V1 and V2 while V2 is linked to X1, X2 and X3 and so on. I need to find a way (algorithm or command) for getting all the biggest independent subsets of the matrix. So, in this case:
# X1 X2 X3
# 1 0 0 V1
# 1 1 1 V2
# 0 1 0 V3
# 0 0 1 V4
and:
# X4
# 1 V5
# 1 V6
and:
# X5 X6
# 1 0 V7
# 1 1 V8
# 1 0 V9
# 0 1 V10
Do you have any hint? I guess there's already some library or function to use either from graph analysis or linear algebra.

As you hinted we can do this with igraph:
# dummy data
df1 <- read.table(text = " X1 X2 X3 X4 X5 X6
V1 1 0 0 0 0 0
V2 1 1 1 0 0 0
V3 0 1 0 0 0 0
V4 0 0 1 0 0 0
V5 0 0 0 1 0 0
V6 0 0 0 1 0 0
V7 0 0 0 0 1 0
V8 0 0 0 0 1 1
V9 0 0 0 0 1 0
V10 0 0 0 0 0 1
")
library(dplyr)
library(tidyr)
library(igraph)
# make graph object
gg <-
df1 %>%
add_rownames(var = "V") %>%
gather(X, value, -V) %>%
filter(value == 1) %>%
graph.data.frame
# split based on clusters of graph
lapply(
sapply(split(clusters(gg)$membership,
clusters(gg)$membership), names),
function(i)
df1[intersect(rownames(df1), i),
intersect(colnames(df1), i),
drop = FALSE])
# $`1`
# X1 X2 X3
# V1 1 0 0
# V2 1 1 1
# V3 0 1 0
# V4 0 0 1
#
# $`2`
# X4
# V5 1
# V6 1
#
# $`3`
# X5 X6
# V7 1 0
# V8 1 1
# V9 1 0
# V10 0 1

R Pairwise comparison of matrix columns ignoring empty values

I have an array for which I would like to obtain a measure of the similarity between values in each column. By which I mean I wish to compare the rows between pairwise columns of the array and increment a measure when their values match. The resulting measure would then be at a maximum for two columns exactly the same.
Essentially my problem is the same as discussed here: R: Compare all the columns pairwise in matrix except that I do not wish empty cells to be counted.
With the example data created from code derived from the linked page:
data1 <- c("", "B", "", "", "")
data2 <- c("A", "", "", "", "")
data3 <- c("", "", "C", "", "A")
data4 <- c("", "", "", "", "")
data5 <- c("", "", "C", "", "A")
data6 <- c("", "B", "C", "", "")
my.matrix <- cbind(data1, data2, data3, data4, data5, data6)
similarity.matrix <- matrix(nrow=ncol(my.matrix), ncol=ncol(my.matrix))
for(col in 1:ncol(my.matrix)){
matches <- my.matrix[,col] == my.matrix
match.counts <- colSums(matches)
match.counts[col] <- 0
similarity.matrix[,col] <- match.counts
}
I obtain:
similarity.matrix =
V1 V2 V3 V4 V5 V6
1 0 3 2 4 2 4
2 3 0 2 4 2 2
3 2 2 0 3 5 3
4 4 4 3 0 3 3
5 2 2 5 3 0 3
6 4 2 3 3 3 0
which counts non-value pairs.
My desired output would be:
expected.output =
V1 V2 V3 V4 V5 V6
1 0 0 0 0 0 1
2 0 0 0 0 0 0
3 0 0 0 0 2 1
4 0 0 0 0 0 0
5 0 0 2 0 0 1
6 1 0 1 0 1 0
Thanks,
Matt

So the following is the answer from akrun :
first changing the blank cells to NA's
is.na(my.matrix) <- my.matrix==''
and then removing the NA's for the match.counts
similarity.matrix <- matrix(nrow=ncol(my.matrix), ncol=ncol(my.matrix))
for(col in 1:ncol(my.matrix)){
matches <- my.matrix[,col] == my.matrix
match.counts <- colSums(matches, na.rm=TRUE)
match.counts[col] <- 0
similarity.matrix[,col] <- match.counts
}
Which did indeed give me my desired output:
V1 V2 V3 V4 V5 V6
1 0 0 0 0 0 1
2 0 0 0 0 0 0
3 0 0 0 0 2 1
4 0 0 0 0 0 0
5 0 0 2 0 0 1
6 1 0 1 0 1 0
thank you.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create subset of the sample by different variables simultaneously - r

in dplyr you can specify a variable name as character string with the pronoun .data (see data masking) df_samples <- list() for(i in 1:7) df_samples[[i]] <- filter(df, .data[[paste0("v", i)]] == 1)

Related

How to find top n% of records in a dataframe and change it to 1, otherwise 0

How to find percentage of people belonging to atleast 2 groups in r

Subsetting a data frame using the sum of each row vector R

Find biggest independent subset of a connectivity matrix

R Pairwise comparison of matrix columns ignoring empty values

Categories

Resources