Count # of IDs that meet both criteria - r

I have a dataset that has two columns. One is userid, the other is company type, like below:
userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A
I want to know how many unique userid's there are that have company.type of A and B or A and C, (but not B and C).
I'm assuming it's some sort of aggregate function, but I'm not sure how to place the qualifier that company.type has to be A and B or A and C only.

We can do this with base R using table
tbl <- table(df1) > 0
sum(((tbl[, 1] & tbl[,2]) | (tbl[,1] & tbl[,3])) & (!(tbl[,2] & tbl[,3])))
#[1] 2

Here's an idea with dplyr. setequal checks if two vectors are composed of the same elements, regardless of ordering:
library(dplyr)
df %>%
group_by(userid) %>%
summarize(temp = setequal(company.type, c("A", "B")) |
setequal(company.type, c("A", "C"))) %>%
pull(temp) %>%
sum()
# [1] 2
Data:
df <- structure(list(userid = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), company.type = c("A",
"A", "C", "B", "B", "B", "A")), .Names = c("userid", "company.type"
), class = "data.frame", row.names = c(NA, -7L))
See: Check whether two vectors contain the same (unordered) elements in R

Sort DF and reduce it to one row per userid with a types column consisting of a comma-separated string of company types. Then filter it using the indicated condition. Finally use tally to get the number of rows left after filtering. To get the details omit the tally line.
library(dplyr)
DF %>%
arrange(userid, company.type) %>%
group_by(userid) %>%
summarize(types = toString(company.type)) %>%
ungroup %>%
filter(grepl("A.*B|A.*C", types) & ! grepl("B.*C", types)) %>%
tally
giving:
# A tibble: 1 x 1
n
<int>
1 2
Note
The input used, in reproducible form, is:
Lines <- "userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A"
DF <- read.table(text = Lines, header = TRUE)

Related

Count percentage of observations that switch value

I have a dataset that has two columns. One column indicates the group and each group has only two rows. The second column represents the category. Now I would like to count the percentage of each group not having the same category. So in row 1 and 2, the Category is not the same while in row 3 and 4 it is the same. In the provided data, I would get a percentage of 66.66% as four times the Category changes while it stays the same for two groups.
This is my data:
structure(list(Group = c("A", "A", "B", "B", "C", "C", "D", "D",
"E", "E", "F", "F"), Category = c(1L, 2L, 3L, 3L, 5L, 6L, 7L,
7L, 7L, 6L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-12L))
I have tried the following so far:
Data <- Data %>%
group_by(Group) %>%
count(n())
But I don't now how to write the code in the last line to get my desired percentage. Could someone help me here?
A base solution with tapply():
mean(with(df, tapply(Category, Group, \(x) length(unique(x)))) > 1)
# [1] 0.6666667
With dplyr, you could use n_distinct() to count the number of unique values.
library(dplyr)
df %>%
group_by(Group) %>%
summarise(N = n_distinct(Category)) %>%
summarise(Percent = mean(N > 1))
# # A tibble: 1 × 1
# Percent
# <dbl>
# 1 0.667
To show it for both classes, you can use the following code:
library(dplyr)
Data %>%
group_by(Group) %>%
mutate(unique = as.numeric(n_distinct(Category) == 1)) %>%
ungroup() %>%
summarise(Percent = prop.table(table(unique)))
Output:
# A tibble: 2 × 1
Percent
<table>
1 0.6666667
2 0.3333333
Using base R
counts <- table(df)
prop.table(table(rowSums(counts != 0)))
-output
1 2
0.3333333 0.6666667

How to collapse rows by identical values in a column

Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe
We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

Using a vector as a grep pattern

I am new to R. I am trying to search the columns using grep multiple times within an apply loop. I use grep to specify which rows are summed based on the vector individuals
individuals <-c("ID1","ID2".....n)
bcdata_total <- sapply(individuals, function(x) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
bcdata is of random size and contains random data but contains columns that have individuals in part of the string
>head(bcdata)
ID1-4 ID1-3 ID2-5
A 3 2 1
B 2 2 3
C 4 5 5
grep(individuals[1],colnames(bcdata_clean)) returns a vector that looks like
[1] 1 2, a list of the column names containing ID1. That vector is used to select columns to be summed in bcdata_clean. This should occur n number of times depending on the length of individuals
However this returns the error
In grep(individuals, colnames(bcdata)) :
argument 'pattern' has length > 1 and only the first element will be used
And results in all the columns of bcdata being identical
Ideally individuals would increment each time the function is run like this for each iteration
apply(bcdata_clean[,grep(individuals[1,2....n], colnames(bcdata_clean))], 1, sum)
and would result in something like this
>head(bcdata_total)
ID1 ID2
A 5 1
B 4 3
C 9 5
But I'm not sure how to increment individuals. What is the best way to do this within the function?
You can use split.default to split data on similarly named columns and sum them row-wise.
sapply(split.default(df, sub('-.*', '', names(df))), rowSums, na.rm. = TRUE)
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))
Passing individuals as my argument in function(x) fixed my issue
bcdata_total <- sapply(individuals, function(individuals) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
An option with tidyverse
library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column('rn') %>%
pivot_longer(cols = -rn, names_to = c(".value", "grp"), names_sep="-") %>%
group_by(rn) %>%
summarise(across(starts_with('ID'), sum, na.rm = TRUE), .groups = 'drop') %>%
column_to_rownames('rn')
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))

how to use a logical vector in R [duplicate]

This question already has answers here:
Remove groups that contain certain strings
(4 answers)
Closed 3 years ago.
3 doctors diagnose a patient
question 1 : how to filter the patient which all 3 doctors diagnose with disease B (no matter B.1, B.2 or B.3)
question 2: how to filter the patient which any of 3 doctors diagnose with disease A.
set.seed(20200107)
df <- data.frame(id = rep(1:5,each =3),
disease = sample(c('A','B'), 15, replace = T))
df$disease <- as.character(df$disease)
df[1,2] <- 'A'
df[4,2] <- 'B.1'
df[5,2] <- 'B.2'
df[6,2] <- 'B.3'·
df
I got a method but I don't know how to write the code. I think in the code any() or all() function shoule be used.
First, I want to group patients by id.
Second, check if all the disease is A or B in each group.
The code like this
df %>% group_by(id) %>% filter_all(all_vars(disease == B))
You can use all assuming every patient is checked by 3 doctors only.
library(dplyr)
df %>% group_by(id) %>% summarise(disease_B = all(grepl('B', disease)))
# id disease_B
# <int> <lgl>
#1 1 FALSE
#2 2 TRUE
#3 3 FALSE
#4 4 FALSE
#5 5 FALSE
If you want to subset the rows of the patient, we can use filter
df %>% group_by(id) %>% filter(all(grepl('B', disease)))
For question 2: similarly, we can use any
df %>% group_by(id) %>% summarise(disease_B = any(grepl('A', disease)))
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), disease = c("A", "A", "A", "B.1", "B.2",
"B.3", "B", "A", "A", "B", "A", "A", "B", "A", "B")), row.names = c(NA,
-15L), class = "data.frame")
For the question 1, you can replace B.1 B.2 ... by B, then count the number of different "Disease" per patients and filter to keep only those equal to 3 and B:
library(tidyverse)
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease)) %>%
count(Disease) %>%
filter(n == 3 & Disease == "B")
# A tibble: 2 x 3
# Groups: id [2]
id Disease n
<int> <chr> <int>
1 2 B 3
2 4 B 3
For the question 2, similarly, you can replace B.1 ... by B, then filter all rows with Disease is A, then count the number of rows per patients and your output is the patient id and the number of doctors that diagnose the disease A:
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease))%>%
filter(Disease == "A") %>%
count(id)
# A tibble: 3 x 2
# Groups: id [3]
id n
<int> <int>
1 1 1
2 3 3
3 5 2

Summarise number of specific rows containing string variables in R (dplyr/tidyverse codes are appreciated)

I have a big dataset with a variety of variables concerning infectious complications. There are columns, containing symptoms written as strings in the corresponding columns ("Dysuria", "Fever", etc.). I would like to know the number of positive symptoms in each observation. I have tried to write different codes, using rowSums within mutate_at with is.character and !is.na, trying to do it simpler and as short as a single line of code, but it did not work.
example:
symps_na %>%
mutate_if(~any(is.character(.), rowSums)) %>%
View()
Then, I wrote a code for each column separately, trying to recode string variables to 1, convert them to numeric and then sum these ones to get the number of symptoms (see the codes below).
symps_na<-
pb_table_ord %>%
select(ID, dysuria:fever)%>%
mutate(dysuria=ifelse(dysuria=="Dysuria", 1, dysuria)) %>%
mutate(frequency=ifelse(frequency=="Frequency", 1, frequency)) %>%
mutate(urgency=ifelse(urgency=="Urgency", 1, urgency)) %>%
mutate(prostatepain=ifelse(prostatepain=="Prostate pain", 1, prostatepain)) %>%
mutate(rigor=ifelse(!is.na(rigor), 1, rigor)) %>%
mutate(loinpain=ifelse(!is.na(loinpain), 1, loinpain)) %>%
mutate(fever=ifelse(!is.na(fever), 1, fever)) %>%
mutate_at(vars(dysuria:fever), as.numeric) %>%
mutate(symptoms.sum=rowSums(select(., dysuria:fever)))
but the column symptoms.sum returns NA's instead numbers.
Oh, sorry, just have realized that I have missed na.rm=TRUE! But anyway. Can anyone suggest a more elegant way how could one get the summary number of non-NA/string variables for each observation in a separate column?
You can create two sets of columns one where you need to check value same as column name and the other one where you need to check to for NA values. I have created a sample data shared at the end of the answer and the two vectors cols1 which is a vector of column names which has same value as in it's column and cols2 where we need to check for NA values. You can change that according to column names that you have.
library(dplyr)
cols1 <- c('b', 'c')
cols2 <- c('d')
purrr::imap_dfc(df %>% select(cols1), `==`) %>% mutate_all(as.numeric) %>%
bind_cols(df %>% transmute_at(vars(cols2), ~+(!is.na(.)))) %>%
mutate(symptoms.sum = rowSums(select(., b:d), na.rm = TRUE))
# A tibble: 5 x 4
# b c d symptoms.sum
# <dbl> <dbl> <int> <dbl>
#1 1 1 0 2
#2 0 1 1 2
#3 1 0 1 2
#4 NA NA 1 1
#5 1 NA 0 1
data
Tested on this data which looks like this
df <- structure(list(a = 1:5, b = structure(c(1L, 2L, 1L, NA, 1L), .Label = c("b",
"c"), class = "factor"), c = structure(c(1L, 1L, 2L, NA, NA), .Label = c("c",
"d"), class = "factor"), d = c(NA, 1, 2, 4, NA)), class = "data.frame",
row.names = c(NA, -5L))
df
# a b c d
#1 1 b c NA
#2 2 c c 1
#3 3 b d 2
#4 4 <NA> <NA> 4
#5 5 b <NA> NA

Resources