Counting number of unique IDs per group at certain time points [duplicate] - r

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I'm trying to find the number of participants per gene at different time points. I'm attempting to do this with a nested for loop, however, I can't seem to figure it out. Here's something I've been trying:
IgH_CDR3_post_challenge_unique<- select(IgH_CDR3_post_challenge_unique, cdr3aa, gene, ID, Timepoint)
participant_list <- unique(IgH_CDR3_post_challenge_unique$gene)
time_list<- unique(IgH_CDR3_post_challenge_unique$Timepoint)
for (c in participant_list)
{
for(i in time_list)
{
IgH_CDR3_post_challenge_unique <- filter(IgH_CDR3_post_challenge_unique, Timepoint==time_list[i] )
}
IgH_CDR3_post_challenge_unique$participant_per_gene[IgH_CDR3_post_challenge_unique$gene == c] <- length(unique(IgH_CDR3_post_challenge_unique$ID[IgH_CDR3_post_challenge_unique$gene == c]))
}
I would like the loops to end up calculating the number of participants per gene for each timepoint.
My data looks something like this:
gene
Timepoint
ID
1
C0
SP1
2
C1
SP2
1
C0
SP4
3
C0
SP2

This could be achieved without the use of a loop using dplyr. Loops tend to get slow and cumbersome when your data becomes large.
First, use group_by to group the data by the relevant column and then count the number of unique IDs within each group.
library(dplyr)
> dat %>% group_by(Timepoint, gene) %>% summarise(n = length(unique(ID)))
# A tibble: 2 × 2
Timepoint n
<chr> <int>
1 C0 3
2 C1 1

Related

Is there any command which give from a specific numeric column how many times a number exists? [duplicate]

This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Counting distinct values in column of a data frame in R
(2 answers)
Closed 4 months ago.
Having a specific column like this number_of_columns_with_text:
df <- data.frame(id = c(1,2,3,4,5,6), number_of_columns_with_text = c(3,2,1,3,1,1))
Is there any command which could give the sum of the numbers exists in this column (how many times a number exists).
Example output
data.frame(number = c(1,2,3), volume = c(3,1,2))
What you might be looking for is table(...)
> table(df$number_of_columns_with_text)
1 2 3
3 1 2
In dplyr, you can first group_by the variable you want to tabulate and then use n() to count the frequencies of the distinct values:
library(dplyr)
df %>%
group_by(number_of_columns_with_text)%>%
summarise(volume = n())
# A tibble: 3 × 2
number_of_columns_with_text volume
<dbl> <int>
1 1 3
2 2 1
3 3 2
Using dplyr
library(tidyverse)
df %>%
group_by(number_of_columns_with_text) %>%
count()

How to count singular entries from multiple entries in a data frame [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I'm trying very hard to break my C mold, as you'll see, it's still present in my R code. I know there will be a smart R way of doing this!
Trying to essentially go through a long list of individuals held in a DF. Each individual can have multiple rows in this if they have taken more than one particular drug or even multiple instances of the same drug. Per row there is a drug name entry. Similar to:
patientID drugname
1 A
2 A
2 B
3 C
3 C
4 A
I have a list containing the unique drug names from this DF (A, B, C). I would like to build a dataframe with columns drugname and drugCount. In the drugCount I want to count up the number of unique instances a drug was prescribed but not multiple counts per person, more of a binary operation of "was this drug given to person X?".
A start of an attempt using a very C-style manner:
uniqueDrugList <- unique(therapyDF$prodcode)
numDrugs <- length(uniqueDrugList)
prevalenceDF <-as.data.frame(drugName=character(numDrugs),drugcount=integer(numDrugs),prevalence=numeric(numDrugs),stringsAsFactors=FALSE)
for(i in 1:length(idList)) {
individualDF <- subset(therapyDF,therapyDF$patid==idList[[i]])
for(j in 1:numDrugs) {
if(uniqueDrugList[[j]] %in% individualDF%prodcode) {
prevalenceDF <---- some how tally up here
}
}
Firstly, I take a subset of my main DF by identifying each individual with a particular ID for a list of unique IDs. Then, for each unique drug (and this is where it is slow), I want to see whether that drug is present in that individual's records. I would like to increment a 1 to an entry if present, else moves on to the next individual's subset.
Expected output
drugname count
A 3
B 1
C 1
We can do a group by 'drugname' and get the length of unique elements of 'patientID'
library(dplyr)
df %>%
group_by(drugname) %>%
summarise(count = n_distinct(patientID))
# A tibble: 3 x 2
# drugname count
# <chr> <int>
#1 A 3
#2 B 1
#3 C 1
Or use table from base R after getting the unique rows
table(unique(df)[2])

Generate cross-section from panel data in R [duplicate]

This question already has answers here:
data.frame Group By column [duplicate]
(4 answers)
Closed 6 years ago.
I have a panel data file (long format) and I need to convert it to cross-sectional data. That is I don't just need a transformation to the wide format but I need exactly one observation per individual that contains the mean for each variable.
Here's what I want to to: I have panel data (a number of observations for each individual) in a data frame and I'm looking for an easy way in R to generate a new data frame that contains cumulated data for each individual, i. e. either the sum of all observations in each variable or their mean. It might also be interesting to get a measure of volatility.
For example I have a given data frame panel_data that contains panel data:
> individual <- c(1,1,2,2,3,3)
> var1 <- c(2,3,3,3,4,3)
> panel_data <- data.frame(individual,var1)
> panel_data
individual var1
1 1 2
2 1 3
3 2 3
4 2 3
5 3 4
6 3 3
The result should look like this:
> cross_data
individual var1
1 1 5
2 2 6
3 3 7
Now this is only an example. I need this feature in a number of varieties, the most important one probably being the intra-individual mean for each variable.
There are ways to do this using base R or using the popular packages data.table or dplyr. Everyone has their own preference and mine is dplyr.
You can very easily perform a variety of operation to summarise your data per individual. With dplyr syntax, you first group_by individual to specify that operations should be performed on groups defined by the variable "individual". You can then summarise your groups using a function you specify.
Try the following:
library("dplyr")
panel_data %>%
group_by(individual) %>%
summarise(sum_var1 = sum(var1), mean_var1=mean(var1))
Do not be put off by the %>% notation, it is just a convenient shortcut to chain operations:
x %>% f is equivalent to f(x)
x %>% f(a) is equivalent to f(x, a)
x %>% f(a) %>% g(b) is equivalent to g(f(x, a), b)

R count occurrences of an element by groups [duplicate]

This question already has answers here:
Add column with order counts
(2 answers)
Count number of rows within each group
(17 answers)
Closed 7 years ago.
What is the easiest way to count the occurrences of a an element on a vector or data.frame at every grouop?
I don't mean just counting the total (as other stackoverflow questions ask) but giving a different number to every succesive occurence.
for example for this simple dataframe: (but I will work with dataframes with more columns)
mydata <- data.frame(A=c("A","A","A","B","B","A", "A"))
I've found this solution:
cbind(mydata,myorder=ave(rep(1,nrow(mydata)),mydata$A, FUN=cumsum))
and here the result:
A myorder
A 1
A 2
A 3
B 1
B 2
A 4
A 5
Isn't there any single command to do it?. Or using an specialized package?
I want it to later use tidyr's spread() function.
My question is not the same than
Is there an aggregate FUN option to count occurrences?
because I don't want to know the total number of occurrencies at the end but the cumulative occurencies till every element.
OK, my problem is a little bit more complex
mydata <- data.frame(group=c("x","x","x","x","y","y", "y"), letter=c("A","A","A","B","B","A", "A"))
I only know to solve the first example I wrote above.
But what happens when I want it also by a second grouping variable?
something like occurrencies(letter) by group.
group letter "occurencies within group"
x A 1
x A 2
x A 3
x B 1
y B 1
y A 1
y A 2
I've found the way with
ave(rep(1,nrow(mydata)),list(mydata$group, mydata$letter), FUN=cumsum)
though it shoould be something easier.
Using data.table
library(data.table)
setDT(mydata)
mydata[, myorder := 1:.N, by = .(group, letter)]
The by argument makes the table be dealt with within the groups of the column called A. .N is the number of rows within that group (if the by argument was empty it would be the number of rows in the table), so for each sub-table, each row is indexed from 1 to the number of rows in that sub-table.
mydata
group letter myorder
1: x A 1
2: x A 2
3: x A 3
4: x B 1
5: y B 1
6: y A 1
7: y A 2
or a dplyr solution which is pretty much the same
mydata %>%
group_by(group, letter) %>%
mutate(myorder = 1:n())

Count by Year Based on Criterion in R

I am trying to count the number of rows with values fcoli>15 and produce a vector sorting these counts by year.
Some sample data:
Year <- c(1996,1996,1997,19971998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
I have been able to count the number of rows one year at a time using:
nrow(subset(sample, sample$fcoli > 15 & sample$Year == 1996))
However I have not been able to use this criterion to produce counts for all the years at once. My actual data consists of over 20 years of data and so I would rather not have to manually iterate this code for each year.
Any suggestions? Thanks!
Here is a simple enough answer.
Year <- c(1996,1996,1997,1997,1998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
aggregate(fcoli~Year,FUN=length,data=sample[sample$fcoli>15,])
library(dplyr)
df1%>% #df1 is yor data frame
filter(fcoli>15) %>%
group_by(Year)%>%
summarise(freq=n())
Source: local data frame [4 x 2]
Year freq
1 1996 1
2 1997 1
3 1998 1
4 1999 2

Resources