Count by Year Based on Criterion in R - r

I am trying to count the number of rows with values fcoli>15 and produce a vector sorting these counts by year.
Some sample data:
Year <- c(1996,1996,1997,19971998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
I have been able to count the number of rows one year at a time using:
nrow(subset(sample, sample$fcoli > 15 & sample$Year == 1996))
However I have not been able to use this criterion to produce counts for all the years at once. My actual data consists of over 20 years of data and so I would rather not have to manually iterate this code for each year.
Any suggestions? Thanks!

Here is a simple enough answer.
Year <- c(1996,1996,1997,1997,1998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
aggregate(fcoli~Year,FUN=length,data=sample[sample$fcoli>15,])

library(dplyr)
df1%>% #df1 is yor data frame
filter(fcoli>15) %>%
group_by(Year)%>%
summarise(freq=n())
Source: local data frame [4 x 2]
Year freq
1 1996 1
2 1997 1
3 1998 1
4 1999 2

Related

Remove rows in a dataframe based on number of element per factor in one column in R [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 6 years ago.
I'm using the dplyr package in R and have grouped my data by 3 variables (Year, Site, Brood).
I want to get rid of groups made up of less than 3 rows. For example in the following sample I would like to remove the rows for brood '2'. I have a lot of data to do this with so while I could painstakingly do it by hand it would be so helpful to automate it using R.
Year Site Brood Parents
1996 A 1 1
1996 A 1 1
1996 A 1 0
1996 A 1 0
1996 A 2 1
1996 A 2 0
1996 A 3 1
1996 A 3 1
1996 A 3 1
1996 A 3 0
1996 A 3 1
I hope this makes sense and thank you very much in advance for your help! I'm new to R and stackoverflow so apologies if the way I've worded this question isn't very good! Let me know if I need to provide any other information.
One way to do it is to use the magic n() function within filter:
library(dplyr)
my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))
my_data %>%
group_by(Year, Site, Brood) %>%
filter(n() >= 3)
The n() function gives the number of rows in the current group (or the number of rows total if there is no grouping).
Throwing the data.table approach here to join the party:
library(data.table)
setDT(my_data)
my_data[ , if (.N >= 3L) .SD, by = .(Year, Site, Brood)]
You can also do this using base R:
temp <- read.csv(paste(folder,"test.csv", sep=""), head=TRUE, sep=",")
matches <- aggregate(Parents ~ Year + Site + Brood, temp, FUN="length")
temp <- merge(temp, matches, by=c("Year","Site","Brood"))
temp <- temp[temp$Parents.y >= 3, c(1,2,3,4)]

Creating a 2 column df from a loop of unique values [duplicate]

This question already has an answer here:
How to count the number of unique values by group? [duplicate]
(1 answer)
Closed last year.
This post was edited and submitted for review last year and failed to reopen the post:
Original close reason(s) were not resolved
I am looking to create a matrix babyList storing the number of unique names in the babynames package for every year in that package. I am only interested in using a loop to do this.
library(babynames)
###### Here I create the matrix where I want my output to be in
babyList <- data.frame(matrix(ncol=2,nrow=range(babynames$year[2]-range(babynames$year[1])),))
colnames(babyList) <- c("years","unique names")
The for loop is giving me trouble. Here is what I know it will need:
Pseudo code
for (i in babynames$year) {
length(unique(babynames$name[babynames$year == i]
}
How can I put this all together in a correct structure?
I would suggest using dplyr::summarise() instead:
library(babynames)
library(dplyr)
babynames %>%
group_by(year) %>%
summarise(unique_names = length(unique(name)))
which gives the number of unique names for each year in the babynames dataset:
# A tibble: 138 x 2
year unique_names
* <dbl> <int>
1 1880 1889
2 1881 1830
3 1882 2012
4 1883 1962
...

Counting number of unique IDs per group at certain time points [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I'm trying to find the number of participants per gene at different time points. I'm attempting to do this with a nested for loop, however, I can't seem to figure it out. Here's something I've been trying:
IgH_CDR3_post_challenge_unique<- select(IgH_CDR3_post_challenge_unique, cdr3aa, gene, ID, Timepoint)
participant_list <- unique(IgH_CDR3_post_challenge_unique$gene)
time_list<- unique(IgH_CDR3_post_challenge_unique$Timepoint)
for (c in participant_list)
{
for(i in time_list)
{
IgH_CDR3_post_challenge_unique <- filter(IgH_CDR3_post_challenge_unique, Timepoint==time_list[i] )
}
IgH_CDR3_post_challenge_unique$participant_per_gene[IgH_CDR3_post_challenge_unique$gene == c] <- length(unique(IgH_CDR3_post_challenge_unique$ID[IgH_CDR3_post_challenge_unique$gene == c]))
}
I would like the loops to end up calculating the number of participants per gene for each timepoint.
My data looks something like this:
gene
Timepoint
ID
1
C0
SP1
2
C1
SP2
1
C0
SP4
3
C0
SP2
This could be achieved without the use of a loop using dplyr. Loops tend to get slow and cumbersome when your data becomes large.
First, use group_by to group the data by the relevant column and then count the number of unique IDs within each group.
library(dplyr)
> dat %>% group_by(Timepoint, gene) %>% summarise(n = length(unique(ID)))
# A tibble: 2 × 2
Timepoint n
<chr> <int>
1 C0 3
2 C1 1

R: Selecting cases when a column condition is met [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 4 years ago.
I have a data frame like this:
x y z country
1 4 1 USA
3 1 1 Canada
0 1 1 Spain
0 2 3 USA
4 1 1 Canada
I need to select the data which countries appear at least 1000 times through all the data frame. Let's say, for example, that only USA and Canada meet that condition. The problem is that I have more than 40 countries and 500000 cases so I can't do it case by case.
I suppose that I need a loop "for" to do so, but I can't figure how to do it.
First get the names of the countries you want. Then subset by those names.
tab <- table(df$country)
mycountries <- names(tab[tab > 1000])
df <- df[df$country %in% mycountries, ]
With data.table and by assuming your dataframe is named df, we can create a variable named count that counts the total number of rows for each country, and then subset to only those countries with >1000 rows:
library(data.table)
setDT(df)
df[ , count := .N, by=country]
df[count > 1000]
One possible solution using dplyr:
library(dplyr)
df %>%
group_by(country) %>%
summarise(count = n()) %>%
filter(count >= 1000) %>%
arrange(desc(count))

Adding a column with consecutive numbers in R

I apologize if this question is abhorrently simple, but I'm looking for a way to just add a column of consecutive integers to a data frame (if my data frame has 200 observations, for example, starting with 1 for the first observation, and ending with 200 on the last one).
How can I do this?
For a dataframe (df) you could use
df$observation <- 1:nrow(df)
but if you have a matrix you would rather want to use
ma <- cbind(ma, "observation"=1:nrow(ma))
as using the first option will transform your data into a list.
Source: http://r.789695.n4.nabble.com/adding-column-of-ordered-numbers-to-matrix-td2250454.html
Or use dplyr.
library(dplyr)
df %>% mutate(observation = 1:n())
You might want it to be the first column of df.
df %>% mutate(observation = 1:n()) %>% select(observation, everything())
Probably, function tibble::rowid_to_column is what you need if you are using tidyverse ecosystem.
library(tidyverse)
dat <- tibble(x=c(10, 20, 30),
y=c('alpha', 'beta', 'gamma'))
dat %>% rowid_to_column(var='observation')
# A tibble: 3 x 3
observation x y
<int> <dbl> <chr>
1 1 10 alpha
2 2 20 beta
3 3 30 gamma

Resources