R: Selecting cases when a column condition is met [duplicate] - r

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 4 years ago.
I have a data frame like this:
x y z country
1 4 1 USA
3 1 1 Canada
0 1 1 Spain
0 2 3 USA
4 1 1 Canada
I need to select the data which countries appear at least 1000 times through all the data frame. Let's say, for example, that only USA and Canada meet that condition. The problem is that I have more than 40 countries and 500000 cases so I can't do it case by case.
I suppose that I need a loop "for" to do so, but I can't figure how to do it.

First get the names of the countries you want. Then subset by those names.
tab <- table(df$country)
mycountries <- names(tab[tab > 1000])
df <- df[df$country %in% mycountries, ]

With data.table and by assuming your dataframe is named df, we can create a variable named count that counts the total number of rows for each country, and then subset to only those countries with >1000 rows:
library(data.table)
setDT(df)
df[ , count := .N, by=country]
df[count > 1000]

One possible solution using dplyr:
library(dplyr)
df %>%
group_by(country) %>%
summarise(count = n()) %>%
filter(count >= 1000) %>%
arrange(desc(count))

Related

Remove rows in a dataframe based on number of element per factor in one column in R [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 6 years ago.
I'm using the dplyr package in R and have grouped my data by 3 variables (Year, Site, Brood).
I want to get rid of groups made up of less than 3 rows. For example in the following sample I would like to remove the rows for brood '2'. I have a lot of data to do this with so while I could painstakingly do it by hand it would be so helpful to automate it using R.
Year Site Brood Parents
1996 A 1 1
1996 A 1 1
1996 A 1 0
1996 A 1 0
1996 A 2 1
1996 A 2 0
1996 A 3 1
1996 A 3 1
1996 A 3 1
1996 A 3 0
1996 A 3 1
I hope this makes sense and thank you very much in advance for your help! I'm new to R and stackoverflow so apologies if the way I've worded this question isn't very good! Let me know if I need to provide any other information.
One way to do it is to use the magic n() function within filter:
library(dplyr)
my_data <- data.frame(Year=1996, Site="A", Brood=c(1,1,2,2,2))
my_data %>%
group_by(Year, Site, Brood) %>%
filter(n() >= 3)
The n() function gives the number of rows in the current group (or the number of rows total if there is no grouping).
Throwing the data.table approach here to join the party:
library(data.table)
setDT(my_data)
my_data[ , if (.N >= 3L) .SD, by = .(Year, Site, Brood)]
You can also do this using base R:
temp <- read.csv(paste(folder,"test.csv", sep=""), head=TRUE, sep=",")
matches <- aggregate(Parents ~ Year + Site + Brood, temp, FUN="length")
temp <- merge(temp, matches, by=c("Year","Site","Brood"))
temp <- temp[temp$Parents.y >= 3, c(1,2,3,4)]

Is there any command which give from a specific numeric column how many times a number exists? [duplicate]

This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Counting distinct values in column of a data frame in R
(2 answers)
Closed 4 months ago.
Having a specific column like this number_of_columns_with_text:
df <- data.frame(id = c(1,2,3,4,5,6), number_of_columns_with_text = c(3,2,1,3,1,1))
Is there any command which could give the sum of the numbers exists in this column (how many times a number exists).
Example output
data.frame(number = c(1,2,3), volume = c(3,1,2))
What you might be looking for is table(...)
> table(df$number_of_columns_with_text)
1 2 3
3 1 2
In dplyr, you can first group_by the variable you want to tabulate and then use n() to count the frequencies of the distinct values:
library(dplyr)
df %>%
group_by(number_of_columns_with_text)%>%
summarise(volume = n())
# A tibble: 3 × 2
number_of_columns_with_text volume
<dbl> <int>
1 1 3
2 2 1
3 3 2
Using dplyr
library(tidyverse)
df %>%
group_by(number_of_columns_with_text) %>%
count()

How to obtain the index of the max value of a column for each level of another column R [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
given a dataframe like this:
COUNTRY CITIZENS SURFACE
A 20000000 40
A 80000000 78
B 3000000 120
B 200000 27
C 10000000 56
A 5600000 20
C 10000000 30
B 2500000 20
I would like to subset the dataframe just with the rows corresponding to the max value of citizens for each country level.
I was able to obtain the max value of "citizens" for each level of country with dplyr and summarize, but I am not able to extract the corresponding surface value for each max value.
Do you know how can I achieve that?
We can use slice after grouping by 'COUNTRY'
library(dplyr)
df1 %>%
group_by(COUNTRY) %>%
slice(which.max(CITIZENS))
Or with filter
df1 %>%
group_by(COUNTRY) %>%
filter(CITIZENS == max(CITIZENS))
Or with data.table
library(data.table)
setDT(df1)[, .SD[CITIZENS == max(CITIZENS)], COUNTRY]

Find last values by condition [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 6 years ago.
I have a very large data frame that I need to subset by last values. I know that the data.table library includes the last() function which returns the last value of an array, but what I need is to subset foo by the last value in id for every separate value in track. Values in id are consecutive integers, but the last values will be different for every track.
> head(foo)
track id coords.x coords.y
1 0 0 -79.90732 43.26133
2 0 1 -79.90733 43.26124
3 0 2 -79.90733 43.26124
4 0 3 -79.90733 43.26124
5 0 4 -79.90725 43.26121
6 0 5 -79.90725 43.26121
The output would look something like this.
track id coords.x coords.y
1 0 57 -79.90756 43.26123
2 1 98 -79.90777 43.26231
3 2 61 -79.90716 43.26200
... and so on
How would one apply the last() function (or another function like tail()) to produce this output?
We can try with dplyr, grouping by track and selecting only the last row of every group.
library(dplyr)
df %>%
group_by(track) %>%
filter(row_number() == n())
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'track' get the last row with tail
library(data.table)
setDT(df1)[, tail(.SD, 1), by = track]
As the also mentioned another logic with 'id' about the consecutive numbers, we can also create a logical index using diff, get the row index (.I) and subset the rows.
setDT(df1)[df1[, .I[c(FALSE, diff(id) ! = 1)], by = track]$V1]
Or we can do this using base R itself
df1[!duplicated(df1$track, fromLast=TRUE),]
Or another option is dplyr
library(dplyr)
df1 %>%
group_by(track) %>%
slice(n())

Count by Year Based on Criterion in R

I am trying to count the number of rows with values fcoli>15 and produce a vector sorting these counts by year.
Some sample data:
Year <- c(1996,1996,1997,19971998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
I have been able to count the number of rows one year at a time using:
nrow(subset(sample, sample$fcoli > 15 & sample$Year == 1996))
However I have not been able to use this criterion to produce counts for all the years at once. My actual data consists of over 20 years of data and so I would rather not have to manually iterate this code for each year.
Any suggestions? Thanks!
Here is a simple enough answer.
Year <- c(1996,1996,1997,1997,1998,1999,1999,1999)
fcoli <- c(45,13,96,10,52,53,64,5)
sample <- data.frame(Year,fcoli)
aggregate(fcoli~Year,FUN=length,data=sample[sample$fcoli>15,])
library(dplyr)
df1%>% #df1 is yor data frame
filter(fcoli>15) %>%
group_by(Year)%>%
summarise(freq=n())
Source: local data frame [4 x 2]
Year freq
1 1996 1
2 1997 1
3 1998 1
4 1999 2

Resources