Count elements (rows) in group of a data table [duplicate] - r

This question already has answers here:
Count number of rows matching a criteria
(9 answers)
Closed 4 years ago.
I have table (dt) which has several columns.
X__1 First Name Last Name Gender Country Age Date Id
1: 1 Dulce Abril Female United States 32 15/10/2017 1562
2: 2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582
3: 3 Philip Gent Male France 36 21/05/2015 2587
4: 4 Kathleen Hanner Female United States 25 15/10/2017 3549
5: 5 Nereida Magwood Female United States 58 16/08/2016 2468
I want to count the number of rows which has Country = "France" and Age >32.
I used the following command which gives me the result, but i need to count the number of rows in the result. What is the command to do it?
dt[Country == 'France' & Age > 32]

use the function nrow()
nrow(dt[Country == 'France' & Age > 32])

nrow() is simplest, but if you want to do it using data.table syntax:
dt[Country == 'France' & Age > 32, (.N)]

Related

How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function?

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.
Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

How to subset your dataframe to only keep the first duplicate? [duplicate]

This question already has answers here:
Remove duplicates based on 2nd column condition
(4 answers)
Closed 4 years ago.
I have a dataframe with multiple variables, and I am interested in how to subset it so that it only includes the first duplicate.
>head(occurrence)
userId occurrence profile.birthday profile.gender postDate count
1 100469891698 6 47 Female 583 days 0
2 100469891698 6 47 Female 55 days 0
3 100469891698 6 47 Female 481 days 0
4 100469891698 6 47 Female 583 days 0
5 100469891698 6 47 Female 583 days 0
6 100469891698 6 47 Female 583 days 0
Here you can see the dataframe. The 'occurrence' column counts how many times the same userId has occurred. I have tried the following code to remove duplicates:
occurrence <- occurrence[!duplicated(occurrence$userId),]
However, this way it remove "random" duplicates. I want to keep the data which is the oldest one by postDate. So for example the first row should look something like this:
userId occurrence profile.birthday profile.gender postDate count
1 100469891698 6 47 Female 583 days 0
Thank you for your help!
Did you try order first like this:
occurrence <- occurrence[order(occurrence$userId, occurrence$postDate, decreasing=TRUE),]
occurrenceClean <- occurrence[!duplicated(occurrence$userId),]
occurrenceClean
You could use dplyr for this and after filtering on the max postDate, use a distinct (unique) to remove all duplicate rows. Of course if there are differences in the rows with max postDate you will get all of those records.
occurrence <- occurrence %>%
group_by(userId) %>%
filter(postDate == max(postDate)) %>%
distinct
occurence
# A tibble: 1 x 6
# Groups: userId [1]
userId occurrence profile.birthday profile.gender postDate count
<dbl> <int> <int> <chr> <chr> <int>
1 100469891698 6 47 Female 583 days 0

Show proportion with multiple conditions in R

I have:
> dataframe
GENDER CITY NUMBER
Male NY 1
Female Paris 2
Male Paris 1
Female NY
Female NY 2
Male Paris 2
Male Paris
Male Paris 1
Female NY 2
Female Paris 1
And I would like to return the proportion of Male and Female in bomb city (then in NY) who has 2 as a third column (The DF is way longer that my example), knowing that there are empty rows in NUMBER column.
Technically speaking I want to show a proportion with two conditions (and more conditions in the future).
I tried:
prop.table(table(dataframe$GENDER, dataframe$CITY == 'NY' & dataframe$NUMBER == 2)
But this gives me the wrong results.
The xxpected output (or any that is close to this):
NY
Male 0
Female 20
Do you have any idea how I can get this?
The best would be to have a column per city
Use the environment data.table, that makes your life much more easier. It uses SQL syntax and its superfast in case your data grows up. The code should be:
library(data.table)
df <- data.table(yourdataframe)
df[, summary(GENDER), by = CITY]
The output should give you the count of each value

Sorting output of tally / count (dplyr) [duplicate]

This question already has answers here:
Arrange a grouped_df by group variable not working
(2 answers)
Closed 6 years ago.
This should be easy, but I can't find a straight forward way to achieve it. My dataset looks like the following:
DisplayName Nationality Gender Startyear
1 Alfred H. Barr, Jr. American Male 1929
2 Paul C\216zanne French Male 1929
3 Paul Gauguin French Male 1929
4 Vincent van Gogh Dutch Male 1929
5 Georges-Pierre Seurat French Male 1929
6 Charles Burchfield American Male 1929
7 Charles Demuth American Male 1929
8 Preston Dickinson American Male 1929
9 Lyonel Feininger American Male 1929
10 George Overbury ("Pop") Hart American Male 1929
...
I want to group by DisplayName and Gender, and get the counts for for each of the names (they are repeated several times on the list, with different year information).
The following 2 commands give me the same output, but they are not sorted by the count output "n". Any ideas on how to achieve this?
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
group_by(DisplayName, Gender) %>%
tally(sort = T) %>%
arrange(desc(n))
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
count(DisplayName, Gender, sort = T)
DisplayName Gender n
(chr) (chr) (int)
1 A. F. Sherman Male 1
2 A. G. Fronzoni Male 2
3 A. Lawrence Kocher Male 3
4 A. M. Cassandre Male 21
5 A. R. De Ycaza Female 1
6 A.R. Penck (Ralf Winkler) Male 20
7 Aaron Siskind Male 25
8 Abigail Perlmutter Female 1
9 Abraham Rattner Male 5
10 Abraham Walkowitz Male 17
.. ... ... ...
Your data is grouped by two variables. So after tally, your dataframe is still grouped by Display name. So arrange(desc(n)) is sorting but by Disply name. If you want to sort the all dataframe by column n, just ungroup before sorting. try this :
artists <- data %>%
filter(!is.na(Gender) & Gender != "NULL") %>%
group_by(DisplayName, Gender) %>%
tally(sort = T) %>%
ungroup() %>%
arrange(desc(n))

Sub-setting with dplyr [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
Dataframe has the columns:
State Sex Year Name Number Percent
I need to filter for each year, one male and one female with highest percentage, in every state.
Example:
Washington M 2011 John 34 0.46
Washington F 2011 Mary 42 0.67
Washington M 2012 John 46 0.46
Washington F 2012 Mary 64 0.67
and so on for every State and year.
You can try
df %>%
group_by(State, Year, Sex) %>%
slice(which.max(Percent))

Resources