This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I have dataset of regional patent. I want to count where how many Appln_id has more than one Person_id and how many Apply_id has only one Person_id.
Appln_id 3 3 3 10 10 10 10 2 4 4
Person_id 23 22 24 49 50 55 51 101 122 104
here Appln_id 3 has three different person_id (23,22,24) and Appln_id 2 has only one Person_id(101). So, I want to count them that how many of Appln_id has more than one Person_id and how many Apply_id has only one Person_id
Count number of unique person for each Appln_id.
library(dplyr)
result <- df %>% group_by(Appln_id) %>% summarise(n = n_distinct(Person_id))
result
# Appln_id n
#* <int> <int>
#1 2 1
#2 3 3
#3 4 2
#4 10 4
Now you can count how many of them have only 1 Person_id and how many of them have more than that.
sum(result$n == 1)
#[1] 1
sum(result$n > 1)
#[1] 3
data
df <- structure(list(Appln_id = c(3L, 3L, 3L, 10L, 10L, 10L, 10L, 2L,
4L, 4L), Person_id = c(23L, 22L, 24L, 49L, 50L, 55L, 51L, 101L,
122L, 104L)), class = "data.frame", row.names = c(NA, -10L))
We can use data.table
library(data.table)
setDT(df)[, .(n = uniqueN(Person_id)), by = Appln_id]
Related
This question already has answers here:
Select the first row by group
(8 answers)
Closed last month.
I have a df that has 2 columns - game_id and score
game id score
1 55
1 59
1 62
1 71
2 74
2 65
2 89
2 98
I would want the result to be
game id score
1 55
2 74
Just trying to grab the first row for each game id
a for and if loop to delete
You can use the function first to filter on the group-first row:
library(dplyr)
df %>%
group_by(game_id) %>%
filter(score == first(score))
Data:
df <- data.frame(
game_id = c(1,1,1,1,2,2,2,2),
score = c(55,59,62,71,74,65,89,98)
)
A base R approach using aggregate
aggregate(. ~ `game id`, df, "[", 1)
game id score
1 1 55
2 2 74
Data
df <- structure(list(`game id` = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
score = c(55L, 59L, 62L, 71L, 74L, 65L, 89L, 98L)),
class = "data.frame", row.names = c(NA,
-8L))
My data.frame includes the results from a survey and looks like this:
date
id
age
gender
...
01-02
99
20
1
...
01-20
52
34
2
...
01-23
47
20
1
...
01-02
100
56
1
...
02-05
99
20
1
...
02-17
78
18
2
...
02-28
47
20
1
...
the users are allowed to attend each month, up to 10 times at the survey, so I have users who's personal data occurs more often in the table.
Now to my problem:
How can I get the mean (e.g. age) of all users who attended the survey? If I just put it mean(df$age), obviously those who did attend more than once will be overrepresented.
How can I get a list with counting users who attended once, twice, ... ten times?
e.g.:
number of participations
number of users
1
2,047
2
23,127
3
50,000
I haven't found a solution for this, so I'm grateful for any help.
Thanks in advance!
To get average age of the participants you can keep only the unique id's of the data and calculate the average.
In dplyr you can do this with distinct and summarise.
library(dplyr)
df %>%
distinct(id, .keep_all = TRUE) %>%
summarise(avg_age = mean(age))
# avg_age
#1 29.6
To count how many times an individual responded to the survey you can use count
df %>% count(id, name = 'count')
# id count
#1 47 2
#2 52 1
#3 78 1
#4 99 2
#5 100 1
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(date = c("01-02", "01-20", "01-23", "01-02", "02-05",
"02-17", "02-28"), id = c(99L, 52L, 47L, 100L, 99L, 78L, 47L),
age = c(20L, 34L, 20L, 56L, 20L, 18L, 20L), gender = c(1L,
2L, 1L, 1L, 1L, 2L, 1L)), row.names = c(NA, -7L), class = "data.frame")
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
ID Amount Previous
1 10 15
1 10 13
2 20 18
2 20 24
3 5 7
3 5 6
I want to remove the duplicate rows from the following data frame, where ID and Amount match. Values in the Previous column do not match. When deciding which row to take, I'd like to take the one where the Previous column value is higher.
This would look like:
ID Amount Previous
1 10 15
2 20 24
3 5 7
An option is distinct on the columns 'ID', 'Amount' (after arrangeing the dataset) while specifying the .keep_all = TRUE to get all the other columns that correspond to the distinct elements in those columns
library(dplyr)
df1 %>%
arrange(ID, Amount, desc(Previous)) %>%
distinct(ID, Amount, .keep_all = TRUE)
# ID Amount Previous
#1 1 10 15
#2 2 20 24
#3 3 5 7
Or with duplicated from base R applied on the 'ID', 'Amount' to create a logical vector and use that to subset the rows of the dataset
df2 <- df1[with(df1, order(ID, Amount, -Previous)),]
df2[!duplicated(df2[c('ID', 'Amount')]),]
# ID Amount Previous
#1 1 10 15
#3 2 20 24
#5 3 5 7
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Amount = c(10L,
10L, 20L, 20L, 5L, 5L), Previous = c(15L, 13L, 18L, 24L, 7L,
6L)), class = "data.frame", row.names = c(NA, -6L))
I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70
Assume I have this data frame
What I want is this
What I want to do is create rows which groups upon the month variable, which then obtains the sum of the total variable, and the unique value of the days_month variable for all of the values in person for that month.
I am just wondering if there is an easy way to do this that does not involve multiple spreads and gathers with adorn totals that I have to change the days in month back to original value after the totals were summed, etc. Is there a quick and easy way to do this?
One option would be to group by 'month', 'days_in_month' and apply adorn_total by group_mapping
library(dplyr)
library(janitor)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ .x %>%
adorn_totals("row")) %>%
select(names(df1))
# A tibble: 10 x 4
# Groups: month, days_in_month [2]
# month person total days_in_month
# <int> <chr> <int> <int>
# 1 1 John 7 31
# 2 1 Jane 18 31
# 3 1 Tim 20 31
# 4 1 Cindy 11 31
# 5 1 Total 56 31
# 6 2 John 18 28
# 7 2 Jane 13 28
# 8 2 Tim 15 28
# 9 2 Cindy 9 28
#10 2 Total 55 28
If we need other statistics, we can have it in group_map
library(tibble)
df1 %>%
group_by(month, days_in_month) %>%
group_map(~ bind_rows(.x, tibble(person = "Mean", total = mean(.x$total))))
data
df1 <- structure(list(month = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), person = c("John",
"Jane", "Tim", "Cindy", "John", "Jane", "Tim", "Cindy"), total = c(7L,
18L, 20L, 11L, 18L, 13L, 15L, 9L), days_in_month = c(31L, 31L,
31L, 31L, 28L, 28L, 28L, 28L)), class = "data.frame", row.names = c(NA,
-8L))