Calculate percent NA by ID variable in R - r

Title is self-explanatory. Looking to calculate percent NA by ID group in R. There are lots of posts on calculating NA by variable column but almost nothing on doing it by row groups.

If there are multiple columns, after grouping by 'ID', use summarise_at to loop over the columns, create a logical vector with is.na, get the mean, and multiply by 100
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), ~ 100 *mean(is.na(.)))
If we want to get the percentage across all other variables,
library(tidyr)
df1 %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(Perc = 100 * mean(is.na(value)))
Or with aggregate from base R
aggregate(.~ ID, df1, FUN = function(x) 100 * mean(is.na(x)), na.action = na.pass)
Or to get the percentage across, then unlist, the other columns, create a table with the logical vector and the 'ID' column, and use prop.table to get the percentage
prop.table(table(cbind(ID = df1$ID,
value = is.na(unlist(df1[setdiff(names(df1), "ID")]))))

Related

How to take the diversity function and add it as a column

I have been following this code here:
calculating diversity by groups
However, when I implement the code it returns only the values. I have tried cbind to put it into another dataframe, however, I am afraid that the rows do not match. Is there a way to run that code, which places it in the same dataframe so the rows match with where they were taken from..
following the code given in the answer linked what you need is to use right_join to link the diversity vector with the method_one data.frame :
diversity(aggregate(. ~ LOC_ID + week + year + POSTCODE , method_one, sum)[, 5:25], MARGIN=1, index="simpson") %>%
tibble(diversity=., ID=1:length(.)) %>%
right_join(method_one %>% mutate(ID=1:nrow(.)))
Explanation :
method_one %>% mutate(ID=1:nrow(.)) adds an ID column.
tibble(diversity=., ID=1:length(.)) turns the result of the diversity call into a tibble with an ID column.
right_join(x, y, by=NULL, ...) : return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. if by==NULL it will use columns with matching names in this case ID.
Here is an option with dplyr
library(dplyr)
library(vegan)
method_one %>%
group_by(LOC_ID, week, year, POSTCODE) %>%
summarise(across(everything(), sum, na.rm = TRUE)) %>%
ungroup %>%
select(5:25) %>%
diversity %>%
tibble(diversity = .) %>%
bind_cols(method_one, .)

R- Populating a dataframe based on another one with conditions

im new on R and i have a data set of 22x252, the 252 have many repeated values on column 1(ID). I made another dataset that has nrows of the unique values (with those values already populated), and i want to populate the rest of the columns based on the other dataset (basically summing all the values that share the same value in column 1.)
Is there a basic function that enables me to do this?
Thanks & Regards
We can use aggregate in base R. Assuming the column name of first column is 'ID' and all other columns are numeric class, we group by 'ID' and get the sum of the rest of the columns in aggregate
aggregate(.~ ID, df1, sum, na.rm = TRUE)
Or with dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), sum, na.rm = TRUE)
Or with new version with across
df1 %>%
group_by(ID) %>%
summarise(across(-group_cols(), sum, na.rm = TRUE))

Choose top n variables in R when matching values

I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.
I am able to do this using group_by(Date) %>% top_n(10)
However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.
Do anyone know what i can do to make sure that only 10 observations are chosen?
You can arrange the data and select first 10 rows in each group.
library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)
Similarly, with filter
df %>%
arrange(Date, desc(col_name)) %>%
group_by(Date) %>%
filter(row_number() <= 10)
With data.table you can do
library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]
Change value to match the variable name used to choose which observation should be kept in case of ties. You can also do:
df[order(Date, desc(value))][, head(.SD,10), by = Date]
We can use base R
df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]

Count the number of missing values in groups in R

I have a tibble, with many observations and variables.
What I want to do is simply calculate(grouping by variable1 and variable2) the mean of variableXXX,and the total number of missing values for each group.
this is what I have written so far:
data%>%
group_by(variable1,variable2)%>%
summarise(mean(variableXXX))
how can I calculate the number of missing values for each group? I am new using R, so the easiest solution is better
We can get the sum of logical vector created with is.na
library(dplyr)
data%>%
group_by(variable1,variable2)%>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(variableXXX)))
NOTE: Assuming that we are interested in the count of NAs in the 'variableXXX' column grouped by 'variable1' and 'variable2'
If we need the NA count of the whole subset of dataset
library(purrr)
data %>%
group_split(variable1, variable2) %>%
map_dfr(~ .x %>%
summarise(Mean = mean(variableXXX, na.rm = TRUE),
MissingCount = sum(is.na(.))))

tidyverse: count number of a specific level when summarizing

I would like, when summarizing after grouping, to count the number of a specific level of another factor.
In the working example below, I would like to count the number of "male" levels in each group. I've tried many things with count, tally and so on but cannot find a straightforward and neat way to do it.
df <- data.frame(Group=replicate(20, sample(c("A","B"), 1)),
Value=rnorm(20),
Factor=replicate(20, sample(c("male","female"), 1)))
df %>%
group_by(Group) %>%
summarize(Value = mean(Value),
n_male = ???)
Thanks for your help!
We can use sum on a logical vector i.e. Factor == "male". The TRUE/FALSE will be coerced to 1/0 to get the frequency of 'male' elements when we do the sum
df %>%
group_by(Group) %>%
summarise(Value = mean(Value),
n_male = sum(Factor=="male"))

Resources