Mutating a count of rows per group matching a subset condition - r

I wish to mutate a new column called SF_COUNT which is a count per group (ID) of the number of rows per group where the column type contains 'SF'
A reproducible example looks as follows:
df <- data.frame(ID = c(1234,1234,1234,4567,4567,4567,4567,8900,8900,8900),type = c('RF','SF','SF','RF','SF','SF','SF','RF','SF','SF'))
My final data frame looks like:
final_df <- data.frame(ID = c(1234,1234,1234,4567,4567,4567,4567,8900,8900,8900),type = c('RF','SF','SF','RF','SF','SF','SF','RF','SF','SF'), SF_COUNT = c(2,2,2,3,3,3,3,2,2,2))
How can I achieve this in dplyr please?

After grouping by 'ID', get the sum of logical vector (type == 'SF') in mutate to create the new column
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(SF_COUNT = sum(type == 'SF', na.rm = TRUE))
If it is a substring, then use str_detect
library(stringr)
df <- df %>%
group_by(ID) %>%
mutate(SF_COUNT = sum(str_detect(type, 'SF'), na.rm = TRUE))

Related

mutate data frame to check if data duplicate between two data frame

I have two data frame, this is just a sample , database have approx 1 million of records.
can have name, email, alphanumeric code etc.
data1<-data.frame(
'ID 1' = c(86364,"ARV_2612","AGH_2212","IND_2622","CHG_2622"),
sector = c(3,3,1,2,5),
name=c("nhug","hugy","mjuk","ghtr","kuld"),
'Enternal code'=c(1,1,1,1,3),
col3=c(1,1,0,0,0),
col4=c(1,0,0,0,0),
col5=c(1,0,1,1,1)
)
data2<-data.frame(
'ID 1' = c(53265,"ARV_7362",76354,"IND_2622","CHG_9762"),
sector = c(3,3,1,2,5),
name=c("nhug","hugy","mjuk","ghtr","kuld"),
'Enternal code'=c(1,1,1,1,3),
col3=c(1,1,0,0,0),
col4=c(1,0,0,0,0),
col5=c(1,0,1,1,1)
)
data2 %>% mutate(
duplicated = factor(if_else(`ID 1` %in%
pull(data1, `ID 1`),
1,
0)))
now i am looking for a function to mutate my one data frame (data2) like. if I give column names of data1 and data2 to find if the values or string already exist in other data and mutate a new column to 1,0 for true and false.
the function would be like
func(data1 = "name",data2="name",mutated_com="name_exist")
In base R, you can write this function as :
func <- function(data1, data2, data1col, data2col, newcol) {
data2[[newcol]] <- factor(as.integer(data2[[data2col]] %in% data1[[data1col]]))
data2
}
and can call it as :
func(data1, data2, 'name', 'name', 'duplicate')
This will create a column named duplicate in data2 giving 1 where the name in df2 is also present in name of df1 and 0 otherwise.
Using dplyr syntax the above can be written as :
library(dplyr)
library(rlang)
func <- function(data1, data2, data1col, data2col, newcol) {
data2 %>%
mutate(!!newcol := factor(as.integer(.data[[data2col]] %in%
data1[[data1col]])))
}
You can use an inner_join (from dplyr) to determine the overlap between the two dataframes. To use all columns (if both dataframes have the same column names) you do not have to specify the 'by' argument.
You can then add a column 'duplicated' and join back to the original dataframe (df1 or df2) to get the desired result.
overlap <- data1 %>%
inner_join(data2) %>%
mutate(duplicated = 1)
data1 %>% #or data2
left_join(overlap) %>%
mutate(duplicated = ifelse(is.na(duplicated),0,1))

Sampling from a subset of a dataframe where the subset is conditional on a value from another dataframe in R

I have two data frames in R. One contains a row for each individual person and the area they live in. E.g.
df1 = data.frame(Person_ID = seq(1,10,1), Area = c("A","A","A","B","B","C","D","A","D","C"))
The other data frame contains demographic information for each Area.
E.g. for gender df2 = data.frame(Area = c("A","A","B","B","C","C","D","D"), gender = c("M","F","M","F","M","F","M","F"), probability = c(0.4,0.6,0.55,0.45,0.6,0.4,0.5,0.5))
In df1 I want to create a gender column where for each row of df1 I sample a gender from the appropriate subset of df2.
For example, for row 1 of df1 I would sample a gender from df2 %>% filter(Area == "A")
The question is how do I do this for all rows without a for loop as in practice df1 could have up to 5 million rows?
Try using the following :
library(dplyr)
library(tidyr)
out <- df1 %>%
nest(data = -Area) %>%
left_join(df2, by = 'Area') %>%
group_by(Area) %>%
summarise(data = map(data, ~.x %>%
mutate(gender = sample(gender, n(),
prob = probability, replace = TRUE)))) %>%
distinct(Area, .keep_all = TRUE) %>%
unnest(data)
We first nest df1 and join it with df2 by Area. For each Area we sample gender value based on probability in df2 and unnest to get long dataframe.
There are not enough samples in df1 to verify the result but if we increase number of rows in df1 the proportion should be similar to probability in df2.

Calculate mean for each row across a list of dataframes in R

I want to calculate the mean of each row in the 'value' column across a list of dataframes.
The output I want is a dataframe with the 'sample' name and its associated mean 'value' for each row across the whole list.
Below is some example data:
list <- list()
dataframe1 <- data.frame(sample = c("OP2645ii_c","OP5048___e","OP5048___f","OP5046___d","OP2645ii_e","OP2645ii_a","OP5054DNAa","OP5048___c","OP2645ii_d","OP5048___b","OP5047___a","OP5048___h","OP5053DNAb","OP3088i__a","OP5048___g","OP5053DNAa","OP5049___a","OP2645ii_b","OP5046___c","OP5044___c","OP2413iiia","OP5054DNAc","OP5046___e","OP5054DNAb","OP5044___a","OP5046___a","OP5046___b","OP2413iiib","OP5051DNAa","OP5048___d","OP5044___b","OP5049___b","OP5051DNAc","OP5051DNAb","OP5053DNAc","OP5047___b","OP5043___b","OP5043___a","OP5052DNAa"),
gr = c("1","1","2","5","4","5","5","3","2","2","2","4","3","1","1","3","2","1","2","5","5","5","2","2","2","1","2","1","1","1","2","1","1","2","2","5","3","3","5"),
value = c("14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500","14.32500"))
list[[1]] <- dataframe1
dataframe2 <- data.frame(sample = c("OP2645ii_c","OP5048___e","OP5048___f","OP5046___d","OP2645ii_e","OP2645ii_a","OP5054DNAa","OP5048___c","OP2645ii_d","OP5048___b","OP5047___a","OP5048___h","OP5053DNAb","OP3088i__a","OP5048___g","OP5053DNAa","OP5049___a","OP2645ii_b","OP5046___c","OP5044___c","OP2413iiia","OP5054DNAc","OP5046___e","OP5054DNAb","OP5044___a","OP5046___a","OP5046___b","OP2413iiib","OP5051DNAa","OP5048___d","OP5044___b","OP5049___b","OP5051DNAc","OP5051DNAb","OP5053DNAc","OP5047___b","OP5043___b","OP5043___a","OP5052DNAa"),
gr = c("5","4","3","5","4","5","5","3","2","2","2","2","3","1","1","3","2","1","2","5","5","5","2","2","2","1","2","1","1","1","2","1","1","2","4","4","4","4","4"),
value = c("12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000","12.59000"))
list[[2]] <- dataframe2
dataframe3 <- data.frame(sample = c("OP2645ii_c","OP5048___e","OP5048___f","OP5046___d","OP2645ii_e","OP2645ii_a","OP5054DNAa","OP5048___c","OP2645ii_d","OP5048___b","OP5047___a","OP5048___h","OP5053DNAb","OP3088i__a","OP5048___g","OP5053DNAa","OP5049___a","OP2645ii_b","OP5046___c","OP5044___c","OP2413iiia","OP5054DNAc","OP5046___e","OP5054DNAb","OP5044___a","OP5046___a","OP5046___b","OP2413iiib","OP5051DNAa","OP5048___d","OP5044___b","OP5049___b","OP5051DNAc","OP5051DNAb","OP5053DNAc","OP5047___b","OP5043___b","OP5043___a","OP5052DNAa"),
gr = c("5","3","3","5","5","5","5","3","5","3","3","3","3","3","3","3","2","1","2","1","1","1","2","2","2","1","2","1","1","1","2","1","1","4","4","4","4","4","4"),
value = c("20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915","20.06915"))
list[[3]] <- dataframe3
dataframe4 <- data.frame(sample = c("OP2645ii_c","OP5048___e","OP5048___f","OP5046___d","OP2645ii_e","OP2645ii_a","OP5054DNAa","OP5048___c","OP2645ii_d","OP5048___b","OP5047___a","OP5048___h","OP5053DNAb","OP3088i__a","OP5048___g","OP5053DNAa","OP5049___a","OP2645ii_b","OP5046___c","OP5044___c","OP2413iiia","OP5054DNAc","OP5046___e","OP5054DNAb","OP5044___a","OP5046___a","OP5046___b","OP2413iiib","OP5051DNAa","OP5048___d","OP5044___b","OP5049___b","OP5051DNAc","OP5051DNAb","OP5053DNAc","OP5047___b","OP5043___b","OP5043___a","OP5052DNAa"),
gr = c("2","2","2","3","4","5","5","3","2","2","2","4","5","1","1","3","2","1","2","5","5","5","2","2","2","1","2","1","1","1","2","1","1","2","2","5","3","3","5"),
value = c("18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500","18.32500"))
list[[4]] <- dataframe4
Many thanks.
Cheers.
Deon
Using base functions, you could extract all the value columns into a matrix and use row means:
rowMeans(sapply(list, "[[", "value"))
For you sample data, you'd need to also convert to numeric (as below), but I'm hoping your real data has numbers not factors.
rowMeans(sapply(lapply(list, "[[", "value"), function(x) as.numeric(as.character(x))))
This just gives the values (and assumes the rows are in the right order). You can add the sample names with cbind, e.g., cbind(list[[1]][["sample"]], rowMeans(...)).
We can bind the list elements to a single data.frame, create a group of row_number() and use that to get the mean of 'value'
library(dplyr)
bind_rows(list, .id = 'id') %>%
mutate(value = as.numeric(value)) %>%
group_by(id) %>%
group_by(grp = row_number(), sample) %>%
summarise(value = mean(value, na.rm = TRUE))

Recherche value in other dataframe for computation

I have a df that looks like that:
df1 <- data.frame(country = c("C1","C1","C2","C2"),year = c(1998,2001,1998,2001), amount = c(11000,11500,5000,4100))
I created another df based on the first one as follow:
df2 <- aggregate(amount ~ year, df1, sum)
I would to create a new column df1$ratio corresponding to the amount ration of each ID for each year. it should look like:
df3 <- data.frame(country = c("C1","C1","C2","C2"),year = c(1998,2001,1998,2001), amount = c(11000,11500,5000,4100), ratio = c(.6875, .7372,.3125,.2628))
Any Idea?
Instead of two step process, it can be done with ave from base R
df1$ratio <- with(df1, amount/ave(amount, year, FUN = sum))
Or with mutate from dplyr
library(dplyr)
df1 %>%
group_by(year) %>%
mutate(ratio = amount/sum(amount))

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

Resources