mutate data frame to check if data duplicate between two data frame - r

I have two data frame, this is just a sample , database have approx 1 million of records.
can have name, email, alphanumeric code etc.
data1<-data.frame(
'ID 1' = c(86364,"ARV_2612","AGH_2212","IND_2622","CHG_2622"),
sector = c(3,3,1,2,5),
name=c("nhug","hugy","mjuk","ghtr","kuld"),
'Enternal code'=c(1,1,1,1,3),
col3=c(1,1,0,0,0),
col4=c(1,0,0,0,0),
col5=c(1,0,1,1,1)
)
data2<-data.frame(
'ID 1' = c(53265,"ARV_7362",76354,"IND_2622","CHG_9762"),
sector = c(3,3,1,2,5),
name=c("nhug","hugy","mjuk","ghtr","kuld"),
'Enternal code'=c(1,1,1,1,3),
col3=c(1,1,0,0,0),
col4=c(1,0,0,0,0),
col5=c(1,0,1,1,1)
)
data2 %>% mutate(
duplicated = factor(if_else(`ID 1` %in%
pull(data1, `ID 1`),
1,
0)))
now i am looking for a function to mutate my one data frame (data2) like. if I give column names of data1 and data2 to find if the values or string already exist in other data and mutate a new column to 1,0 for true and false.
the function would be like
func(data1 = "name",data2="name",mutated_com="name_exist")

In base R, you can write this function as :
func <- function(data1, data2, data1col, data2col, newcol) {
data2[[newcol]] <- factor(as.integer(data2[[data2col]] %in% data1[[data1col]]))
data2
}
and can call it as :
func(data1, data2, 'name', 'name', 'duplicate')
This will create a column named duplicate in data2 giving 1 where the name in df2 is also present in name of df1 and 0 otherwise.
Using dplyr syntax the above can be written as :
library(dplyr)
library(rlang)
func <- function(data1, data2, data1col, data2col, newcol) {
data2 %>%
mutate(!!newcol := factor(as.integer(.data[[data2col]] %in%
data1[[data1col]])))
}

You can use an inner_join (from dplyr) to determine the overlap between the two dataframes. To use all columns (if both dataframes have the same column names) you do not have to specify the 'by' argument.
You can then add a column 'duplicated' and join back to the original dataframe (df1 or df2) to get the desired result.
overlap <- data1 %>%
inner_join(data2) %>%
mutate(duplicated = 1)
data1 %>% #or data2
left_join(overlap) %>%
mutate(duplicated = ifelse(is.na(duplicated),0,1))

Related

Return the names of data frames from a loop into a new data frame as IDs

Suppose I have multiple data frames with the same prefixes and same structure.
mydf_1 <- data.frame('fruit' = 'apples', 'n' = 2)
mydf_2 <- data.frame('fruit' = 'pears', 'n' = 0)
mydf_3 <- data.frame('fruit' = 'oranges', 'n' = 3)
I have a for-loop that grabs all the tables with this prefix, and appends those that match a certain condition.
res <- data.frame()
for(i in mget(apropos("^mydf_"), envir = .GlobalEnv)){
if(sum(i$n) > 0){
res <- rbind.data.frame(res, data.frame('name' = paste0(i[1]),
'n' = sum(i$n)))
}
}
res
This works fine, but I want my 'res' table to identify the name of the original data frame itself in the 'name' column, instead of the column name. My desired result is:
The closest I have gotten to solving this issue is:
'name' = paste0(substitute(i))
instead of
'name' = paste0(i[1])
but it just returns 'i'.
Any simple solution? Base preferred but not essential.
As mentioned in the comments, it is better to put dataframes into a list as it much easier to handle and manipulate them. However, we could still grab the dataframes from the global environment, get the sum for each dataframe, then bind them together and add the dataframe name as a row.
library(tidyverse)
df_list <-
do.call("list", mget(grep("^mydf_", names(.GlobalEnv), value = TRUE))) %>%
map(., ~ .x %>% summarise(n = sum(n))) %>%
discard(~ .x == 0) %>%
bind_rows(., .id = "name")
Or we could use map_dfr to bind together and summarise, then filter out the 0 values:
map_dfr(mget(ls(pattern = "^mydf_")), ~ c(n = sum(.x$n)), .id = "name") %>%
filter(n != 0)
Output
name n
1 mydf_1 2
2 mydf_3 3
To bind a list of data.frames and store the list names as a new column, a convenient way is to set the arg .id in dplyr::bind_rows().
library(dplyr)
mget(apropos("^mydf_")) %>%
bind_rows(.id = "name") %>%
count(name, wt = n) %>%
filter(n > 0)
# name n
# 1 mydf_1 2
# 2 mydf_3 3

How to subset a data frame by id, with sampling 1 row by id? (in R)

I have a big data frame and each row have an id code.
But i want to create another data frame with only one row of each id.
How can i do it?
This is one part of the data (the id column is "codigo_pon"):
Using dplyr, you can do this:
library(dplyr)
your_data %>%
group_by(id_column) %>%
sample_n(1) %>%
ungroup()
Based on the question, you could do somethink like this:
library(tidyverse)
Example data
data <-
tibble(
id = rep(1:20,each = 5),
value = rnorm(100)
)
Sample data, 1 row by id
data %>%
#Group by id variable
group_by(id) %>%
#Sample 1 row by id
sample_n(size = 1)
base R
data[!ave(seq_len(nrow(data)), data$codigo_pon,
FUN = function(z) seq_along(z) != sample(length(z), size = 1)),]
or
do.call(rbind, by(data, data$codigo_pon,
FUN = function(z) z[sample(nrow(z), size = 1),]))
(Previously I suggested aggregate, but that sampled each column separately, breaking up the rows.)
data.table
library(data.table)
as.data.table(data)[, .SD[sample(.N, size = 1),], by = codigo_pon]
(dplyr has already been demonstrated twice)

Mutating a count of rows per group matching a subset condition

I wish to mutate a new column called SF_COUNT which is a count per group (ID) of the number of rows per group where the column type contains 'SF'
A reproducible example looks as follows:
df <- data.frame(ID = c(1234,1234,1234,4567,4567,4567,4567,8900,8900,8900),type = c('RF','SF','SF','RF','SF','SF','SF','RF','SF','SF'))
My final data frame looks like:
final_df <- data.frame(ID = c(1234,1234,1234,4567,4567,4567,4567,8900,8900,8900),type = c('RF','SF','SF','RF','SF','SF','SF','RF','SF','SF'), SF_COUNT = c(2,2,2,3,3,3,3,2,2,2))
How can I achieve this in dplyr please?
After grouping by 'ID', get the sum of logical vector (type == 'SF') in mutate to create the new column
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(SF_COUNT = sum(type == 'SF', na.rm = TRUE))
If it is a substring, then use str_detect
library(stringr)
df <- df %>%
group_by(ID) %>%
mutate(SF_COUNT = sum(str_detect(type, 'SF'), na.rm = TRUE))

Lookup tables in R

I have a tibble with a ton of data in it, but most importantly, I have a column that references a row in a lookup table by number (ex. 1,2,3 etc).
df <- tibble(ref = c(1,1,1,2,5)
data = c(33,34,35,35,32))
lkup <- tibble(CurveID <- c(1,2,3,4,5)
Slope <- c(-3.8,-3.5,-3.1,-3.3,-3.3)
Intercept <- c(40,38,40,38,36)
Min <- c(25,25,21,21,18)
Max <- c(36,36,38,37,32))
I need to do a calculation for each row in the original tibble based on the information in the referenced row in the lookup table.
df$result <- df$data - lkup$intercept[lkup$CurveID == df$ref]/lkup$slope[lkup$CurveID == df$ref]
The idea is to access the slope or intercept (etc) value from the correct row of the lookup table based on the number in the data table, and to do this for each data point in the column. But I keep getting an error telling me my data isn't compatible, and that my objects need to be of the same length.
You could also do it with match()
df$result <- df$data - lkup$Intercept[match(df$ref, lkup$CurveID)]/lkup$Slope[match(df$ref, lkup$CurveID)]
df$result
# [1] 43.52632 44.52632 45.52632 45.85714 42.90909
You could use the dplyr package to join the tibbles together. If the ref column and CurveID column have the same name then left_join will combine the two tibbles by the matching rows.
library(dplyr)
df <- tibble(CurveID = c(1,1,1,2,5),
data = c(33,34,35,35,32))
lkup <- tibble(CurveID = c(1,2,3,4,5),
Slope = c(-3.8,-3.5,-3.1,-3.3,-3.3),
Intercept = c(40,38,40,38,36),
Min = c(25,25,21,21,18),
Max = c(36,36,38,37,32))
df <- df %>% left_join(lkup, by = "CurveID")
Then do the calcuation on each row
df <- df %>% mutate(result = data - (Intercept/Slope)) %>%
select(CurveID, data, result)
For completeness' sake, here's one way to literally do what OP was trying:
library(slider)
df %>%
mutate(result = slide_dbl(ref, ~ slice(lkup, .x)$Intercept /
slice(lkup, .x)$Slope))
though since slice goes by row number, this relies on CurveID equalling the row number (we make no reference to CurveID at all). You can write it differently with filter but it ends up being more code.

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

Resources