R: automatically generate dataframes from single function - r

I have a collection of ten dataframes (df.a, df.b, and so on) and the following concept in R
df.x.new = df.x %>% do this %>% do that...
I wondered if there is an elegant way to interchange the df.x variable of my single code line above iteratively with my dfs one by another, to get ten new dfs as an output.
Meaning something like this:
#place your elegant code here
df.a.new = df.a %>% do this %>% do that
df.b.new = df.b % do this %>% do that
#and so on
Edit:
#this should serve as a minimal reproducible code
df.a = c(1,2,3)
df.b = c(4,5,6)
df.c = c(7,8,9)
df.a.new = df.a %>% left_join(df.b)

Related

Create loop to change structure of multiple data frames in R

I have a bunch of excel files that I have loaded into R as separate dataframes. I now need to change the structure/layout of every one of these data frames. I have done all of this separately, but it is becoming very time consuming. I am not sure how there is a better way to accomplish this. My guess would be that I need to combine them all into a list and then create some type of loop to go through every data frame in that list. I need to be able to remove rows and columns from the edge, add 'row' the top left cell that is currently empty, and then follow that pivot_longer, mutate, and select functions that I have listed below that I have done separately.
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
I have tried what is below and I get an error, does anyone have a better way that what I am currently doing to accomplish this?
for(x in seq_along(files.list)){
names(files.list)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
}
If you have a vector of filenames my_files, I think this will work
library(tidyverse)
library(readxl)
prepare_df <- function(df) {
# make changes to df
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
return(df)
}
names(my_files) <- my_files # often useful if the vector we're mapping over has names
dfs <- map(my_files, read_excel) # read into a list of data frames
dfs <- map(dfs, prepare_df) # prepare each one
df <- bind_rows(dfs, .id = "file") # if you prefer one data frame instead

Why the output has different column order on different runs?

I have a piece of code in R. Every time I run it on a cluster, I get an answer where the order of columns are different. (It seems to be OK on my laptop). If I order the column so they have the same order, answers are identical, the only problem is ordering of the columns.
NNs_loc_year <- Reduce(cbind,
split(NNs_loc_year,
rep(1:n_neighbors, each=(nrow(NNs_loc_year)/n_neighbors)))) %>%
data.table()
# rename columns
NN_dist <- NN_dist %>% data.table()
names(NN_dist) <- paste0("NN_", c(1:n_neighbors))
names(NNs_loc_year) <- paste0(names(NNs_loc_year), paste0("_NN_", rep(1:n_neighbors, each=2)))
NN_chi <- pchi(as.vector(NN_list$nn.dist), PCs)
NN_sigma <- qchi(NN_chi, 1)
NN_sigma_df = Reduce(cbind,
split(NN_sigma,
rep(1:n_neighbors, each=(length(NN_sigma)/n_neighbors)))) %>%
data.table()
names(NN_sigma_df) <- paste0("sigma_NN_", c(1:n_neighbors))
NN_dist_tb = rbind(NN_dist_tb, NN_dist)
NNs_loc_year_tb = rbind(NNs_loc_year_tb, NNs_loc_year)
NN_sigma_tb = rbind(NN_sigma_tb, NN_sigma_df)}

How can I simultaneously assign value to multiple new columns with R and dplyr?

Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.

Generating a unique ID column for large dataset with the RecordLinkage package

I am trying to generate a unique ID column using the RecordLinkage package. I have successfully done so when working with smaller datasets (<= 1,000,000), but have not been able to reproduce this result for larger datasets (> 1,000,000) that use different (but similar) functions in the package. I am given multiple identifier variables for which I want to generate a unique ID despite the fact that there may be some errors (near matches) or duplicates in the records.
Given some data frame of identifiers:
data(RLdata500)
df_identifiers <- RLdata500
This is the code for the smaller datesets (which work):
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- compare.dedup(df_identifiers)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.3)
matches <- getPairs(object = classify, show = "links", single.rows = TRUE)
# this code writes an "ID" column that is the same for similar identifiers
classify <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_identifiers$ID_prior <- df_identifiers$ID
# merge matching information with the original data
df_identifiers <- left_join(df_identifiers, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))
# replace matches in ID with the thing they match with from ID.1
df_identifiers$ID <- ifelse(is.na(df_identifiers$ID.1), df_identifiers$ID, df_identifiers$ID.1)
This approach is discussed here. But this code does not seem to be extensible when applied towards larger datasets when using other functions. For example, the big data equivalent of compare.dedup is RLBigDataDedup, whose RLBigData class support similar functions such as epiWeights, epiClassify, getPairs, etc. Replacing compare.dedup with RLBigDataDedup does not work in this situation.
Consider the following attempt for large datasets:
df_identifiers <- df_identifiers %>% mutate(ID = 1:nrow(df_identifiers))
rpairs <- RLBigDataDedup(df_identifiers)
p=epiWeights(rpairs)
( . . . )
Here, the remaining code is almost identical to that of the first. Although epiWeights and epiClassify work on the RLBigData class as expected, getPairs does not. The function getPairs does not use the show = "links" argument. Because of this, all subsequent code does not work.
Is there a different approach that needs to be taken to generate a column of unique IDs when working with larger datasets in the RLBigData class, or is this just a limitation?
First, import the following libraries:
library(RecordLinkage)
library(dplyr)
library(magrittr)
Consider these example datasets from the RecordLinkage package:
data(RLdata500)
data(RLdata10000)
Assume we care about these matching variables and threshold:
matching_variables <- c("fname_c1", "lname_c1", "by", "bm", "bd")
threshold <- 0.5
The record linkage for SMALL datasets is as follows:
RLdata <- RLdata500
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
compare.dedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(show = "links", single.rows = TRUE) -> matching_data
Here, the following SMALL data manipulation may be applied to append the appropriate IDs to the given dataset (same code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id1, id2) %>%
arrange(id1) %>% filter(!duplicated(id2)),
by = c("ID" = "id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1)) %>%
select(-id1)
RLdata$ID <- RLdata_ID$ID
The equivalent code for LARGE datasets is as follows:
RLdata <- RLdata10000
df_names <- data.frame(RLdata[, matching_variables])
df_names %>%
RLBigDataDedup() %>%
epiWeights() %>%
epiClassify(threshold) %>%
getPairs(filter.link = "link", single.rows = TRUE) -> matching_data
Here, the following LARGE data manipulation may be applied to append the appropriate IDs to the given dataset (similar to code from here):
RLdata_ID <- left_join(mutate(df_names, ID = 1:nrow(df_names)),
select(matching_data, id.1, id.2) %>%
arrange(id.1) %>% filter(!duplicated(id.2)),
by = c("ID" = "id.2")) %>%
mutate(ID = ifelse(is.na(id.1), ID, id.1)) %>%
select(-id.1)
RLdata$ID <- RLdata_ID$ID

How do I refer to all components of a list with a single function in R?

I have split my df into a list containing 500 groups thus:
c1=cut(SNP_Allele_Frequency$College_SE,500)
splitc1=split(SNP_Allele_Frequency,c1,drop=FALSE)
I need to find the mean for a variable for all the 500 groups (levels) contained in the list CL. Instead of repeating the process 500 times (as below), is there a way I can do this with one function?
mean(splitc1[[1L]]$ACB)....mean(splitc1[[2L]]$ACB)...
mean(splitc1[[500L]]$ACB)
First let's make some reproducible data:
set.seed(24)
SNP_Allele_Frequency <- data.frame(College_SE = rnorm(1000), ACB = rnorm(1000))
Now using your original method:
c1 <- cut(SNP_Allele_Frequency$College_SE, 50)
splitc1 <- split(SNP_Allele_Frequency, c1, drop = FALSE)
lapply(splitc1, function(x) mean(x[["ACB"]]))
We could do it more cleanly in dplyr:
library(dplyr)
SNP_Allele_Frequency %>% mutate(c1 = cut(SNP_Allele_Frequency$College_SE, 50)) %>%
group_by(c1) %>%
summarise(meanACB = mean(ACB))

Resources