Create loop to change structure of multiple data frames in R - r

I have a bunch of excel files that I have loaded into R as separate dataframes. I now need to change the structure/layout of every one of these data frames. I have done all of this separately, but it is becoming very time consuming. I am not sure how there is a better way to accomplish this. My guess would be that I need to combine them all into a list and then create some type of loop to go through every data frame in that list. I need to be able to remove rows and columns from the edge, add 'row' the top left cell that is currently empty, and then follow that pivot_longer, mutate, and select functions that I have listed below that I have done separately.
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
I have tried what is below and I get an error, does anyone have a better way that what I am currently doing to accomplish this?
for(x in seq_along(files.list)){
names(files.list)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
}

If you have a vector of filenames my_files, I think this will work
library(tidyverse)
library(readxl)
prepare_df <- function(df) {
# make changes to df
names(df)[1] <- 'row'
df <- df %>%
pivot_longer((!row), names_to = "plateColumn", values_to = "Broth_t0")
df <- df %>%
mutate(wellID = paste0(row, plateColumn)) %>%
select(-c(row, plateColumn))
return(df)
}
names(my_files) <- my_files # often useful if the vector we're mapping over has names
dfs <- map(my_files, read_excel) # read into a list of data frames
dfs <- map(dfs, prepare_df) # prepare each one
df <- bind_rows(dfs, .id = "file") # if you prefer one data frame instead

Related

R - filtering and merging multiple objects

Can anyone tell me an elegant solution to achieve the same result with less code? As far as I know, there is no "un-filter" in R.
test1 <- dat_long%>%
filter(time == "0month") %>%
merge(blood_m1, by="ID")
test2 <- dat_long%>%
filter(time == "3month") %>%
merge(blood_m2, by="ID")
test3 <- dat_long%>%
filter(time == "4month") %>%
merge(blood_m3, by="ID")
test_long <- test3 %>%
bind_rows(test2 )%>%
bind_rows(test1 )
Basically, I want to achieve the "test_long" df by generating a single object and all connect with %>%. Thanks!
I think you want a cleaner blood_all lookup table to begin with, here's a suggestion :
library(dplyr)
blood_all <-
# create named list
list("0month" = blood_m1, "3month" = blood_m2, "4month", blood_m3) %>%
# bind into a single tibble, placing names into "time" column
bind_rows(, .id = "time")
test <- dat_long %>%
merge(blood, by=c("ID", "time")) # or inner_join to keep it tidyverse

Split a data.frame by group into a list of vectors rather than a list of data.frames

I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))

Dynamically type casting to numeric in dplyr and sparklyr

The jist of this question is that I have some R code which works fine on a local data frame, but fails on a Spark data frame, even if otherwise the two tables are identical.
In R, given a dataframe of all character columns, one can dynamically type cast all the columns to numeric that can be safely converted to numeric with the following code:
require(dplyr)
require(varhandle)
require(sparklyr)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
typeCast <- function(df)
{
columns <- colnames(df)
numericIdx <- df %>% mutate(across(columns, checkNumeric)) %>% .[1,]
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
return(df)
}
For a trivial example, one could run:
df <- iris
df$Sepal.Length <- as.character(df$Sepal.Length)
newDF <- df %>% typeCast
class(df$Sepal.Length)
class(newDF$Sepal.Length)
Now, this code will not work on a dataset like starwars, which has composite columns. But for other dataframes, I would expect this code to work just fine on a Spark data frame. It doesn't. That is:
sc <- spark_connect('yarn', config=config) # define your Spark configuration somewhere, that's outside the scope of this question
df <- copy_to(sc, iris, "iris")
newDF <- df %>% typeCast
Will fail with the following error.
Error in .[1, ] : incorrect number of dimensions
When debugging, if we try to run this code:
columns <- colnames(df)
df %>% mutate(across(columns, checkNumeric))
This error is returned:
Error in UseMethod("escape") :
no applicable method for 'escape' applied to an object of class "function"
What gives? Why would the code work fine on a local data frame, but not a Spark data frame?
I didn't find an exact solution per se, but I did find a workaround.
typeCheckPartition <- function(df)
{
require(dplyr)
require(varhandle)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
# this works on non-spark data frames
columns <- colnames(df)
numericIdx <- df %>% mutate(across(all_of(columns), checkNumeric)) %>% .[1,]
return(numericIdx)
}
typeCastSpark <- function(df, max_partitions = 1000, undo_coalesce = T)
{
# numericIdxDf will have these dimensions: num_partition rows x num_columns
# so long as num_columns is not absurd, this coalesce should make collect a safe operation
num_partitions <- sdf_num_partitions(df)
if (num_partitions > max_partitions)
{
undo_coalesce <- T && undo_coalesce
df <- df %>% sdf_coalesce(max_partitions)
} else
{
undo_coalesce <- F
}
columns <- colnames(df)
numericIdxDf <- df %>% spark_apply(typeCheckPartition, packages=T) %>% collect
numericIdx <- numericIdxDf %>% as.data.frame %>% apply(2, all)
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
if (undo_coalesce)
df <- df %>% sdf_repartition(num_partitions)
return(df)
}
Just run the typeCastSpark function against your dataframe and it will type cast all of the columns to numeric (that can be).

Read a set of files into a matrix in R

I am trying to read a set of tab separated files into a matrix or data.frame. For each file I need to extract one column and then concatenate all the columns into a single matrix keeping both column and row names.
I am using tidyverse (and I am terrible at that). I successfully get column names but I miss row names at the very last stage of processing.
library("purrr")
library("tibble")
samples <- c("a","b","c","d")
a <- samples %>%
purrr::map_chr(~ file.path(getwd(), TARGET_FOLDER, paste(., "tsv", sep = "."))) %>%
purrr::map(safely(~ read.table(., row.names = 1, skip = 4))) %>%
purrr::set_names(rownames(samples)) %>%
purrr::transpose()
is_ok <- a$error %>% purrr::map_lgl(is_null)
x <- a$result[is_ok] %>%
purrr::map(~ {
v <- .[,1]
names(v) <- rownames(.)
v
}) %>% as_tibble(rownames = NA)
The x data.frame has correct colnames but lacks rownames. All the element on the a list have the same rownames in the exact same order. I am aware of tricks like rownames(x) <- rownames(a$result[[1]]) but I am looking for more consistent solutions.
It turned out that the solution was easier than expected. Using as.data.frame instead the last as_tibble solved it.

Can I create a data.frame in R from an existing data.frame by assigning a list of col.names?

I have a data.frame where I assign each column.name a vector of variables:
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
I want to create a new data.frame but instead of assigning each column individually, I want to assign them all at once. For example, if I wanted to rename them all:
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1)
This obviously doens't work. Is there a way to make it work?
I understand I can just rename using names(), but the scenario where this actually seems useful is if combining multiple data sets that share the same col.names (and in which I don't want to simply rbind):
dat1 <- data.frame(a=1:5,b=1:5,c=1:5)
dat2 <- data.frame(a=6:10,b=6:10,c=6:10)
dat.new <- data.frame(paste(names(dat1),'1',sep='') = dat1, paste(names(dat1),'2',sep='') = dat2)
library(dplyr)
library(tidyr)
library(magrittr)
Ok, here's the first part:
dat2 =
dat1 %>%
setNames(names(.) %>%
paste0("1") )
Here's the second part. The reshaping is a bit complex but more flexible, especially if you have row id's already with different amounts of rows:
list(dat1, dat2) %>%
bind_rows(.id = "number") %>%
group_by(number) %>%
mutate(id = 1:n()) %>%
gather(variable, value, -number, -id) %>%
unite(new_variable, variable, number) %>%
spread(new_variable, value)

Resources