Write a function to manipulate and then write a dataframe - r

I would like to read in multiple .csv files (dataframes) from a folder and apply a function that I create to all the files. And finally this function will write the new .csv files.
I want the function to do the following 3 things
df$Class <- gsub("null", "OTHER", df$Class)
df$Class <- gsub(': ', ',', df$Class)
df <- df %>% select(c(Image, everything(.), -Name))
I don't really know how to put these thing into a function, but I've tried
`
file_names <- list.files(pattern = "\\.csv$")
tidy_up_fxn <- function(file_names) {
df <- do.call(bind_rows,lapply(file_names,data.table::fread))
df$Class <- gsub("null", "OTHER", df$Class)
df$Class <- gsub(': ', ',', df$Class)
df <- df %>% select(c(Image, everything(.), -Name))
out <- function(df)
fwrite(out, file = file_names, sep = ",")
}
tidy_up_fxn(file_names)
`
When I run it, R gets busy for a few seconds and then nothing happens. Please, help correct my function!

The following works the way I intended to
file_names <- list.files(pattern = "\\.csv$")
tidy_up_fxn <- function(file_names) {
df <- bind_rows(lapply(file_names,data.table::fread))
df$Class <- gsub("null", "OTHER", df$Class)
df$Class <- gsub(': ', ',', df$Class)
df <- df %>% select(c(Image, everything(.), -Name))
fwrite(df, file = "new.csv", sep = ",")
}
tidy_up_fxn(file_names)
Thank you all!!

Related

generate variable names in for loop

Hope you don't mind if this is too easy for you.
In R, I am using fromJSON() to read from 3 urls (tier 1 url) , in the JSON file there is "link" field which give me another url (tier 2 url) and I use that and read.table() to get my final data. My code now is like this:
# note, this code does not run
urlJohn <- www.foo1.com
urlJane <- www.foo2.com
urlJoe <- www.foo3.com
tempJohn <- fromJson(urlJohn)
tempJohn[["data"]][["rows"]]$link %<>%
{clean up this data}
dataJohn <- read.table(tempJohn[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
tempJane <- fromJson(urlJane)
tempJane[["data"]][["rows"]]$link %<>%
{clean up this data}
dataJane <- read.table(tempJane[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
tempJoe <- fromJson(urlJoe)
tempJoe[["data"]][["rows"]]$link %<>%
{clean up this data}
dataJoe <- read.table(tempJoe[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
As you can see, I am just copying-n-pasting code blocks. What I wish is this:
# note, this code also does not run
urlJohn <- www.foo1.com
urlJane <- www.foo2.com
urlJoe <- www.foo3.com
source <- c("John", "Jane", "joe")
for (i in source){
temp <- paste(temp, i, sep = "")
url <- paste(url, i, sep = "")
data <- paste(data, i, sep = "")
temp <- fromJson(url)
temp[["data"]][["rows"]]$link %<>%
{clean up this data}
data <- read.table(temp[["data"]][["rows"]]$link,
header = TRUE,
sep = ",")
}
What do I need to do to make the for loop work? If my question is not clear, please ask me to clarify it.
I usually find using lapply convenient than a for loop. Although you can easily convert this to a for loop if needed.
URLs <- c('www.foo1.com', 'www.foo2.com', 'www.foo3.com')
lapply(URLs, function(x) {
temp <- jsonlite::fromJSON(x)
temp[["data"]][["rows"]]$link %<>% {clean up this data}
read.table(temp[["data"]][["rows"]]$link,header = TRUE,sep = ",")
}) -> list_data
list_data
Thanks to #Ronak Shah. The R community strongly favors "non-For-loop" solution.
The way to get my desired result is lapply.
Below is non-running codes in mnemonics:
URLs <- c('www.foo1.com', 'www.foo2.com', 'www.foo3.com')
lapply(URLs, function(x) {
temp <- jsonlite::fromJSON(x)
x <- temp[["data"]][["rows"]]$link %<>% {clean up this data}
y <- read.table(temp[["data"]][["rows"]]$link,header = TRUE,sep = ",")
return(list(x, y))
})
And this is a running example.
x <- list(alpha = 1:10,
beta = exp(-3:3),
logic = c(TRUE,FALSE,FALSE,TRUE))
lapply(x, function(x){
temp <- sum(x) / 2
temp2 <- list(x,
temp)
return(temp2)
}
)

How to operate on variables in R functions

I am trying to do following variable operations on data frame variables:
ptinr <- read.csv('ptinr.CSV')
ptinr$project <- gsub("_19T228z1xx","", ptinr$project)
ptinr$Subject <- as.integer(gsub("CTMS-",'', ptinr$Subject))
ptinr$Subject <- sprintf("%03d", ptinr$Subject)
ptinr$Subject <- paste0(ptinr$project,'-',ptinr$Subject)
I want to convert this to a function and pass the file name. Any suggestions?
Do you mean this kind of function?
f <- function(fname) {
ptinr <- read.csv(fname)
ptinr$project <- gsub("_19T228z1xx", "", ptinr$project)
ptinr$Subject <- as.integer(gsub("CTMS-", "", ptinr$Subject))
ptinr$Subject <- sprintf("%03d", ptinr$Subject)
ptinr$Subject <- paste0(ptinr$project, "-", ptinr$Subject)
ptinr
}
An option with tidyverse
library(readr)
library(stringr)
library(dplyr)
f1 <- function(fname) {
read_csv(fname) %>%
mutate(project = str_remove(project, '_19T228z1xx'),
Subject = glue::glue('{project}_',
'{sprintf("%03d", parse_number(Subject))}'))
}

Writing a csv file in R with parameter in the file name

I am doing a small log processing project in R. I am trying to write a function that gets a dataframe, and writes it in a csv file with some parameters (dataframe name, today's date.. etc)
I have made some progress but didn't manage to write the csv. I hope the code is reproducible and good.
library(dplyr)
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- paste0('"',"./logs/",dfname, "_", Sys.Date(),'.csv"')
dfpath <- as.data.frame(dfpath)
df %>% write_excel_csv(dfpath)
}
wrt_csv(mtcars)
EDIT- this is a final version that works well. Thanks to Ronak Shah.
wd<- getwd()
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- paste0(wd,'/logs/',dfname, '_', Sys.Date(),'.csv')
df %>% write_excel_csv(dfpath)
}
I do however now have a bunch of dataframes that i want to run the function with them. should I make them as a list? this didn't quite work
l <- list(df1,df2)
lapply(l , wrt_csv)
Any thoughts?
Thanks!
Keep dfpath as string. Try :
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- paste0('/logs/',dfname, '_', Sys.Date(),'.csv')
write.csv(df, dfpath, row.names = FALSE)
#Or same as OP
#df %>% write_excel_csv(dfpath)
}
wrt_csv(mtcars)
We can also do
wrt_csv <- function(df) {
dfname <- deparse(substitute(df))
dfpath <- sprintf('/logs/%s_%s.csv', dfname, Sys.Date())
write.csv(df, dfpath, row.names = FALSE)
}
wrt_csv(mtcars)

Merging multiple text files into one using R

I am trying to merge multiple text files into a csv and have done it successfully using the following code. I have one additional requirement, I need to add the name of the file in a separate column indicating where the data came from. Please suggest.
rm
(list=ls())
setwd("D:/Cersai Rejection Reasons/IT_Oct18-Jun19")
file_list <- list.files()
df <- data.frame(file_list)
library(plyr)
library(dplyr)
files <- dir("D:/Cersai Rejection Reasons/IT_Oct18-Jun19",
full.names = TRUE)
df <- lapply(files, function(x)
read.table(x, sep = '\t', header = FALSE)) %>%
plyr::ldply()
write.csv(df, file="D:/consolidatetext.csv")
library(dplyr)
df <- lapply(files, function(x) {
df <- read.table(x, sep = '\t', header = FALSE, stringsAsFactors = FALSE)
df$source <- x
return(df)
}) %>%
bind_rows()
write.csv(df, file="D:/consolidatetext.csv")

read tables and assign to a string in a loop in R

I'm sure there is a trivial answer to this but I can't seem to find the right code. I have a list of files and a list of strings that I would like to assign the contents of those files to as dataframes. Then I would like to perform other things on the dataframes within the same loop. I also need to keep each dataframe for downstream work. here is my code:
samples <- c('fc14','g14','fc18','g18','fc21','g21')
fc_samples <- grep("fc", samples, value=TRUE)
fc_files <- c('fc14_g14_full_annot_uniq.txt','fc18_g18_full_annot_uniq.txt','fc21_g21_full_annot_uniq.txt')
# make dataframes
for (file in fc_files)
{ fc_n <- 1
g_n <- 1
print(file);
# THE BIT THAT DOESN'T WORK
assign(paste("data", fc_samples[fc_n], sep='_'), read.table(file,sep = "\t", header=T));
# HERE I EXPECT THE TOP OF MY DF TO BE PRINTED BUT IT ISN'T
head(data_fc14);
# I TRY THIS INSTEAD
do.call("<-",list(paste("data", fc_samples[fc_n], sep='_'), read.table(file,sep = "\t", header=T)))
# I TRY TO PRINT THE DF AGAIN BUT STILL NO LUCK
head(paste("data", fc_samples[fc_n], sep='_'))
# FIRST DOWNSTREAM THING I WOULD LIKE TO DO,
# WON'T WORK UNTIL I SOLVE THE DF ASSIGNMENT ISSUE
names(paste("data", fc_samples[fc_n], sep='_'))[names(paste("data", fc_samples[fc_n], sep='_'))==c('SAMPLE_fc','CHROM_fc','START_fc','REF_fc','ALT_fc','REGION_fc','DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc','dbSNP_fc',
# 'NOVEL_fc')] <- c('SAMPLE','CHROM','START','REF','ALT','REGION','DP','FREQ','GENE','AFFECTS','dbSNP','NOVEL')
# ITERATE TO THE NEXT FILE
fc_n <- fc_n+1
}
I tried solutions from here and here but it didn't help.
If anyone has an elegant solution to this then that would be great! Thanks in advance!
Fixing your code:
samples <- c('fc14','g14','fc18','g18','fc21','g21')
fc_samples <- grep("fc", samples, value=TRUE)
# Make dummy example files
fc_files <- file.path("example-data", c(
'fc14_g14_full_annot_uniq.txt','fc18_g18_full_annot_uniq.txt',
'fc21_g21_full_annot_uniq.txt'))
set.seed(123) ; dummy_df <-
setNames(
as.data.frame(replicate(12, rnorm(7))),
c('SAMPLE_fc','CHROM_fc','START_fc','REF_fc','ALT_fc','REGION_fc',
'DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc','dbSNP_fc','NOVEL_fc')
)
if (!dir.exists("./example-data")) dir.create("example-data")
invisible({
lapply(fc_files, write.table, x = dummy_df, sep = "\t")
})
# "fc_n <- 1" should be outside the loop:
fc_n <- 1
for (file in fc_files) {
g_n <- 1
assign(paste("data", fc_samples[fc_n], sep='_'),
read.table(file,sep = "\t", header=T))
# Copy data to be able to change its names
f <- get(paste("data", fc_samples[fc_n], sep='_'))
names(f)[names(f) == c('SAMPLE_fc','CHROM_fc','START_fc',
'REF_fc','ALT_fc','REGION_fc',
'DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc',
'dbSNP_fc','NOVEL_fc')] <-
c('SAMPLE','CHROM','START','REF','ALT','REGION','DP','FREQ',
'GENE','AFFECTS','dbSNP','NOVEL')
# Assign it back, now that names have been changed
assign(paste("data", fc_samples[fc_n], sep='_'), f)
fc_n <- fc_n+1
}
A "more elegant" way:
assign()ing is not considered best practice, rather work with lists.
Though I occasionally use it myself, there are sometimes good reasons to.
# For the '%>%' pipe
library(magrittr)
data <-
samples %>%
grep(pattern = "fc", value = TRUE) %>%
setNames(nm = .) %>%
lapply(grep, x = fc_files, value = TRUE) %>%
lapply(read.table, sep = "\t", header = TRUE) %>%
lapply(function(f) setNames(f, sub("_fc", "", names(f))))
identical(data_fc14, data$fc14)
# [1] TRUE
identical(data_fc18, data$fc18)
# [1] TRUE
identical(data_fc21, data$fc21)
# [1] TRUE
# Clean up
print(unlink("example-data", recursive = TRUE))
samples <- c('fc14','g14','fc18','g18','fc21','g21')
fc_samples <- grep("fc", samples, value=TRUE)
fc_files <- c('fc14_g14_full_annot_uniq.txt','fc18_g18_full_annot_uniq.txt','fc21_g21_full_annot_uniq.txt')
g_files <- c('g14_full_annot_uniq.txt','g18_full_annot_uniq.txt','g21_full_annot_uniq.txt')
# make dataframes
df_names <- c("data_fc14","data_fc18","data_fc21")
fc_n <- 1
for (file in fc_files)
{
assign(df_names[fc_n], read.table(file,sep = "\t", header=T)); #WORKS
#do.call("<-",list(paste("data", fc_samples[fc_n], sep='_'), read.table(file,sep = "\t", header=T))); #ALSO WORKS
print(head(df_names[fc_n]))
print(head(eval(as.symbol(df_names[fc_n]))))
df <- eval(as.symbol(df_names[fc_n]))
names(df)[names(df) == c('SAMPLE_fc','CHROM_fc','START_fc','REF_fc','ALT_fc','REGION_fc','DP_fc','FREQ_fc','GENE_fc','AFFECTS_fc','dbSNP_fc',
'NOVEL_fc')] <- c('SAMPLE','CHROM','START','REF','ALT','REGION','DP','FREQ','GENE','AFFECTS','dbSNP','NOVEL')
assign(df_names[fc_n], df)
print(head(eval(as.symbol(df_names[fc_n]))))
print(file);
fc_n <- fc_n+1
}
Thanks to all that helped, I solved it using the advise from "apom" in the end as it is most intuitive for more novice R users.

Resources