for loop with dplyr - r

I have a bunch of files I read in manually as such:
# gel above replicates
A_gel <-read.delim("XL1_3_S35_L004_R1_001_w_XL2_3_S37_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
B_gel <-read.delim("XL2_3_S37_L004_R1_001_w_XL2_3_S37_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
C_gel <- read.delim("XL2_3_S37_L004_R1_001_w_XL1_3_S35_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
D_gel <- read.delim("XL1_3_S35_L004_R1_001_w_XL1_3_S35_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
# gel below replicates
A_below_gel <- read.delim("XL1_3b_S36_L004_R1_001_w_XL2_3b_S38_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
B_below_gel <- read.delim("XL2_3b_S38_L004_R1_001_w_XL2_3b_S38_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
C_below_gel <- read.delim("XL2_3b_S38_L004_R1_001_w_XL1_3b_S36_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
D_below_gel <- read.delim("XL1_3b_S36_L004_R1_001_w_XL1_3b_S36_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
I would like to change all the columns of these files and arrange by the start column with something like this:
colnames(A_gel) <- c("Chromosome", "Start", "End", "LogPVal", "LogFC", "Strand")
A_gel <- A_gel %>%
arrange(A_gel$Start)
Instead, I would like to use a for loop for all files using R.

Never create multiple variables following the same pattern. The properly supported solution for this general problem is the use of lists (i.e. instead of having variables A_gel, B_gel, …, you have one variable gel, which is a list that contains your individual data.frames; you can also assign names to these individual items, though in your case that doesn’t seem necessary).
Then you can use e.g. lapply to run over your file paths and read the data of the different files into that list:
gel = lapply(gel_filenames, read.delim)
below_gel = lapply(below_gel_filenames, read.delim)
… and likewise you can put your arrangement code into a function and apply that, changing the above to:
read_bed = function (filename) {
read.delim(filename) %>%
setNames(c("Chromosome", "Start", "End", "LogPVal", "LogFC", "Strand")) %>%
arrange(Start)
}
# …
gel = lapply(gel_filenames, read_bed)
Better yet, use purrr::map_dfr to read all data into a single combined table:
gel = gel_filenames %>%
setNames(., .) %>%
map_dfr(read_bed, .id = 'Filename')
(The setNames(., .) step is necessary since read_dfr assigns the names of the input vector to the added ID column.)
This will create one master table for the “GEL” dat, which has an added ID column for the original filename (you’ll probably want to extract just some ID from that, using tidyr::extract).

Related

Changing dataframes in bulk? How to apply a list of operations to multiple dataframes?

So, I have 6 data frames, all look like this (with different values):
Now I want to create a new column in all the data frames for the country. Then I want to convert it into a long df. This is how I am going about it.
dlist<- list(child_mortality,fertility,income_capita,life_expectancy,population)
convertlong <- function(trial){
trial$country <- rownames(trial)
trial <- melt(trial)
colnames(trial)<- c("country","year",trial)
}
for(i in dlist){
convertlong(i)
}
After running this I get:
Using country as id variables
Error in names(x) <- value :
'names' attribute [5] must be the same length as the vector [3]
That's all, it doesn't do the operations on the data frames. I am pretty sure I'm taking a stupid mistake, but I looked online on forums and cannot figure it out.
maybe you can replace
trial$country <- rownames(trial)
by
trial <- cbind(trial, rownames(trial))
Here's a tidyverse attempt -
library(tidyverse)
#Put the dataframes in a named list.
dlist<- dplyr::lst(child_mortality, fertility,
income_capita, life_expectancy,population)
#lst is not a typo!!
#Write a function which creates a new column with rowname
#and get's the data in long format
#The column name for 3rd column is passed separately (`col`).
convertlong <- function(trial, col){
trial %>%
rownames_to_column('country') %>%
pivot_longer(cols = -country, names_to = 'year', values_to = col)
}
#Use `imap` to pass dataframe as well as it's name to the function.
dlist <- imap(dlist, convertlong)
#If you want the changes to be reflected for dataframes in global environment.
list2env(dlist, .GlobalEnv)

R - Combine multiple data frames according to the pattern in their name

I would like to combine data frames in the global environment according to the pattern in their name, and simultaneously add the name of the file they are originally from.
My problem is that I have originally a zip file, with over 20 text files in the main folder and sub-folders, which observe mainly two different scenarios: "test" and "train". Hence, I decided to first read ALL of the txt files into R, create two different lists of df names which either have "test" or "train" pattern and using those lists merge the dataframes into two main dataframes. Now, I need to combine those dataframes according to the names in the list, but the rbind just creates another list of their names - how to make rbind treat inputs as objects from the name list, not strings?
Moreover, rbind would combine the dfs without an opportunity to add the variable of their column names - maybe there is a solution which lets to simultaneously combine dfs and add the df name as a column variable?
What I did so far:
#loading the necessary libraries
library(dplyr)
library(readr)
library(easycsv)
#setting url and directory of the data file
url <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"
destination <- "accelerometer_data.zip"
#downloading the file and storing it into computer memory
download.file(url, destfile = destination)
#read all txt files into R
test_folder <- easycsv::fread_zip(file = destination,
extension = "TXT")
#create a list of "test" data frames
list_test <- as.list(
do.call(cbind, ls(
grep(pattern = "^UCI+(.*)test",
x = ls(),
value = TRUE)
)
)
)
)
#bind dfs as named in list_test
test_df <- lapply(list_test, FUN = function(x) {
rbind(
eval(
parse(text = x)
)
)
}
)
You can use mget to get all the data with specific pattern in a list, then use dplyr::bind_rows to combine them into one dataframe and use .id parameter to include the file name as a separate column.
library(dplyr)
test_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)test", x = ls(),
value = TRUE)), .id = 'filename')
train_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)train", x = ls(),
value = TRUE)), .id = 'filename')
However, the 'test' and 'train' files have dataframes with different number of columns hence you have certain columns with only NAs for some files. Maybe you need to update the pattern and make the pattern more strict?

How to clean multiple excel files in one time in R?

I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.

import multiple files and extract specific column in r

I have 20 data file (.txt). My end goal is to chose a specific column (let say V3) from each 20 files, and make a new file.
I tried
temp <- list.files(pattern='*.snp.blp')
How i can extract V3 from each 20 files and combine (cbind) them in r?
We can use fread from data.table which also have the option of select to select only the specific columns we intend to read instead of reading the whole data
library(data.table)
library(purrr)
library(dplyr)
map(temp, fread, select = 'V3') %>%
bind_cols
If the number of rows are not the same, then use cbind.fill
out <- map(temp, fread, select = 'V3')
do.call(rowr::cbind.fill, c(out, fill = NA))
data
set.seed(24)
invisible(map(paste0('snp.blp', 1:3, '.csv'), ~
matrix(sample(1:10, 10 * 3, replace = TRUE), ncol = 3,
dimnames = list(NULL, paste0("V", 1:3))) %>%
as_tibble %>%
readr::write_csv(., path = .x)))
temp <- list.files(pattern='snp.blp')
Arguably it's better to rbind() the rows of the same variable across multiple files than cbind() them, especially since cbind() fails when the files have different numbers of rows.
In the situation where we need to combine only a single column from multiple files, we can also use unlist() instead of rbind().
A complete, working example combining rows using base R can be accomplished with lapply(), an anonymous function, and unlist(). We'll use data from Alex Barradas' Pokémon Stats database from kaggle.com, where I've restructured the data into 6 CSV files, one for each of the first six generations of Pokémon.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/pokemonData.zip",
"pokemonData.zip",
method="wininet",mode="wb")
unzip("pokemonData.zip")
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
attackStats <- lapply(thePokemonFiles,function(x) {
# read data and subset to Attack stat using the extract operator [
read.csv(x)["Attack"]
})
# unlist to combine into a vector
attackStats <- unlist(attackStats)
# use the data in another R function
hist(attackStats)
...and the output:

Combine some csv files into one - different number of columns

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))
or
list_of_data = lapply(tbl, read.csv)
That how it looks like:
> head(tbl)
[1] "F1.csv" "F10_noS3.csv" "F11.csv" "F12.csv" "F12_noS7_S8.csv"
[6] "F13.csv"
I have to combine all of those files into one. Let's call it a master file but let's try with making a one table with all of the names.
In all of those csv files is a column called "Accession". I would like to make a table of all "names" from all of those csv files. Of course many of the accessions can be repeated in different csv files. I would like to keep all of the data corresponding to the accession.
Some problems:
Some of those "names" are the same and I don't want to duplicate them
Some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the numer.
The number of columns can be different is those csv files.
That's the screenshot showing how those data looks like:
http://imageshack.com/a/img811/7103/29hg.jpg
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Is it possible to do ?
I couldn't do a dput(head) because it's even too big data set.
I tried to use such code:
all_data = do.call(rbind, list_of_data)
Error in rbind(deparse.level, ...) :
The number of columns is not correct.
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
I tried to do it for almost 2 weeks and I am not able to. So please help me.
Your questions seems to contain multiple subquestions. I encourage you to separate them.
The first thing you apparently need is to combine data frames with different columns. You can use rbind.fill from the plyr package:
library(plyr)
all_data = do.call(rbind.fill, list_of_data)
Here's an example using some tidyverse functions and a custom function that can combine multiple csv files with missing columns into one data frame:
library(tidyverse)
# specify the target directory
dir_path <- '~/test_dir/'
# specify the naming format of the files.
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'
# create sample data with some missing columns
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)
# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
x <- read_csv(paste0(dir_path, file_name)) %>%
mutate(file_name = file_name) %>% # add the file name as a column
select(file_name, everything()) # reorder the columns so file name is first
return(x)
}
# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
list.files(dir_path, pattern = re_file) %>%
map_df(~ read_dir(dir_path, .))
# files with missing columns are filled with NAs.

Resources