How to iterate a function through a list of matrices - r

I have a folder full of csv files that I have read and turned into matrices.
setwd("~/Desktop/EMD Test")
FilesToProcess <- list.files(pattern = "csv")
listOfFiles <- lapply(FilesToProcess, function(x){ out = read.csv(x, header=FALSE, stringsAsFactors = FALSE) as.matrix(out) })
Now I need to do an EMD calculation comparing all the files to the first one. Manually it looks like this:
emd(listOfFles[[1]],listOfFiles[[2]])
What I would like to do is run this command with all the files in ListOfFIles like
emd(listOfFles[[1]],listOfFiles[[x]])
I have tried several things with lapply and for loops but nothing has worked.

We can use a nested lapply if we want to do the pairwise emd on all combinations of list elements
lapply(seq_along(listOfFles), function(i) lapply(seq_along(ListOfFles),
function(j) emd(listOfFles[[i]], listOfFles[[j]])))
Or another option is combn which could be more efficient because the number of combinations are reduced
combn(ListOfFles, 2, FUN = function(x) emd(x[[1]], x[[2]]), simplify = FALSE)

If you want to compare the first file with all other files. You can use for loop in this way.
FilesToProcess <- list.files(pattern = "\\.csv$")
result <- vector('list', length(FilesToProcess) - 1)
for(i in 2:length(FilesToProcess)) {
result[[i]] <- emd(listOfFles[[1]],listOfFiles[[i]])
}

Related

Reading several large files in a loop

I am trying to read several large files in a loop. So instead of doing:
library(fst)
df1 <-read_fst("C:/data1.fst", c(1:2), from = 1, to = 1000)
df2 <-read_fst("C:/data2.fst", c(1:2), from = 1, to = 1000)
df3 <-read_fst("C:/data3.fst", c(1:2), from = 1, to = 1000)
I would like to do something like this:
for(i in 1:3){
df_i <- read_fst("C:/data_i.fst", c(1:2), from = 1, to = 1000)
}
You can use list.files to generate all .fst files in a given dir and then loop through them:
files <- list.files(pattern =".fst") # .fst files in your current directory
df_list <- rep(list(NA), length(files)) # Init list of DFs
for (i in seq_along(files))
df_list[[i]] <- fst::read_fst(files[i], ...)
You could refine the pattern arg in list.files to match a certain pattern, e.g. pattern = "data_\\d+.fst" to match data_i.fst
You can also specify the directory to look into via the path arg and return the full file names via full.names
It is better using a list for the loop output like this. You can create a vector to save the dirs where the files are stored (I did on myvec and you can change 1:3 to 1:n where n can be a larger number). With that done, all the results from loop will be in List. Here the code:
library(fst)
#Create empty list
List <- list()
#Vector
myvec <- paste0("C:/data",1:3,".fst")
#Loop
for(i in 1:length(myvec))
{
List[[i]] <- read_fst(myvec[i], c(1:2), from = 1, to = 1000)
}

Create a list of tibbles with unique names using a for loop

I'm working on a project where I want to create a list of tibbles containing data that I read in from Excel. The idea will be to call on the columns of these different tibbles to perform analyses on them. But I'm stuck on how to name tibbles in a for loop with a name that changes based on the for loop variable. I'm not certain I'm going about this the correct way. Here is the code I've got so far.
filenames <- list.files(path = getwd(), pattern = "xlsx")
RawData <- list()
for(i in filenames) {
RawData <- list(i <- tibble(read_xlsx(path = i, col_names = c('time', 'intesity'))))
}
I've also got the issue where, right now, the for loop overwrites RawData with each turn of the loop but I think that is something I can remedy if I can get the naming convention to work. If there is another method or data structure that would better suite this task, I'm open to suggestions.
Cheers,
Your code overwrites RawData in each iteration. You should use something like this to add the new tibble to the list RawData <- c(RawData, read_xlsx(...)).
A simpler way would be to use lapply instead of a for loop :
RawData <-
lapply(
filenames,
read_xlsx,
col_names = c('time', 'intesity')
)
Here is an approach with map from package purrr
library(tidyverse)
filenames <- list.files(path = getwd(), pattern = "xlsx")
mylist <- map(filenames, ~ read_xlsx(.x, col_names = c('time', 'intesity')) %>%
set_names(filenames)
Similar to the answer by #py_b, but add a column with the original file name to each element of the list.
filenames <- list.files(path = getwd(), pattern = "xlsx")
Raw_Data <- lapply(filenames, function(x) {
out_tibble <- read_xlsx(path = x, col_names = c('time', 'intesity'))
out_tibble$source_file <- basename(x) # add a column with the excel file name
return(out_tibble)
})
If you want to merge the list of tibbles into one big one you can use do.call('rbind', Raw_Data)

Using lapply variable in read.csv

I'm just getting used to using lapply and I've been trying to figure out how I can use names from a vector to append within filenames I am calling, and naming new dataframes. I understand I can use paste to call the files of interest, but not sure I can create the new dataframes with the _var name appended.
site_list <- c("site_a","site_b", "site_c","site_d")
lapply(site_list,
function(var) {
all_var <- read.csv(paste("I:/Results/",var,"_all.csv"))
tbl_var <- read.csv(paste("I:/Results/",var,"_tbl.csv"))
rsid_var <- read.csv(paste("I:/Results/",var,"_rsid.csv"))
return(var)
})
Generally, it often makes more sense to apply a function to the list elements and then to return a list when using lapply, where your variables are stored and can be named. Example (edit: use split to process files together):
files <- list.files(path= "I:/Results/", pattern = "site_[abcd]_.*csv", full.names = TRUE)
files <- split(files, gsub(".*site_([abcd]).*", "\\1", files))
processFiles <- function(x){
all <- read.csv(x[grep("_all.csv", x)])
rsid <- read.csv(x[grep("_rsid.csv", x)])
tbl <- read.csv(x[grep("_tbl.csv", x)])
# do more stuff, generate df, return(df)
}
res <- lapply(files, processFiles)

Rbind multiple Data Frames in a loop

I have a bunch of data frames that are named in the same pattern "dfX.csv" where X represents a number from 1 to 67. I loaded them into seperate dataframes using following piece of code:
folder <- mypath
file_list <- list.files(path=folder, pattern="*.csv")
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep=',', header=TRUE))
)}
What I'm trying to do is merge/rbind them into a single huge dataframe.
for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv)
}
However using that I'm getting an error:
Error: unexpected symbol in:
"for (i in 1:length(file_list)){
df_main <- rbind(df_main, df[[i]].csv"
Any idea what might be causing an issue & whether there's a simpler way of doing things.
If file_list is a character vector of filenames that have since been loaded into variables in the local environment, then perhaps one of
do.call(rbind.data.frame, mget(ls(pattern = "^df\\s+\\.csv")))
do.call(rbind.data.frame, mget(paste0("df", seq_along(file_list), ".csv")))
The first assumes anything found (as df*.csv) in R's environment is appropriate to grab. It might not grab then in the correct order, so consider using sort or somehow ordering them yourself.
mget takes a string vector and retrieves the value of the object with each name from the given environment (current, by default), returning a list of values.
do.call(rbind.data.frame, ...) does one call to rbind, which is much much faster than iteratively rbinding.
Here I use map() to iterate over your files reading each one into a list of dataframes and bind_rows is used to bind all df together
library(tidyverse)
df <- map(list.files(), read_csv) %>%
bind_rows()
If you have a lot of data (lot of rows), here's a data.table approach that works great:
library(data.table)
basedir <- choose.dir() # directory with all the csv files
file_names <- list.files(path = basedir, pattern= '*.csv', full.names = F, recursive = F)
big_list <- lapply(file_names, function(file_name){
dat <- fread(file = file.path(basedir, file_name), header = T)
# Add a 'filename' column to each data.table to back-track where it was read from
# this is why we set full.names = F in the list.files line above
dat$filename <- gsub('.csv', '', file_name)
return(dat)
})
big_data <- rbindlist(l = big_list, use.names = T, fill = T)
If you want to read only some columns and not all, you can use the select argument in fread - helps improve speed since empty columns are not read in, similarly skip lets you skip reading in a bunch of rows.

Extract data from text files using for loop

I have 40 text files with names :
[1] "2006-03-31.txt" "2006-06-30.txt" "2006-09-30.txt" "2006-12-31.txt" "2007-03-31.txt"
[6] "2007-06-30.txt" "2007-09-30.txt" "2007-12-31.txt" "2008-03-31.txt" etc...
I need to extract one specific data, i know how to do it individually but this take a while:
m_value1 <- `2006-03-31.txt`$Marknadsvarde_tot[1]
m_value2 <- `2006-06-30.txt`$Marknadsvarde_tot[1]
m_value3 <- `2006-09-30.txt`$Marknadsvarde_tot[1]
m_value4 <- `2006-12-31.txt`$Marknadsvarde_tot[1]
Can someone help me with a for loop which would extract the data from a specific column and row through all the different text files please?
Assuming your files are all in the same folder, you can use list.files to get the names of all the files, then loop through them and get the value you need. So something like this?
m_value<-character() #or whatever the type of your variable is
filelist<-list.files(path="...", all.files = TRUE)
for (i in 1:length(filelist)){
df<-read.table(myfile[i], h=T)
m_value[i]<-df$Marknadsvarde_tot[1]
}
EDIT:
In case you have imported already all the data you can use get:
txt_files <- list.files(pattern = "*.txt")
for(i in txt_files) { x <- read.delim(i, header=TRUE) assign(i,x) }
m_value<-character()
for(i in 1:length(txt_files)) {
m_value[i] <- get(txt_files[i])$Marknadsvarde_tot[1]
}
You could utilize the select-parameter from fread of the data.table-package for this:
library(data.table)
file.list <- list.files(pattern = '.txt')
lapply(file.list, fread, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)
This will result in a list of datatables/dataframes. If you just want a vector with all the values:
sapply(file.list, function(x) fread(x, select = 'Marknadsvarde_tot', nrow = 1, header = FALSE)[[1]])
temp = list.files(pattern="*.txt")
library(data.table)
list2env(
lapply(setNames(temp, make.names(gsub("*.txt$", "", temp))),
fread), envir = .GlobalEnv)
Added data.table to an existing answer at Importing multiple .csv files into R
After you get all your files you can get data from the data.tables using DT[i,j,k] where i will be your condition

Resources