Avoid repeating statements when importing data - r

Iv'e written the following code to import data into R:
## specify where all the data files are stored
DataFolder <- "DataFolder"
## obtain the name of each file in DataFolder
files <- list.files(DataFolder)
## obtain name of each file
LocNames <- unique(sub("^([^.]*).*", "\\1", files)) # this removes the extension and keeps the unique names
for (i in 1:length(LocNames)){
#
car <- read.table(paste(DataFolder, paste(LocNames[i], ".car", sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
car <- aggregate(car[colnames(car)[2:length(colnames(car))]],list(dateTime = cut(car$dateTime,breaks = "hour")),mean, na.rm = TRUE)
#
light <- read.table(paste(DataFolder, paste(LocNames[i], ".light", sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
light <- aggregate(light[colnames(light)[2]],list(dateTime = cut(light$dateTime, breaks = "hour")),mean, na.rm = TRUE)
}
So, here I have a DataFolder where all of my files are stored. The files are named according to the location where the data was recorded and the extension of the file given the name of the variable measured. Here we have car sales and light as examples.
From here I would like to reduce the size of the arguments inside of the loop so instead of having to name one variable after the other repeating the same steps I want to only have to write the variable name e.g. car, light and then the outcome of the script shown will be returned.
Please let me know if my intentions have not been clear.

Just use a function. Something to the effect of
## specify where all the data files are stored
DataFolder <- "DataFolder"
## obtain the name of each file in DataFolder
files <- list.files(DataFolder)
readMyFiles <- function(DataFolder, LocNames, extension){
data <- read.table(paste(DataFolder, paste(LocNames[i], ".", extension, sep=""), sep="/"),
header = TRUE, sep = "\t", colClasses=c(dateTime="POSIXct"))
data <- aggregate(data[colnames(data)[2:length(colnames(data))]],list(dateTime = cut(data$dateTime,breaks = "hour")),mean, na.rm = TRUE)
data
}
## obtain name of each file
LocNames <- unique(sub("^([^.]*).*", "\\1", files)) # this removes the extension and keeps the unique names
for (i in 1:length(LocNames)){
car <- readMyFiles(DataFolder, LocNames, ".car")
light <- readMyFiles(DataFolder, LocNames, ".light")
}

Related

How to loop over on different files and save the output with filename in R?

I have several files with the names RTDFE, TRYFG, FTYGS, WERTS...like 100 files in txt format. For each file, I'm using the following code and writing the output in a file.
name = c("RTDFE")
file1 <- paste0(name, "_filter",".txt")
file2 <- paste0(name, "_data",".txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
nrow(C)
145
Output:
Samples Common
RTDFE 145
Every time I'm assigning the file to variable name running my code and writing the output in the file. Instead, I want the code to be run on all the files in one go and want the following output. Common is the row of merged data frame C
The output I need:
Samples Common
RTDFE 145
TRYFG ...
FTYGS ...
WERTS ...
How to do this? Any help.
How about putting all your names in a single vector, called names, like this:
names<-c("TRYFG","RTDFE",...)
and then feeding each one to a function that reads the files, merges them, and returns the rows
f<-function(n) {
fs = paste0(n,c("_filter", "_data"),".txt")
C = merge(
read.delim(fs[1],sep="\t", header=F),
read.delim(fs[2],sep="\t", header=F), by="XYZ")
data.frame(Samples=n,Common=nrow(C))
}
Then just call call this function f on each of the values in names, row binding the result together
do.call(rbind, lapply(names, f))
An easy way to create the vector names is like this:
p = "_(filter|data).txt"
names = unique(gsub(p,"",list.files(pattern = p)))
I am making some assumptions here.
The first assumption is that you have all these files in a folder with no other text files (.txt) in this folder.
If so you can get the list of files with the command list.files.
But when doing so you will get the "_data.txt" and the "filter.txt".
We need a way to extract the basic part of the name.
I use "str_replace" to remove the "_data.txt" and the "_filter.txt" from the list.
But when doing so you will get a list with two entries. Therefore I use the "unique" command.
I store this in "lfiles" that will now contain "RTDFE, TRYFG, FTYGS, WERTS..." and any other file that satisfy the conditions.
After this I run a for loop on this list.
I reopen the files similarly as you do.
I merge by XYZ and I immediately put the results in a data frame.
By using rbind I keep adding results to the data frame "res".
library(stringr)
lfiles=list.files(path = ".", pattern = ".txt")
## we strip, from the files, the "_filter and the data
lfiles=unique( sapply(lfiles, function(x){
x=str_replace(x, "_data.txt", "")
x=str_replace(x, "_filter.txt", "")
return(x)
} ))
res=NULL
for(i in lfiles){
file1 <- paste0(i, "_filter.txt")
file2 <- paste0(i, "_data.txt")
### One
A <- read.delim(file1, sep = "\t", header = FALSE)
#### two
B <- read.delim(file2, sep = "\t", header = FALSE)
res=rbind(data.frame(Samples=i, Common=nrow(merge(A, B, by="XYZ"))))
}
Ok, I will assume you have a folder called "data" with files named "RTDFE_filter.txt, RTDFE_data, TRYFG_filter.txt, TRYFG_data.txt, etc. (only and exacly this files).
This code should give a possible way
# save the file names
files = list.files("data")
# get indexes for "data" (for "filter" indexes, add 1)
files_data_index = seq(1, length(f), 2) # 1, 3, 5, ...
# loop on indexes
results = lapply(files_data_index, function(i) {
A <- read.delim(files[i+1], sep = "\t", header = FALSE)
B <- read.delim(files[i], sep = "\t", header = FALSE)
C <- merge(A, B, by="XYZ")
samp = strsplit(files[i], "_")[[1]][1]
com = nrow(C)
return(c(Samples = samp, Comon = com))
})
# combine results
do.call(rbind, results)

For-loop in R to create a new file (but gives incorrect/unexpected output)

I'm currently busy with some data and I need to check their validity.
Therefore, I would like to use a for-loop to go through all my data files.
In this for-loop, I would like to calculate some things (like mean, min,max...).
My code below works but produced an incorrectly written csv file. The problem occurs after the calculations (and their values) are done during csv file creation. CSV:
"c.1..1..1004.89081855716..630.174466667434..461.738905906677.." "c.1..1..950.990843858612..479.98560814955..517.955102920532.."
1 1
1 1
1004.89081855716 950.990843858612
630.174466667434 479.98560814955
461.738905906677 517.955102920532
1535.86795806885 1452.30199813843
-13.3948961645365 3.72026950120926
1259.26423788071 1159.17089223862
Approach/What I'm expecting:
So I start from some data files with eye tracking data in it.
As you can see at the beginning of the code, I try to get some values out of this eye tracking data (validity, new file with only validity == 1 data...). Once I created the filtered_data dataframe, I want to calculate some extra values out of it (mean, sd, min/max).
My plan is to create a new csv file (validity_loop.csv) in which I can find all my calculations (validity_left, validity_right,mean_eye_x, mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y). All in a row. One row for each data set (file_list[i]).
Can someone help me in how to tackle and solve this issue?
Here is my code:
set <- setwd("/Users/Sarah/Documents")
file_list <- list.files(set, pattern = ".csv", all.files = TRUE)
validity_list <- data_list <- vector("list", "length" = length(file_list))
for(i in seq_along(file_list)){
filename = file_list[i]
#read files
data_frame = read.csv(filename, sep = ",", dec = ".",
header = TRUE,
stringsAsFactors = FALSE)
#what has to be done
#validity
validity_left <- mean(is.numeric(data_frame$left_gaze_point_validity))
validity_right <-mean(is.numeric(data_frame$right_gaze_point_validity))
#Zuiver dataframe (validity ==1)
to_keep = which(data_frame$left_gaze_point_validity == 1 &
data_frame$right_gaze_point_validity==1)
filtered_data = data_frame[to_keep,]
filtered_data$left_eye_x = as.numeric(filtered_data$left_eye_x)
filtered_data$left_eye_y = as.numeric(filtered_data$left_eye_y)
filtered_data$right_eye_x = as.numeric(filtered_data$right_eye_x)
filtered_data$right_eye_y = as.numeric(filtered_data$right_eye_y)
#1 eye-data
filtered_data$eye_x <- (filtered_data$left_eye_x+filtered_data$right_eye_x)/2
filtered_data$eye_y <- (filtered_data$left_eye_y+filtered_data$right_eye_y)/2
#Pixels
filtered_data$eye_x <- (filtered_data$eye_x)*1920
filtered_data$eye_y <- (filtered_data$eye_y)*1080
#SD and Mean + min-max
mean_eye_x<- mean(filtered_data$eye_x)
mean_eye_y <- mean(filtered_data$eye_y)
sd_eye_x <- sd(filtered_data$eye_x)
sd_eye_y <- sd(filtered_data$eye_y)
min_eye_x <- min(filtered_data$eye_x)
min_eye_y <- min(filtered_data$eye_y)
max_eye_x <- max(filtered_data$eye_x)
max_eye_y <- max(filtered_data$eye_y)
#add everything to new file
validity_list[[i]] <- c(validity_left, validity_right,
mean_eye_x, mean_eye_y,
min_eye_x, min_eye_y,
max_eye_x, max_eye_y)
}
#new document
write.table(validity_list,
file = "Master T&O/Thesis /Loop/Validity/validity_loop.csv",
col.names = TRUE, row.names = FALSE)
I managed to get a new data frame in R, which contains the value of my validity_list as a matrix form.
#FOR LOOP poging 2
set <- setwd("/Users/Sarah/Documents/Master T&O/Thesis /Loop")
file_list <- list.files(set, pattern = ".csv", all.files = TRUE)
validity_list <- vector("list", "length" = length(file_list))
for(i in seq_along(file_list)){
filename = file_list[i]
#read files
data_frame = read.csv(filename, sep = ",", dec = ".", header = TRUE, stringsAsFactors = FALSE)
#what has to be done
#validity
validity_left <- mean(is.numeric(data_frame$left_gaze_point_validity))
validity_right <-mean(is.numeric(data_frame$right_gaze_point_validity))
#Zuiver dataframe (validity ==1)
to_keep = which(data_frame$left_gaze_point_validity == 1 & data_frame$right_gaze_point_validity==1)
filtered_data = data_frame[to_keep,]
filtered_data$left_eye_x = as.numeric(filtered_data$left_eye_x)
filtered_data$left_eye_y = as.numeric(filtered_data$left_eye_y)
filtered_data$right_eye_x = as.numeric(filtered_data$right_eye_x)
filtered_data$right_eye_y = as.numeric(filtered_data$right_eye_y)
#1 eye-data
filtered_data$eye_x <- (filtered_data$left_eye_x+filtered_data$right_eye_x)/2
filtered_data$eye_y <- (filtered_data$left_eye_y+filtered_data$right_eye_y)/2
#Pixels
filtered_data$eye_x <- (filtered_data$eye_x)*1920
filtered_data$eye_y <- (filtered_data$eye_y)*1080
#SD and Mean + min-max
mean_eye_x<- mean(filtered_data$eye_x)
mean_eye_y <- mean(filtered_data$eye_y)
sd_eye_x <- sd(filtered_data$eye_x)
sd_eye_y <- sd(filtered_data$eye_y)
min_eye_x <- min(filtered_data$eye_x)
min_eye_y <- min(filtered_data$eye_y)
max_eye_x <- max(filtered_data$eye_x)
max_eye_y <- max(filtered_data$eye_y)
#add everything to new file
validity_list[[i]] <- c(validity_left, validity_right,mean_eye_x, mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y)
validity_matrix <- matrix(unlist(validity_list), ncol = 8, byrow = TRUE)
}
#new document
write.table(validity_matrix, file = "/Users/Sarah/Documents/Master T&O/Thesis /Loop/Validity/validity_loop.csv", dec = ".")
The only problem I have now, is the fact that my values for the validity_list items are wrong, but that's another problem and I'm trying to fix it!
If I get it then the following line grabs all your data together:
validity_list[[i]] <- c (validity_left, validity_right,mean_eye_x,
mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y).
if it's like in python then I would have:
validity_list = (validity_left, validity_right,mean_eye_x,
mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y)
... whereas the '=' tell the interpreter that everything behind it is a tuple '(', data, ')' ...which makes it one single dataset and if I then write it... it would be end up in one column. If you do a pick using a for-loop I would get "validity_left" writing in a separate column. In your case adding this to your below code an option?
for item in validity_list:
function to process item..etc.

How can I quickly find all the files in a directory that are missing a first row?

I have a folder of files that are in .csv format. They have blank lines in them that are necessary (this indicates an absence of a measure from a LiDAR unit, which is good and needs to stay in). But occasionally, the first row is empty this throws off the code and the package and everything aborts.
Right now I have to open each .csv and see if the first line is empty.
I would like to do one of the following, but am at a loss how to:
1) write a code that quickly scans through all of the files in the directory and tells me which ones are missing the first line
2) be able to skip the empty lines that are only at the beginning--which can vary, sometimes more than one line is empty
3) have a code that cycles through all of the .csv files and inserts a dummy first line of numbers so the files all import no problem.
Thanks!
Here's a bit of code that does 1 and 2 above. I'm not sure why you'd want to insert dummy line(s) given the ability to do 1 and 2; it's straightforward to do, but usually it's not a good idea to modify raw data files.
# Create some test files
cat("x,y", "1,2", sep="\n", file = "blank0.csv")
cat("", "x,y", "1,2", sep="\n", file = "blank1.csv")
cat("", "", "x,y", "1,2", sep="\n", file = "blank2.csv")
files <- list.files(pattern = "*.csv", full.names = TRUE)
for(i in seq_along(files)) {
filedata <- readLines(files[i])
lines_to_skip <- min(which(filedata != "")) - 1
cat(i, files[i], lines_to_skip, "\n")
x <- read.csv(files[i], skip = lines_to_skip)
}
This prints
1 ./blank0.csv 0
2 ./blank1.csv 1
3 ./blank2.csv 2
and reads in each dataset correctly.
I believe that the two functions that follow can do what you want/need.
First, a function to determine the files with a second line blank.
second_blank <- function(path = ".", pattern = "\\.csv"){
fls <- list.files(path = path, pattern = pattern)
second <- sapply(fls, function(f) readLines(f, n = 2)[2])
which(nchar(gsub(",", "", second)) == 0)
}
Then, a function to read in the files with such lines, one at a time. Note that I assume that the first line is the columns header and that at least the second line is left blank. There is a dots argument, ..., for you to pass other arguments to read.table, such as stringsAsFactors = FALSE.
skip_blank <- function(file, ...){
header <- readLines(file, n = 1)
header <- strsplit(header, ",")[[1]]
count <- 1L
while(TRUE){
txt <- scan(file, what = "character", skip = count, nlines = 1)
if(nchar(gsub(",", "", txt)) > 0) break
count <- count + 1L
}
dat <- read.table(file, skip = count, header = TRUE, sep = ",", dec = ".", fill = TRUE, ...)
names(dat) <- header
dat
}
Now, an example usage.
second_blank(pattern = "csv") # a first run as an example usage
inx <- second_blank() # this will be needed later
fl_names <- list.files(pattern = "\\.csv") # get all the CSV files
df_list <- lapply(fl_names[inx], skip_blank) # read the problem ones
names(df_list) <- fl_names[inx] # tidy up the result list
df_list

write results sequentially in a loop in r

I have a bunt of single files which need to apply a test. I need to find the way to write automatically results of each file into a file. Here is what I do:
library(ape)
stud_files <- list.files("path/dir/data",full.names = T)
for (f in stud_files) {
df <- read.table(f, header=TRUE, sep=";")
df_xts <- as.xts(df$cola, order.by = as.Date(df$colb,"%m/%d/%Y"))
pet <- testa(df_xts)
res <- data.frame(estimate = pet$estimate,
p.value=pet$p.value,
logi = pet$alternative)
write.dna(res,file = "res_testa.xls",format = "sequential")
}
This loop works well, except the last command which aim to write the results of each file consecutively, it saved only the last performance. And the results save as string, not a table as I define above (data.frame). Any idea in this case? Thanks in advance
Check help(write.dna).
write.dna(x, file, format = "interleaved", append = FALSE,
nbcol = 6, colsep = " ", colw = 10, indent = NULL,
blocksep = 1)
append a logical, if TRUE the data are appended to the file without
erasing the data possibly existing in the file, otherwise the file (if
it exists) is overwritten (FALSE the default).
Set append = TRUE and you should be all set.
As some of the comments point out, however, you are probably better off generating your table, and then writing it all at once to a file. Unless you have billions of files, you likely won't run out of memory.
Here is how I would approach this.
library(ape)
library(data.table)
stud_files <- list.files("path/dir/data",full.names = T)
sumfunc <- function(f) {
df <- read.table(f, header=TRUE, sep=";")
df_xts <- as.xts(df$cola, order.by = as.Date(df$colb,"%m/%d/%Y"))
pet <- testa(df_xts)
res <- data.table(estimate = pet$estimate,
p.value=pet$p.value,
logi = pet$alternative)
return(res)
}
lres <- lapply(stud_files, sumfunc)
dat <- rbindlist(lres)
write.table(dat,
file = "res_testa.csv",
sep = ",",
quote = FALSE,
row.names = FALSE)

With R, loop over two files at a time

Hello my favourite coding experts,
I am trying to loop through two files at a time in R: i.e. take one 'case' file and another 'control' file, create a graph and dump it into a pdf, then take another set of 2 files and do the same and so on. I have a list indicating which file is a case and which is a control, like this:
case control
A01 G01
A02 G02
A06 G03
and so on… which can be reproduced like this:
mylist<- data.frame(rbind(c("A01","G01"),c("A02","G02"),c("A06","G03")))
colnames(mylist)<- c('control', 'case')
I cannot find a way to specify which 2 files to loop through each time.
The file (each file with many variables) are: "/Users/francy/Desktop/cc_files_A01", ""/Users/francy/Desktop/cc_files_A02", "/Users/francy/Desktop/cc_files_A06", "/Users/francy/Desktop/cc_files_G01", "/Users/francy/Desktop/cc_files_G02", "/Users/francy/Desktop/cc_files_G03"
For each set of case and control, I would like to do this:
case<- read.table(file="/Users/francy/Desktop/case_files_A01.txt", sep = '\t', header = F)
case <- case[,c(1,2,19,20)]
colnames(case)<- c("ID", "fname", "lname", "Position")
control<- read.table(file="/Users/francy/Desktop/case_files_G01.txt", sep = '\t', header = F)
control <- control[,c(1,2,19,20)]
colnames(control)<- c("ID", "fname", "lname", "Position")
#t-test Position:
test<- t.test(case[20],control[20])
p.value= round(test$p.value, digits=3)
mean_case= round(mean(case[20], na.rm=T), digits=2)
mean_control= round(mean(control[20], na.rm=T), digits=2)
boxplot(c(case[20], control[20]), names=c(paste("case", "mean", mean_case, sep=":"),paste("control", "mean", mean_control, sep=":")))
And want to create a pdf file with all the boxplots.
This is what I have for now:
myFiles <- list.files(path= "/mypath/", pattern=".txt")
pdf('/home/graph.pdf')
for (x in myFiles) {
control <- read.table(file = myFiles[x], sep = '\t', header = F)
## How do I specify that is the other file here, and which file it is?
case <- read.table(file = myFiles[x], sep = '\t', header = F)
}
Any help is very appreciated. Thank you!
Why not just pass the pairs of files to the loops via a list?
files <- list(
c("fileA","fileB"),
c("fileC","fileD")
)
for( f in files ) {
cat("~~~~~~~~\n")
cat("f[1] is",f[1],"~ f[2] is",f[2],"\n")
}
The first time the loop runs, f contains the 1st element of the list files. Since the first element is a character vector of length two, f[1] contains the first file name of the pair, and f[2] contains the second. See the printed output of the above code, which should hopefully make it clear.
What probably makes more sense in this case, is building up the two filenames from your "list" (a data.frame?) of cases and controls.
If this "list" is present in a data.frame lcc, you could do something like:
for(i in seq(nrow(lcc)))
{
currentcase<-lcc$case[i]
currentcontrol<-lcc$control[i]
currentcasefilename<-paste("someprefix_", currentcase, "_somepostfix.txt")
currentcontrolfilename<-paste("someprefix_", currentcontrol, "_somepostfix.txt")
#now open and process both files...
}
Assuming your list of cases and controls is in an R object (dataframe or matrix) called mylist:
for (x in seq_along(nrow(mylist)) {
case <- read.table(file = paste("/my/path/", mylist[x, "case"], ".txt", sep = ""),
sep = "\t", header = F)
control <- read.table(file = paste("/my/path/", mylist[x, "control"], ".txt", sep = ""),
sep = "\t", header = F)
## your code here ##
}

Resources