Convert R plotting commands into loop for many files - r

Currently I have been using R to read in a table and plot some of the data which I save as a png file. Now I have 100 files and would like this process to be automated rather than manually changing the path 100 times.
Additionally I would like to join the 100 files into one table in R that I can subsequently analyse. The join would be in the format of dplyr's bind_rows as all files have the same column headers. I've done this for when I have two tables in R but now when I am using a loop to read files in sequentially. What would be the best way to do this in R? Thanks in advance for any suggestions or help.
my_data <- read.table('/path/to/data/results/sample_1.txt', header = TRUE, sep = "\t")
ggplot(my data, aes(x=alt_freq)) + geom_histogram(color="black", fill="white", bins = 20) + xlim(c(0,1))
ggsave("/path/to/plots/sample_1.png", plot = last_plot(),width = 16, height = 9)
#append table to one large table in the format of dplyr::bind_rows(y, z)
Input files are all named with the same naming convention:
sample_1.txt
sample_2.txt
sample_3.txt
The files look like:
sample_name position alt_freq ref_freq sample_1_counts
sample 1 10 0.5 0.5 2
sample 1 20 0.25 0.75 4
All txt files are in the same directory and all txt files are of interest.

First collect the complete path of the files of interest
library(ggplot2)
all_files <- list.files("/path/to/data/results", pattern = "sample_\\d+\\.txt$",
full.names = TRUE)
Then create a function to apply to each file
new_fun <- function(path_of_file) {
my_data <- read.table(path_of_file, header = TRUE)
ggplot(my_data, aes(x=alt_freq)) +
geom_histogram(color="black", fill="white", bins = 20) + xlim(c(0,1))
ggsave(paste0(dirname(path_of_file), "/", sub("txt$", "png",
basename(path_of_file))), plot = last_plot(),width = 16, height = 9)
}
We use paste0 to create path to save the plot dynamically by getting the directory name and replacing the ending txt with png.
Then use lapply/map/for loop to apply new_fun to each file
lapply(all_files, new_fun)
To combine all the files into one dataframe we can do
combined_data <- do.call(rbind, lapply(all_files, read.table, header = TRUE))
If the header is different for one column we can change the column name for that particular column and then rbind. So for example, if the header information for column 1 is different, we can do
combined_data <- do.call(rbind, lapply(all_files, function(x) {
df <- read.table(x, header = TRUE)
names(df)[1] <- "new_header"
df$filename <- basename(x)
df
}))

I would do something like the following.
Change these to their real values.
in_dir <- '/path/to/data/results'
out_dir <- '/path/to/plots'
Now the plots and binding the tables.
library(ggplot2)
old_dir <- getwd()
setwd(in_dir)
flnames <- list.files(pattern = '^sample_[[:digit:]]+\\.txt$')
data_list <- lapply(flnames, read.table, header = TRUE, sep = '\t')
lapply(seq_along(data_list), function(i){
ggplot(data_list[[i]], aes(x = alt_freq)) +
geom_histogram(color = "black", fill = "white", bins = 20) +
xlim(c(0, 1))
f <- sub('txt$', 'png', flname[i])
outfile <- paste(out_dir, f, sep = '/')
ggsave(outfile, plot = last_plot(),width = 16, height = 9)
})
data_all <- dplyr::bind_rows(data_list)
Final cleanup.
setwd(old_dir)
## NOT RUN
#rm(data_list)

Related

R plot for multiple files in folder

By using R programming I want to read files in folder. perform some operations on it, plot and save as csv1.
Read next file, perform same operations, plot and save the new modified dataframe in csv1 with rbind function. Remember I want 1 plot from all files I read in for loop and save plot as pdf.
Right now i am using following code but my system crash due to shortage of RAM
all_paths <-
list.files(path = "/work/newplots",
pattern = "*.*",
full.names = TRUE)
all_filenames <- all_paths %>%
basename() %>%
as.list()
all_content <-
all_paths %>%
lapply(read.table,
header = TRUE,
skip=60,
sep=',',
encoding = "UTF-8")
file <- data.frame()
for (i in 1:length(all_filenames)) {
all_lists <- mapply(c, all_content, i, SIMPLIFY = FALSE)
data <- rbindlist(all_lists, fill = T)
names(data)[1] <- "File.Path"
x1 <- data %>% select(V1) %>% unique()
data <- data %>% data.frame(str_split_fixed(data$File.Path, " ", 23))%>% select(-c(File.Path))%>% filter(X1=='Interactions')
data<- cbind(x1,data)
data <- data %>% select(-c(2)) %>%select(V1,X2)
data$X2 <-as.numeric(data$X2)
file <- write.table(data,"/work/con1_10.csv",row.names = FALSE)
file <- append(file,data)
p<-plot(data$X2, xlab="Cycle number",ylab="Interactions",type = "p")
print(p)
Z<- (2*data$X2)/20006
px<-plot(Z, xlab="Cycle number", ylab="Z")
print(px)
}

Transposing csv files before saving them in the environment in R

I am working with multiple csv files in long format. Each file has a different number of columns but the same number of rows. I was trying to read all files and merged them in one df but I could not do it.
So far I use this code to read each file individually:
try <- read.table('input/SMPS/new_format/COALA_SMPS_20200218.txt', #set the file to read
sep = ',', #separator
header = F, # do not read the header
skip = 17, # skip 17 firdt lines of information
fill = T) %>% #fill all empty spaces in the df
t()%>% #transpose the data
data.frame()%>% #make it a df
select(1:196) #select the useful data
My plan was to use something similar to this code but I don't know where to include the transpose function to make it work.
smps_files_new <- list.files(pattern = '*.txt',path = 'input/SMPS/new_format/')#Change the path where the files are located
myfiles <-do.call("rbind", ##Apply the bind to the files
lapply(smps_files_new, ##call the list
function(x) ##apply the next function
read.csv(paste("input/SMPS/new_format/", x, sep=''),sep = ',', #separator
header = F, # do not read the header
skip = 17, # skip 17 first lines of information
stringsAsFactors = F,
fill = T))) ##
Use the same code in lapply which you used for individual files :
do.call(rbind, ##Apply the bind to the files
lapply(smps_files_new, ##call the list
function(x) ##apply the next function
read.csv(paste("input/SMPS/new_format/", x, sep=''),sep = ',',
header = F, # do not read the header
skip = 17, # skip 17 first lines of information
stringsAsFactors = FALSE,
fill = TRUE) %>%
t()%>%
data.frame()%>%
select(1:196)))
Another way would be to use purrr::map_df or map_dfr instead of lapply + do.call(rbind
purrr::map_df(smps_files_new,
function(x)
read.csv(paste("input/SMPS/new_format/", x, sep=''),sep = ',',
header = F,
skip = 17,
stringsAsFactors = FALSE,
fill = TRUE) %>%
t()%>%
data.frame()%>%
select(1:196)))

Import multiple csv files into R and plot pairs of data from each csv file on the same graph

I want to import multiple csv files (12) that contain data on time t and pressure P. The headers for each file are the same (t,P) but the time values are not common, so each t,P pair from each csv file is unique.
I can import the files without a problem:
filenames <- list.files(path = ("C:/Users/K125763/Documents/Mars"),
# Follows a regular expression that matches:
pattern = "mars-[0-12]{2}.csv",
full.names = TRUE)
filenames <- filenames[1:12]
But the loop below plots each each data pair on a separate graph:
analyze.p <- function(filename) {
dat <- read.csv(file = filename, header = TRUE)
plot(dat$P~dat$t)
}
for (f in filenames) {
print(f)
analyze.p(f)
}
How can I get all the pairs of data on the same graph?
I'll suggest a functional programming approach + ggplot2:
library(purrr) # functional programming tools
library(dplyr) # data wrangling
library(ggplot2) # plotting
filenames <- list.files(path = ("C:/Users/K125763/Documents/Mars"),
# Follows a regular expression that matches:
pattern = "mars-[0-12]{2}.csv",
full.names = TRUE)
filenames <- filenames[1:12]
## Read and concatenate CSV files
data <- filenames %>%
map_df(~mutate(read.csv(file = .x, header = TRUE), source_file = .x))
## Plot in a single chart
ggplot(data, aes(t, P)) +
geom_point() +
facet_wrap('source_file')

R: Split and write very large data frame into slices

I have a large dataframe my_df in R containing 1983000 records. The following lines of sample code take the chunk of 1000 rows starting from 25001, do some processing, and write the processed data into a file to the local disk.
my_df1 <- my_df[25001:26000,]
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
str_data <- my_df1
setwd("path_to_local_dir/data25001_26000")
write.table(str_data, file = "data25001-26000.csv",row.names = F,col.names = F,quote = F)
and so on like this:
my_df2 <- my_df[26001:27000,]
...
I would like automate this task such that the chunks of 1000 records are processed and written to a new directory. Any advise on how this could be done?
Consider generalizing your process in a function, data_to_disk, and call function with an iterator method like lapply passing a sequence of integers with seq() for each subsequent thousand. Also, incorporate a dynamic directory creation (but maybe dump all 1,000+ files in one directory instead of 1,000+ dirs?).
data_to_disk <- function(num) {
str_data <- within(my_df[num:(num + 999)], {
end <- as.POSIXct(paste0(end, "23:59"))
year <- lubridate::year($start)
})
my_dir <- paste0("path_to_local_dir/data", num, "_", num + 999)
if(!dir.exists(my_dir)) dir.create(my_dir)
write.table(str_data, file = paste0(my_dir, "/", "data", num, "-", num + 999, ".csv"),
row.names = FALSE, col.names = FALSE, quote = FALSE)
return(my_df)
}
seqs <- seq(25001, nrow(my_df), by=1000)
head(seqs)
# [1] 25001 26001 27001 28001 29001 30001
tail(seqs)
# [1] 1977001 1978001 1979001 1980001 1981001 1982001
# LIST OF 1,958 DATA FRAMES
df_list <- lapply(seqs, data_to_disk)
Here is my code doing the sliced loop:
step1 = 1000
runto = nrow(my_df)
nsteps = ceiling(runto/step1)
for( part in seq_len(nsteps) ) { # part = 1
cat( part, 'of', nsteps, '\n')
fr = (part-1)*step1 + 1
to = min(part*step1, runto)
my_df1 = my_df[fr:to,]
# ...
write.table(str_data, file = paste0("data",fr,"-",to,".csv"))
}
rm(part, step1, runto, nsteps, fr, to)
You can add a grouping variable to your data first (e.g., to identify every 1000 rows), then use d_ply() to split the data and write to file.
df <- data.frame(var=runif(1000000))
df$fold <- cut(seq(1,nrow(df)),breaks=100,labels=FALSE)
df %>% filter(fold<=2) %>% # only writes first two files
d_ply(.,.(fold), function(i){
# make filenames 'data1.csv', 'data2.csv'
write_csv(i,paste0('data',distinct(i,fold),'.csv'))
})
This is similar to #Parfait but takes a lot of stuff out of the function. Specifically, it creates a copy of the entire dataset and then performs the time manipulation functions.
my_df1 <- my_df
my_df1$end <- as.POSIXct(paste(my_df1$end,"23:59",sep = ""))
my_df1$year <- lubridate::year(my_df1$start)
lapply(seq(25001, nrow(my_df1), by = 1000),
function(i) write.table(my_df1[i:i+1000-1,]
, file = paste0('path_to_logal_dir/data'
, i, '-', i+1000-1, '.csv')
,row.names = F,col.names = F,quote = F)
)
For me, I'd probably just do:
write.table(my_df1, file = ...)
and be done with it. I don't see the advantages of splitting it up - 1 million rows really isn't that many.

For-loop in R to create a new file (but gives incorrect/unexpected output)

I'm currently busy with some data and I need to check their validity.
Therefore, I would like to use a for-loop to go through all my data files.
In this for-loop, I would like to calculate some things (like mean, min,max...).
My code below works but produced an incorrectly written csv file. The problem occurs after the calculations (and their values) are done during csv file creation. CSV:
"c.1..1..1004.89081855716..630.174466667434..461.738905906677.." "c.1..1..950.990843858612..479.98560814955..517.955102920532.."
1 1
1 1
1004.89081855716 950.990843858612
630.174466667434 479.98560814955
461.738905906677 517.955102920532
1535.86795806885 1452.30199813843
-13.3948961645365 3.72026950120926
1259.26423788071 1159.17089223862
Approach/What I'm expecting:
So I start from some data files with eye tracking data in it.
As you can see at the beginning of the code, I try to get some values out of this eye tracking data (validity, new file with only validity == 1 data...). Once I created the filtered_data dataframe, I want to calculate some extra values out of it (mean, sd, min/max).
My plan is to create a new csv file (validity_loop.csv) in which I can find all my calculations (validity_left, validity_right,mean_eye_x, mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y). All in a row. One row for each data set (file_list[i]).
Can someone help me in how to tackle and solve this issue?
Here is my code:
set <- setwd("/Users/Sarah/Documents")
file_list <- list.files(set, pattern = ".csv", all.files = TRUE)
validity_list <- data_list <- vector("list", "length" = length(file_list))
for(i in seq_along(file_list)){
filename = file_list[i]
#read files
data_frame = read.csv(filename, sep = ",", dec = ".",
header = TRUE,
stringsAsFactors = FALSE)
#what has to be done
#validity
validity_left <- mean(is.numeric(data_frame$left_gaze_point_validity))
validity_right <-mean(is.numeric(data_frame$right_gaze_point_validity))
#Zuiver dataframe (validity ==1)
to_keep = which(data_frame$left_gaze_point_validity == 1 &
data_frame$right_gaze_point_validity==1)
filtered_data = data_frame[to_keep,]
filtered_data$left_eye_x = as.numeric(filtered_data$left_eye_x)
filtered_data$left_eye_y = as.numeric(filtered_data$left_eye_y)
filtered_data$right_eye_x = as.numeric(filtered_data$right_eye_x)
filtered_data$right_eye_y = as.numeric(filtered_data$right_eye_y)
#1 eye-data
filtered_data$eye_x <- (filtered_data$left_eye_x+filtered_data$right_eye_x)/2
filtered_data$eye_y <- (filtered_data$left_eye_y+filtered_data$right_eye_y)/2
#Pixels
filtered_data$eye_x <- (filtered_data$eye_x)*1920
filtered_data$eye_y <- (filtered_data$eye_y)*1080
#SD and Mean + min-max
mean_eye_x<- mean(filtered_data$eye_x)
mean_eye_y <- mean(filtered_data$eye_y)
sd_eye_x <- sd(filtered_data$eye_x)
sd_eye_y <- sd(filtered_data$eye_y)
min_eye_x <- min(filtered_data$eye_x)
min_eye_y <- min(filtered_data$eye_y)
max_eye_x <- max(filtered_data$eye_x)
max_eye_y <- max(filtered_data$eye_y)
#add everything to new file
validity_list[[i]] <- c(validity_left, validity_right,
mean_eye_x, mean_eye_y,
min_eye_x, min_eye_y,
max_eye_x, max_eye_y)
}
#new document
write.table(validity_list,
file = "Master T&O/Thesis /Loop/Validity/validity_loop.csv",
col.names = TRUE, row.names = FALSE)
I managed to get a new data frame in R, which contains the value of my validity_list as a matrix form.
#FOR LOOP poging 2
set <- setwd("/Users/Sarah/Documents/Master T&O/Thesis /Loop")
file_list <- list.files(set, pattern = ".csv", all.files = TRUE)
validity_list <- vector("list", "length" = length(file_list))
for(i in seq_along(file_list)){
filename = file_list[i]
#read files
data_frame = read.csv(filename, sep = ",", dec = ".", header = TRUE, stringsAsFactors = FALSE)
#what has to be done
#validity
validity_left <- mean(is.numeric(data_frame$left_gaze_point_validity))
validity_right <-mean(is.numeric(data_frame$right_gaze_point_validity))
#Zuiver dataframe (validity ==1)
to_keep = which(data_frame$left_gaze_point_validity == 1 & data_frame$right_gaze_point_validity==1)
filtered_data = data_frame[to_keep,]
filtered_data$left_eye_x = as.numeric(filtered_data$left_eye_x)
filtered_data$left_eye_y = as.numeric(filtered_data$left_eye_y)
filtered_data$right_eye_x = as.numeric(filtered_data$right_eye_x)
filtered_data$right_eye_y = as.numeric(filtered_data$right_eye_y)
#1 eye-data
filtered_data$eye_x <- (filtered_data$left_eye_x+filtered_data$right_eye_x)/2
filtered_data$eye_y <- (filtered_data$left_eye_y+filtered_data$right_eye_y)/2
#Pixels
filtered_data$eye_x <- (filtered_data$eye_x)*1920
filtered_data$eye_y <- (filtered_data$eye_y)*1080
#SD and Mean + min-max
mean_eye_x<- mean(filtered_data$eye_x)
mean_eye_y <- mean(filtered_data$eye_y)
sd_eye_x <- sd(filtered_data$eye_x)
sd_eye_y <- sd(filtered_data$eye_y)
min_eye_x <- min(filtered_data$eye_x)
min_eye_y <- min(filtered_data$eye_y)
max_eye_x <- max(filtered_data$eye_x)
max_eye_y <- max(filtered_data$eye_y)
#add everything to new file
validity_list[[i]] <- c(validity_left, validity_right,mean_eye_x, mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y)
validity_matrix <- matrix(unlist(validity_list), ncol = 8, byrow = TRUE)
}
#new document
write.table(validity_matrix, file = "/Users/Sarah/Documents/Master T&O/Thesis /Loop/Validity/validity_loop.csv", dec = ".")
The only problem I have now, is the fact that my values for the validity_list items are wrong, but that's another problem and I'm trying to fix it!
If I get it then the following line grabs all your data together:
validity_list[[i]] <- c (validity_left, validity_right,mean_eye_x,
mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y).
if it's like in python then I would have:
validity_list = (validity_left, validity_right,mean_eye_x,
mean_eye_y, min_eye_x,max_eye_x,min_eye_y,max_eye_y)
... whereas the '=' tell the interpreter that everything behind it is a tuple '(', data, ')' ...which makes it one single dataset and if I then write it... it would be end up in one column. If you do a pick using a for-loop I would get "validity_left" writing in a separate column. In your case adding this to your below code an option?
for item in validity_list:
function to process item..etc.

Resources