Print each line of a merged data frame in R - r

I am trying to merge the content of 2 text files and print the merged output using R. My code is given below:
setwd("C:\\Documents and Settings\\Administrator\\Desktop\\Test")
file_list <- list.files()
for (file in file_list){
print(file)
# create merged dataset if it does not exist
if(!exists("dataset")){
dataset <- read.table(file, header=TRUE, sep="\t")
}
# else, append to it
if(exists("dataset")){
temp_dataset <-read.table(file, header=TRUE, sep="\t")
dataset <- rbind(dataset,temp_dataset)
rm(temp_dataset)
}
}
dataset
There are 2 text files in the "Test" folder - file1.txt, and file2.txt. file1.txt has the line ABC, and file2.txt DEF. However, when running the script, only ABC is being printed, and not DEF. I cannot figure out why. I am new to R scripting, and hence might be making basic errors. Please help.

Or use rbindlist from data.table
library(data.table)
file_list <- list.files() #2 rows per files
rbindlist(lapply(file_list, fread))
# A B C
# 1: 4 12 18
# 2: 3 5 6
# 3: 4 14 25
# 4: 3 13 28

You don't need that loop, you can just lapply across your files then rbind the resulting list of data.frames into a single one
file_list <- list.files()
table_list <- lapply(file_list, read.delim)
Single_table <- do.call(rbind, table_list)

Related

Importing multiple csv file from different folder and extracting the filename as additionnal columns: headers faith and multifolders case

Use of ldply (package "plyr") to import multiple csv files from a folder: header faith, and how to do it for multiple folders?
set up:
- Desktop: MacBook Pro (Early 2011) with iOS 10.13.6
- Software version: R version 3.5.1 (2018-07-02) -- "Feather Spray"
- R studio: Version 1.1.456
I would like to import multiple csv files from specific folders and merge them into one file with 5 columns: Variable1/Variable2/file_name/experiment_nb/pulse_nb
I have managed to make the importation of all files from the same folder from previous similar questions in StackOverflow in the same data.frame, however, I am not sure of how to do it for different folder and the faith of header of each file after merge, . As the file are too big to handle manually (200 000 lines per files), I want to make sure there is not any mistake that would cause all subsequent analysis to fail, such as the line of the header before the data of each csv file imported
The csv looks like this: "20190409-0001_002.csv" with the date, followed by the name of the experiment (0001) in the example, and the number of the pulse (002)
#setting package and directory
library(plyr)
library(stringr)
setwd("/Users/macbook/Desktop/Project_Folder/File_folder1")
#Creating a list of all the filenames:
filenames <- list.files(path = "/Users/macbook/Desktop/Project_Folder/File_folder1")
#creating a function to read csv and in the same time adding an additional column with the name of the file
read_csv_filename <- function(filename)
{
ret <- read.csv(filename, header=TRUE, sep=",")
ret$Source <- filename #EDIT
ret
}
#importing
import <- ldply(filenames, read_csv_filename)
#making a copy of import
data<-import
#modifying the file name so it removes ".csv" and change the header
data$Source<-str_sub(data$Source, end=-5)
data[1,3]<-"date_expnb_pulsenb"
t<-substr(data[1,3],1,3)
head(data, n=10)
#create a column with the experiment number, extracted from the file name
data$expnb<-substr(data$Source, 10, 13)
data$expnb<-as.numeric(data$expnb)
head(data, n=10)
tail(data, n=10)
1° Now I need to manage to import all the other folders in the same files, which I could eventually do manually because the number of folder is manually doable (9-10), but I am considering making a code for this as well for future experiments with big number of experiments. How to do that ? to first list all folder, then list all files from those folder, and then regroup them in one list files ? Is this doable with list.files ?
The folder name will looks like this: "20190409-0001"
2° The result from the code above (head(data, n=10)) looks like this:
> head(data, n=10)
Time Channel.A Source pulsenb expnb
1 (us) (A) expnb_pulsenb NA NA
2 -20.00200030 -0.29219970 20190409-0001_002 2 1
3 -20.00100030 -0.29219970 20190409-0001_002 2 1
and
> tail(data, n=10)
Time Channel.A Source pulsenb expnb
20800511 179.99199405 -0.81815930 20190409-0001_105 105 1
20800512 179.99299405 -0.81815930 20190409-0001_105 105 1
I would like to run extensive data analysis on the now big list, and I am wondering how to check that in the middle of them I do not have some line with file headers. As the headers as the same in the csv file, does the ldply function already takes into account the headers? Would all the file header be in a separate line in the "import" data frame ? How to check that? (unfortunately, there is around 200 XXX lines in each file so I can not really manually check for headers).
I hope I have added all the required details and put the questions in the right format as it is my first time posting here :)
Thank you guys in advance for your help!
I have created a sham environnement of folders and files, assuming that you would logically regroup all your files and folders.
# ---
# set up folders and data
lapply( as.list(paste0("iris", 1:3)), dir.create )
iris_write <- function(name) write.csv(x = iris, file = name)
lapply( as.list(paste0("iris", 1:3, "/iris", 1:3, ".csv")), iris_write)
# Supposing you got them all in one folder, one level up
ldir <- list.dirs()
ldir <- ldir[stringr::str_detect(string = ldir, pattern = "iris")] # use 20190409-0001 in your case
# Getting all files
lfiles <- lapply( as.list(ldir), list.files )
# Getting all path
path_fun <- function(dirname) paste0(dirname, "/", list.files(dirname) )
lpath <- lapply( as.list(ldir), path_fun )
Using r base or/and the package data.table
# ---
# --- Import, with functions that detect automatically headers, sep + are way faster to read data
# *** Using data.table
library(data.table)
read_csv_filename <- function(filename){
ret <- fread(filename)
ret$Source <- filename #EDIT
ret
}
ldata <- lapply( lpath , read_csv_filename )
# --- if you want to regroup them
# with r base
df_final <- do.call("rbind", ldata)
# using data.table
df_final <- rbindlist(ldata)
Using package dplyr
# *** using dplyr
library(dplyr)
read_csv_filename2 <- function(filename){
ret <- reader(filename)
ret$Source <- filename #EDIT
ret
}
ldata <- lapply( lpath , read_csv_filename )
df_final <- bind_rows(ldata)
# you may do this with plyr::ldply also
df_final2 <- plyr::ldply(ldata, data.frame)
# *** END loading
Last suggestion : file_path_sans_ext from the package tools
# modifying the file name so it removes ".csv" and change the header
library(tools)
data$Source <- tools::file_path_sans_ext( data$Source )
#create a column with the experiment number, extracted from the file name
data$expnb <- substr(data$Source, 10, 13)
data$expnb <- as.numeric(data$expnb)
Hope this help :)
I'll add my solution, too using purrr's map_dfr
Generate Data
This will just generate a lot of csv files in a temp directory for us to manipulate. This is a good approach for helping us answer questions for you.
library(tidyverse)
library(fs)
temp_directory <- tempdir()
library(nycflights13)
library(nycflights13)
purrr::iwalk(
split(flights, flights$carrier),
~ { str(.x$carrier[[1]]); vroom::vroom_write(.x, paste0(temp_directory,"/", glue::glue("flights_{.y}.csv")),
delim = ",") }
)
Custom Function
It looks like you have a custom function to read in some information because the format of the file might be different. Here's my hack at what you were doing.
# List of files
my_files <- fs::dir_ls(temp_directory, glob = "*.csv")
custom_read_csv <- function(file){
# Read without colnames
ret <- read_csv(file, col_names = FALSE)
# Pull out column names
my_colnames <- unlist(ret[1,])
# Remove the row
ret <- ret[-1,]
# Assign the column names
colnames(ret) <- my_colnames
# Trick to remove the alpha in a row you know should be time
ret <- filter(ret, !is.na(as.numeric(T)))
}
Now you can read in all of your files with the custom function and combine into a single dataframe using map_dfr:
all_files <- map_dfr(my_files, custom_read_csv, .id = "filename")
head(all_files)
Which looks like this:
> head(all_files)
# A tibble: 6 x 20
filename year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 C:/User~ 2013 1 1 810 810 0 1048 1037 11 9E 3538 N915XJ
2 C:/User~ 2013 1 1 1451 1500 -9 1634 1636 -2 9E 4105 N8444F
3 C:/User~ 2013 1 1 1452 1455 -3 1637 1639 -2 9E 3295 N920XJ
4 C:/User~ 2013 1 1 1454 1500 -6 1635 1636 -1 9E 3843 N8409N
5 C:/User~ 2013 1 1 1507 1515 -8 1651 1656 -5 9E 3792 N8631E
The you could remove the root path using the following syntax(my path is in there now):
all_files %>%
mutate(filename = str_remove(filename, "C:/Users/AppData/Local/Temp/RtmpkdmJCE/"))

Need to run R code on all text files in a folder

I have a text file. I made a R code for it to extract a certain line of information from it.
###Read file and format
txt_files <- list.files(pattern = '*.txt')
text <- lapply(txt_files, readLines)
text <- sapply(text, function(x) iconv(x, "latin1", "ASCII", sub=""))
###Search and store grep
l =grep("words" ,text)
(k<- length(l))
###Matrix to store data created
mat <- matrix(data = NA, nrow = k, ncol = 2)
nrow(mat)
###Main
for(i in 1:k){
u= 1
while(text[(l[i])-u]!=""){
line.num=u;
u=u+1
}
mat[i,2]<-text[(l[i])-u-1]
mat[i,1]<- i
}
###Write the output file
write.csv(mat, file = "Evalutaion.csv")
It runs on one file at a time. I need to run it on many files and append all the results in a single file with an additional column that tells me the name of the file from which each of the result has come. I am unable to come up with some solution. What changes do I make?
Applying your operations to all files in a folder:
txt_files <- list.files(pattern = '*.txt')
# Applying all your functions on all txt_files using for loop, you need to use indexes inside where ever you are using txt_files
for (i in 1:length(txt_files)) {
# Operation 1
# Operation 2
# Operation 3
write.table(mat,file=paste0("./",sub(".txt","",FILES[i]),".csv"),row.names=F,quote=F,sep=",")
}
Merging files together with same headers, I have two csv files with Same Header Data and Value, File Names were File1.csv and File2.csv inside Header folder, which I am merging together to get one header and all rows and columns. Make sure both the files have same number of columns and same headers in same order.
## Read into a list of files, an Example below
library(plyr)
library(gdata)
setwd("./Header") # CSV Files to be merged are in this direcory
## Read into a list of files:
filenames <- list.files(path="./",pattern="*.csv")
fullpath=file.path("./",filenames)
print (filenames)
print (fullpath)
dataset <- do.call("rbind",lapply(filenames,FUN=function(files){read.table(files,sep=",",header=T)}))
dataset
# Data Value
# 1 ABC 23
# 2 PQR 33
# 3 MNP 43 # Till here was File.csv
# 4 AC 24
# 5 PQ 34
# 6 MN 44 # Till here was File2.csv
write.table(dataset,file="dataset.csv",sep=",",quote=F,row.names=F,col.names=T)

Merging a bunch of csv files into one with headers

I have a couple of csv files I want to combine as a list then output as one merged csv. Suppose these files are called file1.csv, file2.csv, file3.csv, etc...
file1.csv # example of what each might look like
V1 V2 V3 V4
12 12 13 15
14 12 56 23
How would I create a list of these csvs so that I can output a merged csv that would have headers as the file names and the column names at the top as comments? So a csv that would look something like this in Excel:
# 1: V1
# 2: V2
# 3: V3
# 4: V4
file1.csv
12 12 13 15
14 12 56 23
file2.csv
12 12 13 15
14 12 56 23
file3.csv
12 12 13 15
14 12 56 23
I am trying to use the list function inside of a double for loop to merge these csvs together, write each list to a variable, and write each variable to a table output. however this does not not work as intended.
# finding the correct files in the directory
files <- dir("test files/shortened")
files_filter <- files[grepl("*\\.csv", files)]
levels <- unique(gsub( "-.*$", "", files_filter))
# merging
for(i in 1:length(levels)){
level_specific <- files_filter[grepl(levels[i], files_filter)]
bindme
for(j in 1:length(level_specific)){
bindme2 <- read.csv(paste("test files/shortened/",level_specific[j],sep=""))
bindme <- list(bindme,bindme2)
assign(levels[i],bindme)
}
write.table(levels[i],file = paste(levels[i],"-output.csv",sep=""),sep=",")
}
Looking at your code, I think you don't need a for-loop. With the data.table package you could do it as follows:
filenames <- list.files(pattern="*.csv")
files <- lapply(filenames, fread) # fread is the fast reading function from the data.table package
merged_data <- rbindlist(files)
write.csv(merged_data, file="merged_data_file.csv", row.names=FALSE)
If at least one of the csvs has column names set, they will be used in the resulting datatable.
Considering your code, it could be improved considerably. This:
files <- dir("test files/shortened")
files_filter <- files[grepl("*\\.csv", files)]
can be replaced by just:
filenames <- list.files(pattern="*.csv")
In your for-loop the first time you call bindme, it isn't doing anything. What is it? A list? A dataframe? You could use something like:
bindme <- data.table() # or data.frame()
Furthermore, the part:
write.table(levels[i],file = paste(levels[i],"-output.csv",sep=""),sep=",")
will generate several csv-files, but you wanted just one merged file.
Would this help
mergeMultipleFiles <- function(dirPath, nameRegex, outputFilename){
filenames <- list.files(path=dirPath, pattern=nameRegex, full.names=TRUE, recursive=T)
dataList <- lapply(filenames, read.csv, header=T, check.names=F)
combinedData <- ldply(dataList, rbind)
write.csv(combinedData, outputFilename)
}
ps: There is a regex thrown in for filenames. Just in case you want to only merge certain "pattern" of files.
Modify this example. If I understood your question correctly it will help you.
# get the names of the csv files in your current directory
file_names = list.files(pattern = "[.]csv$")
# for every name you found go and read the csv with that name
# (this creates a list of files)
import_files = lapply(file_names, read.csv)
# append those files one after the other (collapse list elements to one dataset) and save it as d
d=do.call(rbind, import_files)

Is there an efficient way to append to an existing csv file without duplicates in R?

There is a data.frame appended to a existing file. When it is appended by write.table function, it might cause duplicated records into the file. Here is the sample code:
df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
write.csv(df1, "export.csv", row.names=FALSE, na="NA");
#"export.csv" keeps two copies of df1
write.table(df1,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F);
So ideally the output file should only keep one copy of df1. But the write.table function doesn't have any parameter for duplicate check.
Thank you for any suggestion in advance.
You could read the data.frame from file, rbind it with the new data.frame and check for duplicate values. For writing efficiency, append only the non-duplicate rows.
If you came up with this question because you are working with big data sets and read/write time is of concern, take a look at data.table and fread packages.
# initial data.frame
df1<-data.frame(name=c('a','b','c'), a=c(1,2,2))
write.csv(df1, "export.csv", row.names=FALSE, na="NA")
# a new data.frame with a couple of duplicate rows
df2<-data.frame(name=c('a','b','c'), a=c(1,2,3))
dfRead<-read.csv("export.csv") # read the file
all<-rbind(dfRead, df2) # rbind both data.frames
# get only the non duplicate rows from the new data.frame
nonDuplicate <- all[!duplicated(all)&c(rep(FALSE, dim(dfRead)[1]), rep(TRUE, dim(df2)[1])), ]
# append the file with the non duplicate rows
write.table(nonDuplicate,"export.csv", row.names=F,na="NA",append=T, quote= FALSE, sep=",", col.names=F)
> # Original Setup ----------------------------------------------------------
> df1 <- data.frame(name = c('a','b','c'), a = c(1,2,2))
> write.csv(df1, "export.csv", row.names=FALSE, na="NA");
>
> # Add Some Data -----------------------------------------------------------
> df1[,1] <- as.character(df1[,1])
> df1[,2] <- as.numeric(df1[,2])
> df1[4,1] <- 'd'
> df1[4,2] <- 3
>
> # Have a Look at It -------------------------------------------------------
> head(df1)
name a
1 a 1
2 b 2
3 c 2
4 d 3
>
> # Write It Out Without Duplication ----------------------------------------
> write.table(df1, "export.csv", row.names=F, na="NA",
+ append = F, quote= FALSE, sep = ",", col.names = T)
>
> # Proof It Works ----------------------------------------------------------
> proof <- read.csv("export.csv")
> head(proof)
name a
1 a 1
2 b 2
3 c 2
4 d 3
You could alternately follow the comment on your question that recommended rbind or simply use write.csv or write.table with and append = T option, making sure to properly handle row and column names.
However, I will also recommend using and readRDS and saveRDS and just overwriting the rds objects rather than appending as a best practice. The use of RDS is recommended by Hadley and other top names in R.

Making the same calculation for multiple files

I have different csv files with different names. I want to make some calculations and after that I want to save the results into one csv file.
My data of two csv files have this format:
File 1:
day price
2000-12-01 00:00:00 2
2000-12-01 06:00:00 3
2000-12-01 12:00:00 NA
2000-12-01 18:00:00 3
File 2:
day price
2000-12-01 00:00:00 12
2000-12-01 06:00:00 NA
2000-12-01 12:00:00 14
2000-12-01 18:00:00 13
To read the files I use this:
file1 <- read.csv(path_for_file1, header=TRUE, sep=",")
file2 <- read.csv(path_for_file2, header=TRUE, sep=",")
An example of calculation process:
library(xts)
file1 <- na.locf(file1)
file2 <- na.locf(file2)
And save the results into a csv where the timestamp is the same for the csv files:
merg <- merge(x = file1, y = file2, by = "day", all = TRUE)
write.csv(merge,file='path.csv', row.names=FALSE)
To read multiple files I tried this. Any ideas how can make the process from 2 files to be for n files?
You say that your data are comma-separated, but you show them as space-separated. I'm going to assume that your data are truly comma-separated.
Rather than reading them into separate objects, it's easier to read them into a list. It's also easier to use read.zoo instead of read.csv because merging time-series is a lot easier with xts/zoo objects.
# get list of all files (change pattern to match your actual filenames)
files <- list.files(pattern="file.*csv")
# loop over each file name and read data into an xts object
xtsList <- lapply(files, function(f) {
d <- as.xts(read.zoo(f, sep=",", header=TRUE, FUN=as.POSIXct))
d <- align.time(d, 15*60)
ep <- endpoints(d, "minutes", 15)
period.apply(d, ep, mean)
})
# set the list names to the file names
names(xtsList) <- files
# merge all the file data into one object, filling in NA with na.locf
x <- do.call(merge, c(xtsList, fill=na.locf))
# write out merged data
write.zoo(x, "path.csv", sep=",")

Resources