I have a list of files that are all named similarly: "FlightTrackDATE.txt" where the date is expressed in YYYYMMDD. I read in all the files with the list.files() command, but this gives me all the files in that folder (only flight track files are in this folder). What I would like to do is create a new file that will combine all the files from the last 90 days (or three months, whichever is easier) and ignore the other files.
You can try this :
#date from which you want to consolidate (replace with required date)
fromDate = as.Date("2015-12-23")
for (filename in list.files()){
#extract the date from filename using substr ( characters 12- 19)
filenameDate = as.Date(substr(filename,12,19), format = "%Y%m%d")
#read and consolidate if the filedate is on or after from date
if ((filenameDate - fromDate) >=0){
#create consolidated list from first file
if (!exists('consolidated')){
consolidated <- read.table(filename, header = TRUE)
} else{
data = read.table(filename, header = TRUE)
#row bind to consolidate
consolidated = rbind(consolidated, data)
}
}
}
OUTPUT:
I have three sample files :
FlightTrack20151224.txt
FlightTrack20151223.txt
FlightTrack20151222.txt
Sample data:
Name Speed
AA101 23
Consolidated data:
Name Speed
1 AA102 24
2 AA101 23
Note:
1. Create the From date by subtracting from current date or using a fixed date like above.
2. Remember to clean up the existing consolidated data if you are running the script again. Data duplication might occur otherwise.
3. Save consolidated to file :)
Consider an lapply() solution without a need for list.files() since you know ahead of time the directory and file name structure:
path = "C:/path/to/txt/files"
# LIST OF ALL LAST 90 DATES IN YYYYMMDD FORMAT
dates <- lapply(0:90, function(x) format(Sys.Date()-x, "%Y%m%d"))
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES
dfList <- lapply(paste0(path, "FlightTrack", dates, ".txt"),
function(x) if (file.exists(x)) {read.table(x)})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# OUTPUT FINAL FILE TO TXT
write.table(df, paste0(path, "FlightTrack90Days.txt"), sep = ",", row.names = FALSE)
Related
I have a folder with more than 500 .dta files. I would like to load some of this files into a single R object.
My .dta files have a generic name composed of four parts : 'two letters/four digits/y/.dta'. For instance, a name can be 'de2015y.dta' or 'fr2008y.dta'. Only the parts corresponding to the two letters and the four digits change across the .dta file.
I have written a code that works, but I am not satisfied with it. I would like to avoid using a loop and to shorten it.
My code is:
# Select the .dta files I want to load
#.....................................
name <- list.files(path="E:/Folder") # names of the .dta files in the folder
db <- as.data.frame(name)
db$year <- substr(db$name, 3, 6)
db <- subset (db, year == max(db$year)) # keep last year available
db$country <- substr(db$name, 1, 2)
list.name <- as.list(db$country)
# Loading all the .dta files in the Global environment
#..................................................
for(i in c(list.name)){
obj_name <- paste(i, '2015y', sep='')
file_name <- file.path('E:/Folder',paste(obj_name,'dta', sep ='.'))
input <- read.dta13(file_name)
assign(obj_name, value = input)
}
# Merge the files into a single object
#..................................................
df2015 <- rbind (at2015y, be2015y, bg2015y, ch2015y, cy2015y, cz2015y, dk2015y, ee2015y, ee2015y, es2015y, fi2015y,
fr2015y, gr2015y, hr2015y, hu2015y, ie2015y, is2015y, it2015y, lt2015y, lu2015y, lv2015y, mt2015y,
nl2015y, no2015y, pl2015y, pl2015y, pt2015y, ro2015y, se2015y, si2015y, sk2015y, uk2015y)
Does anyone know how I can avoid using a loop and shortening my code ?
You can also use purrr for your task.
First create a named vector of all files you want to load (as I understand your question, you simply need all files from 2015). The setNames() part is only necessary in case you'd like an ID variable in your data frame and it is not already included in the .dta files.
After that, simply use map_df() to read all files and return a data frame. Specifying .id is optional and results in an ID column the values of which are based on the names of in_files.
library(purrr)
library(haven)
in_files <- list.files(path="E:/Folder", pattern = "2015y", full.names = TRUE)
in_files <- setNames(in_files, in_files)
df2015 <- map_df(in_files, read_dta, .id = "id")
The following steps should give you what you want:
Load the foreign package:
library(foreign) # or alternatively: library(haven)
create a list of file names
file.list <- list.files(path="E:/Folder", pattern='*.dat', full.names = TRUE)
determine which files to read (NOTE: you have to check if these are the correct position in substr it is an estimate from my side)
vec <- as.integer(substr(file.list,13,16))
file.list2 <- file.list[vec==max(vec)]
read the files
df.list <- sapply(file.list2, read.dta, simplify=FALSE)
remove the path from the listnames
names(df.list) <- gsub("E:/Folder","",names(df.list))
bind the the dataframes together in one data.frame/data.table and create an id-column as well
library(data.table)
df <- rbindlist(df.list, idcol = "id")
# or with 'dplyr'
library(dplyr)
df <- bind_rows(df.list, .id = "id")
Now you have a data.frame with an id-column that identifies the different original files.
I would change the working directory for this task...
Then does this do what you are asking for?
setwd("C:/.../yourfiles")
# get file names where year equals "2015"
name=list.files(pattern="*.dta")
name=name[substr(name,3,6)=="2015"]
# read in the files in a list
files=lapply(name,foreign::read.dta)
# remove ".dta" from file names and
# give the file contents in the list their name
names(files)=lapply(name,function(x) substr(x, 1, nchar(x)-4))
#or alternatively
names(files)=as.list(substr(name,1,nchar(name)-4))
# optional: put all file contents into one data-frame
#(data-frames (vectors) need to have the same row counts (lengths) for this last step to work)
mydatafrm = data.frame(files)
Before I dive into the question, here is a similar problem asked but there is not yet a solution.
So, I am working in R, and there is a folder in my working directory called columns that contains 198 similar .csv files with the name format of a 6-digit integer (e.g. 100000) that increases inconsistently (since the name of those files are actually names for each variable).
Now, I have would like to full join them, but somehow I have to import all of those files into R and then join them. Naturally, I thought about using a list to contain those files and then use a loop to join them. This is the code I tried to use:
#These are the first 3 columns containing identifiers
matrix_starter <- read_csv("files/matrix_starter.csv")
## import_multiple_csv_files_to_R
# Purpose: Import multiple csv files to the Global Environment in R
# set working directory
setwd("columns")
# list all csv files from the current directory
list.files(pattern=".csv$") # use the pattern argument to define a common pattern for import files with regex. Here: .csv
# create a list from these files
list.filenames <- list.files(pattern=".csv$")
#list.filenames
# create an empty list that will serve as a container to receive the incoming files
list.data <- list()
# create a loop to read in your data
for (i in 1:length(list.filenames))
{
list.data[[i]] <- read.csv(list.filenames[i])
list.data[[i]] <- list.data[[i]] %>%
select(`Occupation.Title`,`X2018.Employment`) %>%
rename(`Occupation title` = `Occupation.Title`) #%>%
#rename(list.filenames[i] = `X2018.Employment`)
}
# add the names of your data to the list
names(list.data) <- list.filenames
# now you can index one of your tables like this
list.data$`113300.csv`
# or this
list.data[1]
# source: https://www.edureka.co/community/1902/how-can-i-import-multiple-csv-files-into-r
The chunk above solve the importing part. Now I have a list of .csv files. Next, I would like to join them:
for (i in 1:length(list.filenames)){
matrix_starter <- matrix_starter %>% full_join(list.data[[i]], by = `Occupation title`)
}
However, this does not work nicely. I end up with somewhere around 47,000 rows, to which I only expect around 1700 rows. Please let me know your opinion.
Reading the files into R as a list and including the file name as a column can be done like this:
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
data <- read_xls(
x,
sheet = 1
)
data$File_name <- x
data
})
I am assuming now that all your excel files have the same structure: the same columns and column types.
If that is the case you can use dplyr::bind_rows to create one combined data frame.
You could off course loop through the list and left_join the list elements. E.g. by using Reduce and merge.
Update based on mihndang's comment. Is this what you are after when you say: Is there a way to use the file name to name the column and also not include the columns of file names?
library(dplyr)
library(stringr)
path <- "./files"
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
read.csv(x, stringsAsFactors = FALSE)
})
col1 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Values")
col2 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Character")
df1 <- data[[1]] %>%
rename(!!col1 := Value,
!!col2 := Character)
I created two simple .csv files in ./files: file1.csv and file2.csv. I read them into a list. I extract the first list element (the DF) and work out column names in a variable. I then rename the columns in the DF by passing the two variables to them. The column name includes the file name.
Result:
> View(df1)
> df1
file1: Values file1: Character
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
I guess you are looking for :
result <- Reduce(function(x, y) merge(x, y, by = `Occupation title`, all = TRUE), list.data)
which can be done using purrrs Reduce as well :
result <- purrr::reduce(list.data, dplyr::full_join, by = `Occupation title`)
When you do full join it adds every combination and gives us the tables. if you are looking for unique records then you might want to use left join where keep dataframe/table on left whose all columns you want keep as reference and keep the file you want to join on right.
Hope this helps.
I am working in a folder (directory1) and I need to first modify and then use .csv files present in another folder (directory2).
First I would like to insert values in a column based on the file name; and I would like to do this in a loop for all subjects.
I know how to do it for single files, but not sure how to create the loop.
#Choose directory with .csv files to read
setwd("/Users/R/directory2")
d = read.table("ppt01_EvF.csv", sep=",")
#Change columns names
colnames(d) <- c("Order","Condition","Press","Response","Time","Time2")
#Read file name
filenames <- "ppt01_EvF.csv"
# Remove ".csv"
filenames2 <- sub(".csv", "", filenames)
# Split the string by "_"
filenames_vec <- strsplit(filenames2, split = "_")[[1]]
# Create new column to store the information
d$PPT_N_NUMBER <- filenames_vec[1]
Second, I would like to save all the .csv files as one big file containing all the participants but just one row at the top of the new big file with the columns names.
Last, I would like to save this new big file (.csv) in the folder I am working on (directory1) - so a different directory than the one the single files are stored.
I would appreciate if someone could help me to understand the best way to do this.
It should be something like this:
setwd("/Users/R/directory2")
files <- list.files()
library(data.table)
data_list <- list()
for(i in 1:length(files)){
file_name <- files[i]
d = fread(file_name, sep=",")
#Change columns names
colnames(d) <- c("Order","Condition","Press","Response","Time","Time2")
# Split the string by "_"
filenames_vec <- strsplit(file_name, split = "_")[[1]]
# Create new column to store the information
d$PPT_N_NUMBER <- filenames_vec[1]
data_list[[i]] <- d
}
all_data <- rbindlist(data_list)
fwrite(all_data, '../directory1/all_data.csv')
i have a series of data, it looks like
sale20160101.txt,
sales20160102.txt,...,
sales20171231.
now i want to read them all and combine, but it also needs a date variable
to help me identify their occurrence time,so the date variable will be
20160101,20160102,...,20161231.
my ideas is:
split filename into sale+"time"
duplicate time whenever i read according to number of data length
cbind data and time.
thx alot.
We could do this with fread and rbindlist from data.table
library(data.table)
#find the files that have names starting as 'sales' followed by numbers
#and have .txt extension
files <- list.files(pattern = "^sale.*\\d+\\.txt", full.names = TRUE)
#get the dates
dates <- readr::parse_number(basename(files))
#read the files into a list and rbind it
dt <- rbindlist(setNames(lapply(files, fread), dates), idcol = 'date')
I usually would do a variation of the following:
# find the files
ls <- list.files(pattern = '^sales')
# Get the dates
dates <- gsub('sales', '', tools::file_path_sans_ext(ls))
# read in the data
dfs <- lapply(ls, read.table)
# match the dates
names(dfs) <- dates
# bind all data together and include the date as a column
df <- dplyr::bind_rows(dfs, .id = 'date')
I have a folder full of .txt files that I want to loop through and compress into one data frame, but each .txt file is data for one subject and there are no columns in the text files that indicate subject number or time point in the study (e.g. 1-5). I need to add a line or two of code into my loop that looks for strings of four numbers (i.e. each file is labeled something like: "4325.5_ERN_No_Startle") and just creates a column with 4325 and another column with 5 that will appear for every data point for that subject until the loop gets to the next one. I have been looking for awhile but am still coming up empty, any suggestions?
I also have not quite gotten the loop to work:
path = "/Users/me/Desktop/Event Codes/ERN task/ERN text files transferred"
out.file <- ""
file <- ""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- read.table(file.names[i],header=FALSE, fill = TRUE)
out.file <- rbind(out.file, file)
}
which runs okay until I get this error message part way through:
Error in read.table(file.names[i], header = FALSE, fill = TRUE) :
no lines available in input
Consider using regex to parse the file name for study period and subject, both of which are then binded in a lapply of list.files:
path = "path/to/text/files"
# ANY TXT FILE WITH PATTERN OF 4 DIGITS FOLLOWED BY A PERIOD AND ONE DIGIT
file.names <- list.files(path, pattern="*[0-9]{4}\\.[0-9]{1}.*txt", full.names=TRUE)
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES AND BINDS THE REGEX EXTRACTS
dfList <- lapply(file.names, function(x) {
if (file.exists(x)) {
data.frame(period=regmatches(x, gregexpr('[0-9]{4}', x))[[1]],
subject=regmatches(x, gregexpr('\\.[0-9]{1}', x))[[1]],
read.table(x, header=FALSE, fill=TRUE),
stringsAsFactors = FALSE)
}
})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# REMOVE PERIOD IN SUBJECT (NEEDED EARLIER FOR SPECIAL DIGIT)
df['subject'] <- sapply(df['subject'],
function(x) gsub("\\.", "", x))
You can try to use tryCatchwhich basically would give you a NULL instead of an error.
file <- tryCatch(read.table(file.names[i],header=FALSE, fill = TRUE), error=function(e) NULL))