Read in a file which has part of the file name changing - r

I have some file names which look like the following;
Year1:
blds_PANEL_DPK_8237_8283
blds_PANEL_DPR_8237_8283
blds_PANEL_MWK_8237_8283
Which are all located in the same file path. However in a different file path for a different year I have very similar files;
Year 2:
blds_PANEL_MHG_9817_9876
blds_PANEL_HKG_9817_9876
blds_PANEL_DPR_9817_9876
Some of the files have the same names as the previous years yet some of the names change. The only part of the name which changes is the MHG, HKG, DPR sections of the name, the blds_PANEL_ stays the same along with 9817_9876.
I have created a paste0()
file_path = C:/Users...
product = blds
part_which_keeps_changing = HKG
weeks = 9817_9876
read.csv(paste0(file_path, product, product, part_which_keeps_changing, weeks, ".DAT"), header = TRUE)
It was working well for one product, however for new products I am running into some errors. So I am trying to load in data which perhaps ignores this part of the file name.
EDIT: This seems to solve what I am looking to do
temp <- list.files(paste0(files, product), pattern = "*.DAT")
location <- paste0(files, product, temp)
myfiles = lapply(location, read.csv)
library(plyr)
df <- ldply(myfiles, data.frame)
How ever I am running into a slightly different problem for some of the files.
If I have the following;
blds_PANEL_DPK_8237_8283
blds_PANEL_DPR_8237_8283
blds_PANEL_MWK_8237_8283
It is possible that one of the files contains no information and when I apply lapply it breaks and stops loading in the data when loading in the data.
Is it possible to skip over these files. Heres the error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
EDIT 2:
This seems to overrde the lapply errors:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error=function(e) NULL))
}
myfiles = lapply_with_error(location, read.delim)

Related

My R script not picking up all the files in the folder

My R script is trying to aggregate excel spreadsheets that are in different folders within the Concerned Files folder (shown in the directory below) and putting all the data into one master file. However, the script is randomly selecting files to copy information from and when i run the code, the following error shows so i am assuming this is why it's not choosing every file in the folder?
all_some_data <- rbind(all_some_data, temp)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The whole code:
#list of people's name it has to search the folders for. For our purposes, i am only taking one name
managers <- c("Name")
#directory of all the files
directory = 'C:/Users/Username/OneDrive/Desktop/Testing/Concerned Files/'
#Create an empty dataframe
all_HR_data <-
setNames(
data.frame(matrix(ncol = 8, nrow = 0)),
c("Employee", "ID", "Overtime", "Regular", "Total", "Start", "End", "Manager")
)
str(files)
#loop through managers to get time sheets and then add file to combined dataframe
for (i in managers){
#a path to find all the extract files
files <-
list.files(
path = paste(directory, i, "/", sep = ""),
pattern = "*.xls",
full.names = FALSE,
recursive = FALSE
)
#for each file, get a start and end date of period, remove unnecessary columns, rename columns and add manager name
for (j in files){
temp <- read_excel(paste(directory, i, "/", j, sep = ""), skip = 8)
#a bunch of manipulations with the data being copied over. Code not relevant to the problem
all_some_data <- rbind(all_some_data, temp)
}
}
The most likely cause of your problem is an extra column in one or more of your files.
A potential solution along with a performance improvement is to use the bind_rows function from the dplyr package. This function is more fault tolerant than the base R rbind.
Wrap you loop up with lapply statement and then use bind_rows on the entire list of dataframes in one statement.
output <-lapply(files, function(j) {
temp <- read_excel(paste(directory, i, "/", j, sep = ""), skip = 8)
#a bunch of manipulations with the data being copied over.
# Code not relevant to the problem
temp #this is the returned value to the list
})
all_some_data <- dplyr::bind_rows(output)

How can I read a table in a loosely structured text file into a data frame in R?

Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)

Trouble reading content of .txt files in multiple subfolders into R

I have data of the structure:
Main_Text
Sub1_text
Sub2_text
Etc (I have several hundred subfolders)
Each subfolder containers multiple .txt files.
I want to read all of the files into R, to create a data frame that looks like this:
Filename | Text
Name of file | Content of .txt file
I've tried the following two approaches, and neither quite works. Any help would be appreciated.
1) Using the readtext package: although this package supposedly loops through subfolders, I cannot get it to do so. The code to loop through the files in the readtext vignette should work like this:
dir <- "/Users/Main_Folder"
text = readtext(paste0(dir, "/Main_Text/*.txt"))
This only produces an error:
Error in listMatchingFiles(i, ignoreMissing = ignoreMissing, lastRound = T) : File '' does not exist.
It works, however, if I specify the subfolder, i.e.
text = readtext(paste0(dir, "/Main_Text/Sub1_text*.txt"))
but given that I have several hundred subfolders, I need a more recursive solution.
2) I've also tried the following two step solution, where I create a list of the files first and then attempt to read in the text, which is also resulting in an error:
This generates an accurate list of all my files, but obviously doesn't include a content generating step:
setwd("/Users/Main_Folder")
dat = basename(list.files(pattern = ".txt$", recursive = TRUE, full.names=TRUE, include.dirs=TRUE))
So I also tried:
mypath="/Users/Main_Folder/"
txt_files_ls = list.files(path=mypath, recursive=T, pattern="*.txt")
Which works, however:
txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x, header = F, fill=T, sep =",")})
Throws an error:
Error in read.table(file = x, header = F, fill = T, sep = ",") : no lines available in input In addition: There were 42 warnings (use warnings() to see them)
If I specify
header=T
I get a different error:
Error in read.table(file = x, header = T, fill = T, sep = ",") : more columns than column names In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
So I can't even get to the final step of combining them using something like
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
I have a sense of why this is, given that the text files themselves don't have headers, and have random formatting (they're press releases). Here's a sample of one of my .txt files:
cat(readLines("Aderholt_text/Aderholt1-28-11.txt"), sep = "\n")
Friday January 28, 2011 Contact: Darrell "DJ" Jordan 202-225-4876 CONGRESSMAN ROBERT ADERHOLT STATEMENT ON THE VIOLENCE IN ALBANIA Washington, DC - Congressman Robert Aderholt (R-Alabama) today issued th
I'm sure I'm missing something small, but can anyone help illuminate how to correctly read in the filenames + text, either using one of the half-working solutions I've tried, or something else?

R rename files keeping part of original name

I'm trying to rename all files in a folder (about 7,000 files) files with just a portion of their original name.
The initial fip code is a 4 or 5 digit code that identifies counties, and is different for every file in the folder. The rest of the name in the original files is the state_county_lat_lon of every file.
For example:
Original name:
"5081_Illinois_Jefferson_-88.9255_38.3024_-88.75_38.25.wth"
"7083_Illinois_Jersey_-90.3424_39.0953_-90.25_39.25.wth"
"11085_Illinois_Jo_Daviess_-90.196_42.3686_-90.25_42.25.wth"
"13087_Illinois_Johnson_-88.8788_37.4559_-88.75_37.25.wth"
"17089_Illinois_Kane_-88.4342_41.9418_-88.25_41.75.wth"
And I need it to rename with just the initial code (fips):
"5081.wth"
"7083.wth"
"11085.wth"
"13087.wth"
"17089.wth"
I've tried by using the list.files and file.rename functions, but I do not know how to identify the code name out of he full name. Some kind of a "wildcard" could work, but don't know how to apply those properly because they all have the same pattern but differ in content.
This is what I've tried this far:
setwd("C:/Users/xxx")
Files <- list.files(path = "C:/Users/xxx", pattern = "fips_*.wth" all.files = TRUE)
newName <- paste("fips",".wth", sep = "")
for (x in length(Files)) {
file.rename(nFiles,newName)}
I've also tried with the "sub" function as follows:
setwd("C:/Users/xxxx")
Files <- list.files(path = "C:/Users/xxxx", all.files = TRUE)
for (x in length(Files)) {
sub("_*", ".wth", Files)}
but get Error in as.character(x) :
cannot coerce type 'closure' to vector of type 'character'
OR
setwd("C:/Users/xxxx")
Files <- list.files(path = "C:/Users/xxxx", all.files = TRUE)
for (x in length(Files)) {
sub("^(\\d+)_.*", "\\1.wth", file)}
Which runs without errors but does nothing to the names in the file.
I could use any help.
Thanks
Here is my example.
Preparation for data to use;
dir.create("test_dir")
data_sets <- c("5081_Illinois_Jefferson_-88.9255_38.3024_-88.75_38.25.wth",
"7083_Illinois_Jersey_-90.3424_39.0953_-90.25_39.25.wth",
"11085_Illinois_Jo_Daviess_-90.196_42.3686_-90.25_42.25.wth",
"13087_Illinois_Johnson_-88.8788_37.4559_-88.75_37.25.wth",
"17089_Illinois_Kane_-88.4342_41.9418_-88.25_41.75.wth")
setwd("test_dir")
file.create(data_sets)
Rename the files;
Files <- list.files(all.files = TRUE, pattern = ".wth")
newName <- sub("^(\\d+)_.*", "\\1.wth", Files)
file.rename(Files, newName)

Merge multiple CSV files and remove duplicates in R

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.
The CSV files have the following format:
Image of CSV format:
I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.
Raw text file:
"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 #A (A1): Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 #B (B1): Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 #C (C1): Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 #D (D1): LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 #E (E1): Ik kijk Bureau sport op Nederland 3. #bureausport #kijkes";"E (E1)";"2012-06-05 00:00:27"
Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).
Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.
http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/
Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!
Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):
filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This results in the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file '..': No such file or directory
Update
I have tried the following code:
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
But I run into the following errors:
After the 3rd line I get:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
After the 4th line I get:
Error: object 'my.df' not found
I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?
To solve the extra column problem I adjusted the code slightly:
#grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
I tried this code on all the files, although R clearly started processing, I eventually got the following errors:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_22_30 2012 .csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_24_31 2012 .csv'
Error: object 'my.df' not found
What did I do wrong?
First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This should get you a data.frame with the contents of all the tweets
A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:
read.csv('fred.csv', header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4))
Nb. changed so all columns are character, and ';' separated
I'd parse out the time later if it was needed...
A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like
my.new.df <- my.df[!duplicated(my.df$tweet),]
For unique by author, I'd append the two fields - hard to know what works without the real data though!
my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]
So bringing it all together and assuming a few things along the way...
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...
Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
cat('about to read from file:', fnam, '\n')
if (exists('tmp.df')) rm(tmp.df)
tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time','extra'),
colClasses=rep('character', 5))
if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
cat(' successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
# now lets append a column containing the originating file name
# so that debugging the file contents is easier
tmp.df$fnam <- fnam
# now lets rbind everything together
if (exists('my.df')) {
my.df <- rbind(my.df, tmp.df)
} else {
my.df <- tmp.df
}
} else {
cat(' read NO rows from ', fnam, '\n')
}
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]

Resources