Trouble reading content of .txt files in multiple subfolders into R - r

I have data of the structure:
Main_Text
Sub1_text
Sub2_text
Etc (I have several hundred subfolders)
Each subfolder containers multiple .txt files.
I want to read all of the files into R, to create a data frame that looks like this:
Filename | Text
Name of file | Content of .txt file
I've tried the following two approaches, and neither quite works. Any help would be appreciated.
1) Using the readtext package: although this package supposedly loops through subfolders, I cannot get it to do so. The code to loop through the files in the readtext vignette should work like this:
dir <- "/Users/Main_Folder"
text = readtext(paste0(dir, "/Main_Text/*.txt"))
This only produces an error:
Error in listMatchingFiles(i, ignoreMissing = ignoreMissing, lastRound = T) : File '' does not exist.
It works, however, if I specify the subfolder, i.e.
text = readtext(paste0(dir, "/Main_Text/Sub1_text*.txt"))
but given that I have several hundred subfolders, I need a more recursive solution.
2) I've also tried the following two step solution, where I create a list of the files first and then attempt to read in the text, which is also resulting in an error:
This generates an accurate list of all my files, but obviously doesn't include a content generating step:
setwd("/Users/Main_Folder")
dat = basename(list.files(pattern = ".txt$", recursive = TRUE, full.names=TRUE, include.dirs=TRUE))
So I also tried:
mypath="/Users/Main_Folder/"
txt_files_ls = list.files(path=mypath, recursive=T, pattern="*.txt")
Which works, however:
txt_files_df <- lapply(txt_files_ls, function(x) {read.table(file = x, header = F, fill=T, sep =",")})
Throws an error:
Error in read.table(file = x, header = F, fill = T, sep = ",") : no lines available in input In addition: There were 42 warnings (use warnings() to see them)
If I specify
header=T
I get a different error:
Error in read.table(file = x, header = T, fill = T, sep = ",") : more columns than column names In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
So I can't even get to the final step of combining them using something like
combined_df <- do.call("rbind", lapply(txt_files_df, as.data.frame))
I have a sense of why this is, given that the text files themselves don't have headers, and have random formatting (they're press releases). Here's a sample of one of my .txt files:
cat(readLines("Aderholt_text/Aderholt1-28-11.txt"), sep = "\n")
Friday January 28, 2011 Contact: Darrell "DJ" Jordan 202-225-4876 CONGRESSMAN ROBERT ADERHOLT STATEMENT ON THE VIOLENCE IN ALBANIA Washington, DC - Congressman Robert Aderholt (R-Alabama) today issued th
I'm sure I'm missing something small, but can anyone help illuminate how to correctly read in the filenames + text, either using one of the half-working solutions I've tried, or something else?

Related

Read in a file which has part of the file name changing

I have some file names which look like the following;
Year1:
blds_PANEL_DPK_8237_8283
blds_PANEL_DPR_8237_8283
blds_PANEL_MWK_8237_8283
Which are all located in the same file path. However in a different file path for a different year I have very similar files;
Year 2:
blds_PANEL_MHG_9817_9876
blds_PANEL_HKG_9817_9876
blds_PANEL_DPR_9817_9876
Some of the files have the same names as the previous years yet some of the names change. The only part of the name which changes is the MHG, HKG, DPR sections of the name, the blds_PANEL_ stays the same along with 9817_9876.
I have created a paste0()
file_path = C:/Users...
product = blds
part_which_keeps_changing = HKG
weeks = 9817_9876
read.csv(paste0(file_path, product, product, part_which_keeps_changing, weeks, ".DAT"), header = TRUE)
It was working well for one product, however for new products I am running into some errors. So I am trying to load in data which perhaps ignores this part of the file name.
EDIT: This seems to solve what I am looking to do
temp <- list.files(paste0(files, product), pattern = "*.DAT")
location <- paste0(files, product, temp)
myfiles = lapply(location, read.csv)
library(plyr)
df <- ldply(myfiles, data.frame)
How ever I am running into a slightly different problem for some of the files.
If I have the following;
blds_PANEL_DPK_8237_8283
blds_PANEL_DPR_8237_8283
blds_PANEL_MWK_8237_8283
It is possible that one of the files contains no information and when I apply lapply it breaks and stops loading in the data when loading in the data.
Is it possible to skip over these files. Heres the error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
EDIT 2:
This seems to overrde the lapply errors:
lapply_with_error <- function(X,FUN,...){
lapply(X, function(x, ...) tryCatch(FUN(x, ...),
error=function(e) NULL))
}
myfiles = lapply_with_error(location, read.delim)

Issues reading data as csv in R

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.
I have done the following code runs:
Input:
data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)
Output for the second code:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 80 elements
Input:
datar <- read.csv("data.csv", header = TRUE, na.strings = NA)
Output:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.
How can I solve this??
Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.
Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
Or, hard-code the path, and read the data into R.
# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")

Problems reading in table with unclear line-end symbol

I am currently trying to read in a .txt file.
I have researched here and found Error in reading in data set in R - however, it did not solve my problem.
The data are political contributions listed by the Federal Election Commission of the U.S. at ftp://ftp.fec.gov/FEC/2014/webk14.zip
Upon inspection of the .txt, I realized that the data is weirdly structured. Especially, the end of the any line is not separated at all from the first cell of the next line (not by a "|", not by a space).
Strangely enough, import via Excel and Access seems to work just fine. However, R import does not work.
To avoid the Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 90 did not have 27 elements error, I use the following command:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", sep = "|", file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))
This does not result in an error, however, the results a) have a different line count than with Excel import and b) fail to correctly separate columns (which is probably the reason for a))
I would like not to do a detour via Excel and directly import into R. Any ideas what I am doing wrong?
It might be related to the symbols inside the variable names so turn of interpretation of these using comment.char="", which gives you:
webk14 <- read.table(header = FALSE, fill = TRUE, colClasses = "character", comment.char="",sep = "|",file = "webk14.txt", stringsAsFactors = FALSE, dec = ".", col.names = c("cmte_id", "cmte_nm", "cmte_tp", "cmte_dsgn", "cmte_filing_freq", "ttl_receipts", "trans_from_aff", "indv_contrib", "other_pol_cmte_contrib", "cand_contrib", "cand_loans", "ttl_loans_received", "ttl_disb", "tranf_to_aff", "indv_refunds", "other_pol_cmte_refunds", "cand_loan_repay", "loan_repay", "coh_bop", "coh_cop", "debts_owed_by", "nonfed_trans_received", "contrib_to_other_cmte", "ind_exp", "pty_coord_exp", "nonfed_share_exp","cvg_end_dt"))

Converting twitteR results to data frame

I have a simple for loop to write the past 100 tweets of a few usernames to .csv files:
library(twitteR)
mclist <- read.table('usernames.txt')
for (mc in mclist)
{
tweets <- userTimeline(mc, n = 100)
df <- do.call("rbind", lapply(tweets, as.data.frame))
write.csv(df, file=paste("Desktop/", mc, ".csv", sep = ""), row.names = F)
}
I mostly followed what I've read on StackOverflow but I continue to get this error message:
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
the condition has length > 1 and only the first element will be used
Where did I go wrong?
I just cleaned up the code a bit, and everything started working.
Step 1: Let's set the working directory and load the 'twitteR' package.
library(twitteR)
setwd("C:/Users/Dinre/Desktop") # Replace with your desired directory
Step 2: First, we need to load a list of user names from a flat text file. I'm assuming that each line in the text file has one username, like so:
[contents of usernames.txt]
edclef
notch
dkanaga
Let's load it using the 'scan' function to read each line into an array:
mclist <- scan("usernames.txt", what="", sep="\n")
Step 3: We'll loop through the usernames, just like you did before, but we're not going to refer to the directory, since we're going to use the same directory for output as input. The original code had a syntax error in attempting to referring to the desktop directory, and we're just going to sidestep that.
for (mc in mclist){
tweets <- userTimeline(mc, n = 100)
df <- do.call("rbind", lapply(tweets, as.data.frame))
write.csv(df, file=paste(mc, ".csv", sep = ""), row.names = F)
}
I end up with three files on the desktop, and all the data seems to be correct.
edclef.csv
notch.csv
dkanaga.csv
Update: If you really want to refer to different directories within your code, use the '.' character to refer to the parent directory. For instance, if your working directory is your Windows user profile, you would refer to the 'Desktop' folder like so:
setwd("C:/Users/Dinre")
...
write.csv(df, file=paste("./Desktop/". mc, ".csv", sep = ""), row.names = F)
There's a convenience function in the package twListToDF which will handle the conversion of the list of tweets to a data.frame.
Since your mclist is a data.frame, you can replace your for by apply
apply( mclist, 1,function(mc){
tweets <- userTimeline(mc, n = 100)
df <- do.call("rbind", lapply(tweets, as.data.frame))
write.csv(df, file=paste("Desktop/", mc, ".csv", sep = ""), ##!! Change Desktop to
## something like Desktop/tweets/
row.names = F)
})
PS :
The userTimeline function will only work if the user requested has a
public timeline, or you have previously registered a OAuth object
using registerTwitterOAuth

Merge multiple CSV files and remove duplicates in R

I have almost 3.000 CSV files (containing tweets) with the same format, I want to merge these files into one new file and remove the duplicate tweets. I have come across various topics discussing similar questions however the number of files is usually quit small. I hope you can help me write a code within R that does this job both efficiently and effectively.
The CSV files have the following format:
Image of CSV format:
I changed (in column 2 and 3) the usernames (on Twitter) to A-E and the 'actual names' to A1-E1.
Raw text file:
"tweet";"author";"local.time"
"1";"2012-06-05 00:01:45 #A (A1): Cruijff z'n met-zwart-shirt-zijn-ze-onzichtbaar logica is even mooi ontkracht in #bureausport.";"A (A1)";"2012-06-05 00:01:45"
"2";"2012-06-05 00:01:41 #B (B1): Welterusten #BureauSport";"B (B1)";"2012-06-05 00:01:41"
"3";"2012-06-05 00:01:38 #C (C1): Echt ..... eindelijk een origineel sportprogramma #bureausport";"C (C1)";"2012-06-05 00:01:38"
"4";"2012-06-05 00:01:38 #D (D1): LOL. \"Na onderzoek op de Fontys Hogeschool durven wij te stellen dat..\" Want Fontys staat zo hoog aangeschreven? #bureausport";"D (D1)";"2012-06-05 00:01:38"
"5";"2012-06-05 00:00:27 #E (E1): Ik kijk Bureau sport op Nederland 3. #bureausport #kijkes";"E (E1)";"2012-06-05 00:00:27"
Somehow my headers are messed up, they obviously should move one column to the right. Each CSV file contains up to 1500 tweets. I would like to remove the duplicates by checking the 2nd column (containing the tweets) simply because these should be unique and the author columns can be similar (e.g. one author posting multiple tweets).
Is it possible to combine merging the files and removing the duplicates or is this asking for trouble and should the processes be separated? As a starting point I included two links two blogs from Hayward Godwin that discuss three approaches for merging CSV files.
http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/
http://psychwire.wordpress.com/2011/06/05/testing-different-methods-for-merging-a-set-of-files-into-a-dataframe/
Obviously there are some topics related to my question on this site as well (e.g. Merging multiple csv files in R) but I haven't found anything that discusses both merging and removing the duplicates. I really hope you can help me and my limited R knowledge deal with this challenge!
Although I have tried some codes I found on the web, this didn't actually result in an output file. The approximately 3.000 CSV files have the format discussed above. I meanly tried the following code (for the merge part):
filenames <- list.files(path = "~/")
do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This results in the following error:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file '..': No such file or directory
Update
I have tried the following code:
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time'), colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
But I run into the following errors:
After the 3rd line I get:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
After the 4th line I get:
Error: object 'my.df' not found
I suspect that these errors are caused by some failures made in the writing process of the csv files, since there are some cases of the author/local.time being in the wrong column. Either to the left or the right of where they supposed to be which results in an extra column. I manually adapted 5 files, and tested the code on these files, I didn't get any errors. However its seemed like nothing happened at all. I didn't get any output from R?
To solve the extra column problem I adjusted the code slightly:
#grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
I tried this code on all the files, although R clearly started processing, I eventually got the following errors:
Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_22_30 2012 .csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'Twitts - di mei 29 19_24_31 2012 .csv'
Error: object 'my.df' not found
What did I do wrong?
First, simplify matters by being in the folder where the files are and try setting the pattern to read only files with the file ending '.csv', so something like
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
my.df <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
This should get you a data.frame with the contents of all the tweets
A separate issue is the headers in the csv files. Thankfully you know that all files are identical, so I'd handle those something like this:
read.csv('fred.csv', header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4))
Nb. changed so all columns are character, and ';' separated
I'd parse out the time later if it was needed...
A further separate issue is the uniqueness of the tweets within the data.frame - but I'm not clear if you want them to be unique to a user or globally unique. For globally unique tweets, something like
my.new.df <- my.df[!duplicated(my.df$tweet),]
For unique by author, I'd append the two fields - hard to know what works without the real data though!
my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]
So bringing it all together and assuming a few things along the way...
# grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
# write a special little read.csv function to do exactly what we want
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time'),
colClasses=rep('character', 4)) }
# read in all those files into one giant data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv))
# remove the duplicate tweets
my.new.df <- my.df[!duplicated(my.df$tweet),]
Based on the revised warnings after line 3, it's a problem with files with different numbers of columns. This is not easy to fix in general except as you have suggested by having too many columns in the specification. If you remove the specification then you will run into problems when you try to rbind() the data.frames together...
Here is some code using a for() loop and some debugging cat() statements to make more explicit which files are broken so that you can fix things:
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
n.files.processed <- 0 # how many files did we process?
for (fnam in filenames) {
cat('about to read from file:', fnam, '\n')
if (exists('tmp.df')) rm(tmp.df)
tmp.df <- read.csv(fnam, header=FALSE, skip=1, sep=';',
col.names=c('ID','tweet','author','local.time','extra'),
colClasses=rep('character', 5))
if (exists('tmp.df') & (nrow(tmp.df) > 0)) {
cat(' successfully read:', nrow(tmp.df), ' rows from ', fnam, '\n')
# now lets append a column containing the originating file name
# so that debugging the file contents is easier
tmp.df$fnam <- fnam
# now lets rbind everything together
if (exists('my.df')) {
my.df <- rbind(my.df, tmp.df)
} else {
my.df <- tmp.df
}
} else {
cat(' read NO rows from ', fnam, '\n')
}
}
cat('processed ', n.files.processed, ' files\n')
my.new.df <- my.df[!duplicated(my.df$tweet),]

Resources