read multiple text files into r for text mining purposes

read multiple text files into r for text mining purposes - r

I have a batch of text files that I need to read into r to do text mining.
So far, I have tried to use read.table, read.line, lapply, mcsv_r from qdap package to no avail. I have tried to write a loop to read the files, but I have to specify the name of the file, which changes in every iteration.
Here is what I have tried:
# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"
# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")
for(i in 1:length(speeches))
{
text_df <- do.call(rbind,lapply(speeches[i],read.csv))
}
Moreover, I have tried the following:
library(data.table)
files <- list.files(path = folder.path,pattern = ".csv")
temp <- lapply(files, fread, sep=",")
data <- rbindlist( temp )
And it is giving me this error when inaugAbrahamLincoln-1.csv clearly exists in the folder:
files <- list.files(path = folder.path,pattern = ".csv")
> temp <- lapply(files, fread, sep=",")
Error in FUN(X[[i]], ...) :
File 'inaugAbrahamLincoln-1.csv' does not exist. Include one or more spaces to consider the input a system command.
> data <- rbindlist( temp )
Error in rbindlist(temp) : object 'temp' not found
>
But it only works on .csv files, not on .txt files.
Is there a simpler way to do text mining from multiple sources files? If so how?
Thanks

I often have this same problem. The textreadr package that I maintain is designed to make reading .csv, .pdf, .doc, and .docx documents and directories of these documents easy. It would reduce what you're doing to:
textreadr::read_dir("../data/InauguralSpeeches/")
Your example is not reproducible so I do it below (please make your example reproducible in the future).
library(textreadr)
## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')
## the read in of a directory
read_dir('delete_me')
output
The output below shows the tibble output with each document registered in the document column. For every line in the document there is one row for that document. Depending on what's in the csv files this may not be fine grained enough.
## document content
## 1 0_9 Bromwell High is a cartoon comedy. It ra
## 2 00_00 test
## 3 00_00
## 4 00_00 testing
## 5 00_00
## 6 00_00 tester
## 7 1_7 If you like adult comedy cartoons, like
## 8 10_9 I'm a male, not given to women's movies,
## 9 11_9 Liked Stanley & Iris very much. Acting w
## 10 12_9 Liked Stanley & Iris very much. Acting w
## .. ... ...
## 141 mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142 mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143 mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18

Here is code that will read all the *.csv files in a directory to a single data.frame:
dir <- '~/Desktop/testcsv/'
files <- list.files(dir,pattern = '*.csv', full.names = TRUE)
data <- lapply(files, read.csv)
df <- do.call(rbind, data)
Notice that I added the argument full.names = TRUE. This will give you the absolute paths, which is why youre getting an error for "inaugAbrahamLincoln-1.csv" even though it exists.

Here is one way to do it.
library(data.table)
setwd("C:/Users/Excel/Desktop/CSV Files/")
WD="C:/Users/Excel/Desktop/CSV Files/"
# read headers
data<-data.table(read.csv(text="CashFlow,Cusip,Period"))
csv.list<- list.files(WD)
k=1
for (i in csv.list){
temp.data<-read.csv(i)
data<-data.table(rbind(data,temp.data))
if (k %% 100 == 0)
print(k/length(csv.list))
k<-k+1
}

Related

R: how to find select files in a folder based on matching specific column title

Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.
I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?
UPDATE:
I have created a dummy folder to have files to reflect the problem
please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.
https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing
The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?
library(data.table)
#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")
#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")
#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)
same.titles <- var.names %in% standar.names
dff.titles <- !var.names %in% standar.names
#confirm the only 3 columns of problem is column 129,130 and 131
mismatched.names <- colnames(df_var[129:131])
#visual check the names of the problematic columns
mismatched.names
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector
to_keep <- which(unlist(column_names)%in% unique_names[1])
#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]
#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )

If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = ';',
header = T,
nrows = 2)
}
# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])
files_to_keep <- files_in_wd[to_keep]
If you have many files you should probably avoid the loop or just read in the header of the corresponding file.
edit after your comment:
by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.
edit:
This code works with your dummy-data.
library(filesstrings)
# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")
# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)
# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
l_files[[i]] <- read.delim(file = files_in_wd[i],
sep = "\t",
header = T,
nrows = 2,
encoding = "UTF-8",
check.names = FALSE
)
}
# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok
# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
'filename' = files_in_wd,
'keep' = NA)
for(i in 2:length(files_in_wd)){
df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}
df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns
# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept
file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")

Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:
for f in ctrl*.txt
do
if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
then echo "$f"
fi
done
This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.

Get certain values from a list of pdfs

I would like to:
get certain data in page 2 for every element in a list created (pdfs files)
data from page 2 (for Bond Futures CGB ... column 2, 11 and 16)
create a data frame aggregating all this data
Year | Month | Metric
2013 January Monthly Volume
2013 January Month End Open Interest
2013 January Transactions
I have tried the following but haven't reached far at all - my apologies.
library(rvest)
library(pdftools)
library(tidyverse)
filepath <- "~R Working Directory/CanadianFutures"
files <- list.files(path = filepath, pattern = '*.pdf')
The variable files contains the list:
[1] "1301_stats_en.pdf" "1302_stats_en.pdf" "1303_stats_en.pdf" "1304_stats_en.pdf" "1305_stats_en.pdf" "1306_stats_en.pdf"
[7] "1307_stats_en.pdf" "1308_stats_en.pdf" "1309_stats_en.pdf" "1310_stats_en.pdf" "1311_stats_en.pdf" "1312_stats_en.pdf"
[13] "1401_stats_en.pdf" "1402_stats_en.pdf" "1403_stats_en.pdf" "1404_stats_en.pdf" "1405_stats_en.pdf" "1406_stats_en.pdf".....[61] "1801_stats_en.pdf" "1802_stats_en.pdf" "1803_stats_en.pdf" "1804_stats_en.pdf" "1805_stats_en.pdf"
I have tried the following to get page 2 for each pdf but totally lost:
all <- lapply(files, function(x) {
txt <- pdf_text(filenames)
page_2 <- txt[2]
})
I get the following:
Error in normalizePath(pdf, mustWork = TRUE) :
path[1]="1301_stats_en.pdf": No such file or directory
All the pdfs in my list have the same consistent formatting.
Here is an example of the pdf https://www.m-x.ca/f_stat_en/1401_stats_en.pdf
Thank you

Make sure your working directory is the same as where you stored your files:
getwd()
Another option is to make your list of files displayed as complete directories.
files <- list.files(filepath, pattern = '*.pdf', full.names = T)
>files
[1] "Downloads/naamloze map//1401_stats_en-2.pdf"
[2] "Downloads/naamloze map//1401_stats_en.pdf"
PDFreader <- function(x){
t <- pdf_text (x)
page_2 <- t
}
lapply(files, PDFreader)
returns
[[1]]
[1]..... text....
[[2]]
[1]..... text....
Good luck

Importing and binding multiple and specific csv files into R

I would like to import and bind, all together in a R file, specific csv files named as "number.CSV" (e.g. 3437.CSV) which I have got in a folder with other csv files that I do not want to import.
How can I select only the ones that interest me?
I have got a list of all the csv files that I need and in the following column there are some of them.
CODE
49002
47001
64002
84008
46003
45001
55008
79005
84014
84009
45003
45005
51001
55012
67005
19004
7003
55023
55003
76004
21013
I have got 364 csv files to read and bind.
n.b. I can't select all the "***.csv" files from my folder because I have got other files that I do not need.
Thanks

You could iterate over the list of CSV files of interest, read in each one, and bind it to a common data frame:
path <- "path/to/folder/"
ROOT <- c("49002", "47001", "21013")
files <- paste0(path, ROOT)
sapply(files, bindFile, var2=all_files_df)
bindFile <- function(x, all_df) {
df <- read.csv(x)
all_df <- rbind(df, all_df)
}

Just make file names out of your numeric codes:
filenames = paste(code, 'csv', sep = '.')
# [1] "49002.csv" "47001.csv" "64002.csv" …
You might need to specify the full path to the files as well:
directory = '/example/path'
filenames = file.path(directory, filenames)
# [1] "/example/path/49002.csv" "/example/path/47001.csv" "/example/path/64002.csv" …
And now you can simply read them into R in one go:
data = lapply(filenames, read.csv)
Or, if your CSV files don’t have column headers (this is the case, in particular, when the file’s lines have different numbers of items!)
data = lapply(filenames, read.csv, header = FALSE)
This will give you a list of data.frames. If you want to bind them all into one table, use
data = do.call(rbind, data)

I don't know if you can do that from .CSV file. What you can do is open all your data and then use the command cbind.
For example:
data1 <- read.table("~/YOUR/DATA", quote="\"", comment.char="")
data2 <- read.table("~/YOUR/DATA", quote="\"", comment.char="")
data3 <- read.table("~/YOUR/DATA", quote="\"", comment.char="")
And then:
df <- cbind(data1$Col1, data2$col3...)
Where col is the name of the column that you want.

Need to run R code on all text files in a folder

I have a text file. I made a R code for it to extract a certain line of information from it.
###Read file and format
txt_files <- list.files(pattern = '*.txt')
text <- lapply(txt_files, readLines)
text <- sapply(text, function(x) iconv(x, "latin1", "ASCII", sub=""))
###Search and store grep
l =grep("words" ,text)
(k<- length(l))
###Matrix to store data created
mat <- matrix(data = NA, nrow = k, ncol = 2)
nrow(mat)
###Main
for(i in 1:k){
u= 1
while(text[(l[i])-u]!=""){
line.num=u;
u=u+1
}
mat[i,2]<-text[(l[i])-u-1]
mat[i,1]<- i
}
###Write the output file
write.csv(mat, file = "Evalutaion.csv")
It runs on one file at a time. I need to run it on many files and append all the results in a single file with an additional column that tells me the name of the file from which each of the result has come. I am unable to come up with some solution. What changes do I make?

Applying your operations to all files in a folder:
txt_files <- list.files(pattern = '*.txt')
# Applying all your functions on all txt_files using for loop, you need to use indexes inside where ever you are using txt_files
for (i in 1:length(txt_files)) {
# Operation 1
# Operation 2
# Operation 3
write.table(mat,file=paste0("./",sub(".txt","",FILES[i]),".csv"),row.names=F,quote=F,sep=",")
}
Merging files together with same headers, I have two csv files with Same Header Data and Value, File Names were File1.csv and File2.csv inside Header folder, which I am merging together to get one header and all rows and columns. Make sure both the files have same number of columns and same headers in same order.
## Read into a list of files, an Example below
library(plyr)
library(gdata)
setwd("./Header") # CSV Files to be merged are in this direcory
## Read into a list of files:
filenames <- list.files(path="./",pattern="*.csv")
fullpath=file.path("./",filenames)
print (filenames)
print (fullpath)
dataset <- do.call("rbind",lapply(filenames,FUN=function(files){read.table(files,sep=",",header=T)}))
dataset
# Data Value
# 1 ABC 23
# 2 PQR 33
# 3 MNP 43 # Till here was File.csv
# 4 AC 24
# 5 PQ 34
# 6 MN 44 # Till here was File2.csv
write.table(dataset,file="dataset.csv",sep=",",quote=F,row.names=F,col.names=T)

Add selection crteria to read.table

Let's take the following simplified version of a dataset that I import using read.table:
a<-as.data.frame(c("M","M","F","F","F"))
b<-as.data.frame(c(25,22,33,17,18))
df<-cbind(a,b)
colnames(df)<-c("Sex","Age")
In reality my dataset is extremely large and I'm only interested in a small proportion of the data i.e. the data concerning Females aged 18 or under. In the example above this would be just the last 2 observations.
My question is, can I just import these observations immediately without importing the rest of the data then using subset to refine my database. My computer's capacities are limited and so I have been using scan to import my data in chunks but this is extremely time consuming.
Is there a better solution?

Some approaches that might work:
1 - Use a packages like ff than can help you with RAM issues.
2 - Use other tools/languages to clean your data before load it to R.
3 - If your file is not too big (i.e., you can load it without crashing), then you could save it to a .RData file and read from this file (instead of calling read.table):
# save each txt file once...
save.rdata = function(filepath, filebin) {
dataset = read.table(filepath)
save(dataset, paste(filebin, ".RData", sep = ""))
}
# then read from the .Rdata
get.dataset = function(filebin) {
load(filebin)
return(dataset)
}
This is much faster than read from a txt file, but i'm not sure if it applies to your case.

There should be several ways to do this. Here is one using SQL.
library(sqldf)
result = sqldf("select * from df where Sex='F' AND Age<=18")
> result
Sex Age
1 F 17
2 F 18
There is also a read.csv.sql function that you can filter with the above statement to avoid reading in the whole text file!

This is almost the same as #Drew75's answer but I'm including it to illustrate some gotcha's with SQLite:
# example: large-ish data.frame
df <- data.frame(Sex=sample(c("M","F"),1e6,replace=T),
Age=sample(18:75,1e6,replace=T))
write.csv(df, "myData.csv", quote=F, row.names=F) # note: non-quoted strings
library(sqldf)
myData <- read.csv.sql(file="myData.csv", # looks for char M (no qoutes)
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 500127
write.csv(df, "myData.csv", row.names=F) # quoted strings...
myData <- read.csv.sql(file="myData.csv", # this fails
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 0
myData <- read.csv.sql(file="myData.csv", # need quotes in the char literal
sql="select * from file where Sex='\"M\"'", eol = "\n")
nrow(myData)
# [1] 500127

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

read multiple text files into r for text mining purposes - r

Related

R: how to find select files in a folder based on matching specific column title

Get certain values from a list of pdfs

Importing and binding multiple and specific csv files into R

Need to run R code on all text files in a folder

Add selection crteria to read.table

Categories

Resources