I am trying to upload several text document into a data frame in R. My desired output is a matrix with two colums:
DOCUMENT
CONTENT
Document A
This is the content.
: ----
: -------:
Document B
This is the content.
: ----
: -------:
Document C
This is the content.
Within the column "CONTENT", all the text information from the text document (10-K report) shall be shown.
> setwd("C:/Users/folder")
> folder <- getwd()
> corpus <- Corpus(DirSource(directory = folder, pattern = "*.txt"))
This will create a corpus and I can tokenize it. But I don't achieve to convert to a data frame nor my desiret output.
Can somebody help me?
If you're only working with .txt files and your endgoal is a dataframe, then I think you can skip the corpus step and simply read in all your files as a list. The hard part is to get the names of the .txt files into a column called DOCUMENT, but this can be done in base R.
# make a reproducible example
a <- "this is a test"
b <- "this is a second test"
c <- "this is a third test"
write(a, "a.txt"); write(b, "b.txt"); write(c, "c.txt")
# get working dir
folder <- getwd()
# get names/locations of all files
filelist <- list.files(path = folder, pattern =" *.txt", full.names = FALSE)
# read in the files and put them in a list
lst <- lapply(filelist, readLines)
# extract the names of the files without the `.txt` stuff
names(lst) <- filelist
namelist <- fs::path_file(filelist)
namelist <- unlist(lapply(namelist, sub, pattern = ".txt", replacement = ""),
use.names = FALSE)
# give every matrix in the list its own name, which was its original file name
lst <- mapply(cbind, lst, "DOCUMENT" = namelist, SIMPLIFY = FALSE)
# combine into a dataframe
x <- do.call(rbind.data.frame, lst)
# a small amount of clean-up
rownames(x) <- NULL
names(x)[names(x) == "V1"] <- "CONTENT"
x <- x[,c(2,1)]
x
#> DOCUMENT CONTENT
#> 1 a this is a test
#> 2 b this is a second test
#> 3 c this is a third test
Related
I have radiotelemetry data that is downloaded as a series of text files. I was provided with code in 2018 that looped through all the text files and converted them into CSV files. Up until 2021 this code worked. However, now the below code (specifically the lapply loop), returns the following error:
"Error in setnames(x, value) :
Can't assign 1 names to a 4 column data.table"
# set the working directory to the folder that contain this script, must run in RStudio
setwd(dirname(rstudioapi::callFun("getActiveDocumentContext")$path))
# get the path to the master data folder
path_to_data <- paste(getwd(), "data", sep = "/", collapse = NULL)
# extract .TXT file
files <- list.files(path=path_to_data, pattern="*.TXT", full.names=TRUE, recursive=TRUE)
# regular expression of the record we want
regex <- "^\\d*\\/\\d*\\/\\d*\\s*\\d*:\\d*:\\d*\\s*\\d*\\s*\\d*\\s*\\d*\\s*\\d*"
# vector of column names, no whitespace
columns <- c("Date", "Time", "Channel", "TagID", "Antenna", "Power")
# loop through all .TXT files, extract valid records and save to .csv files
lapply(files, function(x){
df <- read_table(file) # read the .TXT file to a DataFrame
dt <- data.table(df) # convert the dataframe to a more efficient data structure
colnames(dt) <- c("columns") # modify the column name
valid <- dt %>% filter(str_detect(col, regex)) # filter based on regular expression
valid <- separate(valid, col, into = columns, sep = "\\s+") # split into columns
towner_name <- str_sub(basename(file), start = 1 , end = 2) # extract tower name
valid$Tower <- rep(towner_name, nrow(valid)) # add Tower column
file_path <- file.path(dirname(file), paste(str_sub(basename(file), end = -5), ".csv", sep=""))
write.csv(valid, file = file_path, row.names = FALSE, quote = FALSE) # save to .csv
})
I looked up possible fixes for this and found using "setnames(skip_absent=TRUE)" in the loop resolved the setnames error but instead gave the error "Error in is.data.frame(x) : argument "x" is missing, with no default"
lapply(files, function(file){
df <- read_table(file) # read the .TXT file to a DataFrame
dt <- data.table(df) # convert the dataframe to a more efficient data structure
setnames(skip_absent = TRUE)
colnames(dt) <- c("col") # modify the column name
valid <- dt %>% filter(str_detect(col, regex)) # filter based on regular expression
valid <- separate(valid, col, into = columns, sep = "\\s+") # split into columns
towner_name <- str_sub(basename(file), start = 1 , end = 2) # extract tower name
valid$Tower <- rep(towner_name, nrow(valid)) # add Tower column
file_path <- file.path(dirname(file), paste(str_sub(basename(file), end = -5), ".csv", sep=""))
write.csv(valid, file = file_path, row.names = FALSE, quote = FALSE) # save to .csv
})
I'm confused at to why this code is no longer working despite working fine last year? Any help would be greatly appreciated!
The error occured at this line colnames(dt) <- c("columns") where you provided only one value to rename the (supposedly) 4-column dataframe. If you meant to replace a particular column, you can
colnames(dt)[i] <- c("columns")
where i is the index of the column you are renaming. Alternatively, provide a vector with 4 new names.
So I am tasked with building an excel spreadsheet cataloging a drive with various nested folders and files.
This SO gets me somewhat there but I am confused on how to get my desired output.
I know that there might be a command to get file info and I can break that into these columns.
Apart from the directories split into subdirs, the adaptation of the function in the question's link, Stibu's answer, might be of help.
rfl <- function(path) {
folders <- list.dirs(path, recursive = FALSE, full.names = FALSE)
if (length(folders)==0) {
files <- list.files(path, full.names = TRUE)
finfo <- file.info(files)
Filename <- basename(files)
FileType <- tools::file_ext(files)
DateModified <- finfo$mtime
FullFilePath <- dirname(files)
size <- finfo$size
data.frame(Filename, FileType, DateModified, FullFilePath, size)
} else {
sublist <- lapply(paste0(path,"/",folders),rfl)
setNames(sublist,folders)
}
}
If you have the full path and file names then you can loop through that and parse it into these columns. You can get more file info with file.info:
files <- c("I:/Administration/Budget/2015-BUDGET DOCUMENT.xlsx",
"I:/Administration/Budget/2014-2015 Budget/BUDGET DOCUMENT.xlsx")
# files <- list.files("I:", recursive = T, full.names = T) # this could take a while to run
file_info <- list(length = length(files))
for (i in seq_along(files)){
fullpath <- dirname(files[i])
fullname <- basename(files[i])
file_ext <- unlist(strsplit(fullname, ".", fixed = T))
file_meta <- file.info(files[i])[c("size", "mtime")]
path <- unlist(strsplit(fullpath, "/", fixed = T))[-1]
file_info[[i]] <- unlist(c(file_ext, file_meta, fullpath, path))
}
l <- lapply(file_info, `length<-`, max(lengths(file_info)))
df <- data.frame(do.call(rbind, l))
names(df) <- c("filename", "extension", "size", "modified", paste0("sub", 1:(ncol(df) - 4)))
rownames(df) <- NULL
df$modified <- as.POSIXct.numeric(as.numeric(df$modified), origin = "1970-01-01")
df$size <- as.numeric(df$size)
If you do not have the files you can recursively search the drive using list.files() with recursive = T: list.files("I:", recursive = T, full.names = T)
Note:
l <- lapply(file_info, `length<-`, max(lengths(file_info))) sets the vector length of each list element to be the same. This is necessary because otherwise when the vectors are stacked with unequal lengths values get recycled. A simple example of this is: rbind(1:3, 1:5)
The output of unlist(c(file_ext, file_meta, fullpath, path)) is a vector and vectors in R are atomic, meaning all elements have to be the same class. That means everything gets converted to character in this case, which is why we have the lines df$modified <- ... and df$size <- ... at the end to convert them to their appropriate type.
If you want to output this data frame to excel check out xlsx::write.xlsx or openxlsx::write.xlsx. If you don't have those libraries installed you'll need to use install.packages() first.
Output
Because these files/locations don't actually exist on my computer there are NA values in the size and date modified fields:
filename extension size modified sub1 sub2 sub3 sub4
1 2015-BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget Administration Budget <NA>
2 BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget/2014-2015 Budget Administration Budget 2014-2015 Budget
I am comparing two pairs of csv files each at a time. The files I have each end with a number like cars_file2.csv, Lorries_file3.csv, computers_file4.csv, phones_file5.csv. I have like 70 files per folder and the way I am comparing is, I compare cars_file2.csv and Lorries_file3.csv then Lorries_file3.csv and
computers_file4.csv, and the pattern is 2,3,3,4,4,5 like that. Is there a smart way I can handle this instead of manually coming back and change file like the way I am reading here or I can use the last number on each csv to read them smartly. NOTE the files have same suffixes _file:
library(daff)
setwd("path")
# Load csvs to compare into data frames
x_original <- read.csv("cars_file2.csv", strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv("Lorries_file3.csv", strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
My intention is to compare each two pairs of csv and recorded, Field additions, deletions and modified
You may want to load all files at once and do your comparison with a full list of files.
This may help:
# your path
path <- "insert your path"
# get folders in this path
dir_data <- as.list(list.dirs(path))
# get all filenames
dir_data <- lapply(dir_data,function(x){
# list of folders
files <- list.files(x)
files <- paste(x,files,sep="/")
# only .csv files
files <- files[substring(files,nchar(files)-3,nchar(files)) %in% ".csv"]
# remove possible errors
files <- files[!is.na(files)]
# save if there are files
if(length(files) >= 1){
return(files)
}
})
# delete NULL-values
dir_data <- compact(dir_data)
# make it a named vector
dir_data <- unique(unlist(dir_data))
names(dir_data) <- sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(dir_data))
names(dir_data) <- as.numeric(substring(names(dir_data),nchar(names(dir_data)),nchar(names(dir_data))))
# remove possible NULL-values
dir_data <- dir_data[!is.na(names(dir_data))]
# make it a list again
dir_data <- as.list(dir_data)
# load data
data_upload <- lapply(dir_data,function(x){
if(file.exists(x)){
data <- read.csv(x,header=T,sep=";")
}else{
data <- "file not found"
}
return(data)
})
# setup for comparison
diffs <- lapply(as.character(sort(as.numeric(names(data_upload)))),function(x){
# check if the second dataset exists
if(as.character(as.numeric(x)+1) %in% names(data_upload)){
# first dataset
print(data_upload[[x]])
# second dataset
print(data_upload[[as.character(as.numeric(x)+1)]])
# do your operations here
comparison <- render(diff_data(data_upload[[x]],
data_upload[[as.character(as.numeric(x)+1)]],
ignore_whitespace=T,count_like_a_spreadsheet = F))
numbers <- c(x, as.numeric(x)+1)
# save both the comparison data and the numbers of the datasets
return(list(comparison,numbers))
}
})
# you can find the differences here
diffs
This script loads all csv-files in a folder and its sub-folders and puts them into a list by their numbers. In case there are no doubles, this will work. If you have doubles, you will have to adjust the part where the vector is named so that you can index the full names of the files afterwards.
A simple for- loop using paste will read-in the pairs:
for (i in 1:70) { # assuming the last pair is cars_file70.csv and Lorries_file71.csv
x_original <- read.csv(paste0("cars_file",i,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv(paste0("Lorries_file3",i+1,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
}
For simplicity I used 2 .csv files.
csv_1
1,2,4
csv_2
1,8,10
Load all the .csv files from folder,
files <- dir("Your folder path", pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
#create empty list to store comparison output
diff <- c()
Loop through all loaded files and compare,
for (pos in 1:length(csv)) {
if (pos != length(csv)) { #ignore last one
#save comparison output
diff[[pos]] <- diff_data(as.data.frame(csv[pos]), as.data.frame(csv[pos + 1]), ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE)
}
}
Compared output by diff
[[1]]
Daff Comparison: ‘as.data.frame(tables[pos])’ vs. ‘as.data.frame(tables[pos + 1])’
+++ +++ --- ---
## X1 X8 X10 X2 X4
I'm new to R programming and wondering how I can take the contents of 1,172 text files and create a data frame with the contents of each text file in individual rows in the data frame.
So I want to go from having 1,172 text documents to having a data frame with 1,172 rows and 1 column, with each row having the contents of each individual text file. So the fifth row of the data frame would include the text from the fifth text document in the list I feed into R.
Thanks,
Tyler
# get all files with extension "txt" in the current directory
file.list <- list.files(path = ".", pattern="*.txt", full.names=TRUE)
# this creates a vector where each element contains one file
all.files <- sapply(file.list, FUN = function(x)readChar(x, file.info(x)$size))
# create a dataframe
df <- data.frame( files= all.files, stringsAsFactors=FALSE)
The last 2 steps could be united into one to avoid creating an extra vector:
df <- data.frame( files= sapply(file.list,
FUN = function(x)readChar(x, file.info(x)$size)),
stringsAsFactors=FALSE)
I just tested this and it worked fine for me.
# set the working directory (where files are saved)
setwd("C:/your_path_here/")
file_names = list.files(getwd())
file_names = file_names[grepl(".TXT",file_names)]
# print file_names vector
file_names
files = lapply(file_names, read.csv, header=F, stringsAsFactors = F)
files = do.call(rbind,files)
I would like to be able to import PDF documents into R and classify them as either:
Relevant (contains a specific string, for example, "tacos", within the first 100 words)
Irrelevant (DOES NOT contain "tacos" within the first 100 words)
To be more specific, I would like to address the following questions:
Does a package(s) exist in R to perform this basic classification?
If so, is it possible to generate a dataset that would look something like this in R if I had 2 PDF documents with Paper1 containing at least one instance of the string, "tacos", in the first 100 words and Paper2 that DOES NOT contain at least one instance of the string, "tacos":
Any references to documentation/R packages/sample R code or mock examples related to this type of classification using R would be greatly appreciated! Thanks!
You can use the pdftools library and do something like this:
First, load the library and grab some pdf file names:
library(pdftools)
fns <- list.files("~/Documents", pattern = "\\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...
Then define a function that reads a PDF file in as text and looks up the first n words. (It might be useful to check for errros, like unknown password or things like that - my ex. function returns NA for such cases.)
isRelevant <- function(fn, needle, n = 100L, ...) {
res <- try({
txt <- pdf_text(fn)
txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE)
any(grepl(needle, txt[1:n], ...))
}, silent = TRUE)
if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)
Finally, wrap it up and put it into a data frame:
data.frame(
Document = basename(fns),
Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
# Document Classification
# 1 a.pdf relevant
# 2 b.pdf not relevant
# 3 c.pdf relevant
# 4 d.pdf not relevant
# 5 e.pdf relevant
While #lukeA beat me to it, I wrote another small function that uses pdftools as well. The only real difference is that lukeA looks at the first n-characters, and my skript looks at the first n words.
This is how my approach looks
library(pdftools)
library(dplyr) # for data_frames and bind_rows
# to find the files better
setwd("~/Desktop/pdftask/")
# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)
# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
# loop over the files
res_list <- lapply(pdf_files, function(file) {
# use the library pdftools::pdf_text to extract the text from the pdf
content <- pdf_text(file)
# do some cleanup, i.e., remove punctuation, new-lines and lower all letters
content2 <- tolower(content)
content2 <- gsub("\\n", "", content2)
content2 <- gsub("[[:punct:]]", "", content2)
# split up the text by spaces
content_vec <- strsplit(content2, " ")[[1]]
# look if the search term is within the first n_words words
found <- search_term %in% content_vec[1:n_words]
# create a data_frame that holds our data
res <- data_frame(file = file,
relevance = ifelse(found,
"Relevant",
"Irrelevant"))
return(res)
})
# bind the data to a "tidy" data_frame
res_df <- bind_rows(res_list)
return(res_df)
}
search_pdf(pdf_files, search_term = "taco", n_words = 100)
# # A tibble: 3 × 2
# file relevance
# <chr> <chr>
# 1 pdfs//pdf_empty.pdf Irrelevant
# 2 pdfs//pdf_taco1.pdf Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant