I would like to:
get certain data in page 2 for every element in a list created (pdfs files)
data from page 2 (for Bond Futures CGB ... column 2, 11 and 16)
create a data frame aggregating all this data
Year | Month | Metric
2013 January Monthly Volume
2013 January Month End Open Interest
2013 January Transactions
I have tried the following but haven't reached far at all - my apologies.
library(rvest)
library(pdftools)
library(tidyverse)
filepath <- "~R Working Directory/CanadianFutures"
files <- list.files(path = filepath, pattern = '*.pdf')
The variable files contains the list:
[1] "1301_stats_en.pdf" "1302_stats_en.pdf" "1303_stats_en.pdf" "1304_stats_en.pdf" "1305_stats_en.pdf" "1306_stats_en.pdf"
[7] "1307_stats_en.pdf" "1308_stats_en.pdf" "1309_stats_en.pdf" "1310_stats_en.pdf" "1311_stats_en.pdf" "1312_stats_en.pdf"
[13] "1401_stats_en.pdf" "1402_stats_en.pdf" "1403_stats_en.pdf" "1404_stats_en.pdf" "1405_stats_en.pdf" "1406_stats_en.pdf".....[61] "1801_stats_en.pdf" "1802_stats_en.pdf" "1803_stats_en.pdf" "1804_stats_en.pdf" "1805_stats_en.pdf"
I have tried the following to get page 2 for each pdf but totally lost:
all <- lapply(files, function(x) {
txt <- pdf_text(filenames)
page_2 <- txt[2]
})
I get the following:
Error in normalizePath(pdf, mustWork = TRUE) :
path[1]="1301_stats_en.pdf": No such file or directory
All the pdfs in my list have the same consistent formatting.
Here is an example of the pdf https://www.m-x.ca/f_stat_en/1401_stats_en.pdf
Thank you
Make sure your working directory is the same as where you stored your files:
getwd()
Another option is to make your list of files displayed as complete directories.
files <- list.files(filepath, pattern = '*.pdf', full.names = T)
>files
[1] "Downloads/naamloze map//1401_stats_en-2.pdf"
[2] "Downloads/naamloze map//1401_stats_en.pdf"
PDFreader <- function(x){
t <- pdf_text (x)
page_2 <- t
}
lapply(files, PDFreader)
returns
[[1]]
[1]..... text....
[[2]]
[1]..... text....
Good luck
Related
I am trying to upload several text document into a data frame in R. My desired output is a matrix with two colums:
DOCUMENT
CONTENT
Document A
This is the content.
: ----
: -------:
Document B
This is the content.
: ----
: -------:
Document C
This is the content.
Within the column "CONTENT", all the text information from the text document (10-K report) shall be shown.
> setwd("C:/Users/folder")
> folder <- getwd()
> corpus <- Corpus(DirSource(directory = folder, pattern = "*.txt"))
This will create a corpus and I can tokenize it. But I don't achieve to convert to a data frame nor my desiret output.
Can somebody help me?
If you're only working with .txt files and your endgoal is a dataframe, then I think you can skip the corpus step and simply read in all your files as a list. The hard part is to get the names of the .txt files into a column called DOCUMENT, but this can be done in base R.
# make a reproducible example
a <- "this is a test"
b <- "this is a second test"
c <- "this is a third test"
write(a, "a.txt"); write(b, "b.txt"); write(c, "c.txt")
# get working dir
folder <- getwd()
# get names/locations of all files
filelist <- list.files(path = folder, pattern =" *.txt", full.names = FALSE)
# read in the files and put them in a list
lst <- lapply(filelist, readLines)
# extract the names of the files without the `.txt` stuff
names(lst) <- filelist
namelist <- fs::path_file(filelist)
namelist <- unlist(lapply(namelist, sub, pattern = ".txt", replacement = ""),
use.names = FALSE)
# give every matrix in the list its own name, which was its original file name
lst <- mapply(cbind, lst, "DOCUMENT" = namelist, SIMPLIFY = FALSE)
# combine into a dataframe
x <- do.call(rbind.data.frame, lst)
# a small amount of clean-up
rownames(x) <- NULL
names(x)[names(x) == "V1"] <- "CONTENT"
x <- x[,c(2,1)]
x
#> DOCUMENT CONTENT
#> 1 a this is a test
#> 2 b this is a second test
#> 3 c this is a third test
I have a folder with multiple raster .tif files from January 1950 to December 2018. However, they are named with the month first and then the year (see below):
[1] "./WI_only_cmi60_01_1950.tif" "./WI_only_cmi60_01_1951.tif" "./WI_only_cmi60_01_1952.tif"
[4] "./WI_only_cmi60_01_1953.tif" "./WI_only_cmi60_01_1954.tif" "./WI_only_cmi60_01_1955.tif"
[7] "./WI_only_cmi60_01_1956.tif" "./WI_only_cmi60_01_1957.tif" "./WI_only_cmi60_01_1958.tif"
...
[820] "./WI_only_cmi60_12_2010.tif" "./WI_only_cmi60_12_2011.tif" "./WI_only_cmi60_12_2012.tif"
[823] "./WI_only_cmi60_12_2013.tif" "./WI_only_cmi60_12_2014.tif" "./WI_only_cmi60_12_2015.tif"
[826] "./WI_only_cmi60_12_2016.tif" "./WI_only_cmi60_12_2017.tif" "./WI_only_cmi60_12_2018.tif"
When I bring these into R and use the Raster package to stack these:
# list tif files in working directory
tifs <- list.files(pattern = ".tif$", full.names = TRUE)
# stack tifs in working directory
rstack <- stack(tifs)
They are ordered with all the January .tif files first followed by all the February .tif files etc when I need them for each year then each month (so chronologically ordered from January 1950 - December 2018).
Is there a way to rename these files where the order of the filenames can be rearranged so character 15 and 16 of each filename is moved after the year characters (18, 19, 20, 21)?
i.e. the first filename listed would change from
"./WI_only_cmi60_01_1950.tif"
to
"./WI_only_cmi60_1950_01.tif"
I would not rename the files, but instead sort the filenames appropriately. That should be a better approach in the long run for reproducibility and updating.
Example (unsorted)
ff <- c("./WI_only_cmi60_01_1950.tif","./WI_only_cmi60_01_1951.tif", "./WI_only_cmi60_01_1952.tif",
"./WI_only_cmi60_06_1950.tif", "./WI_only_cmi60_06_1951.tif", "./WI_only_cmi60_06_1952.tif",
"./WI_only_cmi60_12_1950.tif", "./WI_only_cmi60_12_1951.tif", "./WI_only_cmi60_12_1952.tif")
Using Akrun's expression
i <- sub("(\\d+)_(\\d+)(\\.tif)", "\\2_\\1\\3", ff)
fs <- ff[order(i)]
fs
#[1] "./WI_only_cmi60_01_1950.tif" "./WI_only_cmi60_06_1950.tif"
#[3] "./WI_only_cmi60_12_1950.tif" "./WI_only_cmi60_01_1951.tif"
#[5] "./WI_only_cmi60_06_1951.tif" "./WI_only_cmi60_12_1951.tif"
#[7] "./WI_only_cmi60_01_1952.tif" "./WI_only_cmi60_06_1952.tif"
#[9] "./WI_only_cmi60_12_1952.tif"
A more basic approach to achieve the same
x <- gsub("WI_only_cmi60_", "", basename(ff))
d <- paste(substr(x, 4, 7), substr(x, 1, 2), sep="-")
i <- order(d)
ff[i]
Given that the pattern seems to be rather simple (69 years, 12 months each) you could also do (with all your files)
i <- rep(1:69, 12)
fs <- ff[i]
(always double check the results!)
We could capture as a group and rearrange the backreference
sub("(\\d+)_(\\d+)(\\.tif)", "\\2_\\1\\3", "./WI_only_cmi60_01_1950.tif")
-ouptut
[1] "./WI_only_cmi60_1950_01.tif"
Using strsplit.
x <- "./WI_only_cmi60_01_1950.tif"
revfun <- function(x) {
r <- rev(el(strsplit(x, '')))
Reduce(paste0, rev(r[c(1:4, 10:12, 5:9, 13:length(r))]))
}
revfun(x)
# [1] "./WI_only_cmi60_1950_01.tif"
The code I have used is below - via the answers given from akrun and Robert Hijmans - but I wanted to clarify how I used these answers to read all of the .tif files within a working directory and stack these:
setwd("C:/...")
# list tif files in working directory
ff <- list.files(pattern = ".tif$", full.names = TRUE)
i <- sub("(\\d+)_(\\d+)(\\.tif)", "\\2_\\1\\3", ff)
fs <- ff[order(i)]
library(raster)
# create stack of tif files
rstack <- stack(fs)
I have a batch of text files that I need to read into r to do text mining.
So far, I have tried to use read.table, read.line, lapply, mcsv_r from qdap package to no avail. I have tried to write a loop to read the files, but I have to specify the name of the file, which changes in every iteration.
Here is what I have tried:
# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"
# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")
for(i in 1:length(speeches))
{
text_df <- do.call(rbind,lapply(speeches[i],read.csv))
}
Moreover, I have tried the following:
library(data.table)
files <- list.files(path = folder.path,pattern = ".csv")
temp <- lapply(files, fread, sep=",")
data <- rbindlist( temp )
And it is giving me this error when inaugAbrahamLincoln-1.csv clearly exists in the folder:
files <- list.files(path = folder.path,pattern = ".csv")
> temp <- lapply(files, fread, sep=",")
Error in FUN(X[[i]], ...) :
File 'inaugAbrahamLincoln-1.csv' does not exist. Include one or more spaces to consider the input a system command.
> data <- rbindlist( temp )
Error in rbindlist(temp) : object 'temp' not found
>
But it only works on .csv files, not on .txt files.
Is there a simpler way to do text mining from multiple sources files? If so how?
Thanks
I often have this same problem. The textreadr package that I maintain is designed to make reading .csv, .pdf, .doc, and .docx documents and directories of these documents easy. It would reduce what you're doing to:
textreadr::read_dir("../data/InauguralSpeeches/")
Your example is not reproducible so I do it below (please make your example reproducible in the future).
library(textreadr)
## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')
## the read in of a directory
read_dir('delete_me')
output
The output below shows the tibble output with each document registered in the document column. For every line in the document there is one row for that document. Depending on what's in the csv files this may not be fine grained enough.
## document content
## 1 0_9 Bromwell High is a cartoon comedy. It ra
## 2 00_00 test
## 3 00_00
## 4 00_00 testing
## 5 00_00
## 6 00_00 tester
## 7 1_7 If you like adult comedy cartoons, like
## 8 10_9 I'm a male, not given to women's movies,
## 9 11_9 Liked Stanley & Iris very much. Acting w
## 10 12_9 Liked Stanley & Iris very much. Acting w
## .. ... ...
## 141 mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142 mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143 mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18
Here is code that will read all the *.csv files in a directory to a single data.frame:
dir <- '~/Desktop/testcsv/'
files <- list.files(dir,pattern = '*.csv', full.names = TRUE)
data <- lapply(files, read.csv)
df <- do.call(rbind, data)
Notice that I added the argument full.names = TRUE. This will give you the absolute paths, which is why youre getting an error for "inaugAbrahamLincoln-1.csv" even though it exists.
Here is one way to do it.
library(data.table)
setwd("C:/Users/Excel/Desktop/CSV Files/")
WD="C:/Users/Excel/Desktop/CSV Files/"
# read headers
data<-data.table(read.csv(text="CashFlow,Cusip,Period"))
csv.list<- list.files(WD)
k=1
for (i in csv.list){
temp.data<-read.csv(i)
data<-data.table(rbind(data,temp.data))
if (k %% 100 == 0)
print(k/length(csv.list))
k<-k+1
}
I have a list of files that are all named similarly: "FlightTrackDATE.txt" where the date is expressed in YYYYMMDD. I read in all the files with the list.files() command, but this gives me all the files in that folder (only flight track files are in this folder). What I would like to do is create a new file that will combine all the files from the last 90 days (or three months, whichever is easier) and ignore the other files.
You can try this :
#date from which you want to consolidate (replace with required date)
fromDate = as.Date("2015-12-23")
for (filename in list.files()){
#extract the date from filename using substr ( characters 12- 19)
filenameDate = as.Date(substr(filename,12,19), format = "%Y%m%d")
#read and consolidate if the filedate is on or after from date
if ((filenameDate - fromDate) >=0){
#create consolidated list from first file
if (!exists('consolidated')){
consolidated <- read.table(filename, header = TRUE)
} else{
data = read.table(filename, header = TRUE)
#row bind to consolidate
consolidated = rbind(consolidated, data)
}
}
}
OUTPUT:
I have three sample files :
FlightTrack20151224.txt
FlightTrack20151223.txt
FlightTrack20151222.txt
Sample data:
Name Speed
AA101 23
Consolidated data:
Name Speed
1 AA102 24
2 AA101 23
Note:
1. Create the From date by subtracting from current date or using a fixed date like above.
2. Remember to clean up the existing consolidated data if you are running the script again. Data duplication might occur otherwise.
3. Save consolidated to file :)
Consider an lapply() solution without a need for list.files() since you know ahead of time the directory and file name structure:
path = "C:/path/to/txt/files"
# LIST OF ALL LAST 90 DATES IN YYYYMMDD FORMAT
dates <- lapply(0:90, function(x) format(Sys.Date()-x, "%Y%m%d"))
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES
dfList <- lapply(paste0(path, "FlightTrack", dates, ".txt"),
function(x) if (file.exists(x)) {read.table(x)})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# OUTPUT FINAL FILE TO TXT
write.table(df, paste0(path, "FlightTrack90Days.txt"), sep = ",", row.names = FALSE)
Being relatively new to R programming I am struggling with a huge data set of 16 text files (, seperated) saved in one dierctory. All the files have same number of columns and the naming convention, for example file_year_2000, file_year_2001 etc. I want to create a list in R where i can access each file individually by accessing the list elementts. By searching through the web i found some code and tried the following but as a result i get one huge list (16,2 MB) where the output is just strange. I would like to have 16 elements in the list each represting one file read from the directory. I tried the following code but it does not work as i want:
path = "~/.../.../.../Data_1999-2015"
list.files(path)
file.names <- dir(path, pattern =".txt")
length(file.names)
df_list = list()
for( i in length(file.names)){
file <- read.csv(file.names[i],header=TRUE, sep=",", stringsAsFactors=FALSE)
year = gsub('[^0-9]', '', file)
df_list[[year]] = file
}
Any suggestions?
Thanks in advance.
Just to give more details
path = "~/.../.../.../Data_1999-2015"
list.files(path)
file.names <- dir(path, pattern =".txt")
length(file.names)
df_list = list()
for(i in seq(length(file.names))){
year = gsub('[^0-9]', '', file.names[i])
df_list[[year]] = read.csv(file.names[i],header=TRUE, sep=",", stringsAsFactors=FALSE)
}
Maybe it would be worth joining the data frames into one big data frame with an additional column being the year?
I assume that instead of "access each file individually" you mean you want to access individually data in each file.
Try something like this (untested):
path = "~/.../.../.../Data_1999-2015"
file.names <- dir(path, pattern =".txt")
df_list = vector("list", length(file.names))
# create a list of data frames with correct length
names(df_list) <- rep("", length(df_list))
# give it empty names to begin with
for( i in seq(along=length(file.names))) {
# now i = 1,2,...,16
file <- read.csv(file.names[i],header=TRUE, sep=",", stringsAsFactors=FALSE)
df_list[[i]] = file
# save the data
year = gsub('[^0-9]', '', file.names[i])
names(df_list)[i] <- year
}
Now you can use either df_list[[1]] or df_list[["2000"]] for year 2000 data.
I am uncertain if you are reading yout csv files in the right directory. If not, use
file <- read.csv(paste0(path, file.names[i], sep="/"),header=TRUE, sep=",", stringsAsFactors=FALSE)
when reading the file.