How to read multiple PDF files in R? - r

I have a script that I am using to read multiple PDF files. Here is my code
corpus_raw <- data.frame("company" = c(),"text" = c(), check.names = FALSE)
for (i in 1:length(pdf_list)){
print(i)
document_text <- pdf_text(paste("V:/CodingProject2_FundOverview/", pdf_list[i],sep = "")) %>%
strsplit("\r\n")
document <- data.frame("company" = gsub(x = pdf_list[i],pattern = ".pdf", replacement = ""),
"text" = document_text, stringsAsFactors = FALSE, check.names = FALSE)
colnames(document) <- c("company", "text")
corpus_raw <- rbind(corpus_raw,document)
}
I get the following error message:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 79, 56
I even tried to keep the check.names = FALSE but it seems like I am doing something wrong. Any help will be appreciated. Thanks

I knew I was doing something stupid. Anyways, I was able to figure out the answer on my own.
for (i in 1:length(pdf_list)){
print(i)
document_text <- pdf_text(paste("V:/CodingProject2_FundOverview/", pdf_list[i],sep = "")) %>%
strsplit("\r\n")
document <- data.frame("company" = gsub(x = pdf_list[i],pattern = ".pdf", replacement = ""),
"text" = I(document_text), stringsAsFactors = FALSE, check.names = FALSE)
colnames(document) <- c("company", "text")
corpus_raw <- rbind(corpus_raw,document)
}

Related

evol.distinct funciton (Picante) in parallel

I am currently trying to run in parallel a loop of the evol.distinct function (Picante package in R) with no success.
The original code is:
#read trees
phy<-read.nexus("output.nex")
#Create matrix
ed_calc=matrix(nrow=length(phy[[1]]$tip.label), ncol=length(phy))
#Calculate ED for all the trees (the loop I want to run in parallel)
for(i in 1:length(phy)){
print(i)
newdata=evol.distinct(phy[[i]],type= "fair.proportion")
ed_calc[,i] <-newdata[order(newdata[,1]),][,2]
#names(ed_calc)[,i]=i
}
#Convert to data frame
ed_calc=as.data.frame(ed_calc)
#Species names as rownames
rownames(ed_calc)=newdata[order(newdata[,1]),][,1]
write.table(ed_calc, file = "ED_Allpar.txt", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
ed_calc_all= rowMedians(as.matrix(ed_calc))
write.table(ed_calc_all, file = "ED_Medianpar.txt", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
I've tried with foreach, with no succeed :'(. Please, help. data are in (https://www.dropbox.com/sh/uftq7iu1qv9h3te/AACF7jWh0HA8mv1hN8WhItERa?dl=0)
Thanks for any help you could give me,
Edited:
My OS is manjaro Linux. The code I used to parallelize is:
library(parallel)
library(doParallel)
parallel::detectCores()
## [1] 8
n.cores <- parallel::detectCores()
#create the cluster
my.cluster <- parallel::makeCluster(
n.cores-6,
type = "FORK"
)
#check cluster definition (optional)
print(my.cluster)
#register it to be used by %dopar%
doParallel::registerDoParallel(cl = my.cluster)
#check if it is registered (optional)
foreach::getDoParRegistered()
foreach::getDoParWorkers()
foreach (i=1:length(phy),combine='rbind') %dopar% {
print(i)
newdata=evol.distinct(phy[[i]],type= "fair.proportion")
ed_calc[,i] <-newdata[order(newdata[,1]),][,2]
ed_calc=as.data.frame(ed_calc)
}
stopCluster(my.cluster)
#Convert to data frame
ed_calc=as.data.frame(ed_calc)
#Species names as rownames
rownames(ed_calc)=newdata[order(newdata[,1]),][,1]
write.table(ed_calc, file = "ED_Allpar.txt", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")
ed_calc_all= rowMedians(as.matrix(ed_calc))
write.table(ed_calc_all, file = "ED_Medianpar.txt", append = FALSE, quote = TRUE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE, qmethod = c("escape", "double"),
fileEncoding = "")

exporting dataframe to tsv, but row.names are missing?

b <- data.frame(var1 = c(9.2, 3.5,5.5), var2 = 1:3,row.names = c("a","b","c"))
write_tsv(b,path = result_path,na = "NA",append = T,col_names = T,quote_escape = "double")
b is exported as tsv but the row.names are missing. row.names=T is not an argument for write_tsv.
What can I do to maintain the rownames?
Row names are never kept for any of the readr write_delim() functions. You can either add the row names to the data or use write.table().
Add row names:
library(tibble)
write_tsv(b %>% rownames_to_column(), path = result_path, na = "NA", append = T, col_names = T, quote_escape = "double")
Or:
write.table(b, result_path, na = "NA", append = TRUE, col.names = TRUE, row.names = TRUE, sep = "\t", quote = TRUE)

Set parameters for multiple read.table()s for identical files as a variable

I am often in the situation where I have multiple files that have identical structure but different content, which ends up in the situation where I have ugly and repetitive read.table() lines. For example:
df1 <- read.table("file1.tsv", fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t")
df2 <- read.table("file2.tsv", fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t")
df3 <- read.table("file3.tsv", fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t")
df4 <- read.table("file4.tsv", fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t")
Is there a way to store the parameters as a variable, or somehow set a default, to avoid this repetitiveness? (Maybe not, and I've been writing too much python lately).
Naively I tried
read_parameters <- c(fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t")
df1 <- read.table("file1.tsv", read_parameters)
but this gives an error Error in !header : invalid argument type.
Alternatively I could run a loop for each of the files, but I never have found out how to iteratively name data frames in a loop in R, and in any case I think perhaps an answer to this question would be useful to the community, as I think this is a common situation?
You could write a wrapper-function for read table and set the default parameters as you need them
my.read.table <- function(temp.source, fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t")
{
return(read.table(temp.source, fill = fill, header = header, stringsAsFactors = stringsAsFactors, quote = quote, sep = sep))
}
Than you can call this function simply by
df <- my.read.table("file1.tsv")
or you could use lapply to call the same function on every source-string.
sources.to.load <- c("file1.tsv", "file2.tsv", "file3.tsv")
df_list <- lapply(sources.to.load, read.table, fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t")
Edit:
If you want to keep the parameter vector method as well, you could add it to your wrapper function.
my.read.table2 <- function(temp.source, fill = T, header = T, stringsAsFactors = F, quote = "", sep = "\t", parameterstring)
{
if(exists("parameterstring"))
{
fill <- as.logical(parameterstring[1])
header <- as.logical(parameterstring[2])
stringsAsFactors <- as.logical(parameterstring[3])
quote <- parameterstring[4]
sep <- parameterstring[5] # if you need this to be more "strict" about the parameternames in the supplied vector: sep <- parameterstring[which(names(parameterstring) == "sep"))]
}
return(read.table(temp.source, fill = fill, header = header, stringsAsFactors = stringsAsFactors, quote = quote, sep = sep))
}
Than you can call this function simply by
df <- my.read.table2("file1.tsv") # this will call the function with the default settings
df2 <- my.read.table2("file1.tsv", parameterstring = read_parameters) # this will overwrite the default settings by the parameters stored in read_parameters

Import big CSV files at once in R

I`ve 70 csv files with the same columns in a folder, each of them are 0.5 GB.
I want to import them into a single dataframe in R.
Normally I import each of them correctly as below:
df <- read_delim("file.csv",
"|", escape_double = FALSE, col_types = cols(pc_no = col_character(),
id_key = col_character()), trim_ws = TRUE)
To import all of them, coded like that and error as follows:
argument "delim" is missing, with no default
tbl <-
list.files(pattern = "*.csv") %>%
map_df(~read_delim("|", escape_double = FALSE, col_types = cols(pc_no = col_character(), id_key = col_character()), trim_ws = TRUE))
With read_csv, imported but appears only one column which contains all columns and values.
tbl <-
list.files(pattern = "*.csv") %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
In your second block of code, you're missing the ., so read_delim is interpreting your arguments as read_delim(file="|", delim=<nothing provided>, ...). Try:
tbl <- list.files(pattern = "*.csv") %>%
map_df(~ read_delim(., delim = "|", escape_double = FALSE,
col_types = cols(pc_no = col_character(), id_key = col_character()),
trim_ws = TRUE))
I explicitly identified delim= here but it's not strictly necessary. Had you done that in your first attempt, however, you would have seen
readr::read_delim(delim = "|", escape_double = FALSE,
col_types = cols(pc_no = col_character(), id_key = col_character()),
trim_ws = TRUE)
# Error in read_delimited(file, tokenizer, col_names = col_names, col_types = col_types, :
# argument "file" is missing, with no default
which is more indicative of the actual problem.

ERREUR : <text> in R

I'm trying to import data into R.
When I submit
Dataset <- read.table("Data.txt",
header = TRUE, sep = "\t", na.strings = "NA", dec = ".", strip.white = TRUE)
it works, but when I added row.names = 1 and I submit
Dataset <- read.table("Data.txt",
header = TRUE, sep = "\t", na.strings = "NA", dec = ".", row.names = 1, strip.white = TRUE)
I get ERREUR:<text>
If your first instance works, perhaps the easiest way would be simply to :
`Dataset <- read.table("Data.txt", header = TRUE, sep = "\t",
na.strings = "NA", dec = ".", strip.white = TRUE)
rownames(Dataset) <- Dataset[, 1]
Dataset <- Dataset[, -1]`
And you should have the solution with the first column of Data.txt being the row names of Dataset

Resources