I have a large csv file and would like to read only certain lines, defined by a vector of row numbers to be read. Is there any way to read these rows without reading the whole csv into memory?
The only solutions I've found seem to allow reading consecutive lines (e.g. lines 2-100).
A simple example of how you might combine the sed approach I linked to into an R function:
read_rows <- function(file,rows,...){
tmp <- tempfile()
row_cmd <- paste(paste(rows,"p",sep = ""),collapse = ";")
cmd <- sprintf(paste0("sed -n '",row_cmd,"' %s > %s"),file,tmp)
system(command = cmd)
read.table(file = tmp,...)
}
write.csv(x = mtcars,file = "~/Desktop/scratch/mtcars.csv")
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(3,6,7),sep = ",",header = FALSE,row.names = 1)
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> read_rows(file = "~/Desktop/scratch/mtcars.csv",rows = c(1,5,9),sep = ",",header = TRUE,row.names = 1)
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Note the difference with row 1 as the column headers.
sqldf will read it into a database (which it will create and then delete for you) and then read only the rows you want into R. Assuming the csv file created in the Note at the end define the desired Rows and then use read.csv.sql. We have used a temporary file for the database but if the data is sufficiently small you can omit the dbname argument and it will use memory.
library(sqldf)
Rows <- c(3, 5, 10)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv", "select * from file where rowid in ($s)",
dbname = tempfile())
giving:
X Letters
1 "3" "c"
2 "5" "e"
3 "10" "j"
If the number of rows desired is very large then rather than embedding the row numbers in the SQL statement create a data frame from them and join it:
library(sqldf)
Rows <- c(3, 5, 10)
RowsDF <- data.frame(Rows)
s <- toString(Rows)
fn$read.csv.sql("Letters.csv",
"select file.* from file join RowsDF on file.rowid = RowsDF.Rows",
dbname = tempfile())
Note
Letters <- data.frame(Letters = letters, stringsAsFactors = FALSE)
write.csv(Letters, "Letters.csv")
Related
I am using the "tesseract" library in R to convert "PDF files into text", like shown over here: https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
library(pdftools)
library(tesseract)
pngfile <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)
The above code works perfectly. Now, I am trying to "mass upload" a large number of PDF files and convert them into text- currently, I figured out how to do this manually
#import and convert 1st file
pngfile_1 <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text_1 <- tesseract::ocr(pngfile_1)
#import and convert 2nd file (note: the files do not have the same naming convention)
pngfile_2 <- pdftools::pdf_convert('second_file.pdf', dpi = 600)
text_2 <- tesseract::ocr(pngfile_2)
etc
I copied/pasted the above code 50 times (while changing the "index", i.e. pngfile_i, text_i) and was able to accomplish what I wanted to do. However, I am looking for a somewhat "automatic" to import and convert all the pdf files.
At the moment, all my pdf files are in the following folder:
"C:/Users/me/Documents/mypdfs"
I found the following code which can be used to "mass import" pdf files into R:
library(dplyr)
library(data.table)
tbl_fread <-
list.files(pattern = "*.pdf") %>%
map_df(~fread(.))
But I am not sure how to instruct this code to import all pdf's from the correct directory ("C:/Users/me/Documents/mypdfs"). I also don't know how to instruct R to "rename" each imported pdf as "pdf_1, pdf_2, etc."
If all the pdf files were correctly imported and created, I could then write a "loop" and execute the desired commands, e.g.
# "n" would be the total number of pdf files
for (i in 1:n)
{
pngfile_i <- pdftools::pdf_convert('myfile_i.pdf', dpi = 600)
text_i <- tesseract::ocr(pngfile_i)
}
Can someone please show me how to do this?
Thanks
You can add full.names = TRUE in your list.files-function, but this assumes that "C:/Users/me/Documents/mypdfs" is contained within your project.
Alternatively, you can use path = "Documents/mypdfs with full.names = TRUE which will direct the path to mypdfs.
list.files(
path = "Documents/mypdfs"
full.names = TRUE,
pattern = "*.pdf"
)
To save them according to you pdf_n then you can use paste along with map. Here I used data.frames to provide an example, as I do not work with pdfs and have none in bulk that I am willing to process.
library(tidyverse)
1:length(tbl_fread) %>% map(
.f = function(i) {
# Your regular function
# related to PDF
# Saving according to desired names
write.table(
tbl_fread[[i]],
file = paste0("pdf_", i, ".csv")
)
}
)
To verify it works as intended, we can read it accordingly,
read.table(
file = "pdf_1.csv"
)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I want to know to use a short script to eliminate all but one duplicate column variables based on the prefix of the colname without inputting the variables I want to remove by hand.
For example, I created repeats of the mtcars$am variables, called am1, am2, am3, and am4 in a data frame called mtcars_example_2. I removed the original am variable in the mtcars_example_2 data frame.
I can use the script below to eliminate all variables with the prefix "am" but the am1 variable into a new variable called mtcars_example_3 using the code below, which inputs all variables to remove by hand:
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
But this seems like the long way of doing this. Is there a faster way that does not require me to individual type in the names of each of the variables that I want to remove from the data.
Is this possible? If so, how can this be done?
Thanks ahead of time.
Here is the code for the example:
# example data
## loads packages
library(tidyverse)
## creates mtcars_example data
mtcars_example_1 <- data.frame(mtcars)
mtcars_example_2 <- data.frame(mtcars_example_1)
## creates duplicate variables, based on am variable
mtcars_example_2$am1 <- mtcars_example_1$am
mtcars_example_2$am2 <- mtcars_example_1$am
mtcars_example_2$am3 <- mtcars_example_1$am
mtcars_example_2$am4 <- mtcars_example_1$am
## removes original variable
mtcars_example_2 <-
mtcars_example_2 %>%
select(
-c(
"am"
)
)
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
You can remove all the variables that start with am but keep am1 :
library(dplyr)
mtcars_example_2 %>% select(-starts_with('am'), am1) %>% head
# mpg cyl disp hp drat wt qsec vs gear carb am1
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 4 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 4 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 4 1 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 0
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 3 2 0
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1 0
Depending on your actual scenario you can also use regex to remove columns.
mtcars_example_2 %>% select(-matches('am[2-4]')) %>% head
We could also do
library(dplyr)
mtcars_example_2 %>%
select(-contains('am'), am1)
I'm writing a function that takes a data.table as an argument. The column names of data.table are partially specified as arguments, but not all columns names are specified and all original columns need to be maintained. Inside the function, some columns need to be added to the data.table. Even if the data.table is copied inside the function, I want to add these columns in a way that is guaranteed not to overwrite existing columns. What's the best way to ensure I'm not overwriting columns given that column names are not known?
Here's one approach:
#x is a data.table and knownvar is a column name of that data.table
f <- function(x,knownvar){
x <- copy(x)
tempcol <- "z"
while(tempcol %in% names(x))
tempcol <- paste0("i.",tempcol)
tempcol2 <- "q"
while(tempcol2 %in% names(x))
tempcol2 <- paste0("i.",tempcol2)
x[, (tempcol):=3]
eval(parse(text=paste0("x[,(tempcol2):=",tempcol,"+4]")))
x
}
Note that even though I'm copying x here, I still need this process to be memory efficient. Is there an easier way of doing this? Possibly without using eval(parse(text=?
Obviously I could just create a local variable (e.g. a vector) in the function environment (rather than adding it explicitly as column of the data.table), but this wouldn't work if I then need to sort/join the data.table. Plus I may want to explicitly return a data.table that contains both the original variables and the new column.
Here is one way to write the function using set and non-standard evaluation with substitute() + eval().
Note 1: if new columns are created based on the column names in newcols (instead of the column name in knownvar), the character names in newcols are converted to symbols with as.name() (or equivalently as.symbol()).
Note 2: new columns in newvals can only be added in a sensible order, i.e. if column q requires column z, column z should be added before column q.
library(data.table)
f <- function(x, knownvar) {
## remove if x should be modified in-place
x <- copy(x)
## new column names
newcols <- setdiff(make.unique(c(names(x), c("z", "q"))), names(x))
## new column values based on knownvar or new column names
zcol <- as.name(newcols[1])
newvals <- list(substitute(3 * knownvar), substitute(zcol + 4))
for(i in seq_along(newvals)) {
set(x, j = newcols[i], value = eval(newvals[[i]], envir = x))
}
return(x)
}
## example data
x <- as.data.table(mtcars)
x[, c("q", "q.1") := .(mpg, 2 * mpg)]
head(f(x, mpg))
#> mpg cyl disp hp drat wt qsec vs am gear carb q q.1 z q.2
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 42.0 63.0 67.0
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 21.0 42.0 63.0 67.0
#> 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 22.8 45.6 68.4 72.4
#> 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.4 42.8 64.2 68.2
#> 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.7 37.4 56.1 60.1
#> 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 18.1 36.2 54.3 58.3
I'm wondering if there's a way to apply a function in a string variable to .SD cols in a data.table.
I can generalize all other parts of function calls using a data.table, including input and output columns, which I'm very happy about. But the final piece seems to be applying a variable function to a data.table, which is something I believe I've done before with dplyr and do.call.
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
This obviously works:
mtcars[,eval(returnNames) := myfunc(.SD),.SDcols = SDnames,by = cyl]
But if I want to apply a dynamic function, something like this does not work:
functionCall <- "myfunc"
mtcars[,eval(returnNames) := lapply(.SD,eval(functionCall)),.SDcols = SDnames,by = cyl]
I get this error:
Error in `[.data.table`(mtcars, , `:=`(eval(returnNames), lapply(.SD, : attempt to apply non-function
Is using "apply" with "eval" the right idea, or am I on the wrong track entirely?
You don't want lapply. Since myfunc takes a data.table with multiple columns, you just want to feed such a data table into the function as one object.
To get the function you need get instead of eval
On the left-hand-side of :=, you can just put the character vector in parentheses, eval isn't needed
-
mtcars[, (returnNames) := get(functionCall)(.SD)
, .SDcols = SDnames
, by = cyl]
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb calculatedColumn
# 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2310.0
# 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2310.0
# 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2120.4
# 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2354.0
# 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3272.5
# 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1900.5
The code above was run after the following code
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
functionCall <- "myfunc"
Im looking for a way to put a bottom line under a dataframe that I print out in R
The Output looks like
I want to print out the bottom line as long as the dataframe output is. But the width is varying.
Any Idea?
EDIT
I'm trying to get rid of
cat("---------------------------------------\n")
and want to make that dynamic to the output size of a given dataframe. The "-----" line should be not longer or shorter then the dataframe.
Use getOption("width"):
> getOption("width")
[1] 80
You can see a description of this option via ?options which states
‘width’: controls the maximum number of columns on a line used in
printing vectors, matrices and arrays, and when filling by
‘cat’.
That doesn't mean that the entire 80 (in my case) characters are used, but R's printing shouldn't extend beyond that so it should be an upper limit.
You should probably also check this in IDE or other front-ends to R. For example, RStudio might do something different depending on the width of the console widget in their app.
To actually format exactly the correct width for the data frame, you'll need to process the data frame into character strings for each line (much as print.data.frame does via its format method. Something like:
df <- data.frame(Price = round(runif(10), 2),
Date = Sys.Date() + 0:9,
Subject = rep(c("Foo", "Bar", "DJGHSJIBIBFUIBSFIUBFUIS"),
length.out = 10),
Category = rep("Media", 10))
class(df) <- c("MyDF", "data.frame")
print.MyDF <- function(x, ...) {
fdf <- format(x)
strings <- apply(x, 2, function(x) unlist(format(x)))[1, ]
rowname <- format(rownames(fdf))[[1]]
strings <- c(rowname, strings)
widths <- nchar(strings)
names <- c("", colnames(x))
widths <- pmax(nchar(strings), nchar(names))
csum <- sum(widths + 1) - 1
print.data.frame(df)
writeLines(paste(rep("-", csum), collapse = ""))
writeLines("Balance: 48") ## FIXME !!
invisible(x)
}
which gives:
> df
Price Date Subject Category
1 0.73 2015-06-29 Foo Media
2 0.11 2015-06-30 Bar Media
3 0.19 2015-07-01 DJGHSJIBIBFUIBSFIUBFUIS Media
4 0.54 2015-07-02 Foo Media
5 0.04 2015-07-03 Bar Media
6 0.37 2015-07-04 DJGHSJIBIBFUIBSFIUBFUIS Media
7 0.59 2015-07-05 Foo Media
8 0.85 2015-07-06 Bar Media
9 0.15 2015-07-07 DJGHSJIBIBFUIBSFIUBFUIS Media
10 0.05 2015-07-08 Foo Media
----------------------------------------------------
Balance: 48
See how this works. Very simple counting of characters, no bells and whistles, but should do the expected job:
EDIT: Printing of the data.frame and the line done in the function.
# create a function that prints the data.frame with the line we want
lineLength <- function( testDF )
{
# start with the characters in the row names,
# plus empty space between columns
dashes <- max( nchar( rownames( testDF ) ) ) + length ( testDF )
# loop finding the longest string in each column, including header
for( i in 1 : length ( testDF ) )
{
x <- nchar( colnames( testDF ) )[ i ]
y <- max( nchar( testDF[ , i ] ) )
if( x > y ) dashes <- dashes + x else dashes <- dashes + y
}
myLine <- paste( rep( "-", dashes ), collapse = "" )
print( testDF )
cat( myLine, "\n" )
}
# sample data
data( mtcars )
# see how it works
lineLength( head( mtcars ) )
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
--------------------------------------------------------------------