I was trying to read a table in R using an argument received from a php program like,
echo exec("Rscript /var/www/html/genome/coex.R $db");
In R,
args = commandArgs(trailingOnly=FALSE)
db <- args[1]
db <-paste(db, "txt", sep=".")
dat <- as.matrix(read.table("/var/www/html/genome/coex_db/db", header = TRUE, fill = TRUE))
I am not sure how can the filename for the db can be used to get the table content in dat.
Related
I'm trying to use a for loop to get R to loop through a table of inputs with multiple columns, with each column corresponding to a specific variable.
Let me know if you have any questions.
Thanks in advance!
Here is the script that I wrote:
##Call server
library(RCurl)
library(tidyverse)
library(stringr)
library(redcapAPI)
TableForR<-read.csv("//usershare/users$/aweible1/Profile_Data/Desktop/REDCap Monthly Check/TableForR.csv")
for(j in seq_len(nrow(TableForR))){
labelFile <- RCurl::postForm(
uri = "https://redcap.uhhospitals.org/redcap/api/",
token = TableForR[j,1],
content = 'report',
report_id = TableForR[j,2],
format = 'csv',
type = 'flat',
rawOrLabel = 'label',
exportDataAccessGroups = 'true',
.opts = curlOptions(ssl.verifypeer=FALSE)
)
backup <- read.csv(text = labelFile, stringsAsFactors = FALSE)
write.csv(backup, file = TableForR[j,3], row.names=FALSE)
}
I was expecting it to spit out 273 files in a folder on my desktop based on the variable inputs in a csv file. This would help me tremendously because I'm trying to automate an API.
I am trying to import a dataset (with many csv files) into r and afterwards write the data into a table in a postgresql database.
I successfully connected to the database, created a loop to import the csv files and tried to import.
R then returns an error, because my pc runs out of memory.
My question is:
Is there a way to create a loop, which imports the files one after another, writes them into the postgresql table and deletes them afterwards?
That way I would not run out of memory.
Code which returns the memory error:
`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db,
port=db_port, user=db_user, password=db_password)
#check if connection was succesfull
dbExistsTable(con, "prices")
#create function to load multiple csv files
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
}
#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`
Thanks in advance for the feedback!
If you change the lapply() to include an anonymous function, you can read each file and write it to the database, reducing the amount of memory required. Since lapply() acts as an implied for() loop, you don't need an extra looping mechanism.
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,function(x){
prices <- read.csv(x)
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
})
}
I assume that your csv files are very large that you are importing to your database? According to my knowledge R first want to store the data in a dataframe with the code that you have written, storing the data in memory. The alternative will be to read a CSV file in chunks as you do with Python's Pandas.
When calling ?read.csv I saw the following output:
nrows : the maximum number of rows to read in. Negative and other invalid values are ignored.
skip : the number of lines of the data file to skip before beginning to read data.
Why don't you try to read 5000 rows at a time into the dataframe write to the PostgreSQL database and then do it for each file.
For example, for each file do the following:
number_of_lines = 5000 # Number of lines to read at a time
row_skip = 0 # number of lines to skip initially
keep_reading = TRUE # We will change this value to stop the while
while (keep_reading) {
my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB
row_skip = 1 + row_skip + number_of_lines # The "1 +" is there due to inclusivity avoiding duplicates
# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
keep_reading = FALSE
} # end-if
} # end-while
By doing this you are breaking up the csv into smaller parts. You can play around with the number_of_lines variable to reduce the amount of loops. This may seem a bit hacky with a loop involved but I'm sure it will work
I'm trying to loop through all the CSV files on an FTP site and upload the contents of CSVs with a certain filename to a database.
So far I've been able to
access the FTP using...
getURL((url, userpwd = userpwd, ftp.use.epsv = FALSE, dirlistonly = TRUE),
get a list of the filenames using...
unlist(strsplit(filenames, "\r\n"),
and create a dataframe with a list of the full urls (e.g ftp://sample#ftpserver.name.com/samplename.csv) using...
for (i in seq_along(myfiles)) {
url_list[i,] <- paste(url, myfiles[i], sep = '')
}
How do I loop through this dataframe, filtering for certain filenames, in order to create a new dataframe with all of data from the relevant CSVs? (half the files are named Type1SampleName and half are Type2SampleName)
I would then uploading this data to the database.
Thanks!
Since RCurl::getURL returns direct HTTP response here being content of CSVs, consider extending your lapply function call to pass result into read.csv using text argument:
# VECTOR OF URLs
urls <- paste0(url, myfiles[grep("Type1", myfiles])
# LIST OF DATA FRAMES FROM EACH CSV
mydata <- lapply(urls, function(url) {
resp <- getURL(url, userpwd = userpwd, connecttimeout = 60)
read.csv(text = resp)
})
Alternatively, getURL supports a callback function with write argument:
Alternatively, if a value is supplied for the write parameter, this is returned. This allows the caller to create a handler within the call and get it back. This avoids having to explicitly create and assign it and then call getURL and then access the result. Instead, the 3 steps can be inlined in a single call.
# USER DEFINED METHOD
import_csv <- function(resp) read.csv(text = resp)
# LONG FORM NOTATION
mydata <- lapply(urls, function(url)
getURL(url, userpwd = userpwd, connecttimeout = 60, write = import_csv)
)
# SHORT FORM NOTATION
mydata <- lapply(urls, getURL, userpwd = userpwd, connecttimeout = 60, write = import_csv)
Just an update on how I finished this off and what worked for me in the end...
mydata <- lapply(urls, getURL, userpwd = userpwd, connecttimeout = 60)
Following on from above..
while (i <= length(mydata)) {
mydata1 <- paste0(mydata[[i]])
bin <- read.csv(text = mydata1, header = FALSE, skip = 1)
#Column renaming and formatting here
#Uploading to database using RODBC here
}
Thanks for the pointers #Parfait - really appreciated.
Like most problems it looks straightforward after you've done it!
Some background for my question: This is an R script that a previous research assistant wrote, but he did not provide any guidance to me on using it for myself. After working through an R textbook, I attempted to use the code on my data files.
What this code is supposed to do is load multiple .csv files, delete certain items/columns from them, and then write the new cleaned .csv files to a specified directory.
Currently, the files are being created in the right directory with the right file name, but the .csv files that are being created are empty.
I am currently getting the following error message:
Warning in
fread(input = paste0("data/", str_match(pattern = "CAFAS|PECFAS",: Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: (variable names).
This is my code:
library(data.table)
library(magrittr)
library(stringr)
# create a function to delete unnecessary variables from a CAFAS or PECFAS
data set and save the reduced copy
del.items <- function(file){
data <- fread(input = paste0("data/", str_match(pattern = "CAFAS|PECFAS",
string = file) %>% tolower, "/raw/", file), sep = ",", header = TRUE,
na.strings = "", stringsAsFactors = FALSE, skip = 0, colClasses =
"character", data.table = FALSE)
data <- data[-grep(pattern = "^(CA|PEC)FAS_E[0-9]+(TR?(Initial|[0-
9]+|Exit)|SP[a-z])_(G|S|Item)[0-9]+$", x = names(data))]
write.csv(data, file = paste0("data/", str_match(pattern = "CAFAS|PECFAS",
string = file) %>% tolower, "/items-del/", sub(pattern = "ExportData_", x =
file, replacement = "")) %>% tolower, row.names = FALSE)
}
# delete items from all cafas data sets
cafas.files <- list.files("data/cafas/raw", pattern = ".csv")
for (file in cafas.files){
del.items(file)
}
# delete items from all pecfas data sets
pecfas.files <- list.files("data/pecfas/raw", pattern = ".csv")
for (file in pecfas.files){
del.items(file)
}
I'm wondering if it would be possible to draw out out the filename from a file.choose() command embedded within a read.csv call. Right now I'm doing this in two steps, but the user has to select the same file twice in order to extract both the data (csv) and the filename for use in a function I run. I want to make it so the user only has to select the file once, and then I can use both the data and the name of the file.
Here's what I'm working with:
data <- read.csv(file.choose(), skip=1))
name <- basename(file.choose())
I'm running OS X if that helps, since I think file.choose() has different behavior depending on the OS. Thanks in advance.
Why you use an embedded file.choose() command?
filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)
use this:
df = read.csv(file.choose(), sep = "<use relevant seperator>", header = T, quote = "")
seperators are generally comma , or fullstop .
Example:
df = read.csv(file.choose(), sep = ",", header = T, quote = "")
#
Use :
df = df[,!apply(is.na(df), 2, all)] # works for every data format
to remove blank columns to the left