Creating database with csv files in Rstudio - r

I was trying to create a database, and when I looked it up online, I found this tutorial.
here
The step it took was to use
my_db_file <- "data/portal-database-output.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)
When I do file.exists("database.sqlite"), it prints FALSE. I was wondering if there's a way to get "database.sqlite" so I can finish creating this database? Is it from a package? Any help would be appreciated!

The file that you created with the first line was portal-database-output.sqlite under the data/ directory. If you were to do,
file.exists("data/portal-database-output.sqlite")
then it should return TRUE.
You need to read in the data, create the database, then you can add your data to it.
library(tidyverse)
download.file("https://ndownloader.figshare.com/files/3299483",
"species.csv")
species <- read_csv("data/species.csv")
my_db_file <- "data/portal-database-output.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)
copy_to(my_db, surveys)
Output
my_db
src: sqlite 3.35.5 [portal-database-output.sqlite]
tbls: species, sqlite_stat1, sqlite_stat4
file.exists("data/portal-database-output.sqlite")
[1] TRUE

Related

Using unz() to read in SAS data set into R

I am trying to read in a data set from SAS using the unz() function in R. I do not want to unzip the file. I have successfully used the following to read one of them in:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files))
That works great. I'm able to read the data set in and conduct the analysis. When I try to read in another data set, though, I encounter an error:
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files))
Error in read_connection_(con, tempfile()) :
Evaluation error: error reading from the connection.
I have read other questions on here saying that the file path may be incorrectly specified. Some answers mentioned submitting list.files() to the console to see what is listed.
list.files()
[1] "example_data.zip" "data.zip"
As you can see, I can see the folders, and I was successfully able to read the data set in from "example_data.zip", but I cannot access the data.zip folder.
What am I missing? Thanks in advance.
Your "dir2_files" is String vector of the names of different files in "data.zip". So for example if the files that you want to read have them names at the positions "k" in "dir_files" and "j" in "dir2_files" then let update your script like that:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files[k]))
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files[j]))

How to use Rscript with readr to get data from aws s3

I have some R code with readr package that works well on a local computer - I use list.files to find files with a specific extension and then use readr to operate on those files found.
My question: I want to do something similar with files in AWS S3 and I am looking for some pointers on how to use my current R code to do the same.
Thanks in advance.
What I want:
Given AWS folder/file structure like this
- /folder1/subfolder1/quant.sf
- /folder1/subfolder2/quant.sf
- /folder1/subfolder3/quant.sf
and so on where every subfolder has the same file 'quant.sf', I would like to get a data frame which has the S3 paths and I want to use the R code shown below to operate on all the quant.sf files.
Below, I am showing R code that works currently with data on a Linux machine.
get_quants <- function(path1, ...) {
additionalPath = list(...)
suppressMessages(library(tximport))
suppressMessages(library(readr))
salmon_filepaths=file.path(path=path1,list.files(path1,recursive=TRUE, pattern="quant.sf"))
samples = data.frame(samples = gsub(".*?quant/salmon_(.*?)/quant.sf", "\\1", salmon_filepaths) )
row.names(samples)=samples[,1]
names(salmon_filepaths)=samples$samples
# IF no tx2Gene available, we will only get isoform level counts
salmon_tx_data = tximport(salmon_filepaths, type="salmon", txOut = TRUE)
## Get transcript count summarization
write.csv(as.data.frame(salmon_tx_data$counts), file = "tx_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx_data$abundance), file = "tx_TPM_Abundance.csv")
if(length(additionalPath > 0)) {
tx2geneFile = additionalPath[[1]]
my_tx2gene=read.csv(tx2geneFile,sep = "\t",stringsAsFactors = F, header=F)
salmon_tx2gene_data = tximport(salmon_filepaths, type="salmon", txOut = FALSE, tx2gene=my_tx2gene)
## Get Gene count summarization
write.csv(as.data.frame(salmon_tx2gene_data$counts), file = "tx2gene_NumReads.csv")
## Get TPM
write.csv(as.data.frame(salmon_tx2gene_data$abundance), file = "tx2gene_TPM_Abundance.csv")
}
}
I find it easiest to use the aws.s3 R package for this. In this case what you would do is use the s3read_using() and s3write_using() functions to save to and from S3. Like this:
library(aws.s3)
my_tx2gene=s3read_using(FUN=read.csv, object="[path_in_s3_to_file]",sep = "\t",stringsAsFactors = F, header=F)
It basically is a wrapper around whatever function you want to use for file input/output. Works great with read_json, saveRDS, or anything else!

How to get a vector of the file names contained in a tempfile in R?

I am trying to automatically download a bunch of zipfiles using R. These files contain a wide variety of files, I only need to load one as a data.frame to post-process it. It has a unique name so I could catch it with str_detect(). However, using tempfile(), I cannot get a list of all files within it using list.files().
This is what I've tried so far:
temp <- tempfile()
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp) # this is where I only get "character(0)"
# After, I'd like to use something along the lines of:
data <- read.table(unz(temp, str_detect(files, "^file123.txt"), header = TRUE, sep = ";")
unlink(temp)
I know that the read.table() command probably won't work, but I think I'll be able to figure that out once I get a vector with the list of the files within temp.
I am on a Windows 7 machine and I am using R 3.6.0.
Following what was said before, this structure should allow you to check the correct download with a temporary file structure :
temp <- tempfile("test.zip")
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp)

Can't append data to sqlite3 with dplyr db_write_table()

Looking into adding data to a table with dplyr, I saw https://stackoverflow.com/a/26784801/1653571 but the documentation says db_insert_table() is deprecated.
?db_insert_into()
...
db_create_table() and db_insert_into() have been deprecated in favour of db_write_table().
...
I tried to use the non-deprecated db_write_table() instead, but it fails both with and without the append= option:
require(dplyr)
my_db <- src_sqlite( "my_db.sqlite3", create = TRUE) # create src
copy_to( my_db, iris, "my_table", temporary = FALSE) # create table
newdf = iris # create new data
db_write_table( con = my_db$con, table = "my_table", values = newdf) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
db_write_table( con = my_db$con, table = "my_table", values = newdf,append=True) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
Should one be able to append data with db_write_table()?
See also https://github.com/tidyverse/dplyr/issues/3120
No, you shouldn't use db_write_table() instead of db_insert_table(), since it can't be generalized across backends.
And you shouldn't use the tidyverse versions rather than the relevant DBI::: versions, since the tidyverse helper functions are for internal use, and not designed to be robust enough for use by users. See the discussion at https://github.com/tidyverse/dplyr/issues/3120#issuecomment-339034612 :
Actually, I don't think you should be using these functions at all. Despite that SO post, these are not user facing functions. You should be calling the DBI functions directly.
-- Hadley Wickham, package author.

Load data from MongoDB into R

I am trying to query my MongoDB database from R.
I think I lost part of it in the process.
Does R have any limit, and how can I ensure all my records are loaded into R?
Code:
# inspect number of record in mongodb
db.complaints.count()
>395 853
# write a query to load data into R
library(dplyr)
complaints = data.frame(stringsAsFactors = FALSE)
db = "customers.complaints"
cursor = mongo.find(mongo, db)
i = 1
while (mongo.cursor.next(cursor))
{
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors=F)
complaints = rbind.fill(complaints, tmp.df)
}
I get [1] 47077 15 after checking the loading in R with dim(complaints).
How can make sure I get all my collections in R?
http://www.analyticbridge.com/profiles/blogs/time-issue-in-creating-a-huge-data-frame-from-mongodb-collection
the above code using environment variables might help you! Please do comment over here if you get a solution.

Resources