Adding value to existing database table in RSQLite - r

I am new to RSQLite.
I have an input document in text format in which values are seperately by '|'
I created a table with the required variables (dummy code as follows)
db<-dbconnect(SQLite(),dbname="test.sqlite")
dbSendQuery(conn=db,
"CREATE TABLE TABLE1(
MARKS INTEGER,
ROLLNUM INTEGER
NAME CHAR(25)
DATED DATE)"
)
However I am struck at how to import values into the created table.
I cannot use INSERT INTO Values command as there are thousands of rows and more than 20+ columns in the original data file and it is impossible to manually type in each data point.
Can someone suggest an alternative efficient way to do so?

You are using a scripting language. The deal of this is literally to avoid manually typing each data point. Sorry.
You have two routes:
1: You have corrected loaded a database connection and created an empty table in your SQLite database. Nice!
To load data into the table, load your text file into R using e.g. df <-
read.table('textfile.txt', sep='|') (modify arguments to fit your text file).
To have a 'dynamic' INSERT statement, you can use placeholders. RSQLite allows for both named or positioned placeholder. To insert a single row, you can do:
dbSendQuery(db, 'INSERT INTO table1 (MARKS, ROLLNUM, NAME) VALUES (?, ?, ?);', list(1, 16, 'Big fellow'))
You see? The first ? got value 1, the second ? got value 16, and the last ? got the string Big fellow. Also note that you do not enclose placeholders for text in quotation marks (' or ")!
Now, you have thousands of rows. Or just more than one. Either way, you can send in your data frame. dbSendQuery has some requirements. 1) That each vector has the same number of entries (not an issue when providing a data.frame). And 2) You may only submit the same number of vectors as you have placeholders.
I assume your data frame, df contains columns mark, roll, and name, corrsponding to the columns. Then you may run:
dbSendQuery(db, 'INSERT INTO table1 (MARKS, ROLLNUM, NAME) VALUES (:mark, :roll, :name);', df)
This will execute an INSERT statement for each row in df!
TIP! Because an INSERT statement is execute for each row, inserting thousands of rows can take a long time, because after each insert, data is written to file and indices are updated. Insert, enclose it in an transaction:
dbBegin(db)
res <- dbSendQuery(db, 'INSERT ...;', df)
dbClearResult(res)
dbCommit(db)
and SQLite will save the data to a journal file, and only save the result when you execute the dbCommit(db). Try both methods and compare the speed!
2: Ah, yes. The second way. This can be done in SQLite entirely.
With the SQLite command utility (sqlite3 from your command line, not R), you can attach a text file as a table and simply do a INSERT INTO ... SELECT ... ; command. Alternately, read the text file in sqlite3 into a temporary table and run a INSERT INTO ... SELECT ... ;.
Useful site to remember: http://www.sqlite.com/lang.html

A little late to the party, but DBI provides dbAppendTable() which will write the contents of a dataframe to an SQL table. Column names in the dataframe must match the field names in the database. For your example, the following code would insert the contents of my random dataframe into your newly created table.
library(DBI)
db<-dbConnect(RSQLite::SQLite(),dbname=":memory")
dbExecute(db,
"CREATE TABLE TABLE1(
MARKS INTEGER,
ROLLNUM INTEGER,
NAME TEXT
)"
)
df <- data.frame(MARKS = sample(1:100, 10),
ROLLNUM = sample(1:100, 10),
NAME = stringi::stri_rand_strings(10, 10))
dbAppendTable(db, "TABLE1", df)

I don't think there is a nice way to do a large number of inserts directly from R. SQLite does have a bulk insert functionality, but the RSQLite package does not appear to expose it.
From the command line you may try the following:
.separator |
.import your_file.csv your_table
where your_file.csv is the CSV (or pipe delimited) file containing your data and your_table is the destination table.
See the documentation under CSV Import for more information.

Related

Is it possible to import a CSV file to an existing table without the headers being included?

I'm trying to import a CSV file to a table that is empty but already exists in an SQLite database. For example:
sqlite> CREATE TABLE data (...);
sqlite> .mode csv
sqlite> .import mydata.csv data
I have created the table in advance because I'd like to specify a primary key, data types, and foreign key constraints. This process works as expected, but it unfortunately includes the header row from the CSV file in the table.
Here's what I've learned from the SQLite docs regarding CSV imports:
There are two cases to consider: (1) Table "tab1" does not previously exist and (2) table "tab1" does already exist.
In the first case, when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table. In other words, if the table does not previously exist, the first row of the CSV file is interpreted to be column names and the actual data starts on the second row of the CSV file.
For the second case, when the table already exists, every row of the CSV file, including the first row, is assumed to be actual content. If the CSV file contains an initial row of column labels, that row will be read as data and inserted into the table. To avoid this, make sure that table does not previously exist.
So basically, I get extra data because I've created the table in advance. Is there a flag to change this behavior? If not, what's the best workaround?
The sqlite3 command-line shell has no such flag.
If you have a sufficiently advanced OS, you can use an external tool to split off the first line:
sqlite> .import "|tail -n +2 mydata.csv" data
You can also use the --skip 1 option with .import as documented on the sqlite3 website and this SO Answer. So, you can use the following command
.import --csv --skip 1 mydata.csv data

sqlloader: how to use substr in when-clause correctly?

I'm having a problem that I thought to be rather common, but trying to look it up in the "Oracle Database 10g2 Utilities_b14215.pdf" didn't help. After that I've surfed through numerous threads but no luck so far.
I'm having a tab-delimited file (x'09') e. g. name, userid, persnr. The values for the userids begin with either P, R or T e. g. P2198, P2199, R7288, T1229.
I want to load only the records with userids beginning with P.
Isolating a single record with a controlfile like this works splendidly:
OPTIONS (SKIP=1)
LOAD DATA
INFILE UserlistLoader.dat
APPEND
INTO TABLE Z_USERLIST
WHEN USERID = 'P2198'
FIELDS TERMINATED BY x'09'
TRAILING NULLCOLS
(name, userid, persnr)
But every attempt at using SUBSTR in the when-clause fails.
This:
OPTIONS (SKIP=1)
LOAD DATA
INFILE UserlistLoader.dat
APPEND
INTO TABLE Z_USERLIST
WHEN SUBSTR(USERID, 1, 1) = 'P'
FIELDS TERMINATED BY x'09'
TRAILING NULLCOLS
(name, userid, persnr)
ends in an SQL*Loader-350: Syntax-Error.
This
OPTIONS (SKIP=1)
LOAD DATA
INFILE UserlistLoader.dat
APPEND
INTO TABLE Z_USERLIST
WHEN "SUBSTR(:USERID, 1, 1)" = 'P'
FIELDS TERMINATED BY x'09'
TRAILING NULLCOLS
(name, userid, persnr)
ends in an SQL*Loader-403: Referenced column USERID not present in table Z_USERLIST.
But IT IS PRESENT - as the first example proves. I've found that the column should be preceded by : but that obviously isn't the issue.
What am I doing wrong?
From SQL Loader docs the left-hand side of a WHEN condition can only be a full field name e.g. USERID or a position spec e.g. (3:5).
The docs aren't very clear though on what is allowed - e.g. can LIKE be used as the operator?
USERID LIKE 'P%'
I strongly suspect it can't though.
I would load the entire file into a staging table that matches the file layout, then run a procedure that inserts the rows you want from there into the production table. That is a more common way to handle loads with criteria like this without having to edit source data.
If you can preprocess the source file, move the userid to the first field or copy the first letter of the userid to it's own field and construct the WHEN like this so sqlldr looks at the first position (this will cause sqlldr to return non-zero though, as not all rows meet WHEN clause criteria):
WHEN (1) = 'P'

How to pull column names from multiple tables using R

Sorry in advance due to being new to Rstudio...
There are two parts to this question:
1) I have a large database that has almost 6,000 tables in it. Many of these tables have no data in them. Is there a code using R to only pull a list of tables names that have data in them?
I know how to pull a list of all table names and how to pull specific table data using the code below..
test<-odbcDriverConnect('driver={SQL Server};server=(SERVER);database=(DB_Name);trusted_connection=true')
rest<-sqlQuery(test,'select*from information_schema.tables')
Table1<-sqlFetch(test, "PROPERTY")
Above is the code I use to access the database and tables.
"test" is the connection
"rest" shows the list of 5,803 tables names.. one of which is called "PROPERTY"
"Table1" is simply pulling one of the tables named "PROPERTY".
I am looking to make "rest" only show the data tables that have data in them.
2) My ultimate goal, which leads to the second question, is to create a table that shows a list of every table from this database in column#1 and then column 2,3,4,etc... would include every one of the column headers that is contained in each table. Any idea how do to that?
Thanks so much!
The Tables object below returns a data frame giving all of the tables in the database and how many rows are in each table. As a condition, it requires that any table selected have at least one record. This is probably the fastest way to get your list of non-empty tables. I pulled the query to get that information from https://stackoverflow.com/a/14163881/1017276
My only reservation about that query is that it doesn't give the schema name, and it is possible to have tables with the same name in different schemas. So this is likely only going to work well within one schema at a time.
library(RODBCext)
Tables <-
sqlExecute(
channel = test,
query = "SELECT T.name TableName, I.rows Records
FROM sysobjects t, sysindexes i
WHERE T.xtype = ? AND I.id = T.id AND I.indid IN (0,1) AND I.rows > 0
ORDER BY TableName;",
data = list(xtype = "U"),
fetch = TRUE,
stringsAsFactors = FALSE
)
This next part uses the tables you found above and then gets the column information from each of those tables. Lastly, it makes on single data frame with all of the column names.
Columns <-
lapply(Tables$TableName,
function(x) sqlColumns(test, x))
Columns <- do.call("rbind", Columns)
sqlColumns is a function in RODBC.
sqlExecute is a function in RODBCext that allows for parameterized queries. I tend to use that anytime I need to use quoted strings in a query.

SQLite GROUP BY

I am using DB Browser for SQLite to extract some interesting data from my database but I encountered one big problem with GROUP BY statement.
Even the most basic SELECT I can imagine is not working properly.
(Filename nvarchar(2147483647))
SELECT FileName FROM TableName WHERE FileName LIKE '%Nieminen%' GROUP BY FileName gives 5 rows even though I know that there are 9 distinct FileNames containing the phrase 'Nieminen' (I've browsed it).
Can it be possible that GROUP BY in sqlite compares only N (e.g. 10) initial characters? From my observation it might be true...
Any clues?
why don't you try out the following:
SELECT DISTINCT FileName FROM TableName WHERE FileName LIKE '%Nieminen%'

PostgreSQL, R and timestamps with no time zone

I am reading a big csv (>1GB big for me!). It contains a timestamp field.
I read it (100 rows to start with ) with fread from the excellent data.table package.
ddfr <- fread(input="~/file1.csv",nrows=100, header=T)
Problem 1 (RESOLVED): the timestamp fields (called "ts" and "update"), e.g. "02/12/2014 04:40:00 AM" is converted to string. I convert the fields back to timestamp with lubridate package mdh_hms. Splendid.
ddfr$ts <- data.frame( mdy_hms(ddfr$ts))
Problem 2 (NOT RESOLVED): The timestamp is created with time zone as per POSIXlt.
How do I create in R a timestamp with NO TIME ZONE? is it possible??
Now I use another (new) great package, PivotalR to write the dataframe to PostGreSQL 9.3 using as.db.data.frame. It works as a charm.
x <- as.db.data.frame(ddfr, table.name= "tbl1", conn.id = 1)
Problem 3 (NOT RESOLVED): As the original dataframe timestamp fields had time zones, a table is created with the fields "timestamp with time zone". Ultimately the data needs to be stored in a table with fields configured as "timestamp without time zone".
But in my table in Postgres the data is stored as "2014-02-12 04:40:00.0", where the .0 at the end is the UTC offset. I think I need to have "2014-02-12 04:40:00".
I tried
ALTER TABLE tbl ALTER COLUMN ts type timestamp without time zone;
Then I copied across. While Postgres accepts the ALTER COLUMN command, when I try to copy (using INSERT INTO tbls SELECT ...) I get an error:
"column "ts" is of type timestamp without time zone but expression is of type text
Hint: You will need to rewrite or cast the expression."
Clearly the .0 at the end is not liked (but why then Postgres accepts the ALTER COLUMN? boh!).
I tried to do what the error suggested using CAST in the INSERT INTO query:
INSERT INTO tbl2 SELECT CAST(ts as timestamp without time zone) FROM tbl1
But I get the same error (including the suggestion to use CAST aargh!)
The table directly created by PivotalR (based on the dataframe) has this CREATE script:
CREATE TABLE tbl2
(
businessid integer,
caseno text,
ts timestamp with time zone
)
WITH (
OIDS=FALSE
);
ALTER TABLE tbl1
OWNER TO mydb;
The table I'm inserting into has this CREATE script:
CREATE TABLE tbl1
(
id integer NOT NULL DEFAULT nextval('bus_seq'::regclass),
businessid character varying,
caseno character varying,
ts timestamp without time zone,
updated timestamp without time zone,
CONSTRAINT busid_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE tbl1
OWNER TO postgres;
My apologies for the convoluted explanation, but potentially a solution could be found at any step in the chain, so I preferred to put all my steps in one question. I am sure there has to be a simpler method...
I think you're confused about copying data between tables.
INSERT INTO ... SELECT without a column list expects the columns from source and destination to be the same. It doesn't magically match up columns by name, it'll just assign columns from the SELECT to the INSERT from left to right until it runs out of columns, at which point any remaining cols are assumed to be null. So your query:
INSERT INTO tbl2 SELECT ts FROM tbl1;
isn't doing this:
INSERT INTO tbl2(ts) SELECT ts FROM tbl1;
it's actually picking the first column of tbl2, which is businessid, so it's really attempting to do:
INSERT INTO tbl2(businessid) SELECT ts FROM tbl1;
which is clearly nonsense, and no casting will fix that.
(Your error in the original question doesn't match your tables and queries, so the details might be different as you've clearly made a mistake in mangling/obfuscating your tables or posted a newer version of the tables than the error. The principle remains.)
It's generally a really bad idea to assume your table definitions won't change and column order won't change anyway. So always be explicit about columns. In this case I think your intention might have actually been:
INSERT INTO tbl2(businessid, caseno, ts)
SELECT CAST(businessid AS integer), caseno, ts
FROM tbl1;
Note the cast, because the type of businessid is different between the two tables.

Resources