I have a RODBC connection using odbcConnect to a DB.
I am reading in a table into a variable.
This table has a text column, however when I submit a subset on the text column, I get a factor.
I am looking for a stringsasfactors=FALSE equivalent while reading a table using RODBC.
Any thoughts on how I might be able to achieve this?
Thanks!
Use eg the stringsAsFactors option to sqlQuery. Documentation is eg here.
Modifying the basic example from the help page:
channel <- odbcConnect("test")
sqlSave(channel, USArrests, rownames = "State", verbose = TRUE)
options(dec=".") # optional, if DBMS is not locale-aware or set to ASCII
## note case of State, Murder, Rape are DBMS-dependent,
## and some drivers need column and table names double-quoted.
sqlQuery(channel, paste("select State, Murder from USArrests",
"where Rape > 30 order by Murder"),
stringsAsFactors=FALSE) ## your option here
close(channel)
As sqlFetch() passes arguments through via ..., it will work the same way. Just add stringsAsFactors=FALSE, or even set it globally via options().
Related
I am connecting to a SQL Server database through the ODBC connection in R. I have two potential methods to get data, and am trying to determine which would be more efficient. The data is needed for a Shiny dashboard, so the data needs to be pulled while the app is loading rather than querying on the fly as the user is using the app.
Method 1 is to use over 20 stored procedures to query all of the needed data and store them for use. Method 2 is to query all of the tables individually.
Here is the method I used to query one of the stored procedures:
get_proc_data <- function(proc_name, url, start_date, end_date){
dbGetQuery(con, paste0(
"EXEC dbo.", proc_name, " ",
"#URL = N'", url, "', ",
"#Startdate = '", start_date, "', ",
"#enddate = '", end_date, "' "
))
}
data <- get_proc_data(proc_name, url, today(), today() %m-% years(5))
However, each of the stored procedures has a slightly different setup for the parameters, so I would have to define each of them separately.
I have started to implement Method 2, but have run into issues with iteratively querying each table.
# use dplyr create list of table names
db_tables <- dbGetQuery(con, "SELECT * FROM [database_name].INFORMATION_SCHEMA.TABLES;") %>% select(TABLE_NAME)
# use dplyr pull to create list
table_list <- pull(db_tables , TABLE_NAME)
# get a quick look at the first few rows
tbl(con, "[TableName]") %>% head() %>% glimpse()
# iterate through all table names, get the first five rows, and export to .csv
for (table in table_list){
write.csv(
tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/tables/{table}.csv")
)
}
selected_tables <- db_tables %>% filter(TABLE_NAME == c("TableName1","TableName2"))
Ultimately this method was just to test how long it would take to iterate through the ~60 tables and perform the required function. I have tried putting this into a function instead but have not been able to get it to iterate through while also pulling the name of the table.
Pro/Con for Method 1: The stored procs are currently powering a metrics plug-in that was written in C++ and is displaying metrics on the webpage. This is for internal use to monitor website performance. However, the stored procedures are not all visible to me and the client needs me to extend their current metrics. I also do not have a DBA at my disposal to help with the SQL Server side, and the person who wrote the procs is unavailable. The procs are also using different logic than each other, so joining the results of two different procs gives drastically different values. For example, depending on the proc, each date will list total page views for each day or already be aggregated at the weekly or monthly scale then listed repeatedly. So joining and grouping causes drastic errors in actual page views.
Pro/Con for Method 2: I am familiar with dplyr and would be able to join the tables together to pull the data I need. However, I am not as familiar with SQL and there is no Entity-Relationship Diagram (ERD) of any sort to refer to. Otherwise, I would build each query individually.
Either way, I am trying to come up with a way to proceed with either a named function, lambda function, or vectorized method for iterating. It would be great to name each variable and assign them appropriately so that I can perform the data wrangling with dplyr.
Any help would be appreciated, I am overwhelmed with which direction to go. I researched the equivalent to Python list comprehension in R but have not been able get the function in R to perform similarly.
> db_table_head_to_csv <- function(table) {
+ write.csv(
+ tbl(con, table) %>% head(), str_glue("{getwd()}/00_exports/bibliometrics_tables/{table}.csv")
+ )
+ }
>
> bibliometrics_tables %>% db_table_head_to_csv()
Error in UseMethod("as.sql") :
no applicable method for 'as.sql' applied to an object of class "data.frame"
Consider storing all table data in a named list (counterpart to Python dictionary) using lapply (counterpart to Python's list/dict comprehension). And if you use its sibling, sapply, the character vector passed in will return as names of elements:
# RETURN VECTOR OF TABLE NAMES
db_tables <- dbGetQuery(
con, "SELECT [TABLE_NAME] FROM [database_name].INFORMATION_SCHEMA.TABLES"
)$TABLE_NAME
# RETURN NAMED LIST OF DATA FRAMES FOR EACH DB TABLE
df_list <- sapply(db_tables, function(t) dbReadTable(conn, t), simplify = FALSE)
You can extend the lambda function for multiple steps like write.csv or use a defined method. Just be sure to return a data frame as last line. Below uses the new pipe, |> in base R 4.1.0+:
db_table_head_to_csv <- function(table) {
head_df <- dbReadTable(con, table) |> head()
write.csv(
head_df,
file.path(
"00_exports", "bibliometrics_tables", paste0(table, ".csv")
)
)
return(head_df)
}
df_list <- sapply(db_tables, db_table_head_to_csv, simplify = FALSE)
You lose no functionality of data frame object if stored in a list and can extract with $ or [[ by name:
# EXTRACT SPECIFIC ELEMENT
head(df_list$table_1)
tail(df_list[["table_2"]])
summary(df_list$`table_3`)
Looking into adding data to a table with dplyr, I saw https://stackoverflow.com/a/26784801/1653571 but the documentation says db_insert_table() is deprecated.
?db_insert_into()
...
db_create_table() and db_insert_into() have been deprecated in favour of db_write_table().
...
I tried to use the non-deprecated db_write_table() instead, but it fails both with and without the append= option:
require(dplyr)
my_db <- src_sqlite( "my_db.sqlite3", create = TRUE) # create src
copy_to( my_db, iris, "my_table", temporary = FALSE) # create table
newdf = iris # create new data
db_write_table( con = my_db$con, table = "my_table", values = newdf) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
db_write_table( con = my_db$con, table = "my_table", values = newdf,append=True) # insert into
# Error: Table `my_table` exists in database, and both overwrite and append are FALSE
Should one be able to append data with db_write_table()?
See also https://github.com/tidyverse/dplyr/issues/3120
No, you shouldn't use db_write_table() instead of db_insert_table(), since it can't be generalized across backends.
And you shouldn't use the tidyverse versions rather than the relevant DBI::: versions, since the tidyverse helper functions are for internal use, and not designed to be robust enough for use by users. See the discussion at https://github.com/tidyverse/dplyr/issues/3120#issuecomment-339034612 :
Actually, I don't think you should be using these functions at all. Despite that SO post, these are not user facing functions. You should be calling the DBI functions directly.
-- Hadley Wickham, package author.
New to R programming.
I have a simple sql server query whose output looks like this :
EFFECTIVE_DATE NumberOfUser
2015-07-01 564
2015-07-02 433
2015-07-03 306
2015-07-04 50
Here's how I issue the query:
barData <- sqlQuery(sqlCon,
"select EFFECTIVE_DATE,COUNT(USER_ID) as NumberOfUser from UserTable where start_dt between '20150701' AND '20150704' group by EFFECTIVE_DATE order by EFFECTIVE_DATE")
Now I am running this query from R and want to do a barplot on this. What is the best way to do that?
Also how do i convert any query result to a data.table with which I can do barplot? When I try table(myList), it is showing a different format altogether.
The help on sqlQuery (I don't use ODBC generally) says "On success, a data frameā¦" is returned. That would mean you should be able to do something like:
barplot(barData$NumberOfUser, names.arg=barData$EFFECTIVE_DATE,
xlab="Effective Date", ylab="Number of Users")
But posting the output of a dput(barData) into your question would really make it easier to help you.
Assuming that you used the sqldf package in R, an sql query of the form SELECT EFFECTIVE_DATE, NUM_OF_USERS FROM USERTABLE is executed using the sqldf(x, stringsAsFactors = FALSE,...) statement:
sql_string <- "select
effective_date
, num_of_users
from USRTABLE"
user_dates <- sqldf(join_string,stringsAsFactors = TRUE)
resulting in a data.frame object. Use the data.table package to convert the data frame into a data table:
user_dates <- as.data.table(user_dates)
A new data frame, user_dates, will be created using the sqldf statement. The sqldf statement, at minimum, requires a character string with the SQL operation to be performed. The stringsAsFactors argument will force categorical variables to have the class character rather than factor.
EDIT : Sincere apologies, didn't see you stating the package name in the question. In case you decide to use the sqldf package, creating a bar plot is a straightforward call to the barplot(height, ...) function:
barplot(user_dates$num_of_users,names.arg=user_dates$effective_date)
Also please note that the result of the sqlQuery on successful execution is a data frame and not a list:
On success, a data frame (possibly with 0 rows) or character string. On error, if
errors = TRUE
a character vector of error message(s), otherwise an invisible integer error code
-1
(general, call
odbcGetErrMsg
for details) or
-2
(no data, which may not be an error as some SQL statements do
return no data).
I'm (very!) new to R and mysql and I have been struggling and researching for this problem for days. So I would really appreciate ANY help.
I need to complete a mathematical expression from 2 variables in two different tables. Essentially, I'm trying to figure out how old a subject was (DOB is in one table) when they were serviced (date of service is in another table). I have an identifying variable that is the same in both.
I have tried merging these:
age<-merge("tbl1", "tbl2", by=c("patient_id") all= TRUE)
this returns:
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I have tried sub-setting where I just keep the variables of interest, but is is not working because I believe sub-setting only works for numbers not characters..right?
Again, I would appreciate any help. Thanks in advance
Since you are new to data.base , I think you should use dplyr here. It is an abstraction layer for many data base providers. So you will not have to deal with db problems. Here I show you a simple example of where:
I read the tables from the MYSQL
Merge the table assuming having unique shared variable
The code:
library(dplyr)
library(RMySQL)
## create a connection
SDB <- src_mysql(host = "localhost", user = "foo", dbname = "bar", password = getPassword())
# reading tables
tbl1 <- tbl(SDB, "TABLE1_NAME")
tbl2 <- tbl(SDB, "TABLE2_NAME")
## merge : this step can be done using dplyr also
age <- merge(tbl1, tbl2, all= TRUE)
This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
Is it possible to pass a value into the query in dbGetQuery from the RMySQL package.
For example, if I have a set of values in a character vector:
df <- c('a','b','c')
And I want to loop through the values to pull out a specific value from a database for each.
library(RMySQL)
res <- dbGetQuery(con, "SELECT max(ID) FROM table WHERE columna='df[2]'")
When I try to add the reference to the value I get an error. Wondering if it is possible to add a value from an R object in the query.
One option is to manipulate the SQL string within the loop. At the moment you have a string literal, the 'df[2]' is not interpreted by R as anything other than characters. There are going to be some ambiguities in my answer, because df in your Q is patently not a data frame (it is a character vector!). Something like this will do what you want.
Store the output in a numeric vector:
require(RMySQL)
df <- c('a','b','c')
out <- numeric(length(df))
names(out) <- df
Now we can loop over the elements of df to execute your query three times. We can set the loop up two ways: i) with i as a number which we use to reference the elements of df and out, or ii) with i as each element of df in turn (i.e. a, then b, ...). I will show both versions below.
## Version i
for(i in seq_along(df)) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", df[i], "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
OR:
## Version ii
for(i in df) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", i, "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
Which you use will depend on personal taste. The second (ii) version requires you to set names on the output vector out that are the same as the data inside out.
Having said all that, assuming your actual SQL Query is similar to the one you post, can't you do this in a single SQL statement, using the GROUP BY clause, to group the data before computing max(ID)? Doing simple things in the data base like this will likely be much quicker. Unfortunately, I don't have a MySQL instance around to play with and my SQL-fu is weak currently, so I can't given an example of this.
You could also use the sprintf command to solve the issue (it's what I use when building Shiny Apps).
df <- c('a','b','c')
res <- dbGetQuery(con, sprintf("SELECT max(ID) FROM table WHERE columna='%s'"),df())
Something along those lines should work.