Selecting unique rows using sqldf package in R - r

I have a csv file that as ~1.9M rows and 32 columns. I also have limited RAM, which makes it loading into the memory very inconvenient. As a result I am thinking of using a database but do not have any intimate knowledge on the subject and so have have looked around at this site but found no viable solns so far.
The CSV file looks like this:
Case,Event,P01,P02,P03,P04,P05,P06,P07,P08,P09,P10,P11,P12,P13,P14,P15,P16,P17,P18,P19,P20,P21,P22,P23,P24,P25,P26,P27,P28,P29,P30
C000039,E97553,8,10,90,-0.34176313227395744,-5.581162038780728E-4,-0.12090388100201072,-1.5172412910939355,-0.9075283173030568,2.0571877671625742,-0.002902632819930783,-0.6761896565590585,-0.7258602353522214,0.8684602429202587,0.0023189312896576167,0.002318939470525324,-0.1881462494296103,-0.0014303471592995315,-0.03133299206977217,7.72338072867324E-4,-0.08952068388668191,-1.4536398437657685,-0.020065144945600275,-0.16276139919188118,0.6915962670997067,-1.593412697264055,-1.563877781707804,-1.4921751129092755,4.701551108078644,6,-0.688302560842075
C000039,E23039,8,10,90,-0.3420173545012358,-5.581162038780728E-4,-1.6563770995734233,-1.5386562526752448,-1.3604342580422861,2.1025445031625525,-0.0028504751366762804,-0.6103972392687121,-2.0390388918403284,-1.7249948885013526,0.00231891181914203,0.0023189141684282384,-0.18603688853814693,-0.0014303471592995315,-0.03182759137355937,0.001011754948131039,0.13009444290656555,-1.737249614361576,-0.015763602969926262,-0.16276139919188118,0.7133868949811379,-1.624962995908364,-1.5946762525901037,-1.5362787555380522,4.751479927607516,6,-0.688302560842075
C000039,E23039,35,10,90,-0.3593468363273839,-5.581162038780728E-4,-2.2590624066428937,-1.540784192984501,-1.3651511418164592,0.05539868728273849,-0.00225912499740972,0.20899232681704485,-2.2007336302050633,-2.518401278903022,0.0023189850665203673,0.0023189834133465186,-0.1386548782028836,-0.0013092574968056093,-0.0315006293688149,9.042390365542781E-4,-0.3514180333671346,-1.8007561969675518,-0.008593259125791147,-2.295351187387221,0.6329101442826701,-1.8095530459660578,-1.7748676145152822,-1.495347406256394,2.553693742122162,34,-0.6882806822066699
....
....
upto 1.9 M rows
As you can see the 'Case' column repeats itself but I want to only get unique records before importing it into a dataframe. So i used this:
f<-file("test.csv")
bigdf <- sqldf("select * from 'f' where Case in (select Case from 'f' group by Case having count(*) = 1)", dbname = tempfile(), file.format = list(header = T, row.names = F))
However I get this error:
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: near "in": syntax error)
Is there something obvious I am missing here.
Much thanks in advance.

CASE is a keyword, so you have to quote this column name as "Case" in your query.

For those who want unique rows using sqldf, use DISTINCT:
newdf <- sqldf("SELECT DISTINCT * FROM df") # Get unique rows
sqldf uses SQLite Syntax by default.
newdf <- sqldf("SELECT DISTINCT name FROM df") # Get unique column values
newdf <- sqldf("SELECT *, COUNT(DISTINCT name) as NoNames FROM df GROUP BY whatever") # Get a count of unique names

if you use "Case" in sqldf in R, you should put a "," before "Case". Because the "Case" query is the whole line, you should make it seperate.

Related

Using clinicaltrials.gov database in R

I am trying to use R to access the clinicaltrials.gov AACT database to create a list of facility_investigators for a specific topic.
The following code is an example of how to get a list of all clinical trials on the topic TP53
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'studies')
x = study_tbl %>% filter(official_title %like% '%TP53%') %>% collect()
Similarly, if I want a list of principal investigators,
library(dplyr)
library(RPostgreSQL)
aact = src_postgres(dbname = 'aact',
host = "aact-db.ctti-clinicaltrials.org",
user = 'aact',
password = 'aact')
study_tbl = tbl(src=aact, 'facility_investigators')
I am unable to make a list on only TP53 facility_investigators. Something like TP53 & facility_investigators. Any help would be appreciated
This is a link where some explanation is provided, but my problem is not resolved - http://www.cancerdatasci.org/post/2017/03/approaches-to-accessing-clinicaltrials.gov-data/
Is this what your asking...Your pulling from two different tables in the same database the first one is 'studies' and the second one is 'facilities investigators'. What you need to do is run the head() command for each of the tables (or run glimpse() or run str()) and see if the two tables have a common variable you can merge on after you load them into R. If they do then you would do something like this:
library(dplyr)
merged_table <- inner_join(x, study_table, by = "common column")
If the columns have different names it would like:
library(dplyr)
merged_table <- inner_join(x, study_table, by = c("x_column_name" = "study_table_column_name"))
From there you can limit your dataset to just facility investigators that have the characteristics you want.
If you want to do it in one PostgreSQL query you can do it like so. For more information about this syntax in particular see page 18:
con <- dbConnect() # use the same parameters you use above to connect
query <- dbSendQuery(con,
'select s.*, fi.*
from (select * from studies where official_title like "%TP53%")
as s
inner join facility_investigators as fi
on s."joining column" = fi."joining column"'
)
r_dataset <- fetch(query)
# I would just close the connection in RStudio using the connection tab.
The above query has an inner join in the main query and a subquery in the from statement. The subquery performs the filtering you where trying to do in R. It essentially allows you to select only from the table where the results are already filtered. An inner join combines all the records the two tables have in common and puts them into one table. If you need to join on more than one column add an 'and' between the two statements in the on statement.

RODBC (SQL Server) giving inconsistent results for long character fields converted to numeric [duplicate]

I'm trying to import a SQL Server table into R. The first column of this table is a 17-digit ID.
library(ODBC)
channel <- odbcConnect("my_db", uid="my_id", pwd="my_pw")
options(digits=22)
sqlQuery(channel, "select ID from dbo.my_table where ID = 10000000047974745")
Output:
ID
1 10000000047974744
As you can see the last digit is 4 instead of 5.
I've tried to use cast(ID as char) in the select, but the result is the same. What could I do?
As joran said, using as.is = TRUE as an argument to sqlQuery() solves the problem.

R convert query output to table

New to R programming.
I have a simple sql server query whose output looks like this :
EFFECTIVE_DATE NumberOfUser
2015-07-01 564
2015-07-02 433
2015-07-03 306
2015-07-04 50
Here's how I issue the query:
barData <- sqlQuery(sqlCon,
"select EFFECTIVE_DATE,COUNT(USER_ID) as NumberOfUser from UserTable where start_dt between '20150701' AND '20150704' group by EFFECTIVE_DATE order by EFFECTIVE_DATE")
Now I am running this query from R and want to do a barplot on this. What is the best way to do that?
Also how do i convert any query result to a data.table with which I can do barplot? When I try table(myList), it is showing a different format altogether.
The help on sqlQuery (I don't use ODBC generally) says "On success, a data frameā€¦" is returned. That would mean you should be able to do something like:
barplot(barData$NumberOfUser, names.arg=barData$EFFECTIVE_DATE,
xlab="Effective Date", ylab="Number of Users")
But posting the output of a dput(barData) into your question would really make it easier to help you.
Assuming that you used the sqldf package in R, an sql query of the form SELECT EFFECTIVE_DATE, NUM_OF_USERS FROM USERTABLE is executed using the sqldf(x, stringsAsFactors = FALSE,...) statement:
sql_string <- "select
effective_date
, num_of_users
from USRTABLE"
user_dates <- sqldf(join_string,stringsAsFactors = TRUE)
resulting in a data.frame object. Use the data.table package to convert the data frame into a data table:
user_dates <- as.data.table(user_dates)
A new data frame, user_dates, will be created using the sqldf statement. The sqldf statement, at minimum, requires a character string with the SQL operation to be performed. The stringsAsFactors argument will force categorical variables to have the class character rather than factor.
EDIT : Sincere apologies, didn't see you stating the package name in the question. In case you decide to use the sqldf package, creating a bar plot is a straightforward call to the barplot(height, ...) function:
barplot(user_dates$num_of_users,names.arg=user_dates$effective_date)
Also please note that the result of the sqlQuery on successful execution is a data frame and not a list:
On success, a data frame (possibly with 0 rows) or character string. On error, if
errors = TRUE
a character vector of error message(s), otherwise an invisible integer error code
-1
(general, call
odbcGetErrMsg
for details) or
-2
(no data, which may not be an error as some SQL statements do
return no data).

How to handle column names not supported by sqldf in R

I've a data frame where some of the column names are of the format . format. For ex: Company.1
when i'm using that column in a sqldf function it throws an error
data=sqldf(select Company.1 from test)
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: near ".1": syntax error)
Any workaround so that i can use the column name as it is?
The dot has another meaning in SQL (e.g., separating table name from column name) and
is replaced by an underscore before sending the data to SQLite.
library(sqldf)
test <- data.frame( "Company.1" = 1:10 )
sqldf( 'SELECT Company_1 FROM test' )
This problem is about the . in your column name, if you change it to Company_1 it works:
data = sqldf("select Company_1 from test")
The solution for the latest update of sqldf is answered here
We only need to write the SQL statement between single quotes, and the
column names including dots between double quotes or
backticks/backquotes interchangeably.

Add a dynamic value into RMySQL getQuery [duplicate]

This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
Is it possible to pass a value into the query in dbGetQuery from the RMySQL package.
For example, if I have a set of values in a character vector:
df <- c('a','b','c')
And I want to loop through the values to pull out a specific value from a database for each.
library(RMySQL)
res <- dbGetQuery(con, "SELECT max(ID) FROM table WHERE columna='df[2]'")
When I try to add the reference to the value I get an error. Wondering if it is possible to add a value from an R object in the query.
One option is to manipulate the SQL string within the loop. At the moment you have a string literal, the 'df[2]' is not interpreted by R as anything other than characters. There are going to be some ambiguities in my answer, because df in your Q is patently not a data frame (it is a character vector!). Something like this will do what you want.
Store the output in a numeric vector:
require(RMySQL)
df <- c('a','b','c')
out <- numeric(length(df))
names(out) <- df
Now we can loop over the elements of df to execute your query three times. We can set the loop up two ways: i) with i as a number which we use to reference the elements of df and out, or ii) with i as each element of df in turn (i.e. a, then b, ...). I will show both versions below.
## Version i
for(i in seq_along(df)) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", df[i], "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
OR:
## Version ii
for(i in df) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", i, "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
Which you use will depend on personal taste. The second (ii) version requires you to set names on the output vector out that are the same as the data inside out.
Having said all that, assuming your actual SQL Query is similar to the one you post, can't you do this in a single SQL statement, using the GROUP BY clause, to group the data before computing max(ID)? Doing simple things in the data base like this will likely be much quicker. Unfortunately, I don't have a MySQL instance around to play with and my SQL-fu is weak currently, so I can't given an example of this.
You could also use the sprintf command to solve the issue (it's what I use when building Shiny Apps).
df <- c('a','b','c')
res <- dbGetQuery(con, sprintf("SELECT max(ID) FROM table WHERE columna='%s'"),df())
Something along those lines should work.

Resources