I've seen elsewhere on this site that a single variable can be passed to sqldf but I haven't been able to track down a way to loop though all the values in "myplace":
myplace <- ("place1", "place2", "place3", ...)
myquery <- sqldf(select things from mydata where place = myplace)
myplace.graph1 <- ggplot2(myquery)
My aim is to automate the production of a number of tables and graphs for each "myplace" element - I want to do this as my dataset is updated monthly and I have to report on it. (For this reason, I don't think grouping, as suggested by a similar query on this site, is the way to go although I stand to be corrected).
I am in the process of learning the R ropes in order to replace a bunch of muddled spreadsheets - my data is now in in sqlite but I couldn't see a way of looping through a "dbgetQuery" either.
It could well be that a completely different approach is required - I'm exploring R because the graphs look great, I can document my steps and it's open source - I would appreciate any advice.
Many thanks
1) A query is just a text string. You can manipulate that text string in all the usual ways before passing it to sqldf. See ?paste, ?paste0, ?sprintf, etc.
qry <- paste("select things from mydata where place =", myplace[1])
myplace.graph1 <- plot(sqldf(qry))
myplace.graph <- list()
for(i in seq_along(myplace))
{
qry <- paste("select things from mydata where place =", myplace[i])
myplace.graph[[i]] <- plot(sqldf(qry))
}
2) Or, without a loop:
myplace.graph <- lapply(myplace, function(x) {
qry <- paste("select things from mydata where place =", x)
plot(sqldf(qry))
}
3) Or using $.fn from the gsubfn package (which is automatically loaded by sqldf so is available) as in Example 5 on the sqldf home page:
sql <- "select things from mydata where place = '$p' "
lapply(myplace, function(p) plot(fn$sqldf(sql)))
Related
Suppose I have a data frame with one case of self-reported job titles (this is done in R):
x <- data.frame("job.title" = c("psychologist"))
I'd like to have this job title entered into a search engine on a website (this part I can do) in order to have data on these jobs pulled into a data frame (this part I can also do).
The following function does this for me:
onet.sum <- function(x) {
obj1 <- as.list(ONETr::keySearch(x)) # enter self-reported job title into ONET's search engine
job.title <- obj1[["title"]][1] # pull best-matching title
soc.code <- obj1[["code"]][1] # pull best matching title's SOC code
obj4 <- as.data.frame(cbind(job.title,soc.code))
return(obj4)
}
However, once I add a second job title in a second row...
x <- data.frame("job.title" = c("psychologist", "social worker"))
...I get this system error that I'm not sure how to diagnose.
Space required after the Public Identifier
SystemLiteral " or ' expected
SYSTEM or PUBLIC, the URI is missing
Any advice?
UPDATE
So it turns out that there are two solutions that work if I pass job titles that do not contain spaces:
Using lapply(). Make sure that the job titles do not contain spaces.
So this works:
final_data <- lapply(c("psychologist","socialworker"), onet.sum) %>%
bind_rows
...but this doesn't work:
final_data <- lapply(c("psychologist","social worker"), onet.sum) %>%
bind_rows
Use purrr's map_df() is more flexible.
result <- purrr::map_df(gsub('\s', '', x$job.title), onet.sum)
You can try with a lapply statement -
result <- do.call(rbind, lapply(x$job.title, onet.sum))
Using purrr::map_df might be shorter.
result <- purrr::map_df(x$job.title, onet.sum)
I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.
I am working on a project where I have to map firms that have an SIC industry classification to the corresponding Fama-French industry classification. I have found that Ian Gow has gracefully created the script to do this. The script is available from the following url: https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
However, there is a glitch in the script or in the data set and for some reason, it does not work with “Siccodes30.txt”. More specifically, it does not produce the correct result (mapping) for lines related to “6726-6726 Unit inv trusts, closed-end” from the “Siccodes30.txt”. I have been trying to figure out the source of the problem, but I have not been successful.
In the post below, I have included the original script (there is some room to make it more efficient) and I have added a few lines at the end to make it work with an online example.
Original Script (I have removed comments to makes the post shorter). Again, this is not my script (the original script is in https://iangow.wordpress.com/2011/05/17/getting-fama-french-industry-data-into-r/
url4FF <- paste("http://mba.tuck.dartmouth.edu",
"pages/faculty/ken.french/ftp",
"Industry_Definitions.zip", sep="/")
f <- tempfile()
download.file(url4FF, f)
fileList <- unzip(f,list=TRUE)
trim <- function(string) {
ifelse(grepl("^\\s*$", string, perl=TRUE),"",
gsub("^\\s*(.*?)\\s*$","\\1",string,perl=TRUE))
}
extract_ff_ind_data <- function (file) {
ff_ind <- as.vector(read.delim(unzip(f, files=file), header=FALSE,
stringsAsFactors=FALSE))
ind_num <- trim(substr(ff_ind[,1],1,10))
for (i in 2:length(ind_num)) {
if (ind_num[i]=="") ind_num[i] <- ind_num[i-1]
}
sic_detail <- trim(substr(ff_ind[,1],11,100))
is.desc <- grepl("^\\D",sic_detail,perl=TRUE)
regex.ind <- "^(\\d+)\\s+(\\w+).*$"
ind_num <- gsub(regex.ind,"\\1",ind_num,perl=TRUE)
ind_abbrev <- gsub(regex.ind,"\\2",ind_num[is.desc],perl=TRUE)
ind_list <- data.frame(ind_num=ind_num[is.desc],ind_abbrev,
ind_desc=sic_detail[is.desc])
regex.sic <- "^(\\d+)-(\\d+)\\s*(.*)$"
ind_num <- ind_num[!is.desc]
sic_detail <- sic_detail[!is.desc]
sic_low <- as.integer(gsub(regex.sic,"\\1",sic_detail,perl=TRUE))
sic_high <- as.integer(gsub(regex.sic,"\\2",sic_detail,perl=TRUE))
sic_desc <- gsub(regex.sic,"\\3",sic_detail,perl=TRUE)
sic_list <- data.frame(ind_num, sic_low, sic_high, sic_desc)
return(merge(ind_list,sic_list,by="ind_num",all=TRUE))
}
FFID_30 <- extract_ff_ind_data("Siccodes30.txt")
I have added the following lines to allow testing the script:
library(gsheet)
url <-"https://docs.google.com/spreadsheets/d/1QRv8YmJv0pdhIVmkXMQC7GQuvXV21Kyjl9pVZsSPEAk/gid=1758600626"
companiesSIC <- read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE)
names(companiesSIC)
library(sqldf)
companiesFFID_30 <- sqldf("SELECT a.gvkey, a.SIC, b.ind_desc AS FF30,
b.ind_num as FFIndNUm30
FROM companiesSIC AS a
LEFT JOIN FFID_30 AS b
ON a.sic BETWEEN b.sic_low AND b.sic_high")
companiesFFID_30
Results on rows 141 and 142 are wrong. Instead of an industry number the provide a string.
Thanks
PS As I said there is room to make the script shorter (e.g., you don't need to create a separate function to remove white space, you can use trimws) but to give credit to the original author, I kept the script in its original form. However, if someone can solve the problem should also try to update the rest of the script too.
There is nothing wrong with the script. The problem is in the formatting of the two lines (141 and 142) of the txt file.
I opened the text file with a text editor, deleted and re-typed the content of these two lines. When I re-run the R script the problem was gone.
I'm having an issue running sqldf in R and special characters.
Here is a small detail of the code I'm running to give you an idea of what's going on:
First I read my data from an excel sheet (using R's xlsx package), the method xlsx2 seems to get the data correctly and characters seem to be showing special characters such as 'Ñ'
verif_oblig <- try(read.xlsx2("My computer/Filename.xlsx", sheetName = 'VERIF_OBLIG'))
if("try-error" %in% class(verif_oblig))
verif_oblig <- Empty()
Then I start running my sql query using sqldf and the resulting table seems to replace Ñ characters for Ñ. Here's the query:
verif_oblig_v2 <- sqldf("
select
a.*,
case when b.Estado is null then 'NO GENERADO'
else b.Estado end as ESTADO,
case when resultado_operacion in ('EXITO','CORRECTO')
then 'EXITO'
else 'SIN EXITO'
end as RESULTADO_ACTUAL
from
verif_oblig a left join fin2016 b
on
a.CUPS = b.CUPS_Largo and a.DIVISION = b.DIVISION")
Can anyone help me find a solution for this?
Thank you very much
I ended up solving it by simply replacing the characters that were bugging after the sql query using gsub like this:
clear_errors <- function(table, campo){
table <- as.data.frame(table)
table[,campo] <- c(gsub("Ã'","Ñ",c(tabla_entrada[,campo])))
table[,campo]<- c(gsub("é","é",c(tabla_entrada[,campo])))
table[,campo]<- c(gsub("ó", "ó",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("ú","ú",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("ñ","ñ",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("Ã","í",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("O","A",c(tabla_entrada[,campo])))
return(table)
}
It isn't the most elegant solution but it works.
I think the issue happens because xlsx formats characters as factors and probably uses a different encoding than sqldf does for them. If someone can find out exactly what's going on I'd very much appreciate it (just out of curiosity)
This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
Is it possible to pass a value into the query in dbGetQuery from the RMySQL package.
For example, if I have a set of values in a character vector:
df <- c('a','b','c')
And I want to loop through the values to pull out a specific value from a database for each.
library(RMySQL)
res <- dbGetQuery(con, "SELECT max(ID) FROM table WHERE columna='df[2]'")
When I try to add the reference to the value I get an error. Wondering if it is possible to add a value from an R object in the query.
One option is to manipulate the SQL string within the loop. At the moment you have a string literal, the 'df[2]' is not interpreted by R as anything other than characters. There are going to be some ambiguities in my answer, because df in your Q is patently not a data frame (it is a character vector!). Something like this will do what you want.
Store the output in a numeric vector:
require(RMySQL)
df <- c('a','b','c')
out <- numeric(length(df))
names(out) <- df
Now we can loop over the elements of df to execute your query three times. We can set the loop up two ways: i) with i as a number which we use to reference the elements of df and out, or ii) with i as each element of df in turn (i.e. a, then b, ...). I will show both versions below.
## Version i
for(i in seq_along(df)) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", df[i], "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
OR:
## Version ii
for(i in df) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", i, "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
Which you use will depend on personal taste. The second (ii) version requires you to set names on the output vector out that are the same as the data inside out.
Having said all that, assuming your actual SQL Query is similar to the one you post, can't you do this in a single SQL statement, using the GROUP BY clause, to group the data before computing max(ID)? Doing simple things in the data base like this will likely be much quicker. Unfortunately, I don't have a MySQL instance around to play with and my SQL-fu is weak currently, so I can't given an example of this.
You could also use the sprintf command to solve the issue (it's what I use when building Shiny Apps).
df <- c('a','b','c')
res <- dbGetQuery(con, sprintf("SELECT max(ID) FROM table WHERE columna='%s'"),df())
Something along those lines should work.