I'm having an issue running sqldf in R and special characters.
Here is a small detail of the code I'm running to give you an idea of what's going on:
First I read my data from an excel sheet (using R's xlsx package), the method xlsx2 seems to get the data correctly and characters seem to be showing special characters such as 'Ñ'
verif_oblig <- try(read.xlsx2("My computer/Filename.xlsx", sheetName = 'VERIF_OBLIG'))
if("try-error" %in% class(verif_oblig))
verif_oblig <- Empty()
Then I start running my sql query using sqldf and the resulting table seems to replace Ñ characters for Ñ. Here's the query:
verif_oblig_v2 <- sqldf("
select
a.*,
case when b.Estado is null then 'NO GENERADO'
else b.Estado end as ESTADO,
case when resultado_operacion in ('EXITO','CORRECTO')
then 'EXITO'
else 'SIN EXITO'
end as RESULTADO_ACTUAL
from
verif_oblig a left join fin2016 b
on
a.CUPS = b.CUPS_Largo and a.DIVISION = b.DIVISION")
Can anyone help me find a solution for this?
Thank you very much
I ended up solving it by simply replacing the characters that were bugging after the sql query using gsub like this:
clear_errors <- function(table, campo){
table <- as.data.frame(table)
table[,campo] <- c(gsub("Ã'","Ñ",c(tabla_entrada[,campo])))
table[,campo]<- c(gsub("é","é",c(tabla_entrada[,campo])))
table[,campo]<- c(gsub("ó", "ó",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("ú","ú",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("ñ","ñ",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("Ã","í",c(tabla_entrada[,campo])))
table[,campo] <- c(gsub("O","A",c(tabla_entrada[,campo])))
return(table)
}
It isn't the most elegant solution but it works.
I think the issue happens because xlsx formats characters as factors and probably uses a different encoding than sqldf does for them. If someone can find out exactly what's going on I'd very much appreciate it (just out of curiosity)
Related
So my task is very simple, I would like to use R to solve this. I have hundreds of excel files (.xlsx) in a folder and I want to replace an especific text without altering formating of worksheet and preserving the rest of text in the cell, for example:
Text to look for:
F13 A
Replace for:
F20
Text in a current cell:
F13 A Year 2019
Desired result:
F20 Year 2019
I have googled a lot and havent found something appropiate, even though it seems to be a common task. I have a solution using Powershell but it is very slow and I cant believe that there is no simple way using R. Im sure someone had the same problem before, Ill take any sugestions.
You can try :
text_to_look <- 'F13 A'
text_to_replace <- 'F20'
all_files <- list.files('/path/to/files', pattern = '\\.xlsx$', full.names = TRUE)
lapply(all_files, function(x) {
df <- openxlsx::read.xlsx(x)
#Or use readxl package
#df <- readxl::read_excel(x)
df[] <- lapply(df, function(x) {x[grep(text_to_look, x)] <- text_to_replace;x})
openxlsx::write.xlsx(df, basename(x))
})
I am trying to optimize a simple R code I wrote on two aspects:
1) For loops
2) Writing data into my PostgreSQL database
For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.
For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)
EDIT: I updated my code with a reproducible example below.
for (i in 1:100){
search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)
download.file(search, destfile ='scrape.html',quiet = TRUE)
url <- read_html('scrape.html')
github_title <- url%>%html_nodes(xpath="//div[#class=mt-n1]")%>%html_text()
github_link <- url%>%html_nodes(xpath="//div[#class=mt-n1]//#href")%>%html_text()
df <- data.frame(github_title, github_link )
colnames(df) <- c("title", "link")
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
cat(i)
}
Thanks a lot for all your inputs!
First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.
I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.
n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
id <- element_a[i]
location <- element_b[i]
language <- element_c[i]
date_creation <- element_d[i]
df <- data.frame(id, location, language, date_creation)
colnames(df) <- c("id", "location", "language", "date_creation")
outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con) #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con) #<= might speed up might not.
Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.
#Create the statements. here
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))
##dbBegin(con) #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
SET NOCOCUNT ON seems to be necessary in the DBI API.
It seems to react to 'n rows affected' messages.
Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON
INSERT INTO [my table] values ', statement))
##dbCommit(con) #<= might speed up might not.
Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.
Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.
What I am doing is crosstabulating a range of variables in a dataset (kol) with one crossing variable (cross) and then exporting the result into a csv.
library(dplyr)
cross <- dataset$crossingvariable
kol <- select(dataset,var.first:var.last)
myfunction <- function(x) {
table1 <- table(cross, x)
table2 <- prop.table(table1, 1)
table3 <- data.frame(table2)
}
dtfr <- lapply(kol,FUN=myfunction)
dtfr2 <- do.call("rbind", dtfr)
write.csv(dtfr2, file="C:/Users/Gebruiker/Documents/statistics/test.csv")
The problem is that the two dtfr (dtfr and dtfr2) steps mess up the formatting. What it needs to do is that every element in dtfr is arranged vertically under one another when exporting to csv. dtfr2 does this, but in the process it also rearranges the tables. So in step dtfr the formatting is still correct if I include as.data.frame.matrix with table3 (shown below) (how the formatting should be, shown in r console and with as.data.frame.matrix in table3), in step dtfr2 it has moved (Incorrect formatting in csv). Step dtfr2 however is needed to get it vertically under one another in the csv file.
When I change table3 to:
library(dplyr)
cross <- dataset$crossingvariable
kol <- select(dataset,var.first:var.last)
myfunction <- function(x) {
table1 <- table(cross, x)
table2 <- prop.table(table1, 1)
table3 <- as.data.frame.matrix(table2)
}
dtfr <- lapply(kol,FUN=myfunction)
dtfr2 <- do.call("rbind", dtfr)
write.csv(dtfr2, file="C:/Users/Gebruiker/Documents/statistics/test.csv")
dtfr gets a correct output (shown above). But at this point dtfr2 shows the following error message when I run the script:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
I assume it has something to do with unequal amounts of columns in the different variables I am crosstabulating. Most of them are nominal or ordinal variables, so some of them have three categories and some of them have five.
I can't manage to find how to fix this, mostly having to do with my limited r knowledge. I googled the problem and looked on stack overflow which yielded some similar problems but all the offered solutions didn't work in my case. Again this has alot to do with me not really knowing enough about r yet, but some help pointing me in the right direction would be very much appreciated.
I've seen elsewhere on this site that a single variable can be passed to sqldf but I haven't been able to track down a way to loop though all the values in "myplace":
myplace <- ("place1", "place2", "place3", ...)
myquery <- sqldf(select things from mydata where place = myplace)
myplace.graph1 <- ggplot2(myquery)
My aim is to automate the production of a number of tables and graphs for each "myplace" element - I want to do this as my dataset is updated monthly and I have to report on it. (For this reason, I don't think grouping, as suggested by a similar query on this site, is the way to go although I stand to be corrected).
I am in the process of learning the R ropes in order to replace a bunch of muddled spreadsheets - my data is now in in sqlite but I couldn't see a way of looping through a "dbgetQuery" either.
It could well be that a completely different approach is required - I'm exploring R because the graphs look great, I can document my steps and it's open source - I would appreciate any advice.
Many thanks
1) A query is just a text string. You can manipulate that text string in all the usual ways before passing it to sqldf. See ?paste, ?paste0, ?sprintf, etc.
qry <- paste("select things from mydata where place =", myplace[1])
myplace.graph1 <- plot(sqldf(qry))
myplace.graph <- list()
for(i in seq_along(myplace))
{
qry <- paste("select things from mydata where place =", myplace[i])
myplace.graph[[i]] <- plot(sqldf(qry))
}
2) Or, without a loop:
myplace.graph <- lapply(myplace, function(x) {
qry <- paste("select things from mydata where place =", x)
plot(sqldf(qry))
}
3) Or using $.fn from the gsubfn package (which is automatically loaded by sqldf so is available) as in Example 5 on the sqldf home page:
sql <- "select things from mydata where place = '$p' "
lapply(myplace, function(p) plot(fn$sqldf(sql)))
This question already has answers here:
Dynamic "string" in R
(4 answers)
Closed 5 years ago.
Is it possible to pass a value into the query in dbGetQuery from the RMySQL package.
For example, if I have a set of values in a character vector:
df <- c('a','b','c')
And I want to loop through the values to pull out a specific value from a database for each.
library(RMySQL)
res <- dbGetQuery(con, "SELECT max(ID) FROM table WHERE columna='df[2]'")
When I try to add the reference to the value I get an error. Wondering if it is possible to add a value from an R object in the query.
One option is to manipulate the SQL string within the loop. At the moment you have a string literal, the 'df[2]' is not interpreted by R as anything other than characters. There are going to be some ambiguities in my answer, because df in your Q is patently not a data frame (it is a character vector!). Something like this will do what you want.
Store the output in a numeric vector:
require(RMySQL)
df <- c('a','b','c')
out <- numeric(length(df))
names(out) <- df
Now we can loop over the elements of df to execute your query three times. We can set the loop up two ways: i) with i as a number which we use to reference the elements of df and out, or ii) with i as each element of df in turn (i.e. a, then b, ...). I will show both versions below.
## Version i
for(i in seq_along(df)) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", df[i], "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
OR:
## Version ii
for(i in df) {
SQL <- paste("SELECT max(ID) FROM table WHERE columna='", i, "';", sep = "")
out[i] <- dbGetQuery(con, SQL)
dbDisconnect(con)
}
Which you use will depend on personal taste. The second (ii) version requires you to set names on the output vector out that are the same as the data inside out.
Having said all that, assuming your actual SQL Query is similar to the one you post, can't you do this in a single SQL statement, using the GROUP BY clause, to group the data before computing max(ID)? Doing simple things in the data base like this will likely be much quicker. Unfortunately, I don't have a MySQL instance around to play with and my SQL-fu is weak currently, so I can't given an example of this.
You could also use the sprintf command to solve the issue (it's what I use when building Shiny Apps).
df <- c('a','b','c')
res <- dbGetQuery(con, sprintf("SELECT max(ID) FROM table WHERE columna='%s'"),df())
Something along those lines should work.