I have a problem with some data I´m working.
I extrac data from SQL SERVER and with R I work them, but for some fields of names, some names have instead of a letter the REPLACEMENT CHARACTER (Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD)), is the one enter image description here
I don´t want to use the replace function, to change the entire name.
Some ideas?
example the name MAGAÑA: MAGA�A
I use the following code to the connection and query:
library(odbc)
library(tidyverse)
library(dgof)
library(pROC)
library(ggplot2)
library(dbplyr)
library(dplyr)
library(lubridate)
library(janitor)
library(DBI)
library(readxl)
library(data.table)
## Connection
conex1 <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "xxx.xxx.xxx.xx",
Database = "xxxxxxxx",
UID = "xxxxxxx",
PWD = "xxxxxxxxx",
Port = 1433)
# Query
Fecha_nac<- dbSendQuery(conex1, "SELECT id_orden,
fecha_nacimiento
FROM zzgm_clientes_xxxxxxx") %>%
dbFetch()
I think, iconv can help to you in this situation.
dataframe_with_right_symbols <- raw_dataframe %>%
mutate_if(is.character, function(col) iconv(col, to="UTF-8"))
Related
I connected R to SQL using the following:
library(dplyr)
library(dbplyr)
library(odbc)
library(RODBC)
library(DBI)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "srv name",
Database = "Warehouse")
I pull in the table I want using
data <- tbl(con, in_schema("prc", "PricingLawOfUniv")))
The following things show me what I expect to see (a 38 X 1000 table of data):
head(data)
colnames(data)
The following things behave as I expect:
In the Environment data is a "list of 2"
View(data) shows a list with "src" and "ops" - each of those is also a list of 2.
Ultimately I want to work with the 38 X 1000 table as a dataframe using dplyr. How can I do this? I tried data[1] and data[2] but neither worked. Where is the actual table I want hiding?
You could use DBI::Id to specify the table/schema, and then dbReadTable:
tbl <- DBI::Id(
schema = "prc",
table = "PricingLawOfUniv"
)
data <- DBI::dbReadTable(con, tbl)
In another project working with Amazon Athena I could do this:
con <- DBI::dbConnect(odbc::odbc(), Driver = "path-to-driver",
S3OutputLocation = "location",
AwsRegion = "eu-west-1", AuthenticationType = "IAM Profile",
AWSProfile = "profile", Schema = "prod")
tbl(con,
# Run SQL query
sql('SELECT *
FROM TABLE')) %>%
# Without having collected the data, I could further wrangle the data inside the database
# using dplyr code
select(var1, var2) %>%
mutate(var3 = var1 + var2)
However, now using BigQuery I get the following error
con <- DBI::dbConnect(bigrquery::bigquery(),
project = "project")
tbl(con,
sql(
'SELECT *
FROM TABLE'
))
Error: dataset is not a string (a length one character vector).
Any idea if with BigQuery is not possible to do what I'm trying to do?
Not a BigQuery user, so can't test this, but from looking at this example it appears unrelated to how you are piping queries (%>%). Instead it appears BigQuery does not support receiving a tbl with an sql string as the second argument.
So it is likely to work when the second argument is a string with the name of the table:
tbl(con, "db_name.table_name")
But you should expect it to fail if the second argument is of type sql:
query_string = "SELECT * FROM db_name.table_name"
tbl(con, sql(query_string))
Other things to test:
Using odbc::odbc() to connect to BigQuery instead of bigquery::bigquery(). The problem could be caused by the bigquery package.
The second approach without the conversation to sql: tbl(con, query_string)
Data
I work with a large dataset (280 million rows) for which Spark and R seems to work nicely.
Problem
I'd had problems with SparkR's regexp_extract function. I thought it to work analogically to Stringr's str_detect but I haven't managed to get it to work. The documentation for regexp_extract is limited. Could you please give me a hand?
Reprex
Here is a reprex where I try to identify strings that do not have a space and paste " 00:01" as a suffix.
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken = ifelse(regexp_extract(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Error
error: org.apache.spark.sql.AnalysisException: cannot resolve '(NOT regexp_extract(df.sampletaken, ' ', 1))' due to data type mismatch: argument 1 requires boolean type, however, 'regexp_extract(df.sampletaken, ' ', 1)' is of string type.; line 1 pos 80;
Solution
# Load packages
library(tidyverse)
library(sparklyr)
library(SparkR)
# Create data
df <- data.frame(sampletaken = c("06/03/2013", "29/11/2005 8:30", "06/03/2013", "15/01/2007 12:25", "06/03/2013", "15/01/2007 12:25"))
# Create Spark connection
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
# Transfer data to Spark memory
df <- copy_to(sc, df, "df", overwrite = TRUE)
# Modify data
df1 <- df %>%
dplyr::mutate(sampletaken1 = ifelse(rlike(sampletaken, " "), sampletaken, paste(sampletaken, "00:01")))
# Collect data as dataframe
df1 <- df1 %>% as.data.frame()
head(df1$sampletaken)
Probably rlike is what you're after if you're looking for the analog to str_detect, see the SQL API docs:
str rlike regexp - Returns true if str matches regexp, or false otherwise.
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*'
true
on a Column (i.e., in R, rather than in SparkQL through sql()), it would be like:
rlike(Column, 'regex.*pattern')
# i.e., in magrittr form
Column %>% rlike('regex.*pattern')
Note that like is usually more efficient if you can use it since the set of valid like patterns is much smaller.
I'm not familiar with SparkR, but it seems that the function regex_extract returns a string (presumably the matched pattern in the string) instead of a boolean, as required by the function ifelse.
You may try to match the returned value against the empty string.
Here is my code
library(DBI)
library(dplyr)
con <- dbConnect(odbc::odbc(), some_credentials)
dbListTables(con, table_name = "Table_A")
The above code returns Table_A indicating presence of table. Now I am trying to query Table_A
df <- as.data.frame(tbl(con, "Table_A"))
and get back:
Error: <SQL> 'SELECT *
FROM "Table_A" AS "zzz18"
WHERE (0 = 1)'
nanodbc/nanodbc.cpp:1587: 42S02: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid object name 'Table_A'.
so dplyr does not see it. How can I reconcile. I already double checked spelling.
As mentioned, any object (table, stored procedure, function, etc.) residing in a non-default schema requires explicit reference to the schema. Default schemas include dbo in SQL Server and public in PostgreSQL. Therefore, as docs indicate, use in_schema in dbdplyr and Id or SQL in DBI:
# dbplyr VERSION
df <- tbl(con, in_schema("myschema", "Table_A"))
# DBI VERSION
t <- Id(schema = "myschema", table = "Table_A")
df <- dbReadTable(con, t)
df <- dbReadTable(con, SQL("myschema.Table_A"))
Without a reproducible example it is kinda hard but I will try my best. I think you should add the dbplyr package which is often used for connecting to databases.
library(DBI)
library(dbplyr)
library(tidyverse)
con <- dbConnect(odbc::odbc(), some_credentials)
df <- tbl(con, "Table_A") %>%
collect() #will create a dataframe in R and use dplyr
Here are some additional resources:
https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
Hope that can help!
I have a table saved in AWS redshift that has lots of rows and I want to collect only a subset of them using a "user_id" column. I am trying to use R with the dplyr library to accomplish this (see below).
conn_dplyr <- src_postgres('dev',
host = '****',
port = ****,
user = "****",
password = "****")
df <- tbl(conn_dplyr, "redshift_table")
However, when I try to subset over a collection of user ids it fails (see below). Can someone help me understand how I might be able to collect the data table over a collection of user id elements? The individual calls work, but when I combine them both it fails. In this case there are only 2 user ids, but in general it could be hundreds or thousands, so I don't want to do each one individually. Thanks for your help.
df_subset1 <- filter(df, user_id=="2239257806")
df_subset1 <- collect(df_subset1)
df_subset2 <- filter(df, user_id=="22159960")
df_subset2 <- collect(df_subset2)
df_subset_both <- filter(df, user_id==c("2239257806", "22159960"))
df_subset_both <- collect(df_subset_both)
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: operator does not exist: character varying = record
HINT: No operator matches the given name and argument type(s). You may need to add explicit type casts.
)
Try this:
df_subset_both <- filter(df, user_id %in% c("2239257806", "22159960"))
Also you can add condition in the query you uploaded from redshift.
install.packages("RPostgreSQL")
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
conn <-dbConnect(drv,host='host link',port='5439',dbname='dbname',user='xxx',password='yyy')
df_subset_both <- dbSendQuery(conn,"select * from my_table where user_id in (2239257806,22159960)")