One of the columns in R dataframe has "," (comma) in it and because of it, when I try to convert it into netezza data frame, it throws me below error:
Error in nzQuery(sqlCommandUpload) : HY008 51 Operation canceled
01000 1 Unable to write nzlog/bad files
01000 1 Unable to write nzlog/bad files
HY000 46 ERROR: External Table : count of bad input rows reached maxerrors limit
How can I achieve this without making any changes to data?
With a dataframe like this, everything works fine:
I get error when the dataframe is like this:
library(nzr)
library(forecast)
library (reshape2)
library(doBy)
nzDisconnect()
nzConnectDSN('DSNInfo', force=FALSE , verbose=TRUE)
#read file
test2<-read.csv("test_df.csv", stringsAsFactors = F)
# convert to nz dataframe, no error
#nzdf.test2<-as.nz.data.frame(test2)
nzdf.d<-as.nz.data.frame(d)
# copy
#test<-test2
testd<-d
#replace one of the values containing a ","
#test$Category[1]<-"a,b"
testd$Category[1]<-"Bed, Bath & Towels"
# converting to nz gives error
#nzdf.test<-as.nz.data.frame(test)
nzdf.testd<-as.nz.data.frame(testd)
#remove ","
test$Category <- gsub(",","",test$Category)
# converting to nz dataframe, gives no error
nzdf.test<-as.nz.data.frame(test)
Did you check if you have nulls (NAs) in your data? I have faced the same problem but when i checked Netezza-R documentation i found that you can not write Nulls into a Netezza tables from another system. there is a mention about using setOutputNull funciton in such cases.
So a workaround is replace nulls with the string "NULL" in your R-dataframe, this makes the numerical columns become varchar, mind you. But fortunately "NULL" becomes null in your netezza table automatically. Only extra effort is that you have to covnert the columns back to numeric later.
Hope this helps
Related
I was just going through a tremendous headache caused by read_csv messing up my data by substituting content with NA while reading simple and clean csv files.
I’m iterating over multiple large csv files that add up to millions of observations. Some columns contain quite some NA for some variables.
When reading a csv that contains NA in a certain column for the first 1000 + x observations, read_csv populates the entire column with NA and thus, the data is lost for further operations.
The warning message “Warning: x parsing failure” is shown, but as I’m reading multiple files I cannot check this file by file. Still, I would not know an automated fix for the parsing problem indicated also with problems(x)
Using read.csv instead of read_csv does not cause the problem, but it is slow and I run into encoding issues (using different encodings requires too much memory for large files).
An option to overcome this bug is to add a first observation (first row) to your data that contains something for each column, but still I need to read the file first somehow.
See a simplified example below:
##create a dtafrane
df <- data.frame( id = numeric(), string = character(),
stringsAsFactors=FALSE)
##poluate columns
df[1:1500,1] <- seq(1:1500)
df[1500,2] <- "something"
# variable string contains the first value in obs. 1500
df[1500,]
## check the numbers of NA in variable string
sum(is.na(df$string)) # 1499
##write the df
write_csv(df, "df.csv")
##read the df with read_csv and read.csv
df_readr <- read_csv('df.csv')
df_read_standard <- read.csv('df.csv')
##check the number of NA in variable string
sum(is.na(df_readr$string)) #1500
sum(is.na(df_read_standard$string)) #1499
## the read_csv files is all NA for variable string
problems(df_readr) ##What should that tell me? How to fix it?
Thanks to MrFlick for giving the answering comment on my questions:
The whole reason read_csv can be faster than read.csv is because it can make assumptions about your data. It looks at the first 1000 rows to guess the column types (via guess_max) but if there is no data in a column it can't guess what's in that column. Since you seem to know what's supposed to be in the columns, you should use the col_types= parameter to tell read_csv what to expect rather than making it guess. See the ?readr::cols help page to see how to tell read_csv what it needs to know.
Also guess_max = Inf overcomes the problem, but the speed advantage of read_csv seems to be lost.
I'm using R to tidy data supplied to me (in a SAS file) so that I can bulk insert it into a SQLserver database. The problem that I'm having is that sometimes numeric fields get transformed by R after I read them in eg.(the leading 0 gets dropped, some numeric fields convert to scientific notation, long ID numbers turn into gibberish after the 15th digit).
Reading all the data into R as character solves these issues. When I'm supplied a csv file I can use data.tables 'fread' function to specify colClasses = 'character' however as far as I'm aware something like this doesnt exist for the 'read_sas' function from the haven package.
Are there any workarounds or extra documentation on how I can better approach and solve this issue?
Edit to highlight issues (left values is numeric and what I want to avoid, right value is as character and what I want):
1.
postcode <- c(0629,'0629')
postcode
[1] "629" "0629"
2.
id <- c(12000000,'12000000')
id
[1] "1.2e+07" "12000000"
3.
options(scipen=999)
id <- c(123123123123123123123123,'123123123123123123123123')
id
[1] "123123123123123117883392" "123123123123123123123123"
How can I import the data directly from SAS so that all columns in the data frame are read in as character data type (in order to avoid data quality issues when I insert into SQLserver)
I was trying to read the Toxic Release Inventory (TRI) csv files which I downloaded from Here using the command tri2016 <- fread("TRI_2016_US.csv") but it gives me a warning about discarding line 1 has too few or too many items to be column names or data.
However, tri2016_1 <- read.csv("TRI_2016_US.csv") reads it without giving any errors and correct column names! Using tri2016_1 <- fread("TRI_2016_US.csv", header=TRUE) still generates the warning and still ignores the header.
The TRI files have 108 columns and the header row contains special characters. The list of columns are listed in Pdf file (Appendix A on pg 7).
Is there any way to get fread to read these csv files along with the header?
Or should I just stick with tri2016 <- as.data.table(read.csv("TRI_2016_US.csv")) and not worry about it?
The header line seems to have a trailing comma (one more than in the other rows) - tested with TRI_2016_US.csv - 111 columns.
If you remove that, the problem should be solved.
Try the readr package.
library(readr)
tri2016_1 <- readr::read_csv("TRI_2016_US.csv")
You'll get a warning saying
Warning messages:
1: Missing column names filled in: 'X112' [112]
2: In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)
I am trying to read data out of a csv-file.
The data consists of small integer numbers (53, 98 ...)
The csv was made with OpenOffice, the data stood there in the first column
one number in each row.
reading data was simple (no problem at all):
BirthNumbers <- read.csv(“/Users/.../RawData.csv”, header=FALSE)
Now I try to calculate mean(BirthNumbers) (for example),
but it is not possible, the error message:
x is not numeric
Where is my mistake?
Thanks for all help
Norbert
It's probably being read in as characters.
Try mean(as.numeric(BirthNumbers))
As per https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html (see Value section), read.csv returns a data frame.
You should be calling mean on the column of the data frame. Since you have no headers (given your header = FALSE), most likely the column is called V1 (verify by doing head(BirthNumbers) or colnames(BirthNumbers)), so you should do mean(BirthNumbers$V1).
I am trying to Reading CSV file in R . How can I read and format dates and times while reading and avoid missing values marked as ?. The data I load after reading should be clean.
I tried something like
data <- read.csv("Data.txt")
It worked, but the dates and times were as is.
Also how can I extract a subset of data from specific data range?
For this I tried something like
subdata <- subset(data,
Date== 01/02/2007 & Date==02/02/2007,
select = Date:Sub_metering_3)
I get error Error in eval(expr, envir, enclos) : object 'Date' not found
Date is the first column.
The functions read.csv() and read.table() are not set up to do detailed fancy conversion of things like dates that can have many formats. When these functions don't automatically do what's wanted, I find it best to read the data in as text and then convert variables after the fact.
data <- read.csv("Data.txt",colClasses="character",na.strings="?")
data$FixedDate <- as.Date(data$Date,format="%Y/%m/%d")
or whatever your date format is. The variable FixedDate will then be of type Date and you can use equality and other conditions to subset.
Also, in your example code you are putting 01/02/2007 as bare code, which results in dividing 1 by 2 and then by 2007 yielding 0.0002491281, rather than inserting a meaningful date. Consider as.Date("2007-01-02") instead.