Stop SAS from counting blank cells with COUNT function - count

I am writing a SAS query to QA some data views. Part of the QA is determining what percentage of the values are populated. Unfortunately SAS is counting empty character cells as populated rather than NULL or having no data. For example, an ID field has some blank cells and I run a COUNT() function, I get the same result as though I ran a COUNT(*)
If I run a CASE WHEN statement to exclude "" values I get the correct results, but needing to do that for every single text field in the SAS query seems like overkill and that I'm not aware of some function or some way to preprocess the data so that the COUNT function won't count empty cells.
Some example data that provide the idea is:
data QA_Test;
Input Name $ ID_Number;
Robert 1AY
Shirley ""
Tammy XB3

Use the DSD option when reading values that have quotes around them from a text file.
data QA_Test;
infile cards dsd dlm=' ' truncover;
input Name $ ID_Number $;
cards;
Robert 1AY
Shirley ""
Tammy XB3
;
Now ID_NUMBER will not contain the quotes.
Or use a period to represent the missing values in your text file.
data QA_Test;
input Name $ ID_Number $;
cards;
Robert 1AY
Shirley .
Tammy XB3
;
If you already have those '""' strings in your data and you don't want to count them then use a different method of counting.
sum(not (id_number in (' ','""')))

You need to provide a reproducible example. Please follow the instructions here or use the hex example I previously showed.
So somewhat fixing the non-working code you posted, I did this:
data QA_Test;
Input Name $ ID_Number $;
cards;
Robert 1AY
Shirley ""
Tammy XB3
;;;;
run;
proc sql;
select count(*) as total_count, count(Id_number) as n_id
from QA_TEST;
quit;
Results:
total_count n_id
3 3
But this creates a data set with actual quotes in the cell, I'm assuming that isn't the case in your actual data? So if I read it in as missing:
data QA_Test;
infile cards truncover;
Input Name $ ID_Number $;
cards;
Robert 1AY
Shirley
Tammy XB3
;;;;
run;
proc sql;
select count(*) as total_count, count(Id_number) as n_id
from QA_TEST;
quit;
Results in:
total_count n_id
3 2
So I think SAS is right, your data quality tests are correct and your data has data quality issues that need to be resolved - specifically in this case, fields that has likely tab or invisible characters in the data.
You can test this with the following and post your output here or on communities.sas.com.
proc freq data=qa_test;
table id_number / out=check missing;
format Id_number $hex.;
run;

You can use compress also to remove them, inside the count, if it's correct to leave them in but you don't want them to count:
proc sql;
select count(compress(id,'"'))
...
;
quit;

Related

Working on loop and wanting some feedback, re-adding this to update code and list .csv

Acses to
https://www.opendataphilly.org/dataset/shooting-victims/resource/a6240077-cbc7-46fb-b554 39417be606ee
I have gotten close and got my loop to run, but not gotten the output I want
want a split of street # any '&' locations to a col called 'street$2
**Main objective explained et's deal with the streets with & separating their names. Create a new column named street2 and set it equal to NA.
Then, iterate over the data frame using a for loop, testing if the street variable you created earlier contains an NA value.
In cases where this occurs, separate the names in block according to the & delimiter into the fields street and street2 accordingly.
Output the first 5 lines of the data frame to the screen.
Hint: mutate(); for; if; :; nrow(); is.na(); strsplit(); unlist().
library('readr')
NewLocation$street2 <- 'NA'
#head(NewLocation)
Task7 <- unlist(NewLocation$street2)
for (row in seq(from=1,to=nrow(NewLocation))){
if (is.na(Task7[NewLocation$street])){
NewLocation$street2 <-strsplit(NewLocation$street,"&",(NewLocation[row]))
}
}
This is changing all on my street2 to equal street 1 and get rid of my "NA"s

How to save R DataFrame to a file in MSSQL backup format?

I need to feed external MSSQL server with a large amount of data calculated in R.
No direct access to DB is possible, so it must be an interim export file.
Excel format cannot be utilised due to number of data rows exceeding Excel capacity.
CSV would be fine, but there are many obstacles in the data itself like semicolons used in names, special characters, not closed quotations (odd number of ") and so on.
I am looking for the versatile method of transporting data from R to MSSQL database, independent of data content. If I were able to save DataFrame as a database containing single table to a MSSQL backup format file, that would satisfy the needs.
Any idea on how to achieve this? Any package available? Any suggestion would be appreciated.
I'm inferring you're hoping to bulk-insert the data using bcp or sqlcmd. While neither one deals well with commas, embedded commas, and embedded quotes, you can work around this by using a different field separator (that is not contained within the data).
Setup:
evil_data <- data.frame(
id = 1:2,
chr = c('evil"string ,;\n\'', '",";:|"'),
stringsAsFactors = FALSE
)
# con <- DBI::dbConnect(...)
DBI::dbExecute(con, "create table r2test (id INT, chr nvarchar(64))")
# [1] 0
DBI::dbWriteTable(con, "r2test", evil_data, create = FALSE, append = TRUE)
DBI::dbGetQuery(con, "select * from r2test")
# id chr
# 1 1 evil"string ,;\n'
# 2 2 ",";:|"
First, I'll use \U001 as the field separator and \U002 as the row separator. Those two should be "good enough", but if you have non-printable characters in your data, then you might either change your separators to other values or look for encoding options for the data (e.g., base64, though it might need to be stored that way).
write.table(evil_data, "evil_data.tab", sep = "\U001", eol = "\U002", quote = FALSE)
# or data.table::fwrite or readr::write_delim
Since I'm using bcp, it can use a "format file" to indicate separators and which columns on the source correspond with columns in the database. See references for how to create this file, but for this example I'll use:
fmt <- c("12.0", "2",
"1 SQLCHAR 0 0 \"\001\" 1 id \"\" ",
"2 SQLCHAR 0 0 \"\002\" 2 chr SQL_Latin1_General_CP1_CI_AS")
writeLines(fmt, "evil_data.fmt")
From here, assuming bcp is in your PATH (you'll need an absolute path for bcp otherwise), run this in a terminal (I'm using git-bash on windows, but this should be the same in others). The second line is all specific to my database connection, you'll need to omit or change all of this for your own connection. The first line is your stuff
$ bcp [db_owner].[r2test] in evil_data.tab -F2 -f evil_data.fmt -r '\002' \
-S '127.0.0.1,21433' -U 'myusername' -d 'mydatabase' -P ***MYPASS***
Starting copy...
2 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 235 Average : (8.51 rows per sec.)
Proof that it worked:
DBI::dbGetQuery(con, "select * from r2test")
# id chr
# 1 1 evil"string ,;\n'
# 2 2 ",";:|"
# 3 1 1\001evil"string ,;\r\n'
# 4 2 2\001",";:|"
References:
Microsoft pages for bcp: windows and linux
non-XML format files
bcp and quoted-CSV

R bad row data not shown when read to data.table, but written to file

Sample input tab-delimited text file, note there is bad data from this source file, the enclosing " at end of line 3 is two lines down. So there is 1 complete blank line, followed by a line with just the double-quote character, then continued good data on the next line.
id ca cb cc cd
1 hi bye hey nope
2 ab cd ef "quoted text here"
3 gh ij kl "quoted text but end quote is 2 lines down
"
4 mn op qr lalalala
when I read this into R, tried using read.csv and fread, with/without 'blank.lines.skip = T' for fread, I get the following data table:
id ca cb cc cd
1 1 hi bye hey nope
2 2 ab cd ef quoted text here
3 3 gh ij kl quoted text but end quote is 2 lines down
4 4 mn op qr lalalala
The data table does not show the 'bad' lines. OK, good! However, when I go to write out this data table, tried both write.table and fwrite, those 2 bad lines of /nothing/, and the double-quote, are written out just like they show in the input file!
I've tried doing:
dt[complete.cases(dt),],
dt[!apply(dt == "", 1, all),]
to clear out empty data before writing out, but it does nothing. The data table still only shows those 4 entries. Where is R keeping this 'missing' data? How can I clear out that bad data?
I hope this is a 'one-off' bad output from the source (good ol' US Govt!), but I think they saved this from an xls file, which had bad formatting in a column, causing the text file to contain this mistake, but they obviously did not check the output.
After sitting back and thinking through the reading functions, because that column (cd) data is quoted, there's actually two newline characters at the end of the string, which is not shown in the data table element! So writing out that element will result in writing those two line breaks.
All I needed to do was:
dt$cd <- gsub("[\r\n","",dt$cd)
and that fixed it, the output written to file now has correct rows of data.
I wish I could remove my question...but maybe someday someone will come across the same "issue". I should have stepped back and thought about it before posting the question.

Nesting Data Frames

I have a function wrapping RODBC::sqlQuery that takes a start & end date and returns 5 columns with and roughly 1 million rows per call. I need to iterate through a list of about 60 dates storing the function's resulting data frames in a list.
What I want to know is:
How to pass both start and end date arguments to the function in an
apply-style fashion
How to store the resulting data frames neatly (like a table of |date|data.frame.pointer|)
Here's some of the code:
get.data <- function(date.start, date.end) { ... }
date.range <- seq(as.Date("2009-01-01"), Sys.Date(), by="1 month")
And sample output:
get.data(date.start="2009-01-01", date.end='2009-02-01')
date country oId eId pId
1 2009-01-01 Australia 12345 12345 12345
2 ... ... ... ... ...
Thank you for your help. I've been trying to figure out how to do this for hours to no avail.
For what you want, mapply will do the trick:
n <- length(date.range)
mapply(get.data, date.range[-n], date.range[-1])
This returns a list whose elements are the individual returned values from get.data. So in this case, you would get a list of data frames. That may well be the most appropriate way of storing the output, but it would depend on what you want to do with it.

Create flat file that 'flattens' multiple rtansaction rows into one line in SQL*PLus

As the title says, I need a method of flattening multiple rows into a one luine output per account. For example table looks like this:
Account Transaction
12345678 ABC
12345678 DEF
12346578 GHI
67891011 ABC
67891011 JKL
I need the output to be:
12345678|ABC|DEF|GHI
67891011|ABC|JKL
The amount of transactions is unknown. For some accounts it could be 1 or 2, all the way up to 100's.
You can do this using a customised version of Tom Kyte's STRAGG function, like this:
select account||'|'||stragg(transaction)
from mytable
where ...
group by account;
The function as given uses commas to separate the values, but you can easily change it to use '|'.
An example using EMP (and with commas still):
SQL> select deptno || '|' || stragg(ename) names
2 from emp
3 group by deptno;
NAMES
--------------------------------------------------------------------------------
10|CLARK,KING,FARMER,MILLER
20|JONES,FORD,SCOTT
30|ALLEN,TURNER,WARD,MARTIN,BLAKE

Resources