When I run a proc print where segment ID equals 1234 the output shows segment ID 1235. SAS actually changes the last 4 digits of a 19 digit number. Contents shows the field in a num 8 formatted as a char 20. I just pull the data and print with no additional formatting or processing.
If I run a SQL statement in a different software package where segment ID equals 1234 (the exact same record) the results show 1234 (no change to the last 4). The other vars pulled with the query exactly match those of SAS except for the segment ID.
My best guess is it's a formatting issue even though the field should be large enough, 20 > 19.
Second guess is some sort of encryption on the field. Typically if I don't have proper access a field would be blank. But I am unfamiliar with this new data source.
I'll try adding a specific format to my SAS datapull for that field but would love to hear any other suggestions.
Thank you!
PROC PRINT is not the issue. You cannot store 19 decimal digits exactly as a number in SAS. SAS stores numbers as 64-bit floating point numbers. The maximum number of decimal digits that can be represented as consecutive integers is 15. After that the binary representation will not have enough bits to exactly represent every decimal string.
Check this description about precision from the documentation: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lrcon/p0ji1unv6thm0dn1gp4t01a1u0g6.htm
You should store such things as character strings. I doubt that you need to do any arithmetic with those values.
If you are getting the data from a remote database use the DBSASTYPE= dataset option to control what type of SAS variable is created.
Contents shows the field in a num 8 formatted as a char 20. I just pull the data and print with no additional formatting or processing.
That doesn't make sense. A numeric variable shouldn't have a character format.
I think you'll need to re-read the source field as a character from the source. Not sure where you "just pull the data" from, but you'll need to modify that step to ensure it gets brought over as a character in the first place otherwise you'll have that issue no matter what.
You cannot fix the issue with the data as is, as far as I know.
*issue with numeric;
data have;
input segmentID;
format segmentID 32.;
segmentIDChar=put(SegmentID, 32.);
cards;
1234567890123456789
;;;;
run;
proc print data=have;
run;
*no issue with character fields;
data have;
length segmentID $20.;
input segmentID $;
format segmentID $32.;
cards;
1234567890123456789
;;;;
run;
proc print data=have;
run;
Results:
Numeric Isssue
Obs segmentID segmentIDChar
1 1234567890123456768 1234567890123456768
Character Issue
Obs segmentID
1 1234567890123456789
Related
I am a somewhat newbie to SQLite (and KMyMoney). KMyMoney (an open source personal finance manager) allows one-click exporting data into an SQLite database.
On browsing the SQLite database output, the dollar amount data is stored in a table called kmmSplits as several text fields in a strange format based on “value” and “valueFormatted” (see screen shot below). The “value” field is apparently written as a division equation (in a text format) which apparently yields the “valueFormatted” field (again in text format). The “valueFormatted is the correct number amount but the problem is that parenthesis are used to indicate a negative number instead of a simple minus in front of the value. This is apparently an accounting number format, but I don’t know how to parse this into a float value for running calculated SQL queries, etc. The positive values (without parenthesis) are no problem to convert to FLOATS.
I’ve tried using the CAST to FLOAT function but this does not do the division math, nor does it convert parenthesis into negative values (see screen shot).
The basic question is: how to parse a text value containing parenthesis in the “valueFormatted field (accounting money format) into a common number format OR, alternatively, how to convert a division equation in the “value” field to an actual calculation.
Use a CASE expression to check if valueFormatted is a numeric value inside parentheses and if it is multiply -1 with the substring starting from the 2nd char (the closing parenthesis will be discarded by SQLite during this implicit type casting):
SELECT *,
CASE
WHEN valueFormatted LIKE '(%)' THEN (-1) * SUBSTR(valueFormatted, 2)
ELSE valueFormatted
END AS value
FROM kmmSQLite;
Or, replace '(' with ''-'' and add 0 to covert the result to a number:
SELECT *,
REPLACE(valueFormatted, '(', '-') + 0 AS value
FROM kmmSQLite;
I have a table with a numeric(6,4) column (x.xxxx)
In the select statement, I have to do math on this column which results in the value being typed as a REAL. I need to display the results using numeric(6,4) though and can't seem to see how to do this.
I tried "cast(column as numeric(6,4))" which works in other databases, but the displayed value is still REAL format with more than 4 digits to the right of the decimal. I need to get it back to the "x.xxxx" format from REAL.
I have to read data from an archaic fixed-width table. Some of the fields are in the IBM packed decimal Format, and I have no idea how to get the numbers.
for example, one line of my Input file might look like this:
140101 72 ‘ëm§11
where ‘ëm§1 is actually the packed decimal representation of the number 3153947. SAS can read this field correctly via INPUT gew pd4.4
Anyone any expericnes with this issue?
I have a problem with unformatted data and I don't know where, so I will post my entire workflow.
I'm integrating my own code into an existing climate model, written in fortran, to generate a custom variable from the model output. I have been successful in getting sensible and readable formatted output (values up to the thousands), but when I try to write unformatted output then the values I get are absurd (on the scale of 1E10).
Would anyone be able to take a look at my process and see where I might be going wrong?
I'm unable to make a functional replication of the entire code used to output the data, however the relevant snippet is;
c write customvar to file [UNFORMATTED]
open (unit=10,file="~/output_test_u",form="unformatted")
write (10)customvar
close(10)
c write customvar to file [FORMATTED]
c open (unit=10,file="~/output_test_f")
c write (10,*)customvar
c close(10)
The model was run twice, once with the FORMATTED code commented out and once with the UNFORMATTED code commented out, although I now realise I could have run it once if I'd used different unit numbers. Either way, different runs should not produce different values.
The files produced are available here;
unformatted(9kb)
formatted (31kb)
In order to interpret these files, I am using R. The following code is what I used to read each file, and shape them into comparable matrices.
##Read in FORMATTED data
formatted <- scan(file="output_test_f",what="numeric")
formatted <- (matrix(formatted,ncol=64,byrow=T))
formatted <- apply(formatted,1:2,as.numeric)
##Read in UNFORMATTED data
to.read <- file("output_test_u","rb")
unformatted <- readBin(to.read,integer(),n=10000)
close(to.read)
unformatted <- unformatted[c(-1,-2050)] #to remove padding
unformatted <- matrix(unformatted,ncol=64,byrow=T)
unformatted <- apply(unformatted,1:2,as.numeric)
In order to check the the general structure of the data between the two files is the same, I checked that zero and non-zero values were in the same position in each matrix (each value represents a grid square, zeros represent where there was sea) using;
as.logical(unformatted)-as.logical(formatted)
and an array of zeros was returned, indicating that it is the just the values which are different between the two, and not the way I've shaped them.
To see how the values relate to each other, I tried plotting formatted vs unformatted values (note all zero values are removed)
As you can see they have some sort of relationship, so the inflation of the values is not random.
I am completely stumped as to why the unformatted data values are so inflated. Is there an error in the way I'm reading and interpreting the file? Is there some underlying way that fortran writes unformatted data that alters the values?
The usual method that Fortran uses to write unformatted file is:
A leading record marker, usually four bytes, with the length of the following record
The actual data
A trailing record marker, the same number of bytes as the leading record marker, with the same information (used for BACKSPACE)
The usual number of bytes in the record marker is four bytes, but eight bytes have also been sighted (e.g. very old versions of gfortran for 64-bit systems).
If you don't want to deal with these complications, just use stream access. On the Fortran side, open the file with
OPEN(unit=10,file="foo.dat",form="unformatted",access="stream")
This will give you a stream-oriented I/O model like C's binary streams.
Otherwise, you would have to look at your compiler's documentation to see how exactly unformatted I/O is implemented, and take care of the record markers from the R side. A word of caution here: Different compilers have different methods of dealing with very long records of more than 2^31 bytes, even if they have four-byte record markers.
Following on from the comments of #Stibu and #IanH, I experimented with the R code and found that the source of error was the incorrect handling of the byte size in R. Explicitly specifying a bite size of 4, i.e
unformatted <- readBin(to.read,integer(),size="4",n=10000)
allows the data to be perfectly read in.
A tab-delimited text file, which is actually an export (using bcp) of a database table, is of that form (first 5 columns):
102 1 01 e113c 3224.96 12
102 1 01 e185 101127.25 12
102 2 01 e185 176417.90 12
102A 3 01 e185 26261.03 12
I tried to import it in R with a command like
data <- read.delim("C:\\test.txt", header = FALSE, sep = "\t")
The problem is that the 3rd column which is actually a varchar field (alphanumeric) is mistakenly read as integer (as there are no letters in the entire column) and the leading zeros disappeared. The same thing happened when I imported the data directly from the database, using odbcConnect. Again that column was read as integer.
str(data)
$ code: int 1 1 1 1 1 1 6 1 1 8 ...
How can I import such a dataset in R correctly, so as to be able to safely populate that db table again, after doing some data manipulations?
EDIT
I did it adding the following parameter in read.delim
colClasses = c("factor","integer","factor","factor","numeric","character","factor","factor","factor","factor","integer","character","factor")
Would you suggest "character" or "factor" for varchar fields?
Is it ok to use "character" for datetime ones?
What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?
I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.
Would you suggest "character" or "factor" for varchar fields?
As John mentioned, this depends upon usage. It is simple to switch between the two, so don't worry too much about it. If the column represents a categorical variable, it should eventually be considered as a factor. If you intend on mining the text (e.g. comments fields), then character makes more sense.
Is it ok to use "character" for datetime ones?
It's fine for storing the dates in a data frame, but if you want them to be treated correctly for analysis purposes, you'll have to convert it to Date or POSIXct/POSIXlt form.
What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?
Values are read in to usual double accuracy (about 15 sig figs); in this particular example, 540912.69 is the best accuracy you can achieve. Compare
print(540912.68999999994) # 540912.7
print(540912.68999999994, digits=22) # 540912.69
print(540912.6899999994) # 540912.7
print(540912.6899999994, digits=22) # 540912.6899999994
EDIT: If you need more precision for your numbers, use the Rmpfr package.
I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.
The default for colClasses (when you don't specify it) does a pretty good job of guessing what columns should be. If you are doing things like using 01 as a character, then there's no way round explicitly specifying it.
the character and factor question is something only you can answer. It depends if you need to use them later as factors or characters. It also depends whether you need to clean them up at all afterwards. For example, if you plan to apply a number of ifelse() modifications to a factor afterwards you might as well just read it in as a character now and turn it into a factor later. Or, if you want to specifically code the factor in some way you will likely be better off reading it in as character.
As an aside, the reason you use read.delim over read.table is because of the default settings therefore don't bother setting the sep to the same as the default.