I do some simple logging by means of:
println("*** ", Dates.now(), " READ CSV")
But depending on timestamp I got that last value of the timestamp can be omitted if zero (see third log message, 470 is displayed as 47):
*** 2019-09-14T15:44:59.862 READ CSV
*** 2019-09-14T15:45:08.065 PARSE DATETIME
*** 2019-09-14T15:45:10.47 ROUND DOWN PRICES TO CONTRACT TICKS
While it is not a big deal, I am still wondering how this would be fixed in Julia.
The simplest is to use the rpad function:
println("*** ", rpad(Dates.now(), 23, "0"), " READ CSV")
Related
I'm a fairly basic user and I'm having issues uploading a .txt file in a neat manner to get a Excel-like table output using R.
My main issue stems from the fact that the "columns" in the .txt file are created by using a varying amount of spaces. So for example (periods representing spaces, imagining that the info lines up together):
Mister B Smith....Age 35.....Brooklyn
Mrs Smith.........Age 33.....Brooklyn
Child Smith.......Age 8......Brooklyn
Other Child Smith.Age 1......Brooklyn
Grandma Smith.....Age 829....Brooklyn
And there are hundreds of thousands of these rows, all with different spaces that line up to make "columns." Any idea on how I should go about inputting the data?
It appears as your your file is not delimited at all, but in a fixed width format. You focused on the number of spaces when really it seems like the data have varying number of characters in fields of the same fixed width. You'll need to verify this. But the first "column" seems to be exactly 19 characters long. Then comes the string Age (with a space at the end) and then a 7 character column with the age. Then a final column and it's not clear at all how long it might be.
Of course this could be me overfitting to this small snippet. Check if I have guessed correctly. If I have, you can use the base function read.fwf for files like this. Let's say the file name is foo.txt and you want to call the result my_foo. The Age column is redundant, so let's skip it. And let's say the final column actually has 8 characters (the number of characters in Brooklyn but you'll need to check this)
my_foo <- read.fwf("foo.txt", c(19, -4, 7, 8))
might get you what you want. See ?read.fwf for details.
If the deliminator is always a number of spaces you can read in your .txt file and split each line into a vector using a regex that looks for more than one space:
x <- c("Mister B Smith Age 35 Brooklyn",
"Mrs Smith Age 33 Brooklyn")
stringr::str_split(x, " {2,}")
[[1]]
[1] "Mister B Smith" "Age 35" "Brooklyn"
[[2]]
[1] "Mrs Smith" "Age 33" "Brooklyn"
The only problem you might run into with this approach is if, due to the length of one field, there is only one space between fields (for example: "Mister B Smithees Age 35 Brooklyn"). In this case, #ngm's approach is the only possible option.
How can I increase the precision of FLOAT failed assertion messages in tSQLt?
For example
DECLARE #Expected FLOAT = -5.4371511392520810
PRINT STR(#Expected, 40, 20)
DECLARE #Actual FLOAT = #Expected - 0.0000000001
PRINT STR(#Actual, 40, 20)
EXEC tSQLt.AssertEquals #Expected, #Actual
gives
-5.4371511392520810
-5.4371511393520811
[UnitTest].[test A] failed: Expected: <-5.43715> but was: <-5.43715>
In most computer languages (including T-SQL) floating point values are approximate, so comparing FLOAT variables for being equal is often a bad idea (especially after doing some maths on them) E.g. a FLOAT variable is only accurate to about 15 digits (by default)
You can see this by adding the following line at the end of your sample code:
PRINT STR((#Actual - #Expected) * 1000000, 40, 20)
which returns -0.0001000000082740
So you could either
Use the built in SQL function ROUND to allow numbers approximately the same to be viewed as equal:
EXEC tSQLt.AssertEquals ROUND (#Expected, 14), ROUND (#Actual, 14)
Use an exact type for the variables, like NUMERIC (38, 19). Replacing every FLOAT in your example with NUMERIC (38, 19) seems to give the same result, but when you add the PRINT STR((#Actual - #Expected) * 1000000, 40, 20) mentioned above, it now prints exactly
-0.0001000000000000, showing that there is an inaccuracy in the PRINT statement as well
Of course your tSQLt.AssertEquals test will still fail since the values are different in the 10th digit after the decimal point. (one number is ...925... and the other is ...935...). If you want it to pass even then, round the values off to 9 digits with ROUND
Further information:
See David Goldberg's excellent article What Every Computer Scientist Should Know About Floating-Point Arithmetic here or here under the heading Rounding Errors.
http://msdn.microsoft.com/en-us/library/ms173773.aspx
http://www.informit.com/library/content.aspx?b=STY_Sql_Server_7&seqNum=93
I wanna write a script whicht makes R usable for "everybody" at this special topic of analysis. Is there a possibility to create warnings?
time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3
For example if the value is at least 3 times 0 (afterwards - better within a settet period of time - 3 days) give warnings - and naming the date. Maybe create something like a report, if I am combining conditions.
In general: Masurement data are read via read.csv and then set Date by as.POSIXct - xts/zoo. I want the "user" to get a clear message if the values are changing etc.; if they are 0 for a long time etc.
The second step would be sending emails - maybe running on a server later.
Additional Questions:
I do have a df in xts now - is it possible to check if the value is greater a threshold value? It's not working because it's not an atomic vector.
Thanks
Try this.
x <- read.table(text = "time,value
2012-01-01,5
2012-01-02,0
2012-01-03,0
2012-01-04,0
2012-01-05,3", header = TRUE, sep = ",")
if(any(rle(x$value)$lengths >= 3)) warning("I noticed some dates have value 0 at least three times.")
Warning message:
I noticed some dates have value 0 at least three times.
I'll leave it to you as a training exercise to paste a warning message that would also give you the date(s).
For our analysis we need to read raw data from csv (xls) & convert it into SAS dataset before doing our analysis.
Now, the problem is this raw data generally have 2 issues:
1. The ordering of columns changes sometimes. So, if in the earlier period we have columns in order of variable A,then B, then C, etc. It might change to B, then C, then A.
2. There are foreign elements like "#", or ".", or "some letters", etc.
Now, we have to first clean the raw data, before reading into SAS. This take considerable amount of time. Is there any way we can clean the data within SAS system itself before reading the data. If we can rectify the data with SAS code, it will save quite amount of time.
Here's the example:
Period 1: I got the data in Data1.csv in this format. In column B, which is numeric, I've "#" & ".". And colummn C, which is also numeric, I've "g". If I import Data1.csv using either PROC IMPORT or Infile statement, these foreign elements in column B & C will remain. The question here is how to do that? I can use If STATEMENT. But the problem is there are too many foreign elements (e.g. instead of "#", ".", "g", I might get other foreign elements like "$", "h" etc.) If there's any way we can have a code which detect & remove foreign elements without I've to specifying it using IF STATEMENT everytime I import the raw data in SAS.
A B C
Name1 1 5
Name2 2 6
Name3 3 4
Name4 # g
Name5 5 3
Name6 . 6
Period 2: In this period I got DATA2.csv which is given below. When I use INFILE statement, I specify 1st A should be read with the specific name, then B with specific name & then C. In 2nd period when I get the data B is given 1st. So, when SAS read the data I've B instead of A. So, I've to check the variables ordering with previous phase data everytime & correct it before reading the data using infile statement. Since the number of variables are too large, it's very time consuming ( & at time frustrating) to verify the column ordering in this fashion. Is there SAS code, with which SAS will automatically read A,& then B & then C, even though it's not in this order?
B A C
1 Name1 5
2 Name2 6
3 Name3 4
# Name4 g
5 Name5 3
. Name6 6
Even though I mainly use SAS in my analysis purpose. But I can use R to clean the data, then use to read it in SAS for further analysis. So R code can also be helpful.
Thanks.
In R you increase the speed of file reading when you specify that a column is a particular class. With the example provided (3 columns with the middle one being "character" you might use this code:
dat <- read.csv( filename, colClasses=c("numeric", "character", "numeric"), comment.char="")
The "#" and "." would become NA values when encountered in the numeric columns. The above code removes the default specification of the comment character which is "#". If you wanted the "#" and "." entries in character columns to be coerced to NA_character_, you could use this code:
dat <- read.csv( filename,
colClasses=c("numeric", "character", "numeric"),
comment.char="",
na.strings=c("NA", ".", "#") )
By default the header=TRUE setting is assumed by read.csv(), but if you used read.table() you would need to assert header=TRUE with the two file structures you showed. There is further documentation and worked examples of reading Excel data here: However, my advice is to do as you are planning and use CSV transfer. You will see the screwy things Excel does with dates and missing values more quickly that way. You would be well advised to change the data formats to a custom "yyyy-mm-dd" in agreement with the POSIX standard, in which case you can also specify a "Date" classed column and skip the process of turning character classed columns in the default Excel formats (all of which are bad) into dates.
Yes, you can use SAS to do any kind of "data cleaning" you might imagine. The SAS DATA step language is full of features to do things like this, but there is no magic bullet; you need to write the code yourself.
A csv file is just a plain text file (very different from an xls file). Normally the first row in a csv file contains column names and the data begins with the second row. If you use PROC IMPORT, SAS will use the first row to construct variable names and try to determine data types by scanning the first several rows of the file. For example:
proc import datafile='c:\temp\somefile.csv'
out=SASdata
dbms=csv replace;
run;
Alternatively, you can read the file with a data step. This would require that you know the file layout in advance. For example:
data SASdata;
infile 'c:\temp\somefile.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.; /* A character variable with max length 50 */
informat B yymmdd10.; /* A date presented like 2012-08-25 */
informat C dollar12.; /* A number containing dollar sign, commas, or decimals */
input A B C; /* The order of the variables in the file */
if B = . then B = today(); /* A possible data cleaning statement */
run;
Note that the INPUT statement controls the order that the variables exist in the file. The point is that the code you use must match the layout of each file you process.
These are just general comments. If you encounter problems, post back with a more specific question.
UPDATE FOR UPDATED QUESTION: The variables from the raw data file must be listed in the INPUT statment in the same order as they existin each file. Also, you need to define the column types directly, and establish whatever rules they need to follow. There is no way to do this automatically; each file much be treated separately.
In this case, let's assume your variables are A, B, and C, where A is character and B and C are numbers. This program might process both files and add them to a history dataset (let's say ALLDATA):
data temp;
infile 'c:\temp\data1.csv' dsd firstobs=2 lrecl=32767 truncover;
/* Define dataset variables */
informat A $50.;
informat B 12.;
informat C 12.;
/* Add a KEEP statement to keep only the variables you want */
keep A B C;
input A B C;
run;
proc append base=ALLDATA data=temp;
run;
data temp;
infile 'c:\temp\data2.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.;
informat B 12.;
informat C 12.;
input B A C;
run;
proc append base=ALLDATA data=temp;
run;
Notice that the "data definition" part of each data step is the same; the only difference is the order of variables listed in the INPUT statement. Notice that because the variables A and B are defined as numeric, when those invalid characters are read (# and g), the values are stored as missing values.
In your case, I'd create a template SAS program to define all the variables you want in the order you expect them to be. Then use that template to import each file using the order of the variables in that file. Setting up the template program might take a while, but to run it you would only need to modify the INPUT statement.
I'm working with some UK postcode data around 60,000 entries in a SQL 2008 database and need to manipulate a string containing a postcode.
The original form data was collected with no validation so the postcodes are held in different formats, so CA12HW can also be CA1 2HW (correctly formatted).
UK postcodes vary in length and letter / number mix with the only exception being all codes finish space number letter letter.
I am only interested in looking at the first part of the code ie before the space. Therefore I am looking at writing a piece of code that does the following:
1.Check for a space 4th from the right.
2.If there is no space insert one 4th right.
3.Split the string at the space.
So far I have:
PostCode = "CA30GX"
SpaceLocate = InStr(PostCode, " ")
If SpaceLocate = 0 Then 'Postcode needs a space
If the only constant is that there should be a space 4th right how do I insert one?
Once the space is inserted I can split the code to use as I need.
PostcodeArray = Split(Postcode, " ")
Now PostcodeArray(0) is equal to "CA3", PostcodeArray(1) is equal to "0GX"
Any help would be appreciated.
You can just recreate the string:
PostCode = Left(PostCode, 3) & " " & Right(PostCode, 3)
PostcodeArray = Split(PostCode, " ")
Edit:
PostCode = Left(PostCode, Len(PostCode) - 3) & " " & Right(PostCode, 3)
You can use left and right string functions to do this:
newCode = left(Postcode,3) & " " & right(Postcode,len(Postcode)-3)