Modify the dates in a huge file (around 1000 rows) using script - unix

I have a requirement in which I need to subtract x number of days from dates present in a delimited file if the date exists excluding the first and last row. If the date does not exist in the specified field, ignore the same.
For example, aaa.txt contains
header
abc|20160431|dhadjs|20160325|hjkkj|kllls
ddd||dhajded|20160320|dwdas|hfehf
footer
I want the modified file to have the dates subtracted by 10 days. Something like below:-
header
abc|20160421|dhadjs|20160315|hjkkj|kllls
ddd||dhajded|20160310|dwdas|hfehf
footer
I don't want to use a programming language like Java to read the file but rather use a scripting language on unix. Any suggestions on how this can be done?

Related

Pasting SQL decimal columns into Excel

I have a issue with data formats of Excel and SQL.
I have a column in SQL which is of datatype DECIMAL(18,0) and when I am trying to paste the result in SQL..the last 3 digits of the sql result gets replaced by 0 in Excel.
Example:
In SQL the result set has a column called session id and has decimal numbers like
119,597,417,242,309,670
329,621,151,415,350,454
134,460,940,261,658,890
but when I paste it in Excel the numbers look like:
I tried changing the format in EXCEL to paste as text however, the whole format of the result set gets distorted (and only the first column gets pasted properly without the 0's)
I can't keep casting all columns in SQL from decimal to int as there are way too many columns.
Can you please guide me as to what I can do?
Numeric fields in Excel are limited to 15 digits precision.
In SQL Assistant under Tools / Options / Data Format you can ask to have large Decimal (and BIGINT) fields displayed as text for just this sort of copy / paste. Or you can tell SQL Assistant to Save As or Export to Excel format.
For other tools you can explicitly FORMAT and CAST the data to VARCHAR in your SELECT so it is retrieved as text.
Several things you can do. I'll list 4.
Pick whatever suits you best.
First paste in a text editor (like notepad), seach/replace there, and paste that.
Set the datarange where you're going to paste to "text", and then paste. After that you can search/replace, and change to the correct format.
Change the regional settings of Windows to match the data that you have.
You can generate formula's from your SQL query, instead of floating point numbers. So generate a text like =5/10 instead of 0.5 or 0,5. Excel will pick it up correctly regardless of your regional settings.

fread - skip lines starting with certain character - "#"

I am using the fread function in R for reading files to data.tables objects.
However, when reading the file I'd like to skip lines that start with #, is that possible?
I could not find any mention to that in the documentation.
fread can read from a piped command that filters out such lines, like this:
fread("grep -v '^#' filename")
Not currently, but it's on the list to do.
Are the # lines at the top forming a header which is more than 30 lines long?
If so, that's come up before and the solution is :
fread("filename", autostart=60)
where 60 is chosen to be inside the block of data to be read.
From ?fread :
Once the separator is found on line autostart, the number of columns
is determined. Then the file is searched backwards from autostart
until a row is found that doesn't have that number of columns. Thus,
the first data row is found and any human readable banners are
automatically skipped. This feature can be particularly useful for
loading a set of files which may not all have consistently sized
banners. Setting skip>0 overrides this feature by setting
autostart=skip+1 and turning off the search upwards step.
The default autostart=30 might just need bumping up a bit in your case.
Or maybe skip=n or skip="string" helps :
If -1 (default) use the procedure described below starting on line autostart to find the first data row. skip>=0 means ignore autostart and take line skip+1 as the first data row (or column names according to header="auto"|TRUE|FALSE as usual). skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).

What would cause Microsoft Jet OLEDB SELECT to miss a whole column?

I'm importing an .xls file using the following connection string:
If _
SetDBConnect( _
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & filepath & _
";Extended Properties=""Excel 8.0;HDR=Yes;IMEX=1""", True) Then
This has been working well for parsing through several Excel files that I've come across. However, with this particular file, when I SELECT * into a DataTable, there is a whole column of data, Item Description, missing from the DataTable. Why?
Here are some things that may set this particular workbook apart from the others that I've been working with:
The workbook has a freeze pane consisting of the first 24 rows (however, all of these rows appear in the DataTable)
There is some weird cell highlighting going on throughout the workbook
That's pretty much it. I can't see anything that would make the Item Description column not import correctly. Its data is comprised of all Strings that really have no special characters apart from &. Additionally, each data entry in this column is a maximum of 20 characters. What is happening? Is there any other way I can get all of the data? Keep in mind I have to use the original file and I cannot alter it, as I want this to ultimately be an automated process.
Thanks!
Some initial thoughts/questions: Is the missing column the very first column? What happens if you remove the space within "Item Description"? Stupid question, but does that column have a column header?
-- EDIT 1 --
If you delete that column, does the problem move to another column (the new index 4), or is the file complete. My reason for asking this -- is the problem specific to data in that column/header, or is the problem more general, on index 4.
-- EDIT 2 --
Ok, so since we know it's that column, we know it's either the header, or the rows. Let's concentrate on rows for now. Start with that ampersand; dump it, and see what happens. Next, work with the first 50% of rows. Does deleting that subset affect anything? What about the latter 50% of rows? If one of those subsets changes the result, you ought to be able to narrow it down to an individual row (hopefully not plural) by halfing your selection each time.
My guess is that you're going to find a unicode character or something else funky is one of the cells. Maybe there's a formula or, as you mentioned, some of that "weird cell highlighting."
It's been years since I worked with excel access, but I recall some problems with excel grouping content into some areas that would act as table inside each sheet. Try copy/paste the content from the problematic sheet to a new workbook and connect to that workbook. If this works you may be able to investigate a bit further about areas.

Striping out time components for data in a csv file with | seperated variables

A bit new to UNIX but I have a question with reagrds altering csv files going into a datafeed.
There are a few | seperated columns where the date has come back as (for example)
|07-04-2006 15:50:33:44:55:66|
and this needs to be changed to
|07-04-2006|
It doesn't matter if all the data gets written to another file. There are thousands of rows in these files.
Ideally, I'm looking for a way of going to the 3rd and 7th piped columns and taking the first 10 characters and removing anything else till the next |
Thanks in advance for your help.
What exactly do you want?
You can replace |07-04-2006 15:50:33:44:55:66| by |07-04-2006| using File IO.
This operates on all columns, but should do unless there are date columns which must not be changed:
sed 's/|\(..-..-....\) ..:..:..:..:..:..|/|\1|/g'
If you want to change the files in place, you can use sed's -i option.

Sequence number inside a txt file in UNIX

I want to generate a unique sequence number for each row in the file in unix. I can not make identity column in database as it has some other sources which also inserts data in it. I tried using NR number in awk but since i have filters in my script it may skip rows in the file so i may not get sequential numbers.
my requirements are - This sequence number needs to be persistent since everday i would receive this file and should start from where i left of. also the number needs to be preceded by "EMP_" for each line in the file.
Please suggest.
Thanks in advance.
To obtain unique id in UNIX you may use file to store and read the value. however this method is so tedious and require mechanism on file IO locking. the easiest way is to use date time to obtain unique id example :
#!/bin/sh
uniqueVal = `date '+%Y%m%d%H%M%S'`

Resources