Data manipulation using a script - unix

I have data in the following format :
Key1:Value1 Key2:Value2 Key3:Value3
A
B
C
D
Key1:Value4 Key2:Value5 Key3:Value6
A1
B1
C1
Key1..
and so on. The number of keys is always three, and in this same order. No extra lines between the values A,B,C,D in the original data set.
I want to get output in the format
Value3, A B C D
Value6, A1 B1 C1
.
.
.
Any thoughts on a script I might be able to use to get this done

Regular expressions can help you but depends on what type of values those are in general you can write it up to match for Key3: [Pattern to match value] and graph that and then all successive lines before the next Key1 can be grabbed manually with a for loop and stop until you get to new key line and repeat for each section.
Pseudocode:
current_key = ""
while !EOF:
line = next_line()
if line has regular expression for "Key3: Value":
process for Value
current_key = Value
else
process line as a regular ABCD value and print or whatever
There isn't much error checking but hopefully that helps get you going.

Related

How can a formula be used in scalc (Apache OpenOffice) to count cells with content, while the count should stop at the first empty cell?

Here (https://www.howtoexcel.org/formulas/how-to-find-the-position-of-the-first-non-blank-cell-in-a-range/) is a source for Excel where the following formula is described:
{=MATCH(FALSE,ISBLANK(B3:B9),0)}
The count should stop at b because the following cell is empty or contains no printable character. The result should be 2.
=COUNTA(A30:A50) ; A29
a
b
c
There is no need to use a count function because we can subtract 1 from the matched index to get the count. Also, to find the first blank cell, the search criterion should be true, not false.
=MATCH(TRUE();ISBLANK(A30:A50);0)-1
a
b
c
Result: 2
In LibreOffice Calc, this works when entered normally. In Apache OpenOffice, it must be entered as an array formula with Ctrl+Shift+Enter.

How can I choose to show only unequal data result and minus or blind equal data result?

Here's my robot code
There are 22 records on both tables but only 1 record has an unequal data I would like to show that result and minus or blind an equal data result.
connect to database using custom params cx_Oracle ${DB_CONNECT_STRING}
#{queryResultsA}= Query Select count (*) from QA_USER.SealTest_Security_A order by SECURITY_ID
Log ${queryResultsA}
#{queryResultsB}= Query Select count (*) from QA_USER.SealTest_Security_B order by SECURITY_ID
Log ${queryResultsB}
should not contain match ${queryResultsB} ${queryResultsA}
Using For Loop
# Assuming your table has values like this
#{queryResultsA}= Create List a b c d e
#{queryResultsB}= Create List a z c d e
${Length}= Get Length ${queryResultsA}
${count}= Set Variable
:FOR ${count} IN RANGE ${Length}
\ Run Keyword If '#{queryResultsA}[${count}]'!='#{queryResultsB}[${count}]' Log To Console #{queryResultsA}[${count}] #{queryResultsB}[${count}]
OUTPUT
b z
Using SET
${z} = Evaluate (set(${queryResultsA}) - set(${queryResultsB}))
Log ${z}
OUTPUT
b
Note the difference here Set B is subtracted from Set A so whatever not matched in set A will be the output.

Compare cell against series of cell pairs

I'm trying to make a LibreOffice spreadsheet formula that populates a column based on another input column, comparing each input with a series of range pairs defined in another sheet and finally outputting a symbol based on matched criteria. I have a series of ranges that specify a - output, and another series that corresponds to +, but not all inputs will fall into a category. I am using this trinary output later for another expression, which I already have in place.
My question becomes: how can I test input against each range pair without spelling out the cell coordinates for each individual cell (ie OR(AND(">= $A$1", "< $B$1"), AND(">=$A$2", "<$B$2"), ...))? Ideally I could just specify an array to compare against like $A$1:$B$4. Writing it in a python macro would work, too, since I don't plan on sharing this file.
I wrote a really quick list comp in python to illustrate what I'm after. This snippet would be one half, such as testing - qualification, and these values may be fed into a condition that outputs the symbol:
>>> def cmp(f, r):
... return r[0] <= f < r[1]
>>> f = (1, 2, 3)
>>> ranges = ((2, 5), (4, 6), (3, 8))
>>> [any([cmp(i, r) for r in ranges]) for i in f]
[False, True, True]
Here is a small test example with real input and real ranges.
Change the range pairs so that they are in two columns starting from A13. Be sure that they are in sorted order (Data -> Sort).
A B C
~~~~~~~~ ~~~~~~~~ ~
145.1000 145.5000 -
146.0000 146.4000 +
146.6000 147.0000 -
147.0000 147.4000 +
147.6000 148.0000 -
440.0000 445.0000 +
In each row, specify whether it is negative or positive. To do this, I entered the following formula in C13 and filled down. If the range pairs are not consistent enough then enter values for C13 and below manually.
=IF(ISODD(ROW());"-";"+")
Now, enter the following formula in cell C3 and fill down.
=IFNA(IF(
VLOOKUP(A3;A$13:C$18;2;1) >= A3;
VLOOKUP(A3;A$13:C$18;3;1);
"None");"None")
The formula finds the closest pair and then checks if the number is inside that range or not. For better testing, I would also suggest using 145.7000 as input, which should result in no shift if I understood the question correctly.
The results in column C:
-
+
None
None
Documentation: VLOOKUP, IFNA, ROW.
EDIT:
The following formula produces correct results for the example data you gave, and it works for anything between 144.0 and 148.0.
=IFNA(VLOOKUP(A3;A$13:C$18;3;1); "None")
However, 150.0 produces - and 550.0 produces +. If that is not what you want, then use the formula above that has two VLOOKUP expressions.

How to create excel formula that will add an number to specific digits in a multi digit number

Ex: I enter the number 9876543210 in a cell.
I want to create an if then formula to add a sequential number to this but working only off of the last digit. the zero in this example.
If the last digit is >= to 3 than add 5 if the last digit is <=2 than add 15.
Then have this formula repeat for 10 numbers - is that possible?
so i imput the 9876543210
it then show:
9876543225
9876543230
9876543245
and so on
=IF((RIGHT(A1,1)/1)>2,A1+5,A1+15)
Assumed that you update the number in the cell A1. Paste the above formula in A2 and copy paste downwards.
If this is Excel, you may want to use MOD (modulo or remainder) function to get the last digit and then perform an IF-THEN or nested IF-THEN to achieve this.
=IF(MOD(A1,10)=3, A1+15, IF(MOD(A1,10)=5, A1+20, A1+30))
This formula translates to the following decision tree:
IF the last digit of the value in cell A3 is 3 Then
Add 15 to it
ELSEIF the last digit of the value in cell A3 is 5 then
Add 20 to it
ELSE
Add 30 to it
END IF
Repeating the operation may require some VBA. If you already know the number of times you need to repeat the operation, you can pre-populate formulas in subsequent rows/columns, each time refer to the immediately preceding cell. For example, if you want to repeat it 5 times, you should compute the diff of first two cells and then add that diff to the value of immediately preceding row/column like this (assuming A1 had the original value, B1 had the formula I posted above and C1 through G1 are the next 5 cells):
In C1: =B1 + ($B1 - $A1)
In D1: =C1 + ($B1 - $A1)
and so on...
Note the use of absolute and relative addresses in these formulae. You can copy/paste the formula in C1 to the subsequent cells and it will automatically adjust itself to refer to immediately preceding cell.
EDIT
I just realized that you want to evaluate the MOD formula in each subsequent cell. In that case you simply need to copy/paste it to subsequent cells instead of using 2nd and 3rd formulas I posted above.

Reading Raw Data in SAS Or R

For our analysis we need to read raw data from csv (xls) & convert it into SAS dataset before doing our analysis.
Now, the problem is this raw data generally have 2 issues:
1. The ordering of columns changes sometimes. So, if in the earlier period we have columns in order of variable A,then B, then C, etc. It might change to B, then C, then A.
2. There are foreign elements like "#", or ".", or "some letters", etc.
Now, we have to first clean the raw data, before reading into SAS. This take considerable amount of time. Is there any way we can clean the data within SAS system itself before reading the data. If we can rectify the data with SAS code, it will save quite amount of time.
Here's the example:
Period 1: I got the data in Data1.csv in this format. In column B, which is numeric, I've "#" & ".". And colummn C, which is also numeric, I've "g". If I import Data1.csv using either PROC IMPORT or Infile statement, these foreign elements in column B & C will remain. The question here is how to do that? I can use If STATEMENT. But the problem is there are too many foreign elements (e.g. instead of "#", ".", "g", I might get other foreign elements like "$", "h" etc.) If there's any way we can have a code which detect & remove foreign elements without I've to specifying it using IF STATEMENT everytime I import the raw data in SAS.
A B C
Name1 1 5
Name2 2 6
Name3 3 4
Name4 # g
Name5 5 3
Name6 . 6
Period 2: In this period I got DATA2.csv which is given below. When I use INFILE statement, I specify 1st A should be read with the specific name, then B with specific name & then C. In 2nd period when I get the data B is given 1st. So, when SAS read the data I've B instead of A. So, I've to check the variables ordering with previous phase data everytime & correct it before reading the data using infile statement. Since the number of variables are too large, it's very time consuming ( & at time frustrating) to verify the column ordering in this fashion. Is there SAS code, with which SAS will automatically read A,& then B & then C, even though it's not in this order?
B A C
1 Name1 5
2 Name2 6
3 Name3 4
# Name4 g
5 Name5 3
. Name6 6
Even though I mainly use SAS in my analysis purpose. But I can use R to clean the data, then use to read it in SAS for further analysis. So R code can also be helpful.
Thanks.
In R you increase the speed of file reading when you specify that a column is a particular class. With the example provided (3 columns with the middle one being "character" you might use this code:
dat <- read.csv( filename, colClasses=c("numeric", "character", "numeric"), comment.char="")
The "#" and "." would become NA values when encountered in the numeric columns. The above code removes the default specification of the comment character which is "#". If you wanted the "#" and "." entries in character columns to be coerced to NA_character_, you could use this code:
dat <- read.csv( filename,
colClasses=c("numeric", "character", "numeric"),
comment.char="",
na.strings=c("NA", ".", "#") )
By default the header=TRUE setting is assumed by read.csv(), but if you used read.table() you would need to assert header=TRUE with the two file structures you showed. There is further documentation and worked examples of reading Excel data here: However, my advice is to do as you are planning and use CSV transfer. You will see the screwy things Excel does with dates and missing values more quickly that way. You would be well advised to change the data formats to a custom "yyyy-mm-dd" in agreement with the POSIX standard, in which case you can also specify a "Date" classed column and skip the process of turning character classed columns in the default Excel formats (all of which are bad) into dates.
Yes, you can use SAS to do any kind of "data cleaning" you might imagine. The SAS DATA step language is full of features to do things like this, but there is no magic bullet; you need to write the code yourself.
A csv file is just a plain text file (very different from an xls file). Normally the first row in a csv file contains column names and the data begins with the second row. If you use PROC IMPORT, SAS will use the first row to construct variable names and try to determine data types by scanning the first several rows of the file. For example:
proc import datafile='c:\temp\somefile.csv'
out=SASdata
dbms=csv replace;
run;
Alternatively, you can read the file with a data step. This would require that you know the file layout in advance. For example:
data SASdata;
infile 'c:\temp\somefile.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.; /* A character variable with max length 50 */
informat B yymmdd10.; /* A date presented like 2012-08-25 */
informat C dollar12.; /* A number containing dollar sign, commas, or decimals */
input A B C; /* The order of the variables in the file */
if B = . then B = today(); /* A possible data cleaning statement */
run;
Note that the INPUT statement controls the order that the variables exist in the file. The point is that the code you use must match the layout of each file you process.
These are just general comments. If you encounter problems, post back with a more specific question.
UPDATE FOR UPDATED QUESTION: The variables from the raw data file must be listed in the INPUT statment in the same order as they existin each file. Also, you need to define the column types directly, and establish whatever rules they need to follow. There is no way to do this automatically; each file much be treated separately.
In this case, let's assume your variables are A, B, and C, where A is character and B and C are numbers. This program might process both files and add them to a history dataset (let's say ALLDATA):
data temp;
infile 'c:\temp\data1.csv' dsd firstobs=2 lrecl=32767 truncover;
/* Define dataset variables */
informat A $50.;
informat B 12.;
informat C 12.;
/* Add a KEEP statement to keep only the variables you want */
keep A B C;
input A B C;
run;
proc append base=ALLDATA data=temp;
run;
data temp;
infile 'c:\temp\data2.csv' dsd firstobs=2 lrecl=32767 truncover;
informat A $50.;
informat B 12.;
informat C 12.;
input B A C;
run;
proc append base=ALLDATA data=temp;
run;
Notice that the "data definition" part of each data step is the same; the only difference is the order of variables listed in the INPUT statement. Notice that because the variables A and B are defined as numeric, when those invalid characters are read (# and g), the values are stored as missing values.
In your case, I'd create a template SAS program to define all the variables you want in the order you expect them to be. Then use that template to import each file using the order of the variables in that file. Setting up the template program might take a while, but to run it you would only need to modify the INPUT statement.

Resources