Join multiline message from log file into a single row in R - r

How is it possible to join multiple lines of a log file into 1 dataframe row?
ADDED ONE LINE -- Example 4-line log file:
[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]
[INFO ][2019-03-15 12:34:55,886][DefaultListableBeanFactory] - [Overriding bean definition for bean 'cpnreq': replacing [Generic bean: class [com.ar.moves.domain.bom.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/404.jar!/com/ar/moves/moves-context.xml]] with [Generic bean: class [com.ar.bl.bom.domain.Cpnreq]; scope=prototype; abstract=false; lazyInit=false; autowireMode=0; dependencyCheck=0; autowireCandidate=true; primary=false; factoryBeanName=null; factoryMethodName=null; initMethodName=null; destroyMethodName=null; defined in URL [jar:file:/D:/Dev/Tools/Tomcatv8.5-appGit-master/404.jar!/com/ar/bl/bom/bl-bom-context.xml]]]
(See representative 8-line extract at https://pastebin.com/bsmWWCgw.)
The structure is clean:
[PRIOR][datetime][ClassName] - [Msg]
but the message is often multi-lined, there may be multiple brackets in the message itself (even trailing…), or ^M newlines, but not necessarily… That makes it difficult to parse. Dunno where to begin here…
So, in order to process such a file, and be able to read it with something like:
#!/usr/bin/env Rscript
df <- read.table('D:/logfile.log')
we really need to have that merge of lines happening first. How is that doable?
The goal is to load the whole log file for making graphics, analysis (grepping out stuff), and eventually writing it back into a file, so -- if possible -- newlines should be kept in order to respect the original formatting.
The expected dataframe would look like:
PRIOR Datetime ClassName Msg
----- ------------------- ------------------- ----------
WARN 2016-12-16 13:43:10 ConfigManagerLoader Low max...
DEBUG 2016-05-26 10:10:22 DataSourceImpl SELECT ...
And, ideally once again, this should be doable in R directly (?), so that we can "process" a live log file (opened in write mode by the server app), "à la tail -f".

This is a pretty wicked Regex bomb. I'd recommend using the stringr package, but you could do all this with grep style functions.
library(stringr)
str <- c(
'[WARN ][2016-12-16 13:43:10,138][ConfigManagerLoader] - [Low max memory=477102080. Java max memory=1000 MB is recommended for production use, as a minimum.]
[DEBUG][2016-05-26 10:10:22,185][DataSourceImpl] - [SELECT mr.lb_id,mr.lf_id,mr.mr_id FROM mr WHERE (( mr.cap_em >
0 AND mr.cap_em > 5
)) ORDER BY mr.lb_id, mr.lf_id, mr.mr_id]
[ERROR][2016-12-21 13:51:04,710][DWRWorkflowService] - [Update Wizard - : [DWR WFR request error:
workflow rule = BenCommonResources-getDataRecords
version = 2.0
filterValues = [{"fieldName": "wotable_hwohtable.status", "filterValue": "CLOSED"}, {"fieldName": "wotable_hwohtable.status_clearance", "filterValue": "Goods Delivered"}]
sortValues = [{"fieldName": "wotable_hwohtable.cost_actual", "sortOrder": -1}]
Result code = ruleFailed
Result message = Database error while processing request.
Result details = null
]]'
)
Using regex we can split each line by checking for the pattern you mentioned. This regex is checking for a [, followed by any non-line feed character or line feed character or carriage return character, followed by a [. But do this is a lazy (non-greedy) way by using *?. Repeat that 3 times, then check for a -. Finally, check for a [, followed by any characters or a group that includes information within square brackets, then a ]. That's a mouthful. Type it into a regex calculator. Just remember to remove the extra backlashes (in a regex calculator \ is used but in R \\ is used).
# Split the text into each line without using \n or \r.
# pattern for each line is a lazy (non-greedy) [][][] - []
linesplit <- str %>%
# str_remove_all("\n") %>%
# str_extract_all('\\[(.|\\n|\\r)+\\]')
str_extract_all('\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\]\\[(.|\\n|\\r)*?\\] - \\[(.|\\n|\\r|(\\[(.|\\n|\\r)*?\\]))*?\\]') %>%
unlist()
linesplit # Run this to view what happened
Now that we have each line separated break them into columns. But we don't want to keep the [ or ] so we use a positive lookbehind and a positive lookahead in the regex code to check to see if the are there without capturing them. Oh, and capture everything between them of course.
# Split each line into columns
colsplit <- linesplit %>%
str_extract_all("(?<=\\[)(.|\\n|\\r)*?(?=\\])")
colsplit # Run this to view what happened
Now we have a list with an object for each line. In each object are 4 items for each column. We need to convert those 4 items to a dataframe and then join those dataframes together.
# Convert each line to a dataframe, then join the dataframes together
df <- lapply(colsplit,
function(x){
data.frame(
PRIOR = x[1],
Datetime = x[2],
ClassName = x[3],
Msg = x[4],
stringsAsFactors = FALSE
)
}
) %>%
do.call(rbind,.)
df
# PRIOR Datetime ClassName Msg
# 1 WARN 2016-12-16 13:43:10,138 ConfigManagerLoader Low max memory=
# 2 DEBUG 2016-05-26 10:10:22,185 DataSourceImpl SELECT mr.lb_id
# 3 ERROR 2016-12-21 13:51:04,710 DWRWorkflowService Update Wizard -
# Note: there are extra spaces that probably should be trimmed,
# and the dates are slightly messed up. I'll leave those for the
# questioner to fix using a mutate and the string functions.
I will leave it to you to fix the extra spaces, and date field.

Related

Airflow SqlToS3Operator has unwanted an index in the beginning

Recent airflow-providers-amazon has deprecated MySQLToS3Operator and introduced SqlToS3Operator and now it is adding an index column in the beginning of the CSV dump.
For example, if I run the following
sql_to_s3_task = SqlToS3Operator(
task_id="sql_to_s3_task",
sql_conn_id=conn_id_name,
query="SELECT created_at, score FROM my_table",
s3_bucket=bucket_name,
s3_key=key,
replace=True,
)
The S3 file has something like this:
,created_at,score
1,2023-01-01,5
2,2023-01-02,6
The output seems to be a direct dump from Pandas. How can I remove this unwanted preceding index column?
The operator uses pandas DataFrame under the hood.
You should use pd_kwargs. It allows you to pass arguments to include in DataFrame .to_parquet(), .to_json() or .to_csv().
Since your output is csv the relevant pandas.DataFrame.to_csv parameters are:
header: bool or list of str, default True
Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
index: bool, default True
Write row names (index).
Thus you can do:
sql_to_s3_task = SqlToS3Operator(
task_id="sql_to_s3_task",
sql_conn_id=conn_id_name,
query="SELECT created_at, score FROM my_table",
s3_bucket=bucket_name,
s3_key=key,
replace=True,
file_format="csv",
pd_kwargs={"index": False, "header": False},
)

Passing array as input variable into a Ruby file from zsh CLI

I am writing a ruby file that is called from zsh and, among others, I am trying to pass an array as an input variable like that:
ruby cim_manager.rb test --target=WhiteLabel --devices=["iPhone 8", "iPhone 12 Pro"]
Inside my ruby file I have a function:
# Generates a hash value from an array of arguments
#
# #param [Array<String>] The array of values. Each value of the array needs to separate the key and the value with "=". All "--" substrings will be replaced for empty substrings
#
# #return [Hash]
#
def generate_hash_from_arguemnts(args)
hash = {}
args.each{ |item|
item = item.gsub("--", "")
item = item.split("=")
puts item.kind_of?(Array)
hash[item[0].to_s] = item[1].to_s
}
return hash
end
So I can have a value like:
{"target": "WhiteLabel", "devices": ["iPhone 8", "iPhone 12 Pro"]}
The error I am getting when executing my Ruby file is:
foo#Mac-mini fastlane % ruby cim_manager.rb test --target=WhiteLabel --devices=["iPhone 8", "iPhone 12 Pro"]
zsh: bad pattern: --devices=[iPhone 8,
Any ideas?
#ReimondHill : I don't see how the error is possibly related to Ruby. You have a zsh-line, in which you have --devices= [.... You could get the same error when doing a
echo --devices=["iPhone 8", "iPhone 12 Pro"]
An open square bracket is a zsh wildcard construct; for instance, [aeiou] is a wildcard which tries to match against a vocal in a file name. Hence, this parameter tries to match against files starting with the name --devices= in your working directory, so you would expect an error message like no matches found: --devices=.... However, there is one gotcha: The list of characters between [ ... ] must not have an (unescaped) space. Therefore, you don't see no matches found, but bad pattern.
After all, you don't want a filename expansion to occur, but pass the parameter to your program. Therefore you need to quote it:
ruby .... '--devices=["iPhone 8", "iPhone 12 Pro"]'
Ronald
Following the answer from #user1934428, I am extending my ruby file like that:
# Generates a hash value from an array of arguments
#
# #param [Array<String>] The array of values. Each value of the array needs to separate the key and the value with "=". All "--" substrings will be replaced for empty substrings
#
# #return [Hash]
#
def generate_hash_from_arguemnts(args)
hash = {}
args.each{ |item|
item = item.gsub("--", "")
item = item.gsub("\"", "")
item = item.split("=")
key = item[0].to_s
value = item[1]
if value.include?("[") && value.include?("]") #Or any other pattern you decide
value = value.gsub("[","")
value = value.gsub("]","")
value = value.split(",")
end
hash[key] = value
}
return hash
end
And then my zsh-line:
ruby cim_manager.rb test --target=WhiteLabel --devices='[iPhone 8,iPhone 12 Pro]'
The return value from generate_hash_from_arguemnts prints:
{"target"=>"WhiteLabel", "devices"=>["iPhone 8", "iPhone 12 Pro"]}

testing errors: comparing 2 files

I have 5 functions working relatively
1- singleline_diff(line1, line2)
comparing 2 line in one file
Inputs:
line1 - first single line string
line2 - second single line string
Output:
the index of the first difference between the two lines
identical if the two lines are the same.
2- singleline_diff_format(line1, line2, idx):
comparing 2 line in one file
Inputs:
line1 - first single line string
line2 - second single line string
idx - index at which to indicate difference (from 1st function)
Output:
abcd (first line)
==^ (= indicate identical character, ^ indicate the difference)
abef (second line)
If either input line contains a newline or carriage return,
then returns an empty string.
If idx is not a valid index, then returns an empty string.
3- multiline_diff(lines1, lines2):
deal with two lists of lines
Inputs:
lines1 - list of single line strings
lines2 - list of single line strings
Output:
a tuple containing the line number (starting from 0) and
the index in that line where the first difference between lines1
and lines2 occurs.
Returns (IDENTICAL, IDENTICAL) if the two lists are the same.
4-get_file_lines(filename)
Inputs:
filename - name of file to read
Output:
a list of lines from the file named filename.
If the file does not exist or is not readable, then the
behavior of this function is undefined.
5- file_diff_format(filename1, filename2) " the function with the problem"
deals with two input files
Inputs:
filename1 - name of first file
filename2 - name of second file
Output:
four line string showing the location of the first
difference between the two files named by the inputs.
If the files are identical, the function instead returns the
string "No differences\n".
If either file does not exist or is not readable, then the
behavior of this function is undefined.
testing the function:
everything goes will until it the test use one empty file
it gave me "list index out of range"
this is the code I use
def file_diff_format(filename1, filename2):
file_1 = get_file_lines(filename1)
file_2 = get_file_lines(filename2)
mli_dif = multiline_diff(file_1, file_2)
min_lens = min(len(file_1), len(file_2))
if mli_dif == (-1,-1) :
return "No differences" "\n"
else:
diff_line_indx = mli_dif[0]
diff_str_indx = int (mli_dif[1])
if len(file_1) >= 0:
line_file_1 = ""
else:
line_file_1 = file_1[diff_line_indx]
if len(file_2) >= 0:
line_file_2 = ""
else:
line_file_2 = file_2[diff_line_indx]
line_file_1 = file_1[diff_line_indx]
line_file_2 = file_2 [diff_line_indx]
out_print = singleline_diff_format(line_file_1, line_file_2, diff_str_indx)
return ( "Line {}{}{}".format ((diff_line_indx), (":\n"), (out_print)))
If one of the files is empty, either file1 or file2 should be an empty list, so that trying to access an element of either would cause the error you describe.
Your code checks for these files to be empty when assigning to line_file_`` andline_file_2`, but then goes ahead and tries to access elements of both.

IndexError: list index out of range, scores.append( (fields[0], fields[1]))

I'm trying to read a file and put contents in a list. I have done this mnay times before and it has worked but this time it throws back the error "list index out of range".
the code is:
with open("File.txt") as f:
scores = []
for line in f:
fields = line.split()
scores.append( (fields[0], fields[1]))
print(scores)
The text file is in the format;
Alpha:[0, 1]
Bravo:[0, 0]
Charlie:[60, 8, 901]
Foxtrot:[0]
I cant see why it is giving me this problem. Is it because I have more than one value for each item? Or is it the fact that I have a colon in my text file?
How can I get around this problem?
Thanks
If I understand you well this code will print you desired result:
import re
with open("File.txt") as f:
# Let's make dictionary for scores {name:scores}.
scores = {}
# Define regular expressin to parse team name and team scores from line.
patternScore = '\[([^\]]+)\]'
patternName = '(.*):'
for line in f:
# Find value for team name and its scores.
fields = re.search(patternScore, line).groups()[0].split(', ')
name = re.search(patternName, line).groups()[0]
# Update dictionary with new value.
scores[name] = fields
# Print output first goes first element of keyValue in dict then goes keyName
for key in scores:
print (scores[key][0] + ':' + key)
You will recieve following output:
60:Charlie
0:Alpha
0:Bravo
0:Foxtrot

Pyparsing - name not starting with a character

I am trying to use Pyparsing to identify a keyword which is not beginning with $ So for the following input:
$abc = 5 # is not a valid one
abc123 = 10 # is valid one
abc$ = 23 # is a valid one
I tried the following
var = Word(printables, excludeChars='$')
var.parseString('$abc')
But this doesn't allow any $ in var. How can I specify all printable characters other than $ in the first character position? Any help will be appreciated.
Thanks
Abhijit
You can use the method I used to define "all characters except X" before I added the excludeChars parameter to the Word class:
NOT_DOLLAR_SIGN = ''.join(c for c in printables if c != '$')
keyword_not_starting_with_dollar = Word(NOT_DOLLAR_SIGN, printables)
This should be a bit more efficient than building up with a Combine and a NotAny. But this will match almost anything, integers, words, valid identifiers, invalid identifiers, so I'm skeptical of the value of this kind of expression in your parser.

Resources