Importing unstructured software log file in R? - r

Below is our software's log file sample. I like to analysis this data with the help of R language to get some insight information.
30-Mar-14 17:59:58.1244 (6628 6452) Module1.exe:Program1.cpp,v:854: ERROR: group 7 failed with error = 0x8004000f
30-Mar-14 17:59:58.1254 (6628 6452) Module1.exe:Program1.cpp,v:880: ERROR: group 7 failed on its 3 retry
30-Mar-14 18:00:04.8491 ( -1 1376 13900) Module2.exe:Execute:803: Information - Executing command 1
30-Mar-14 18:00:08.6213 ( -1 1376 13900) Module2.exe:Execute:603: Information - command 1 completed.
30-Mar-14 18:00:08.6273 ( -1 1376 13900) Module2.exe:Execute:803: Information - Executing command 2
Each log file contains 20k lines and we have plenty of log files.
My requirement is to split as following.
| 30-Mar-14 | 17:59:58.1244 | (6628 6452) | Module1.exe:Program1.cpp,v | :854: | ERROR: group 7 failed with error = 0x8004000f |
I tried to import this dataset using "Import Dataset" -->"From File" in R studio. I tried with different options available there. But it unable to recognize the fields. Is there any option split based on patterns or regular expression?
Software environment:
R language v3.0.3
R studio
Windows 7
Note: I have edited the log file to remove real module names.

There is no such option in the GUI itself (unlike Excel or SPSS, for instance, which might have more powerful GUI import options). You need a script for that.
You can construct a regular expression with placeholders that matches all lines, and call gsub to extract the values in the placeholders. For instance:
text <- readLines("log.log")
rx <- "^([0-9]+-[^-]+[0-9]+) +([0-9]+:[0-9]+:[0-9]+[.][0-9]+) +.*$"
stopifnot(grepl(rx, text))
And then:
date <- gsub(rx, "\\1", text)
time <- gsub(rx, "\\2", text)
date.time.df <- data.frame(date, time)
Or:
date.time <- gsub(rx, "\\1\n\\2", text)
date.time.l <- strsplit(date.time, "\n")
do.call(rbind, date.time.l)
Enhance rx to match the other fields.

Here is a script that will do it:
x <- scan(text = "30-Mar-14 17:59:58.1244 (6628 6452) Module1.exe:Program1.cpp,v:854: ERROR: group 7 failed with error = 0x8004000f
30-Mar-14 17:59:58.1254 (6628 6452) Module1.exe:Program1.cpp,v:880: ERROR: group 7 failed on its 3 retry
30-Mar-14 18:00:04.8491 ( -1 1376 13900) Module2.exe:Execute:803: Information - Executing command 1
30-Mar-14 18:00:08.6213 ( -1 1376 13900) Module2.exe:Execute:603: Information - command 1 completed.
30-Mar-14 18:00:08.6273 ( -1 1376 13900) Module2.exe:Execute:803: Information - Executing command 2",
what = '', sep = '\n')
# pull off date/time
dateTime <- sapply(strsplit(x, ' '), '[', 1:2)
# piece together with "|"
dateTime <- apply(dateTime, 2, paste, collapse = "|")
newX <- sub("^[^ ]+ [^(]+", "", x)
# extract the data in parenthesises
par1 <- sub("(\\([^)]+\\)).*", "\\1", newX)
newX <- sub("[^)]+\\)", "", newX) # remove data just matched
# parse the rest of the data
x <- strsplit(newX, ":")
y <- sapply(x, function(.line){
paste(c(paste(c(.line[1], .line[2]), collapse = ":")
, paste0(":", .line[3], ":")
, paste(.line[-(1:3)], collapse = ":")
), collapse = "|")
})
# put it all back together
paste0("|"
, dateTime
, "|"
, par1
, "|"
, y
, "|"
)
Here is the output of the script:
[1] "|30-Mar-14|17:59:58.1244|(6628 6452)| Module1.exe:Program1.cpp,v|:854:| ERROR: group 7 failed with error = 0x8004000f|"
[2] "|30-Mar-14|17:59:58.1254|(6628 6452)| Module1.exe:Program1.cpp,v|:880:| ERROR: group 7 failed on its 3 retry|"
[3] "|30-Mar-14|18:00:04.8491|( -1 1376 13900)| Module2.exe:Execute|:803:| Information - Executing command 1|"
[4] "|30-Mar-14|18:00:08.6213|( -1 1376 13900)| Module2.exe:Execute|:603:| Information - command 1 completed.|"
[5] "|30-Mar-14|18:00:08.6273|( -1 1376 13900)| Module2.exe:Execute|:803:| Information - Executing command 2|"

Related

Error in Rscript: "Error in system("tail -n1010 EpisodeIV_dialogues.txt | cut -f2", intern = TRUE) : 'tail' not found"

I'm trying to run the following script on R, but I get an error that I do not understand. The script is supposed to parse a movie script which is in txt format.
setwd("C:/Users/name/Desktop/star wars")
# read episode IV script in R (this is a character vector)
sw = readLines("StarWars_EpisodeIV_script.txt")
# inspect first 70 lines
# you'll see that the first dialogue is from THREEPIO in line 52
sw[1:70]
# command to extract character name (just for demo purposes)
substr(sw[52], 21, nchar(sw[52]))
# command to extract dialogue text (just for demo purposes)
substr(sw[53], 11, nchar(sw[53]))
# we need these auxiliary strings to help us
# extract character names and their dialogues
b10 = " "
b20 = " "
# how many lines in input file
nlines = length(sw)
# let's parse the entire script while extracting only the names of the
# characters and their dialogues. The output file is EpisodeIV_dialogues.txt
# write first line in output file
writeLines("STAR WARS - EPISODE 4: STAR WARS", "EpisodeIV_dialogues.txt")
# the first 50 lines don't contain dialogues
# start reading at line 50
i = 50
# while loop to extract character and dialogues
# you may get some errors, just ignore them and re-run
# the while loop as many times as needed
while (i <= nlines)
{
# if empty line
if (sw[i] == "") i = i + 1 # next line
# if text line
if (sw[i] != "")
{
# if script description
if (substr(sw[i], 1, 1) != " ") i = i + 1 # next line
if (nchar(sw[i]) < 10) i = i + 1 # next line
# if character name
if (substr(sw[i], 1, 20) == b20)
{
if (substr(sw[i], 21, 21) != " ")
{
tmp_name = substr(sw[i], 21, nchar(sw[i], "bytes"))
cat("\n", file="EpisodeIV_dialogues.txt", append=TRUE)
cat(tmp_name, "", file="EpisodeIV_dialogues.txt", sep="\t", append=TRUE)
i = i + 1
} else {
i = i + 1
}
}
# if dialogue
if (substr(sw[i], 1, 10) == b10)
{
if (substr(sw[i], 11, 11) != " ")
{
tmp_diag = substr(sw[i], 11, nchar(sw[i], "bytes"))
cat(tmp_diag, file="EpisodeIV_dialogues.txt", append=TRUE)
i = i + 1
} else {
i = i + 1
}
}
}
}
# =====================================================================
# Creating data table "SW_EpisodeIV.txt"
# =====================================================================
# how many lines in output file
system("wc -l EpisodeIV_dialogues.txt")
# get vector of character names
SW4_chars = system("tail -n1010 EpisodeIV_dialogues.txt | cut -f1", intern=TRUE)
# get vector of dialogue lines
SW4_diags = system("tail -n1010 EpisodeIV_dialogues.txt | cut -f2", intern=TRUE)
# check character names
table(SW4_chars)
# remove voices
SW4_chars = gsub("'S VOICE", "", SW4_chars)
# join characters and dialogues in one table
SW4 = cbind(character=SW4_chars, dialogue=SW4_diags)
# save SW4 in file 'SW_EpisodeIV.txt'
write.table(SW4, file="SW_EpisodeIV.txt")
# if you want to check the data table
A = read.table("SW_EpisodeIV.txt")
head(A)
tail(A)
The error comes up when I run the following lines
SW4_chars = system("tail -n1010 EpisodeIV_dialogues.txt | cut -f1", intern=TRUE)
# get vector of dialogue lines
SW4_diags = system("tail -n1010 EpisodeIV_dialogues.txt | cut -f2", intern=TRUE)
The error says
Error in system("tail -n1010 EpisodeIV_dialogues.txt | cut -f2", intern = TRUE) :
'tail' not found
I'm not sure what the error means.

Sparklyr - Decimal precision 8 exceeds max precision 7

I'm trying to copy a big database into Spark using spark_read_csv, but I'm getting the following error as output:
Error: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 16.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 16.0 (TID 176, 10.1.2.235):
java.lang.IllegalArgumentException: requirement failed: Decimal
precision 8 exceeds max precision 7
data_tbl <- spark_read_csv(sc, "data", "D:/base_csv", delimiter = "|", overwrite = TRUE)
It's a big data set, about 5.8 million of records, with my dataset I have data of types Int, num and chr.
I think you have a couple options depending on the spark version that you're using
Spark >=1.6.1
from here: https://docs.databricks.com/spark/latest/sparkr/functions/read.df.html
it seems, you can specifically specify your schema to force it to use doubles
csvSchema <- structType(structField("carat", "double"), structField("color", "string"))
diamondsLoadWithSchema<- read.df("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
source = "csv", header="true", schema = csvSchema)
Spark < 1.6.1
consider test.csv
1,a,4.1234567890
2,b,9.0987654321
you can easily make this more efficient, but I think you get the gist
linesplit <- function(x){
tmp <- strsplit(x,",")
return ( tmp)
}
lineconvert <- function(x){
arow <- x[[1]]
converted <- list(as.integer(arow[1]), as.character(arow[2]),as.double(arow[3]))
return (converted)
}
rdd <- SparkR:::textFile(sc,'/path/to/test.csv')
lnspl <- SparkR:::map(rdd, linesplit)
ll2 <- SparkR:::map(lnspl,lineconvert)
ddf <- createDataFrame(sqlContext,ll2)
head(ddf)
_1 _2 _3
1 1 a 4.1234567890
2 2 b 9.0987654321
NOTE: the SparkR::: methods are private for a reason, the docs say 'be careful when you use this'

how to capture the full previous command run within R?

i've looked at history and savehistory and sys.call(-1) but nothing appears to capture the full previously-executed command if that command moves onto multiple lines. there's some r-help discussion on the topic, but i couldn't figure out a direct answer to my question. i just want the entire previously-evaluated command captured in a character string. is there a smart way to do this?
edit the purpose of this is for an R package convey that's dependent on another package survey and needs a few additional configuration commands run before usage. if it looks like the user ran the survey::svydesign function immediately before the convey::convey_prep function, then there is no need to print the warning. but if the user ran survey::svydesign and then edited the svydesign object prior to running convey::convey_prep then it could cause a mistaken calculation. so if the command prior to convey::convey_prep() did not include the svydesign() function, then i just want to print a warning. otherwise, it's safe to assume that the two functions were used appropriately (one-immediately-after-the-other).
i need this to work in both scripts and in interactive mode..thanks
# succeeds
c( 1 , 2 , 3 , 4 , 5 )
hist_tf <- tempfile()
savehistory( hist_tf )
hist_lines <- readLines( hist_tf )
# this output is what i want
hist_lines[ length( hist_lines ) - 2 ]
# fails
c( 1 , 2 , 3 ,
4 , 5 )
hist_tf <- tempfile()
savehistory( hist_tf )
hist_lines <- readLines( hist_tf )
# this output fails,
# because it does not have the `c( 1 , 2 , 3 ,`
hist_lines[ length( hist_lines ) - 2 ]

How to convert code to more readable form in R

I copy code from the terminal to post here. It is in following form:
> ddf2 = ddf[ddf$stone_ny>'stone',] # this is first command
> ddf2[!duplicated(ddf2$deltnr),] # second command
deltnr us stone_ny stone_mobility
4 1536 63 stone mobile
10 1336 62 stone mobile
First 2 lines are commands while next 3 lines are output. However, this cannot be copied from here back to R terminal since the commands start with '> '. How can I convert this to:
ddf2 = ddf[ddf$stone_ny>'stone',] # this is first command
ddf2[!duplicated(ddf2$deltnr),] # second command
# deltnr us stone_ny stone_mobility
#4 1536 63 stone mobile
#10 1336 62 stone mobile
So that it become suitable for copying from here.
I tried:
text
[1] "> ddf2 = ddf[ddf$stone_ny>'stone',] # this is first command\n> ddf2[!duplicated(ddf2$deltnr),] # second command\n deltnr us stone_ny stone_mobility \n4 1536 63 stone mobile \n10 1336 62 stone mobile "
text2 = gsub('\n','#',text)
text2 = gsub('#>','\n',text2)
text2 = gsub('#','\n#',text2)
text2
[1] "> ddf2 = ddf[ddf$stone_ny>'stone',] \n# this is first command\n
ddf2[!duplicated(ddf2$deltnr),] \n# second command\n# deltnr us stone_ny stone_mobility \n#4 1536 63 stone mobile \n#10 1336 62 stone mobile "
But it cannot get pasted to the terminal.
I've been waiting for an opportunity to share this function I keep in my .Rprofile file. While it may not answer exactly your question, I feel it is accomplishing something very close to what you are after. So you might get some ideas by looking at its code. And others might find it useful just as it is. The function:
SO <- function(script.file = '~/.active-rstudio-document') {
# run the code and store the output in a character vector
tmp <- tempfile()
capture.output(
source(script.file, echo = TRUE,
prompt.echo = "> ",
continue.echo = "+ "), file = tmp)
out <- readLines(tmp)
# identify lines that are comments, code, results
idx.comments <- grep("^> [#]{2}", out)
idx.code <- grep("^[>+] ", out)
idx.blank <- grep("^[[:space:]]*$", out)
idx.results <- setdiff(seq_along(out),
c(idx.comments, idx.code, idx.blank))
# reformat
out[idx.comments] <- sub("^> [#]{2} ", "", out[idx.comments])
out[idx.code] <- sub("^[>+] ", " ", out[idx.code])
out[idx.results] <- sub("^", " # ", out[idx.results])
# output
cat(out, sep = "\n", file = stdout())
}
This SO function is what allows me to quickly format my answers to questions on this very website, StackOverflow. My workflow is as follows:
1) In RStudio, write my answer in an untitled script (that's the top-left quadrant). For example:
## This is super easy, you can do
set.seed(123)
# initialize x
x <- 0
while(x < 0.5) {
print(x)
# update x
x <- runif(1)
}
## And voila.
2) Near the top, click the "Source" button. It will execute the code in the console which is not really what we are after: rather, it will have the side effect of saving the code to the default file '~/.active-rstudio-document'.
3) Run SO() from the console (bottom-left quadrant) which will source the code (again...) from the saved file, capture the output and print it in a SO-friendly format:
This is super easy, you can do
set.seed(123)
# initialize x
x <- 0
while(x < 0.5) {
print(x)
# update x
x <- runif(1)
}
# [1] 0
# [1] 0.2875775
And voila.
4) Copy-paste into stackoverflow and done.
Note: For code that takes a while to run, you can avoid running it twice by saving your script to a file (e.g. 'xyz.R') instead of clicking the "Source" button. Then run SO("xyz.R").
You could try cat with an ifelse condition.
cat(ifelse(substr(s <- strsplit(text, "\n")[[1]], 1, 1) %in% c("_", 0:9, " "),
paste0("# ", s),
gsub("[>] ", "", s)),
sep = "\n")
which results in
ddf2 = ddf[ddf$stone_ny>'stone',] # this is first command
ddf2[!duplicated(ddf2$deltnr),] # second command
# deltnr us stone_ny stone_mobility
# 4 1536 63 stone mobile
# 10 1336 62 stone mobile
The "_" and 0:9 are in there because one of the rules in R is that a function cannot begin with a _ or a digit. You can adjust it to fit your needs.

R - Exact String Match - Revisited

I have the below test input in a file called Input
Exploratory objectives :
This is Exp objective 1
This is Exp objective 2
3.3 Exploratory objective(s)
This is Exp objective 1
This is Exp objective 2
From this text file, I'm trying to grep for "Exploratory objective(s)" using the below. The output line number I am expecting is 7.
However, when I run the below command: I am getting the line number as 1. Can anyone please point out what is wrong with my grep here and why it doesnt return 7? Also how I can fix this?
key_str <-"Exploratory objective(s)"
key_str
key_pat <- paste0("(", key_str, ")", "(?![[:alpha:]])")
line_number<-grep(key_pat,Input,perl=TRUE)
line_number
Expected line_number: 7
Output line_number using above: 1 (Incorrect)
You have to escape parentheses:
key_str <- "Exploratory objective\\(s\\)"
If the string is dynamically generated or read from a file, use this:
key_str <- gsub("([\\(\\)])", "\\\\\\1", string)

Resources