Generate Splunk report with only extracted fields - report

First and foremost, maybe what I am looking for isn’t possible or I am going down the wrong path. Please suggest.
Consider, I’ve raw data which has n number of parameters each separated by ‘&’.
Id=1234&ACC=bc3gds5&X=TESTX&Y=456567&Z=4457656&M=TESTM&N=TESTN&P=5ec3a
Using SPL, I’ve filtered only a few fields(ACC, X, Y) which I’m interested in. Now, I would like to generate the report only with the filtered fields in a tabular format, not the whole raw data.

There may be more than one way to do that, but I like to use rex. The rex command extracts text that matches a regular expression into fields. Once you have the fields you use SPL on them to do whatever you need.
index=foo
| rex "ACC=(?<ACC>[^&]+)&X=(?<X>[^&]+)&Y=(?<Y>[^&]+)"
| table ACC X Y

Related

summary of row of numbers in R

I just hope to learn how to make a simple statistical summary of the random numbers fra row 1 to 5 in R. (as shown in picture).
And then assign these rows to a single variable.
enter image description here
Hope you can help!
When you type something like 3 on a single line and ask R to "run" it, it doesn't store that anywhere -- it just evaluates it, meaning that it tries to make sense out of whatever you've typed (such as 3, or 2+1, or sqrt(9), all of which would return the same value) and then it more or less evaporates. You can think of your lines 1 through 5 as behaving like you've used a handheld scientific calculator; once you type something like 300 / 100 into such a calculator, it just shows you a 3, and then after you have executed another computation, that 3 is more or less permanently gone.
To do something with your data, you need to do one of two things: either store it into your environment somehow, or to "pipe" your data directly into a useful function.
In your question, you used this script:
1
3
2
7
6
summary()
I don't think it's possible to repair this strategy in the way that you're hoping -- and if it is possible, it's not quite the "right" approach. By typing the numbers on individual lines, you've structured them so that they'll evaluate individually and then evaporate. In order to run the summary() function on those numbers, you will need to bind them together inside a single vector somehow, then feed that vector into summary(). The "store it" approach would be
my_vector <- c(1, 3, 7, 2, 6)
summary(my_vector)
The importance isn't actually the parentheses; it's the function c(), which stands for concatenate, and instructs R to treat those 5 numbers as a collective object (i.e. a vector). We then pass that single object into my_vector().
To use the "piping" approach and avoid having to store something in the environment, you can do this instead (requires R 4.1.0+):
c(1, 3, 7, 2, 6) |> summary()
Note again that the use of c() is required, because we need to bind the five numbers together first. If you have an older version of R, you can get a slightly different pipe operator from the magrittr library instead that will work the same way. The point is that this "binding" part of the process is an essential part that can't be skipped.
Now, the crux of your question: presumably, your data doesn't really look like the example you used. Most likely, it's in some separate .csv file or something like that; if not, hopefully it is easy to get it into that format. Assuming this is true, this means that R will actually be able to do the heavy lifting for you in terms of formatting your data.
As a very simple example, let's say I have a plain text file, my_example.txt, whose contents are
1
3
7
2
6
In this case, I can ask R to parse this file for me. Assuming you're using RStudio, the simplest way to do this is to use the File -> Import Dataset part of the GUI. There are various options dealing with things such as headers, separators, and so forth, but I can't say much meaningful about what you'd need to do there without seeing your actual dataset.
When I import that file, I notice that it does two things in my R console:
my_example <- read.table(...)
View(my_example)
The first line stores an object (called a "data frame" in this case) in my environment; the second shows a nice view of how it's rendered. To get the summary I wanted, I just need to extract the vector of numbers I want, which I see from the view is called V1, which I can do with summary(my_example$V1).
This example is probably not helpful for your actual data set, because there are so many variations on the theme here, but the theme itself is important: point R at a file, as it to render an object, then work with that object. That's the approach I'd recommend instead of typing data as lines within an R script, as it's much faster and less error-prone.
Hopefully this will get you pointed in the right direction in terms of getting your data into R and working with it.

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

Assigning observation name to a value when retrieving a variable

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!
I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

What is a good data structure to save a dictionary?

I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?
-- more details
If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?
You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words.
http://en.wikipedia.org/wiki/Hash_table
A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.
Most languages provide a set data structure out of the box that has insert and membership testing methods.
A small example in Python, comparing a list and a set:
import random
import string
import time
def create_word(min_len, max_len):
return "".join([random.choice(string.ascii_lowercase) for _ in
range(random.randint(min_len, max_len+1))])
def create_article(length):
return [create_word(3, 10) for _ in range(length)]
wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]
print("using list")
print(time.time())
for word in article.split(" "):
if word in bad_words_list:
continue
good_words.append(word)
print(time.time())
good_words = []
bad_words_set = set(bad_words_list)
print("using set")
print(time.time())
for word in article.split(" "):
if word in bad_words_set:
continue
good_words.append(word)
print(time.time())
This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".
First, they are put in a list and the "article" is scanned word by word if a word is in this list of bad words. In Python, the in operator tests for membership. For an unordered list, there's no better way than scanning the whole list.
The second approach uses the set datatype that is initialized with the list of bad words. A set has no ordering, but way faster lookup (again using the in keyword) if an element is contained. That seems to be all you need to know.
On my machine, the timings are:
using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762
So it takes about 4 seconds with a list and 2 hundreths of a second with a set.
I think the best structure, you can use there is set. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29
I takes log_2(n) time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set.
You need a Bag data structure for this problem. In a Bag data structure elements have no order but is designed for fast lookup of an element in the Bag. It time complexity is O(1). So for N words in an article overall complexity turns out to be O(N). Which is the best you can achieve in this case. Java Set is an example of Bag implementation in Java.

Fortran90 created allocatable arrays but elements incorrect

Trying to create an array from an xyz data file. The data file is arranged so that x,y,z of each atom is on a new line and I want the array to reflect this.
Then to use this array to find find the distance from each atom in the list with all the others.
To do this the array has been copied such that atom1 & atom2 should be identical to the input file.
length is simply the number of atoms in the list.
The write statement: WRITE(20,'(3F12.9)') atom1 actually gives the matrix wanted but when I try to find individual elements they're all wrong!
Any help would be really appreciated!
Thanks guys.
DOUBLE PRECISION, DIMENSION(:,:), ALLOCATABLE ::atom1,atom2'
ALLOCATE(atom1(length,3),atom2(length,3))
READ(10,*) ((atom1(i,j), i=1,length), j=1,3)
atom2=atom1
distn=0
distc=0
DO n=1,length
x1=atom1(n,1)
y1=atom1(n,2) !1st atom
z1=atom1(n,3)
DO m=1,length
x2=atom2(m,1)
y2=atom2(m,2) !2nd atom
z2=atom2(m,3)`
Your READ statement reads all the x coordinates for all atoms from however many records, then all the y coordinates, then all the z coordinates. That's inconsistent with your description of the input file. You have the nesting of the io-implied-do's in the READ statement around the wrong way - it should be ((atom1(i,j),j=1,3),i=1,length).
Similarly, as per the comment, your diagnostic write mislead you - you were outputting all x ordinates, followed by all y ordinates, etc. Array element order of a whole array reference varies the first (leftmost) dimension fastest (colloquially known as column major order).
(There are various pitfalls associated with list directed formatting that mean I wouldn't recommend it for production code (or perhaps for input specifically written with the knowledge of and defence against those pitfalls). One of those pitfalls is that the READ under list directed formatting will pull in as many records as it requires to satisfy the input list. You may have detected the problem earlier if you were using an explicit format that nominated the number of fields per record.)

Resources