How to make Kimono Labs not skip blank table cells? - web-scraping

I've made a Kimono Labs scraper to grab data from this website which has a few tables with empty cells.
Rather than returning that empty cell, the scraper is returning the next value down the list, so the values in the rows don't correspond to what's in the html table. How do I adjust the settings so that the data stays ordered correctly?
To illustrate, here's what's happening.
Original Kimono
1 2 3 1 2 3
a x y z a x y z
b n o b n e o
c d e f c d f

Goto the "Modify Results" tab in your API dashboard and create a simple javascript routine (IF...THEN) that checks if the data is "nil" and instead placed a "-" to fill in the cell that was empty.
The great thing is that you can see the results in realtime as you make your function, so you can see if its working.

Related

Gnuplot - a way to convert and plot text information?

I am trying to use gnuplot to display the information contained in a file as in the example below:
1 2 3 … 10 11
1 1.0000000e-06 1.0000000e-06 … 0
2 2.5000000e-06 1.5000000e-06 … 0 #dt_grow
3 4.7500000e-06 2.2500000e-06 … 0 #dt_grow
4 8.1250000e-06 3.3750000e-06 … 0 #dt_cfl
5 1.2450703e-05 4.3257029e-06 … 1 #dt_mach, max_iteration_turbulence
6 1.6811013e-05 0.3603104e-06 … 0 #dt_grow
My goal is to be able to represent, somehow, the information listed in column 11 which, as you can see, contains non-numeric characters.
It might be pointless but, before moving ahead, it might be helpful to stress that:
row1 has no value at column 11
each column 11 value start with # and is not quoted
column 11 contains many other different possible entries (e.g. "#dt_piso","#dt_piso, 2*max_piso reached", "#dt_mach, temperature extrapolation error")
when values of column 11 present an additional information (e.g ", max_iteration_turbulence") values of column 10 are non-zero
the number of rows is typically of the order 10^6
My idea was to use associate a numeric value to each element of column11 using functions (e.g. if #dt_grow then 1, if #dt_cfl then 2 ecc) so that I can somehow represent this information.
What I have tried so far produce nothing but errors (that I am for brevity listing below each used plot command):
p "file" u 1:11 w l
--> x range is invalid
p "file" u 1:(''.$11 eq "#dt_cfl" ? 1 : 0) w l
--> warning: Skipping data file with no valid points. x range is invalid
p "file" u 1:(column(11) eq "#dt_cfl" ? 1 : 0) w l
--> internal error : STRING operator applied to non-STRING type
p "file" u 1:(strcol(11) eq "#dt_cfl" ? 1 : 0) w l
--> internal error : STRING operator applied to non-STRING type
splot "time.out" u 1:(11 eq "#dt_cfl" ? 1 : 0) w l
--> Need 1 or 3 columns for cartesian data
#Usage of functions does not resolve the issue:
e.g. f(x)= ''.x eq "#dt_cfl" ? 1 : 0
As you can probably tell by the diversity of my trials I am somehow confused on how it is recommendable to proceed in such cases. I have never had to plot string data and I am not quite sure of what is causing the issue. I've been looking for some inputs on the documentation but nothing really helped me on this. I would very much appreciate any inputs on how to handle string data and associate them to numeric values.
To wrap it up: I want to display the evolution of the information on column 11.
Ideally, I would like to be able to use the eventual additional information (as explained in point 4 above) based on the value of column 10.
Based on my request I believe a python script could better fit my necessities, but I am wondering if gnuplot offers such possibilities and I am eager to learn more.
Thanks in advance :)!
P.S.: I am adding a sketch of the results I am trying to obtain hping that this can help clarify my goals.
I am anyway open to new solution as this is just my plan of how I was thinking about overcoming the problem of plotting text data.
With respect to the few rows of data that provided above and assuming to do the following assosiations:
#dt_grow is 1
#dt_cfl is 2
#dt_mach is 3
so on for other possible values (this could be hardcoded as I would have no more that 10 possible values in column11)
Plot_ sketch
Maybe something like this?
You can use the 11th column (here: 5th column) as x2ticlabels (check help xticlabels). Before, link the x2 axis to the x1 axis (check help link).
You could rotate the x2tic labels if they are getting to many and overlap: set x2tics rotate by 90.
In principle, you could get rid of the leading # of each label, but I guess it will get a bit tricky because of your missing value in row 1.
Look at the example below as a starting point.
Script:
### adding text info from columns to some labels
reset session
$Data <<EOD
1 2 3 4 5
1 1.0000000e-06 1.0000000e-06 0
2 2.5000000e-06 1.5000000e-06 0 #dt_grow
3 4.7500000e-06 2.2500000e-06 0 #dt_grow
4 8.1250000e-06 3.3750000e-06 0 #dt_cfl
5 1.2450703e-05 4.3257029e-06 1 #dt_mach, max_iteration_turbulence
6 1.6811013e-05 0.3603104e-06 0 #dt_grow
EOD
set termoption noenhanced
set key top left
set link x2 via x inverse x
set x2tics
plot $Data u 1:2:x2tic(5) skip 1 axes x2y1 w lp pt 7 lc "red" title "column 2", \
'' u 1:3 skip 1 w lp pt 7 lc "web-green" title "column 3"
### end of script
Result:
Addition:
I guess I understand what you want to do but the background is still a bit unclear.
What you are asking for is a conversion or mapping of strings to numbers.
I assume you have a fixed and known set of keywords.
Apparently, for your desired plot the other columns besides 1 and 11 do not play a role.
Your missing value in column 11 in row 1 (excl. header) will create problems, hence add the option skip 2.
In the minimized example below, your column 11 is actually column 2.
The example below will create some random test data for better illustration.
create a string list of your keywords
you can address them via word(), check help word
you can (mis)use sum for a lookup to get the index, check help sum
furthermore, check help strcol, help xticlabels, help skip, help ternary.
Script:
### map strings to numbers
reset session
myKeys = '#dt_grow #dt_cfl #dt_piso #dt_foo #dt_bar #dt_xyz #dt_abc'
myKey(i) = word(myKeys,i)
# create some random test data
set table $Data
set samples 50
plot '+' u ("1 2") every ::0::0 w table
plot '+' u ("1") every ::0::0 w table
plot '+' u ($0+1):(word(myKeys,int(rand(0)*words(myKeys)+1))) w table
unset table
getIdx(s) = (n=0, sum[i=1:words(myKeys)] (s eq myKey(i) ? n=i : 0), n)
set ytics 1
set grid x,y
plot $Data u 1:(y0=getIdx(strcol(2))):ytic(myKey(y0)) skip 2 w lp pt 7 lc "red" notitle
### end of script
Result:
I will not attempt a full answer right now, but here are a few pieces that may be useful by themselves or in conjunction with the answer from #theozh.
Column 11 not always present: The presence or absence of column 11 on any given line can be tested using the "pseudo-column" #$, which evaluates to the total number of columns found on that line. See "help pseudo". This feature was introduced in gnuplot version 5.4.2 (June 2021). For example to plot the values of column 10 but only if column 11 is also present:
plot FOO using 0:((#$ > 10) ? column(10) : NaN)
-Separate lines on the graph for each column 11 category: This could be done more cleanly using arrays in the development version of gnuplot, but sticking with features present in version 5.4 I suggest placing all the categories you want to track in one big string and then looping over the string.
Category = "#dt_grow #dt_cfl #dt_mach"
xcoord(x) = ... some function of the value in column 1? ...
ycoord(y) = ... some function of the value in column 10? ...
set datafile missing NaN #ignore any lines that evaluate to NaN
plot for [cat in Category] (xcoord($1)) : (strcol(#$) eq cat ? ycoord($10) : NaN) with steps

How to select rows based on pre-determined gaps (R)?

I recently post another question asking how I could create a new data.frame based on a colunm variable. I thought it would fix my problem but I realize now that I was asking the wrong thing.
What I mean with my question is, how I can select rows in a constant gap and create a new data.frame with them?
Like, if I have:
1 A B C
2 D E F
3 G H I
4 J K L
5 M N O
6 P Q R
I will want to select the rows that grow in two to two like:
2 D E F
4 J K L
6 P Q R
But actually in my case, I need to select the rows that are groing in 40 to 40 and create a new data.frame with them.
Sorry for another post, but I will be really glad if you guys could help me. I'm a new user of R.
It's very easy with the dplyr package.
library(dplyr)
test %>% dplyr::filter(row_number() %% 2 == 0)
Basically, you are calling the row number and selecting only the even ones. If you wanted to go for every 40 rows, you would be doing row_number() %% 40 == 0. If you wanted to start from another row and get every 40 rows, you just need to change the 0 to another number, as the %% operator performs modular divsion.
You can use R base functions if you don't want to load any extra packages:
Option 1
df[seq_len(nrow(df))%%2==0,]
Option 2
subset(df, seq_len(nrow(df))%%2==0)

R: replacing values in a df according to a different df

I am new to R and I am having some trouble.
I am trying to replace some specific values of a data frame according to different values from another df.
For example:
I have this two dataframes:
a <- data.frame(c('a','b','c','d'), c('g','e','p','d'))
1 a g
2 b e
3 c p
4 d d
b <- data.frame(c('a','c'))
1 a
2 c
I want to find out which items that are on df a are also on df b and assign the value of the next column, in this case 'g' and 'p'. I tried with the match function but it has a problem if there are many items with the same name that need to be changed. I would really want an option to do this without checking 1 by 1 with a loop.

R - how to reference values in a table() nested in a list?

I'm generating a series of tables with the command table() that I'm storing together in a list, and I want to reference specific values of each table to use in calculation. I can correctly pull out the correct table from my list but I can't seem to find the correct way to reference the values within the table.
Here is my table (I don't think it matters, but I'm referencing this table within my list with code tables$'10'[1]):
[[1]]
label.test
test_pred Disorder Normal
Disorder 7 4
Normal 8 16
I'd like to be able to pull out one of those numbers, for example the 4 which seems like it would be referred to with [1,2]. I've tried nesting more brackets inside like this [1[1,2]], or chaining the square brackets one after another like this [1][1,2], or using more of the $ notation, but none of these have worked so far.
How can I reference the values in the table?
Not sure if you have a nested list which may need more attention. Without seeing your code, I guess you can try
tables$'10'[1][[1]][1,2]
This should be clear enough:
b <- factor(rep(c("A","B","C"), 10))
table(b)
c <- factor(rep(c("A","B","C"), 10))
table(b)
tables <- list(table(b),table(c))
> tables
[[1]]
b
A B C
10 10 10
[[2]]
c
A B C
10 10 10
To access the first , second or third element of the first table:
> tables[[1]][1]
A
10
> tables[[1]][2]
B
10
> tables[[1]][3]
C
10
It is the the same thing for the second table or any table. You need double square brackets [[]] at the beginning to access the element of the list

Retrieve characters labels from a vector

I have created a numeric vector using tapply(characters,numbers,sum) which looks like this (just a sample below):
a c d or f e ar fu bar
1 5 9 1 1 1 1 1 1
Now i need to retrieve the character labels on another vector. Any ideas?
The original character vector contains multiple instances of the characters, so I'm not sure how much use it will be.
Desired output a vector with the characters listed:
a c d or f e ar fu bar
I thought that objects such as these could be accessed using some simple command since they are embedded so to speak into the numeric vector, but alas haven't been able to find this function. as.character() just gives me the numbers in character format.
I think you want 'names':
names(tapply(characters,numbers,sum))

Resources