Creating a boxplot from two tables in KNIME - r

I am trying to plot two data columns coming from different tables using KNIME. I want to plot them in a single plot in order to be easier to compare them. In R this effect can be achieved by the following:
boxplot(df$Delay, df2$Delay,names=c("Column from Table1","Column from Table2"), outline=FALSE)
However, by using KNIME, I cannot think of a way that you can use data coming from two different tables. Have you ever faced this issue in KNIME?

If you have the same number of rows in the columns you can use the Column Appender node to get both columns into the same table. If not, you can use the Column Joiner node to create a superset of both columns.

Seems to be a working solution would be -according to the discussions in the comments- the following:
Installing the KNIME Interactive R Statistics Integration (you might already have it installed)
Using the Add Table To R node to add the second table to R
I guess the usual R code can be used to create the figures

The above answer using the "Add Table to R" node is a very nice option.
You could also do it before hand in KNIME. If the two tables have the same columns you could concatenate them using the "Concatenate" node and, if you need mark the rows with the "Constant Value" node from which table the come from originally.
If the two tables have different columns but some row identifiers in common you could join them using the "Joiner" node into one table. Then pass the concatenated or joined table over to R.

Related

Can sqlite-utils convert function select two columns?

I'm using sqlite-utils to load a csv into sqlite which will later be served via Datasette. I have two columns, likes and dislikes. I would like to have a third column, quality-score, by adding likes and dislikes together then dividing likes by the total.
The sqlite-utils convert function should be my best bet, but all I see in the documentation is how to select a single column for conversion.
sqlite-utils convert content.db articles headline 'value.upper()'
From the example given, it looks like convert is followed by the db filename, the table name, then the col you want to operate on. Is it possible to simply add another col name or is there a flag for selecting more than one column to operate on? I would be really surprised if this wasn't possible, I just can't find any documentation to support it.
This isn't a perfect answer as it doesn't resolve whether sqlite-utils supports multiple column selection for transforms, but this is how I solved this particular problem.
Since my quality_score column would just be basic math, I was able to make use of sqlite's Generated Columns. I created a file called quality_score.sql that contained:
ALTER TABLE testtable
ADD COLUMN quality_score GENERATED ALWAYS AS (likes /(likes + dislikes));
and then implemented it by:
$ sqlite3 mydb.db < quality_score.sql
You do need to make sure you are using a compatible version of sqlite, as this only works with version 3.31 or later.
Another consideration is to make sure you are performing math on integers or floats and not text.
Also attempted to create the table with the virtual generated column first then fill it with my data later, but that didn't work in my case - it threw an error that said the number of items provided didn't match the number of columns available. So I just stuck with the ALTER operation after the fact.

Maintain column order when uploading an R data frame to Big Query

I am uploading an R data frame to Big Query with this:
bq_table_upload("project.dataset.table_name", df_name)
It works, but the order of the columns is different in BQ than in R.
Is it possible to get the table to inherit the order of the columns from the data frame? I have looked through the documentation and have not found this option, though it is certainly possible I missed it.
As pointed out in the comments by Oluwafemi Sule, it's only a matter of providing the dataframe in the "fields" attribute like below:
bq_table_upload("project.dataset.table_name", values = df_name, fields=df_name)

How can I make two key columns from the different part of the column names in R?

I am going to do repeated measures ANOVA on my data, but to this point, my data is wide. Two independent (categorical) variables are spread across single responsive variable.
See the image: https://imgur.com/1eTWSIM
I want to create two categorical variables that take values from the different parts of the columns (circled on the screenshot). Subject numbers should be kept as a category. So after using gather() function, the data should look something like this:
https://imgur.com/SGM2N69
I've seen in a tutorial (that I can't find anymore) that you can create two columns from a single function, using different parts of the colnames (using "_" as a separator), but I can't exactly remember how it was done.
Any help would be appreciated and ask if anythings is not clear in my explanation.
I have solved the problem by using 'gather()' function first and then 'separate()' to separate it into two new columns. So I guess, if you want to make two key columns, first you have to make a single column containing both values and later separate it into two.
At least that is how I did it.

Load multiple tables from one word sheet and split them by detecting one fully emptied row

So generally what I would like to do is to load a sheet with 4 different tables, and split this one big data into smaller tables using str_detect() to detect one fully blank row that's deviding those tables. After that I want to plug that information into the startRow, startCol, endRow, endCol.
I have tried using this function as followed :
str_detect(my_data, ‘’) but the my_data format is wrong. I’m not sure what step shall I make do prevent this and make it work.
I’m using read_xlsx() to read my dataset

Nested gapply(), dapply, or spark_apply() function?

I have two separate Hive tables within which I'd like to run a very complex string matching algorithm. I'd like to use SparkR or sparklyr, but I'm trying to determine the feasibility of nested dapply, gapply, or spark_apply statements. I haven't seen a single example of a nested apply.
The problem statement: Fuzzy matching on addresses within zip codes. Essentially, I've already done a cartesian join on addresses from both data sets where Zip=Zip. But now I have two columns of Addresses that need to be matched, and an third column of Zips that need to be retained as a "GroupBy" to constrain the superset of potential pairwise comparisons. Thus, the first "key" is the Zip, but then I want to use a second "key" to send a series of comparisons to a single address from column1, matching all possible addresses in column2 (within the same Zip). This seems to require one of the distributed apply functions within SparkR or sparklyr, but each of them does not look like it will allow, for example, gapply(...,gapply()) or spark_apply(...,spark_apply()).
Has anyone every tried this or gotten around a similar problem?

Resources