I have some tables in Hive that I need to join together. Since I need to do some work on each of them, normalize the key, remove outliers.... and as I add more and more tables... This chaining process turned out to be a big mass.
It is so easy to get lost where you are and the query is getting out of control.
However, I have a pretty clear idea how the final table should look like and each column is fairly independent of the other tables.
For examp, here is an example:
table_class1
name id score
Alex 1 90
Chad 3 50
...
table_class2
name id score
Alexandar 1 50
Benjamin 2 100
...
In the end I really want something looks like:
name id class1 class2 ...
alex 1 90 50
ben 2 100 NA
chad 3 50 NA
I know it could be a left outer join, but I am really having a hard time to create a seperate table for each of them after the normalization and then use left outer join with the union of the keys to left outer join each of them...
I am thinking about using NOSQL(HBase) to dump the processed data into NOSQL format.. like:
(source, key, variable, value)
(table_class1, (alex, 1), class1, 90)
(table_class1, (chad, 3), class1, 50)
(table_class2, (alex, 1), class2, 50)
(table_class2, (benjamin, 2), class2, 100)
...
In the end, I want to use something like the melt and cast in R reshape package to bring that data back to be a table.
Since this is a big data project, and there will be hundreds of millions of key value pairs in HBase.
(1) I don't know if this is a legit approach
(2) If so, is there any big data tool to pivot long HBase table into a Hive table.
Honestly, I would love to help more, but I am not clear about what you're trying to achieve (maybe because I've never used R), please elaborate and I'll try to improve my answer if necessary.
Why do you need HBase for? You can store your processed data in new tables and work with them, you can even CREATE VIEW to simplify the query if it's too large, maybe that's what you're looking for (HIVE manual). Unless you have a good reason for using HBase, I'll stick just to HIVE to avoid additional complexity, don't get me wrong, there are a lot of valid reasons for using HBase.
About your second question, you can define and use HBase tables as HIVE tables, you can even CREATE and SELECT INSERT into them all inside HIVE, is that what you're looking for?: HBase/HIVE integration doc
One last thing in case you don't know, you can create custom functions in HIVE very easily to help you with the tedious normalization process, take a look at this.
Related
I have been searching and searching and have to resolved to post! I'm still pretty new to R.
I have 2 data frames. The large one is HEAT and the small one is EE.
I have manage to do a left join to get EE matched up with HEAT.
df(HEAT)
Date Time. EVENT. Person. PersonID
DTgroup1. X. Code. Code
DTgroup2. X Code. Code
DTgroup3. Y. Code. Code
....
Then there is:
df(EE)
Person ID. Type. var 3. var 4 var 5
here is the merge that I used:
merge <- left_join(HEAT, EE)
I have managed to merge the two data frames but I loose all the data in df(EE) except for the PersonID that it share with df(HEAT).
Does anyone have any advice about what I am doing wrong?
Thanks a bunch!
A left join will keep all rows on the left side, in your case HEAT, and include data where there is a match on the right hand side.
An inner join, would only return records where there is a valid join on both sides, in your case, one record would be returned.
See What is the difference between “INNER JOIN” and “OUTER JOIN”? for more info.
Obviously, you want a
merge <- full_join(HEAT, EE)
Here is a nice Cheat sheet page http://stat545.com/bit001_dplyr-cheatsheet.html
And here a super nice graphics http://r4ds.had.co.nz/relational-data.html
Oracle has a table function called SDO_JOIN that is used to do join tables based on spatial relations. An example query to find what neighbourhood a house is in is something like this:
select
house.address,
neighbourhood.name
from table(sdo_join('HOUSE', 'GEOMETRY', 'NEIGHBOURHOOD', 'GEOMETRY', 'mask=INSIDE')) a
inner join house
on a.rowid1 = house.rowid
inner join neighbourhood
on a.rowid2 = neighbourhood.rowid;
But I get the same result by just doing a regular join with a spatial relation in the on clause:
select
house.address,
neighbourhood.name
from house
inner join neighbourhood
on sdo_inside(house.geometry, neighbourhood.geometry) = 'TRUE';
I prefer the second method because I think it's easier to understand what exactly is happening, but I wasn't able to find any Oracle documentation on whether or not this is the proper way to do a spatial join.
Is there any difference between these two methods? If there is, what? If there isn't, which style is more common?
The difference is in performance.
The first approach (SDO_JOIN) isolates the candidates by matching the RTREE indexes on each table.
The second approach will search the HOUSE table for each geometry of the NEIGHBORHOOD table.
So much depends on how large your tables are, and in particular, how large the NEIGHBORHOOD table is - or more precisely, how many rows of the NEIGHBORHOOD table your query actually uses. If the NEIGHBORHOOD table is small (less than 1000 rows) then the second approach is good (and the size of the HOUSE table does not matter).
On the other hand, if you need to match millions of houses and millions of neighborhoods, then the SDO_JOIN approach will be more efficient.
Note that the SDO_INSIDE approach can be efficient too: just make sure you enable SPATIAL_VECTOR_ACCELERATION (only if you use Oracle 12.1 or 12.2 and you have the proper licensed for Oracle Spatial and Graph) and use parallelism.
I am working with dplyr and the dbplyr package to interface with my database. I have a table with millions of records. I also have a list of values that correspond to the key in that same table I wish to filter. Normally I would do something like this to filter the table.
library(ROracle)
# connect info omitted
con <- dbConnect(...)
# df with values - my_values
con %>% tbl('MY_TABLE') %>% filter(FIELD %in% my_values$FIELD)
However, that my_values object contains over 500K entries (hence why I don't provide actual data here). This is clearly not efficient when they will basically be put in an IN statement (It essentially hangs). Normally if I was writing SQL, I would create a temporary table and write a WHERE EXISTS clause. But in this instance, I don't have write privileges.
How can I make this query more efficient in R?
Note sure whether this will help, but a few suggestions:
Find other criteria for filtering. For example, if my_values$FIELD is consecutive or the list of values can be inferred by some other columns, you can seek help from the between filter: filter(between(FIELD, a, b))?
Divide and conquer. Split my_values into small batches, make queries for each batch, then combine the results. This may take a while, but should be stable and worth the wait.
Looking at your restrictions, I would approach it similar to how Polor Beer suggested, but I would send one db command per value using purrr::map and then use dplyr::bindrows() at the end. This way you'll have a nice piped code that will adapt if your list changes. Not ideal, but unless you're willing to write a SQL table variable manually, not sure of any other solutions.
I have been struggling to get aggregation tables to work. Here is what my fact table looks like:
employment_date_id
dimension1_id
dimension2_id
dimension3_id
dimension4
dimension5
measure1
measure2
measure3
I'm collapsing the employment_date_id from year, quarter, and month to include just the year, but every other column is included. This is what my aggregation table looks like:
yearquartermonth_year
dimension1_id
dimension2_id
dimension3_id
dimension4
dimension5
measure1
measure2
measure3
fact_count
I'm only collapsing the year portion of the date. The remaining fields are left as is. Here is my configuration:
<AggFactCount column="FACT_COUNT"/>
<AggForeignKey factColumn="dimension1_id" aggColumn="dimension1_id"/>
<AggForeignKey factColumn="dimension2_id" aggColumn="dimension2_id"/>
<AggForeignKey factColumn="dimension3_id" aggColumn="dimension3_id"/>
<AggMeasure name="[Measures].[measure1]" column="measure1"/>
<AggMeasure name="[Measures].[measure2]" column="measure2"/>
<AggMeasure name="[Measures].[measure3]" column="measure3"/>
<AggLevel name="[dimension4].[dimension4]" column="dimension4"/>
<AggLevel name="[dimension5].[dimension5]" column="dimension5"/>
<AggLevel name="[EmploymentDate.yearQuarterMonth].[Year]" column="yearquartermonth_year"/>
I'm for the most part copying the 2nd example of aggregation tables from the documentation. Most of my columns are not collapsed into the table and are foreign keys to the dimension tables.
My query I'm trying to execute is something like:
select {[Measures].[measure1]} on COLUMNS, {[EmploymentDate.yearQuarterMonth].[Year]} on ROWS from Cube1
The problem is that when I debug it and turn on the logging I see bit keys that look like this:
AggStar:agg_year_employment
bk=0x00000000000000000000000000000000000000000000000111111111101111100000000000000000000000000000000000000000000000000000000000000000
fbk=0x00000000000000000000000000000000000000000000000000000001101111100000000000000000000000000000000000000000000000000000000000000000
mbk=0x00000000000000000000000000000000000000000000000111111110000000000000000000000000000000000000000000000000000000000000000000000000
And my query's bit pattern is:
Foreign columns bit key=0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
Measure bit key= 0x00000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000
And so my aggregation table is skipped. However, these are the exact columns that are folded into the table. But the bit positions are off between the query's and the aggregation table's. The other thing I find strange is that a portion of the columns is collapsed into the table, but all AggForeignKeys aren't included as bits so if I make a query with those columns this aggregation table will get skipped? That's counter to what I had planned. My plan was as long as you are making a query on year boundaries use this aggregation table.
I don't understand why this isn't working and why it fails to build the bit keys properly. I've tried debugging mondrian code, but figuring out which column maps to which position in the bit keys is not obvious. I feel like this shouldn't be this hard, but everything out there doesn't really explain this very well. And this aggregation table architecture is really to break.
What am I doing wrong? And why doesn't my solution work?
Update Here is my mondrian.properties file:
mondrian.jdbcDrivers=com.mysql.jdbc.Driver,oracle.jdbc.driver.OracleDriver
mondrian.rolap.generate.formatted.sql=true
mondrian.rolap.localePropFile=locale.properties
mondrian.rolap.aggregates.Use=true
mondrian.rolap.aggregates.Read=true
mondrian.trace.level=2
mondrian.drillthrough.enable=true
Might be the case mondrian.rolap.aggregates.Read is set to true and mondrian.rolap.aggregates.Use is set to false.
Please set mondrian.rolap.aggregates.Use=true and check.
Reference: http://mondrian.pentaho.com/documentation/configuration.php
If this is not the case, please attach all the properties related to aggregate tables and the complete cube definition XML.
I have imported data relating to about 70 human subjects from three data tables and have merged them into one data frame in R. Some of the 100 fields are straight forward such as date.birth, number.surgeries.lifetime and number.surgeries.12months. Other fields, such as "comments" may contain no value or it may contain one sentence or maybe even several sentences.
Some human subjects have an anomaly, meaning that something is missing or not just right and for those people I have to manually investigate whats up. When I open the data frame as a data frame or even as a table in fix() it is difficult to read. I have to scroll from left to right and then I have to expand some columns by a ridiculous amount to read just one comment.
It would be much better if I could subset the 5 patients I need to explore and report the data as free flowing text. I thought I could do that by exporting to a csv but its difficult to see which fields are what. For example 2001-01-05, 12, 4, had testing done while still living in Los Angeles. That was easy, imagine what happens if there are 100 fields, many are numbers, many are dates, there are several different comment fields.
A better way would be to output a report such as this:
date.birth:2001-01-05, number.surgeries.lifetime:12, number.surgeries.12months:4, comments:will come talk to us on Monday
Each one of the 5 records would follow that format.
field name 1:field 1 value record 1, field name 2:field 2 value record 1...
skip a line (or something easy to see)
field name 1:field 1 value record 2, field name 2:field 2 value record 2
How can I do it?
How about this?
set.seed(1)
age <- abs(rnorm(10, 40, 20))
patient.key <- 101:110
date.birth <- as.Date("2011-02-02") - age * 365
number.surgeries.12months <- rnbinom(10, 1, .5)
number.surgeries.lifetime <- trunc(number.surgeries.12months * (1 + age/10))
comments <- "comments text here"
data <- data.frame(patient.key,
date.birth,
number.surgeries.12months,
number.surgeries.lifetime,
comments)
Subset the data by the patients and fields you are interested in:
selected.patients <- c(105, 109)
selected.fields <- c("patient.key", "number.surgeries.lifetime", "comments")
subdata <- subset(data[ , selected.fields], patient.key %in% selected.patients)
Format the result for printing.
# paste the column name next to each data field
taggeddata <- apply(subdata, 1,
function(row) paste(colnames(data), row, sep = ":"))
# paste all the data fields into one line of text
textdata <- apply(taggeddata, 2,
function(rec) do.call("paste", as.list(rec)))
# write to a file or to screen
writeLines(textdata)
Though I risk repeating myself, I'll make yet another case for the RMySQL package. You will be able to edit your database with your favorite SQL client (I recommend SequelPro). Using SELECT statements / Filtering and then edit it. For example
SELECT patentid, patentname, inability FROM patients LIMIT 5
could display only your needed fields. With a nice SQL client you could edit the result directly and store the result to the database. Afterwards, you can just reload the database into R. I know a lot of folks would argue that your dataset ist too small for such an overhead, but still I´d prefer the editing properties of most SQL editors of R. The same applies for joining tables if it gets trickier. Plus you might be interesting in writing views ("tables" that are updated on access) which will be treated like tables in R.
Check out library( reshape). I think if you start by melt()-ing your data your feet will be on the path to your desired outcome. Let us know if that looks like it will help and how it goes from there.