how to join two frames in h2o flow? - r

How can i do a join on two frames in h2o flow? I want to join the first column of one frame with the first column a second frame, the second column of one frame with the second column of a second frame and so on.

You seem to be describing what h2o.rbind does. E.g.
i1 = as.h2o(iris)
nrow(i1) #150
i2 = h2o.rbind(i1,i1)
nrow(i2) #300
If you check over on Flow to see what has happened, getFrames, you will see "iris" with 150 rows, and "RTMP_sid_abcd_2" (i.e. some random name) with 300 rows. In other words, h2o.rbind() creates a new H2O frame.
If by "join" you were thinking an SQL join, where the two frames have a common index column, but otherwise different columns, then you want h2o.merge(). (If that was what you wanted, but you cannot get h2o.merge() to work, then it would be helpful to see some of your data.)

Related

Why does nest() sometimes combine rows?

I have a tibble with 1755 rows.
The other question on my profile relates to setting this up.
The columns include a variable number of columns with name format "C1L", "C1H", "C2L" etc. always starting with c, no other columns start with c, and a column named "DI". I would like to nest these columns.
I run this code:
fullfile <- fullfile %>%
nest(alleles = c(starts_with("C", ignore.case = FALSE), "DI"))
and get an output tibble with 1742 rows.
Looking in more detail, a subset of rows have two sets of data in the "alleles" column.
The affected rows are spread through the dataset, not clustered.
This is data from 16 groups, and each group has a probability related to each row. This give me an easy measure - summing the probability column before the nest gives 16, afterwards gives 15.99826 so I'm definitely losing data, not just empty rows.
I'm looking for advice on what I can do to narrow down the cause of this issue.
I can't upload the example as I don't have permission to share the data I'm afraid.

Why am I getting more rows after applying the merge(x,y,all.x=T) function in r?

I do have two data sets: data2 and data 3.
The relevant information of the data3 should be added to the respective rows of data2 and the commmon columns in both set are "Inschrijfjaar" and "Leeftijd".
I am using the code:
data4=merge(x=data2,y=data3, by=c("Inschrijfjaar", "Leeftijd"),all.x=TRUE)
A check up gives me:
dim(data2)
525380 5
dim(data3)
1707 7
dim(data4)
5307668 10
So the merge is not done correctly, the dimension of data4 should also be 525380, because it is a left join. So I am getting ways more rows then the left data set. What could be the cause?
I also tried the code:
data4=merge(x=data2,y=data3,all.x=TRUE)
Sorry, I cannot comment, this is not a full answer, but meant to be a comment:
There are many differnt forms of joins.
I find them well explained here.
You do a left join, which returns all rows from the left table, and any rows with matching keys from the right table. So you would expect more values in your data4.
What you you actually want seems to be a left semi-join "A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y" (from this question which may help you answer your question).
This behaviour occurs when there are multiple rows of data3 that have the same values in your columns c("Inschrijfjaar", "Leeftijd") (and these values appear in data2). Where data in data3 could be merged with multiple records in data2, they will all be included, leading to more records in data4 than in data2

Update a data frame within a for loop

The point of this question is that I want to know how to update a dataframe inside of either a for loop or a function. So i know there are other ways to do the specific task i am looking at, but i want to know how to do it the way i am trying to do it.
I have a data frame with 15 columns and 2k observations with some 98 and 99s. For each row in where there is a 98 or 99 for any variable/column, I want to remove the whole row. I create a function to filter by variable name not equal to 98/99, and use lapply. however, instead of continually updating the data frame, It just spits out a series of data frames, overwriting the previous data frame, meaning that at the end i will only get a data frame with the last column cleaned. How do i get it to update the data frame for each column sequentially?
nafunction = function(variable){
kuwait5=kuwait5%>%
filter(variable<90)
}
`nafunction = function(variable){
kuwait5=kuwait5%>%
filter(variable<90)
}
lapply(kuwait5, nafunction)`
Expected result is a new data frame with all rows that have an 98 removed. What i get is a sequence of data frames each one having ONE column in which rows with NAS are removed.

Merge with multiple conditions and nearest numerical match

From looking through Stackoverflow, and other sources, I believe that changing my dataframes to data.tables and using setkey, or similar, will give what I want. But as of yet I have been unable to get a working Syntax.
I have two data frames, one containing 26000 rows and the other containing 6410 rows.
The first dataframe contains the following columns:
Customer name, Base_Code, Idenity_Number, Financials
The second dataframe holds the following:
Customer name, Base_Code, Idenity_Number, Financials, Lapse
Both sets of data have identical formatting.
My goal is to join the Lapse column in the second dataframe to first dataframe. The issue I have is that the numeric value in Financials does not match between the two datasets and I only want the closest match in DF1 to have the value in the Lapse column in DF2 against it.
There will be examples where there are multiple entries for the same customer ID and Base Code in each dataframe, so I need to merge the two based on Idenity_Number and Base_Code (which is exact) and then match against the nearest financial numeric match for each entry only.
There will never be more entries in the DF2 then held within DF1 for each Customer and Base_Code.
Here is an example of DF1:
Here is an example of DF2:
And finally, here is what I want end up with:
If we use Jessica Rabbit as the example we have a match against DF1 and DF2, the financial value of 1240 from DF1 was matched against 1058 in DF2 as that was the closest match.
I could not work out how to get a working solution using data.table, so I re-thought my approach and have come up with a solution.
First of all I merged the two datasets, and then removed any entries that did not have a stauts of "LAP", this gave me all of the NON Lapsed entries:
NON_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.x=TRUE)
NON_LAP <- NON_LAP [!grepl("LAP", NON_LAP$Status, ignore.case=FALSE),]
Next I merged again, this time looking specifically for the lapsed cases. To work out which was the cloest match I used the abs function, then I ordered by the lowest difference to get the closest matches in order. Finally I removed duplicates to show the closest matches and then also kept duplicates and stripped out the "LAP" status to ensure those that were not the closest match remained in the data.
Finally I merged them all together giving me the required outcome.
FIND_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.y=FALSE)
FIND_LAP$Difference <- abs(FIND_LAP$GWP - FIND_LAP$ACTUAL_PRICE)
FIND_LAP <- FIND_LAP[order( FIND_LAP[,27] ),]
FOUND_LAP <- FIND_LAP [!duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
NOT_LAP <- FIND_LAP [duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
Hopefully this will help someone else who might be new to R and encounters the same issue.

Conditional operation on two data frames (R)

I'm having some difficulty executing a conditional operation on two dataframes. For problem illustration, I have three variables: Price, State, and Item, which are stored in a data frame (data1) with those column names. I use ddply to generate a dataframe (data2) that includes columns State and Item, and the average price(or some other function) for that State/Item combination.
What I then want to do is fill in a column in the originating data frame(i.e. a simple prediction vector), where the column's value is the mean value for a given observations combination of State and Item in data1. (e.g., if an observation in data1 has state="Arizona" and item="pen", I then want to retrieve the average price stored in data2 that corresponds to that state/item combination, and insert it into the column.)
Thank you for any help.
The plyr package comes with a great little function called join. You can use this to complete your task.
join(dat1,dat2, by=c('State','Item'))
Review ?join to see the different types of joins possible. I'm pretty sure you want a left join.

Resources