R Left join using tidyverse/dplyr drops data from df 2 - r

I have been searching and searching and have to resolved to post! I'm still pretty new to R.
I have 2 data frames. The large one is HEAT and the small one is EE.
I have manage to do a left join to get EE matched up with HEAT.
df(HEAT)
Date Time. EVENT. Person. PersonID
DTgroup1. X. Code. Code
DTgroup2. X Code. Code
DTgroup3. Y. Code. Code
....
Then there is:
df(EE)
Person ID. Type. var 3. var 4 var 5
here is the merge that I used:
merge <- left_join(HEAT, EE)
I have managed to merge the two data frames but I loose all the data in df(EE) except for the PersonID that it share with df(HEAT).
Does anyone have any advice about what I am doing wrong?
Thanks a bunch!

A left join will keep all rows on the left side, in your case HEAT, and include data where there is a match on the right hand side.
An inner join, would only return records where there is a valid join on both sides, in your case, one record would be returned.
See What is the difference between “INNER JOIN” and “OUTER JOIN”? for more info.

Obviously, you want a
merge <- full_join(HEAT, EE)
Here is a nice Cheat sheet page http://stat545.com/bit001_dplyr-cheatsheet.html
And here a super nice graphics http://r4ds.had.co.nz/relational-data.html

Related

R Merge 2 tables/dataframes by partial match

I have two tables/dataframes.
The first Table (ID) one looks like this:
The second table (Names) looks like this:
I want to match the "IDTag" variable to the first few letters of the "Name" variable. In other programming languages I would do a foreach and run through each of the IDTags for each of the rows of the second table (matching the IDTag to the first n characters of the "Name" variable where n is the number of characters of the IDTag in question.
In R it seems like there should be a method for doing this and I have looked at pmatch and a few others but those either don't appear to make the match at all or when I try to use them come up with several NAs in places where I wouldn't have expected them (Sample code using the table data above:
NameMatches <- Names[pmatch(
ID$IDTag,
Names$Name,
duplicates.ok = TRUE
),]
I have the feeling I am going about this with the wrong theory or concept so I am looking to see if someone can guide on the simplest/clearest way to do this accurately.
Editing original question to reply to comments...
The expected output would look something like this (i.e. - all of the columns of the Names table with the addition of the Group column from the ID table. Multiple matches are expected - one to many relationship between ID and Names tables):
Thanks,
If you are open to using the sqldf package, then one option would be to just write a join using the logic you gave us:
library(sqldf)
sql <- "SELECT * FROM ID t1 INNER JOIN Names t2
ON t2.Name LIKE t1.IDTag || '%'"
output <- sqldf(sql)
Note: If you want to keep all rows from the ID data frame, regardless of whether or not they match to anything in Names, then use a left join instead.

HBase keyvalue (NOSQL) to Hive table (SQL)

I have some tables in Hive that I need to join together. Since I need to do some work on each of them, normalize the key, remove outliers.... and as I add more and more tables... This chaining process turned out to be a big mass.
It is so easy to get lost where you are and the query is getting out of control.
However, I have a pretty clear idea how the final table should look like and each column is fairly independent of the other tables.
For examp, here is an example:
table_class1
name id score
Alex 1 90
Chad 3 50
...
table_class2
name id score
Alexandar 1 50
Benjamin 2 100
...
In the end I really want something looks like:
name id class1 class2 ...
alex 1 90 50
ben 2 100 NA
chad 3 50 NA
I know it could be a left outer join, but I am really having a hard time to create a seperate table for each of them after the normalization and then use left outer join with the union of the keys to left outer join each of them...
I am thinking about using NOSQL(HBase) to dump the processed data into NOSQL format.. like:
(source, key, variable, value)
(table_class1, (alex, 1), class1, 90)
(table_class1, (chad, 3), class1, 50)
(table_class2, (alex, 1), class2, 50)
(table_class2, (benjamin, 2), class2, 100)
...
In the end, I want to use something like the melt and cast in R reshape package to bring that data back to be a table.
Since this is a big data project, and there will be hundreds of millions of key value pairs in HBase.
(1) I don't know if this is a legit approach
(2) If so, is there any big data tool to pivot long HBase table into a Hive table.
Honestly, I would love to help more, but I am not clear about what you're trying to achieve (maybe because I've never used R), please elaborate and I'll try to improve my answer if necessary.
Why do you need HBase for? You can store your processed data in new tables and work with them, you can even CREATE VIEW to simplify the query if it's too large, maybe that's what you're looking for (HIVE manual). Unless you have a good reason for using HBase, I'll stick just to HIVE to avoid additional complexity, don't get me wrong, there are a lot of valid reasons for using HBase.
About your second question, you can define and use HBase tables as HIVE tables, you can even CREATE and SELECT INSERT into them all inside HIVE, is that what you're looking for?: HBase/HIVE integration doc
One last thing in case you don't know, you can create custom functions in HIVE very easily to help you with the tedious normalization process, take a look at this.

How to efficiently merge these data.tables

I want to create a certain data.table to be able to check for missing data.
Missing data in this case does not mean there will be an NA, but the entire row will just be left out. So I need to be able to see of a certain time dependent column which values are missing for which level from another column. Also important is if there are a lot of missing values together or if they are spread across the dataset.
So I have this 6.000.000x5 data.table (Call it TableA) containing the time dependent variable, an ID for the level and the value N which I would like to add to my final table.
I have another table (TableB) which is 207x2. This couples the ID's for the factor to the columns in TableC.
TableC is 1.500.000x207 of which each of the 207 columns correspond to an ID according to TableB and the rows correspond to the time dependent variable in TableA.
These tables are large and although I recently acquired extra RAM (totalling now to 8GB) my computer keeps swapping away TableC and for each write it has to be called back, and gets swapped away again after. This swapping is what is consuming all my time. About 1.6 seconds per row of TableA and as TableA has 6.000.000 rows this operation would take more than a 100 days running non stop..
Currently I am using a for-loop to loop over the rows of TableA. Doing no operation this for-loop loops almost instantly. I made a one-line command looking up the correct column and row number for TableC in TableA and TableB and writing the value from TableA to TableC.
I broke up this one-liner to do a system.time analysis and each step takes about 0 seconds except writing to the big TableC.
This showed that writing the value to the table was the most time consuming and looking at my memory use I can see a huge chunk appearing whenever a write happens and it disappears as soon as it is finished.
TableA <- data.table("Id"=round(runif(200, 1, 100)), "TimeCounter"=round(runif(200, 1, 50)), "N"=round(rnorm(200, 1, 0.5)))
TableB <- data.table("Id"=c(1:100),"realID"=c(100:1))
TSM <- matrix(0,ncol=nrow(TableB), nrow=50)
TableC <- as.data.table(TSM)
rm(TSM)
for (row in 1:nrow(TableA))
{
TableCcol <- TableB[realID==TableA[row,Id],Id]
TableCrow <- (TableA[row,TimeCounter])
val <- TableA[row,N]
TableC[TableCrow,TableCcol] <- val
}
Can anyone advise me on how to make this operation faster, by preventing the memory swap at the last step in the for-loop?
Edit: On the advice of #Arun I took some time to develop some dummy data to test on. It is now included in the code given above.
I did not include wanted results because the dummy data is random and the routine does work. It's the speed that is the problem.
Not entirely sure about the results, but give it a shot with the dplyr/tidyr packages for, as they seem to be more memory efficient than for loops.
install.packages("dplyr")
install.packages("tidyr")
library(dplyr)
library(tidyr)
TableC <- TableC %>% gather(tableC_id, value, 1:207)
This turns TableC from 1,500,000x207 to a long format 310,500,000x2 table with 'tableC_id' and 'tableC_value' columns.
TableD <- TableA %>%
left_join(TableB, c("LevelID" = "TableB_ID")) %>%
left_join(TableC, c("TableB_value" = "TableC_id")
This is a couple of packages I've been using of late, and they seem to be very efficient, but the data.table package is used specifically for management of large tables so there could be useful functions there. I'd also take a look at sqldf which allows you to query your data.frames via SQL commands.
Rethinking my problem I came to a solution which works much faster.
The thing is that it does not follow from the question posed above, because I already did a couple of steps to come to the situation described in my question.
Enter TableX from which I aggregated TableA. TableX contains Id's and TimeCounters and much more, that's why I thought it would be best to create a smaller table containing only the information I needed.
TableX also contains the relevant times while in my question I am using a complete time series from the beginning of time (01-01-1970 ;) ). It was way smarter to use the levels in my TimeCounter column to build my TableC.
Also I forced myself to set values individually while merging is a lot faster in data.table. So my advice is: whenever you need to set a lot of values try finding a way to merge instead of just copying them individually.
Solution:
# Create a table with time on the row dimension by just using the TimeCounters we find in our original data.
TableC <- data.table(TimeCounter=as.numeric(levels(factor(TableX[,TimeCounter]))))
setkey(TableC,TimeCounter) # important to set the correct key for merge.
# Loop over all unique Id's (maybe this can be reworked into something *apply()ish)
for (i in levels(factor(TableX[,Id])))
{
# Count how much samples we have for Id and TimeCounter
TableD <- TableX[Id==i,.N,by=TimeCounter]
setkey(TableD,TimeCounter) # set key for merge
# Merge with Id on the column dimension
TableC[TableD,paste("somechars",i,sep=""):=N]
}
There could be steps missing in the TimeCounter so now I have to check for gaps in TableC and insert rows which were missing for all Id's. Then I can finally check where and how big my data gaps are.

How do I match single ID's in one data frame to multiples of the IDs in another data frame in R?

For a project at work, I need to generate a table from a list of proposal ids, and a table with more data about some of those proposals (called "awards"). I'm having trouble with the match() function; the data in the "awards" table often has several rows that use the same ID, while the proposals frame has only one copy of each ID. From what I've tried, R ignores multiple rows and only returns the first match, when I need all of them. I haven't been able to find anything in documentation or through searches that helps me, though I have been having difficulty phrasing the right question.
Here's what I have so far:
#R CODE to add awards data on proposals to new data spreadsheet
#read tab delimited files
Awards=read.delim("O:/testing.txt",as.is=T)
Proposals=read.delim("O:/test.txt",as.is=T)
#match IDs from both spreadsheets
Proposals$TotalAwarded=Awards$TotalAwarded([match(Proposals$IDs,Awards$IDs)]),
write.table(Proposals,"O:/tested.txt",quote=F,row.names=F,sep="\t")
This does exactly what I want, except that only the first match is encapsulated.
What's the best way to go forward? How do I make R utilize all of the matches available?
Thanks
See help on merge: ?merge
merge( Proposals, Awards, by=ID, all.y=TRUE )
But I cannot believe this hasn't been asked on SO before.

How can I create a table with two categories and then sort by one of them in R?

I have a full dataset of observations and over 40 columns of categories but I only want two, NameID and Error and I want to sort Error in a descending order but still have NameID connected to each observation. Here is some code I've tried:
z<-15
sort(data.frame(skill$Error,skill$NameID),decreasing = TRUE)[1:z]
data.frame(skill$NameID,sort(kill#Error,decreasing=T)[1:z])
error2<-skill[order(Error , )]
Hopefully from what I've tried you can understand what I'm trying to do. Again, I want to pull two values from my skills data set, Error and NameID, but have Error sorted at the same time with NameID attached to the values. I need this all done inside of R. Thanks!
df <- data.frame(Error=skill$Error,NameID=skill$NameID)
df <- df[order(df$Error, decreasing=TRUE), ]
best of luck with whatever you are doing. Hopefully you have someone else to learn some R from.
Assuming that skill is a data frame
Errors <- skill[,c("Error","NameID")]
Errors <- Errors[order(-Errors$Error),]
You don't want to ever use sort in a data frame because it sorts whatever column you tell it to independently from the rest of the data frame. You only ever want order, order keeps the links between other columns intact.

Resources