Dataframe join on multiple IDs - julia

I'm trying to do a database style merge with at least two IDs using data-frame types:
merged_df = join(df1, df2, on = (:ID1 :ID2), kind = :outer)
This does not seem to be allowed in base.join.
I can make this work with some kind of verbose function, but I want to see if there's a cleaner way?

merged_df = join(df1, df2, on = [:ID1, :ID2], kind = :outer)
DataFrames is awesome, but there's a lot of useful stuff that's not documented... There's a few things I've been meaning to add to the documentation in the joins and split-apply-combine sections.

In response to ARM, it looks like this is close, but the actual syntax is:
merged_df = join(df1, df2, on = [[:ID1, :ID2]], kind = :outer)
I was getting an error when using your method

Related

Issue with duplicate last names/cannot find object

Recent excel graduate trying to transition to R, so am very new to this.
I am trying to create a player based sports model. However, when trying to print the code I have already written, R is conflating players with the same last name (using dplyr). Essentially it has created two columns (player_last_name.x and player_last_name.y), and has merged these players stats. My first thoughts were to merge the first and last name columns into one. However, not sure how R goes with merging categorical data.
Also, R seems to not be able to find my third variable in season_TOG.
Any help would be appreciated.
Thanks.
disp <- playerdata %>%
group_by(player_first_name, player_last_name)%>%
summarise(season_disposals = sum(disposals))%>%
games <- playerdata %>%
group_by(player_first_name, player_last_name) %>%
summarise(season_game_count = n_distinct(match_round))%>%
TOG <- playerdata %>%
group_by(player_first_name, player_last_name)%>%
summarise(season_TOG = sum(time_on_ground_percentage))%>%
PropModel_df <- merge(disp, games, TOG, by="player_first_name", "player_last_name")%>%
PropModel_df <- transform(PropModel_df, avg_disp = season_disposals/season_game_count)%>%
PropModel_df <- transform(PropModel_df, avg_TOG = season_TOG/season_game_count)%>%
print(PropModel_df)```
```Error in eval(substitute(list(...)), `_data`, parent.frame()) :
object 'season_TOG' not found```
There are at least three clear issues here.
Your code is not parse-able: you have extra %>% in several points of your code. It might be that this is just an artifact of your question, and you reduced some otherwise unnecessary portions of your code but didn't clean up your pipes ... in which case thank you for reducing your code, but please try your reduced code before posting in the question.
merge accepts exactly two frames to join, so your
PropModel_df <- merge(disp, games, TOG, by="player_first_name", "player_last_name")
will fail by that notion. You'll need to merge the first two (merge(disp, games, by=...) and then merge those results with TOG.
When you join using multiple fields, you need to include them in a single-vector. Your code (adjusted for #2):
PropModel_df <- merge(disp, games, by="player_first_name", "player_last_name")
should be
PropModel_df <- merge(disp, games, by = c("player_first_name", "player_last_name"))
Further detail: when arguments are provided without names, they are assigned by position. Because merge arguments are
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
incomparables = NULL, ...)
this is the apparent argument names for your call:
merge(x = disp, y = games, by = "player_first_name", by.x = "player_last_name")
which is (I believe) not what you intend.

Apply functions in different csv files in R with

Already made some research about loops in R and found that most of them focus only on how to change file or variable name with loops like Change variable name in for loop using R could find a good solution from other articles of bloggers....
The following is what I want to do with my original data(s1P...s9P) to get new data(s1Pm...s9Pm) by calculate their means according to the Datum column in (s1P...s9P).
Running the following lines one by one is okay, however, it seems like there should be a possibility to use loops to make it tidy.
Any suggestions would be appreciated. Have a nice weekend!
s1Pm = aggregate(s1P, list(s1P$Datum), mean)
s2Pm = aggregate(s2P, list(s2P$Datum), mean)
s3Pm = aggregate(s3P, list(s3P$Datum), mean)
s4Pm = aggregate(s4P, list(s4P$Datum), mean)
s5Pm = aggregate(s5P, list(s5P$Datum), mean)
s6Pm = aggregate(s6P, list(s6P$Datum), mean)
s7Pm = aggregate(s7P, list(s7P$Datum), mean)
s8Pm = aggregate(s8P, list(s8P$Datum), mean)
s9Pm = aggregate(s9P, list(s9P$Datum), mean)
We can load all the objects in a list with mget and then apply the aggregate by looping through the list
outLst <- lapply(mget(paste0("s", 1:9, "P")),
function(x) aggregate(x, list(x$Datum), mean))
names(outLst) <- paste0(names(outLst), "m")
It is better to keep the output in a list rather than creating multiple objects. But, it can be done as well
list2env(outLst, envir = .GlobalEnv)
though not recommended

Merge is duplicating rows in r

I have two data sets with country names in common.
first data frame
As you can see, both data sets have a two letter country code formated the same way.
After running this code:
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I get the following result
Rather than having 2 rows with the same country code, I'd like them to be combine.
Thanks!
I strongly suspect that the Group.1 strings in one or other of your data frames has one or more trailing spaces, so they appear identical when viewed, but are not. An easy way of visually checking whether they are the same:
levels(as.factor(Trade$Group.1))
levels(as.factor(aggdata$Group.1))
If the problem does turn out to be trailing spaces, then if you are using R 3.2.0 or higher, try:
Trade$Group.1 <- trimws(Trade$Group.1)
aggdata$Group.1 <- trimws(aggdata$Group.1)
Even better, if you are using read.table etc. to input your data, then use the parameter strip.white=TRUE
For future reference, it would be better to post at least a sample of your data rather than a screenshot.
The following works for me:
aggdata <- data.frame(Group.1 = c('AT', 'BE'), CASEID = c(1587.6551, 506.5), ISOCNTRY = c(NA, NA),
QC17_2 = c(2.0, 1.972332), D70 = c(1.787440, 1.800395))
Trade <- data.frame(Group.1 = c('AT', 'BE'), trade = c(99.77201, 100.10685))
merged<- merge(aggdata, Trade, by="Group.1" , all.y = TRUE, all.x=TRUE)
I had to transcribe your data by hand from your screenshots, so I only did the first two rows. If you could paste in a full sample of your data, that would be helpful. See here for some guidelines on producing a reproducible example: https://stackoverflow.com/a/5963610/236541

How do I enforce a global column-type using excel_sheet?

I am importing several datasets that needs to be bind_rows() afterwards. For this reason, I would like set a global column type for every column of the tbl_df that results from running excel_sheet() function.
The reason is that different column types retrieve errors when I bind_rows() them.
I was trying read_excel("myExcel.xlsx", sheet=1, col_types = 'text') assuming that text would have been recycled, but I got an error message saying read_excel("survey.xlsx", sheet=1, col_types = 'text').
My solution has been to mutate after import:
res.df <- import.df %>% mutate(col.name = as.character(col.name))
Its not the most elegant, as it uses a second operation after the import. But it has worked for me.

Specific for loop too slow in R

I have to use 2 data frames 2 million records and another 2 million records. I used a for loop to obtain the data from one another but it is too slow. I've created an example to demonstrate what I need to do.
ratings = data.frame(id = c(1,2,2,3,3),
rating = c(1,2,3,4,5),
timestamp = c("2006-11-07 15:33:57","2007-04-22 09:09:16","2010-07-16 19:47:45","2010-07-16 19:47:45","2006-10-29 04:49:05"))
stats = data.frame(primeid = c(1,1,1,2),
period = c(1,2,3,4),
user = c(1,1,2,3),
id = c(1,2,3,2),
timestamp = c("2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00","2011-07-01 00:00:00"))
ratings$timestamp = strptime(ratings$timestamp, "%Y-%m-%d %H:%M:%S")
stats$timestamp = strptime(stats$timestamp, "%Y-%m-%d %H:%M:%S")
for (i in(1:nrow(stats)))
{
cat("Processing ",i," ...\r\n")
temp = ratings[ratings$id == stats$id[i],]
stats$idrating[i] = max(temp$rating[temp$timestamp < stats$timestamp[i]])
}
Can someone provide me with an alternative for this? I know apply may work but I have no idea how to translate the for function.
UPDATE: Thank you for the help. I am providing more information.
The table stats has unique combinations of primeid,period,user,id.
The table ratings has multiple id records with different ratings and timestamps.
What I want to do is the following. For each id found in stats, to find all the records in the ratings table (id column) and then get the max rating according to a specific timestamp obtained also from stats.
I love plyr, and most of the tools created by Hadley Wickham, but I find that it can be painfully slow, especially if I'm trying to split on an ID field. When this happens, I turn to sqldf. I usually get a speed up of 20x.
First I need to use lubridate because sqldf chokes on POSIXlt types:
library(lubridate)
ratings$timestamp = ymd_hms(ratings$timestamp)
stats$timestamp = ymd_hms(stats$timestamp)
Merge the dataframes, as Vincent did, and remove those violating the date constraint:
tmp <- merge(stats, ratings, by="id")
tmp <- subset(tmp, timestamp.y < timestamp.x )
Lastly, grab the max rating for each ID:
library(sqldf)
sqldf("SELECT *, MAX(rating) AS rating FROM tmp GROUP BY id")
Depending on the ratio of ids to data points this may work better:
r = split(ratings, ratings$id)
stats$idrating = sapply(seq.int(nrow(stats)), function(i) {
rd = r[[stats$id[i]]]
if (length(rd))
max(rd$rating[rd$timestamp < stats$timestamp[i]])
else NA
})
If your IDs are not contiguous integers (you can check that with all(names(r) == seq_along(r))) you'll have to add as.character() when referencing r[[ or use match once to create the mapping and it will cost you some speed.
Obviously, you can do the same without the split, but that's typically slower yet will use less memory:
stats$idrating = sapply(seq.int(nrow(stats)), function(i) {
rd = ratings[ratings$id == stats$id[i],]
if (nrow(rd))
max(rd$rating[rd$timestamp < stats$timestamp[i]])
else NA
})
You can also drop the if if you know there will be no mismatches.
I voted the answer provided although I used another approach to get to the same result
In the merge dataset I first removed dates that were older than the conditioned date and then run this:
aggregate (rating ~ id+primeid+period+user, data=new_stats, FUN = max)
From a data structure perspective it seems that you want to merge two tables and then perform a split-group-apply method.
Instead of for looping to check what row belongs to what row you can simply merge the two tables (much like a JOIN statement in SQL) and then perform an 'aaply' type of method. I recommend you download the 'plyr' library.
new_stats = merge(stats, ratings, by='id')
library(plyr)
ddply(new_stats,
c('primeid', 'period', 'user'),
function(new_stats)
c( max(new_stats[as.Date(new_stats$timestamp.x) > as.Date(new_stats$timestamp.y)]$rating )))
If the use of plyr confuses you, please visit this tutorial: http://www.creatapreneur.com/2013/01/split-group-apply/.

Resources