I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.
To illustrate this I have the following example dataset:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
1 1001 108 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1005 119 2
2 1006 120 2
2 1005 121 1
2 1007 122 1
3 1010 123 5
3 1480 124 2
The example in the end should look like this:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1007 122 1
3 1010 123 5
3 1480 124 2
There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.
The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..
if dat is your dataframe:
do.call(rbind,
by(dat, INDICES=list(dat$belongID, dat$sourceID),
FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
Say your data is in df. The ordered (by uniqID) output is obtained after this:
tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
if(all(is.na(tab[bIDs == i, ])))next
ids <- na.omit(sIDs[tab[i, sIDs] > 5])
for(j in ids)
{
cond <- df$belongID == i & df$sourceID == j
old <- df[cond,]
id5 <- order(old$Time, decreasing = TRUE)[1:5]
new <- old[id5,]
df <- df[!cond,]
df <- rbind(df, new)
}
}
df[order(df$uniqID), ]
A solution in two lines using the plyr package:
library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]
The results:
belongID sourceID uniqID Time
5 1 1001 101 5
6 1 1002 102 5
4 1 1001 103 4
2 1 1001 104 3
3 1 1001 105 3
7 1 1005 106 2
1 1 1001 108 2
10 2 1005 109 5
16 2 1006 110 5
11 2 1005 111 5
17 2 1006 112 5
12 2 1005 113 5
15 2 1006 114 4
9 2 1005 115 4
13 2 1006 116 3
8 2 1005 117 3
14 2 1006 118 3
18 2 1007 122 1
19 3 1010 123 5
20 3 1480 124 2
The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns
Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:
danas.zuokas' solution:
User System Elapsed
2829.569 0 2827.86
mplourde's solution:
User System Elapsed
765.628 0.000 763.908
Aurdie's solution:
User System Elapsed
984.989 0.000 984.010
Therefore I will use mplourde's solution. Thank you all!
This should be faster, using data.table :
DT = as.data.table(dat)
DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]
Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?
Related
I really need to speed some R code up. I have a large dataset from a particular sport. Each row in the data frame represents some type of action in the game. For each game (game_id) we have two teams (team_id) that take part in the game. time_ref in the data frame are the actions in chronological order for each game. type_id is the type of action in the game. player_off is set as TRUE or FALSE and is linked to action_id=3. action_id=3 represents a player getting a card and player_off is set to TRUE/FALSE if the player was sent off when they got that card. Example data.frame:
> df
game_id team_id action_id player_off time_ref
100 10 1 NA 1000
100 10 1 NA 1001
100 10 1 NA 1002
100 11 1 NA 1003
100 11 2 NA 1004
100 11 1 NA 1005
100 10 3 1 1006
100 11 1 NA 1007
100 10 1 NA 1008
100 10 1 NA 1009
101 12 3 0 1000
101 12 1 NA 1001
101 12 1 NA 1002
101 13 2 NA 1003
101 13 3 1 1004
101 12 1 NA 1005
101 13 1 NA 1006
101 13 1 NA 1007
101 12 1 NA 1008
101 12 1 NA 1009
What I need is another column in the data frame that gives me TRUE or FALSE on whether both teams had an equal/unequal number of players on the field while each action (row) took place.
So game_id=100 had an action_id=3 & player_off=1 for team_id=10 at time_ref=1006. So we know the teams were equal with number of players on the field up to that point but unequal for the rest of the game (time_ref>1006). The same thing occurred in game_id=101 also.
This an example of the data frame with an extra column I would like to have for the dataset.
>df
game_id team_id action_id player_off time_ref is_even
100 10 1 NA 1000 1
100 10 1 NA 1001 1
100 10 1 NA 1002 1
100 11 1 NA 1003 1
100 11 2 NA 1004 1
100 11 1 NA 1005 1
100 10 3 1 1006 1
100 11 1 NA 1007 0
100 10 1 NA 1008 0
100 10 1 NA 1009 0
101 12 3 0 1000 1
101 12 1 NA 1001 1
101 12 1 NA 1002 1
101 13 2 NA 1003 1
101 13 3 1 1004 1
101 12 1 NA 1005 0
101 13 1 NA 1006 0
101 13 1 NA 1007 0
101 12 1 NA 1008 0
101 12 1 NA 1009 0
So you can see that in game_id=100 a player was sent off at time_ref=1006 so all previous rows were marked as is_even=1 and subsequent marked as uneven or 0. Similar for game_id=101 at time_ref=1004.
What is the most efficient way of achieving this extra column? Preferably not using for loops.
For some vector
x = c(0, NA, NA, NA, 1, NA, NA, NA)
write a function to standardize the data (0 or 1 player lost), calculate the cumulative number of players lost, and compare this to zero,
fun0 = function(x) {
x[is.na(x)] = 0
cumsum(x) == 0
}
For several groups, use ave() with a grouping variable
x = c(x, rev(x))
grp = rep(1:2, each = length(x) / 2)
ave(x, grp, FUN = fun0)
For the data in the question, try
df$is_even = ave(df$player_off, df$game_id, FUN = fun)
Semantically, it seems likely that fun0() is more complicated than implied in this solution, specifically that if each team loses a player, they are again even, as #SunLisa says. If so, clean the data
df$player_off[is.na(df$player_off)] = 0
and change fun0(), e.g.,
fun1 <- function(x, team) {
is_team_1 <- team == head(team, 1) # is 'team' the first team?
x1 <- x & is_team_1 # lost player & team 1
x2 <- x & !is_team_1 # lost player & team 2
cumsum(x1) == cumsum(x2) # same total number of players?
}
(it doesn't seem like a good idea to coerce the logical return value to an integer). This could be applied by group with
df$is_even = ave(seq_len(nrow(df)), df$game_id, FUN = function(i) {
fun1(df$player_off[i], df$team_id[i])
})
or
split(df$is_even, df$game_id) <-
Map(fun1,
split(df$player_off, df$game_id),
split(df$team_id, df$game_id)
)
The implementation of ave() is useful to look at, the important line being
split(x, g) <- lapply(split(x, g), FUN)
The right-hand side splits x by group g, then applies FUN() to each group. The left-hand side split<-() is a tricky operation, using the group indexes to update the original vector x.
Comments
The original question asked for 'no for loops', but actually lapply() (in ave()) and Map() are exactly that; ave() is relatively efficient because of the split-apply-combine strategy it adopts, rather than what the OP probably implemented, which was likely to iterate through games, subset the data frame, then update the data.frame for each game. The subsetting would have duplicated subsets of the entire data set, and the update in particular would have copied at least the entire result column on each assignment; this copying would have slowed the execution down alot. It's also possible that the OP was struggling with fun0(); it would help to clarify the question, especially title, to identify that as the problem.
There are faster ways, especially using the data.table package, but the principle is the same -- identify a function that operates on a vector the way you'd like, and apply it by group.
An alternative, fully-vectorized, solution follows this suggestion to calculate a cumulative sum by group. For fun0(), standardize x to be the number of players leaving the game at a particular timepoint, without NAs
x[is.na(x)] = 0
For the equivalent of fun(), calculate the cumulative sum of players leaving the game, irrespective of group
cs = cumsum(x)
Correct this for the group that the cumulative sum applies to
in_game = cs - (grp - 1)
and set this to 'TRUE' when 0 players have left the game
is_even = (in_game == 0)
This relies on grp indexing from 1 to the number of groups; for the data here one might grp = match(df$game_id, unique(df$game_id)). A similar solution exists for fun1().
Here's a dplyr + tidyr solution to the problem, with the summary of what was done:
Manipulate the data by converting all NAs in player_off to 0 for easier summing and assigning the smaller team_num (assuming there are only 2) to team1 and the other to team2
"Tally" the player_offs using spread and fill the invalid combinations in the data with 0 -- for example, in game_id = 100, there's no team_id = 11 for time_ref = 1000
Take the cumulative sum of the lagged team1 and team2 vectors (and of course fill NAs with 0)
Code below:
require(dplyr)
require(tidyr)
df %>%
group_by(game_id) %>%
mutate(
player_off = player_off %>% replace(list = is.na(.), values = 0),
team_num = if_else(team_id == min(team_id), "team1", "team2")
) %>%
spread(key = team_num, value = player_off, fill = 0) %>%
arrange(game_id, time_ref) %>%
mutate(
team1_cum = cumsum(lag(team1, default = 0)),
team2_cum = cumsum(lag(team2, default = 0)),
is_even = as.integer(team1_cum == team2_cum)
) %>%
ungroup() %>%
select(-team1, -team2, -team1_cum, -team2_cum)
Output:
# A tibble: 20 x 5
game_id team_id action_id time_ref is_even
<int> <int> <int> <int> <int>
1 100 10 1 1000 1
2 100 10 1 1001 1
3 100 10 1 1002 1
4 100 11 1 1003 1
5 100 11 2 1004 1
6 100 11 1 1005 1
7 100 10 3 1006 1
8 100 11 1 1007 0
9 100 10 1 1008 0
10 100 10 1 1009 0
11 101 12 3 1000 1
12 101 12 1 1001 1
13 101 12 1 1002 1
14 101 13 2 1003 1
15 101 13 3 1004 1
16 101 12 1 1005 0
17 101 13 1 1006 0
18 101 13 1 1007 0
19 101 12 1 1008 0
20 101 12 1 1009 0
Here's my think:
data.table is going to work well, especially when you are working with large data sets. It's faster. We just need to group it, cumsum 2 team's layoff, and see if they equal.
First I have to say:
(problem solved by Martin Morgan, his updated answer no longer has this error)
I don't think #Martin Morgan 's answer is right. Let's imagine a certain case:
when team 1 had one player off, after which team 2 had another player off, then 2 teams should be even, but #Martin Morgan's output would be FALSE.
I'll make an example with this dataset, where player_off of record 19 was modified to 1, which means that in game 101, after team 13 had had 1 player off at 1004, team 12 had 1 player off at 1008, which would make 2 teams even at 1009.
> dt.1
game_id team_id action_id player_off time_ref
1 100 10 1 NA 1000
2 100 10 1 NA 1001
3 100 10 1 NA 1002
4 100 11 1 NA 1003
5 100 11 2 NA 1004
6 100 11 1 NA 1005
7 100 10 3 1 1006
8 100 11 1 NA 1007
9 100 10 1 NA 1008
10 100 10 1 NA 1009
11 101 12 3 0 1000
12 101 12 1 NA 1001
13 101 12 1 NA 1002
14 101 13 2 NA 1003
15 101 13 3 1 1004
16 101 12 1 NA 1005
17 101 13 1 NA 1006
18 101 13 1 NA 1007
19 101 12 1 1 1008
20 101 12 1 NA 1009
But #Martin Morgan 's function would produce this output:
> dt.1$is_even = ave(df$player_off, df$game_id, FUN = fun)
> dt.1
game_id team_id action_id player_off time_ref is_even
1 100 10 1 NA 1000 1
2 100 10 1 NA 1001 1
3 100 10 1 NA 1002 1
4 100 11 1 NA 1003 1
5 100 11 2 NA 1004 1
6 100 11 1 NA 1005 1
7 100 10 3 1 1006 1
8 100 11 1 NA 1007 0
9 100 10 1 NA 1008 0
10 100 10 1 NA 1009 0
11 101 12 3 0 1000 1
12 101 12 1 NA 1001 1
13 101 12 1 NA 1002 1
14 101 13 2 NA 1003 1
15 101 13 3 1 1004 1
16 101 12 1 NA 1005 0
17 101 13 1 NA 1006 0
18 101 13 1 NA 1007 0
19 101 12 1 1 1008 0
20 101 12 1 NA 1009 0
Notice how at line 19 and line 20, is.even=0. Which is not what op wants.
My code does not process NAs, so I am going to transform NA to 0 first.
> dt.1<-as.data.table(dt.1)
> dt.1[is.na(dt.1)]<-0
My code would produce the correct output, at time 1008 and 1009, where both team 12 and team 13 had 1 off, two teams are even.
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id][order(game_id,time_ref)][,.(team_id,time_ref,action_id,player_off,even=as.numeric(cumsum(team2_off)==cumsum(team1_off))),by=game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 0 1
2: 100 10 1001 1 0 1
3: 100 10 1002 1 0 1
4: 100 11 1003 1 0 1
5: 100 11 1004 2 0 1
6: 100 11 1005 1 0 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 0 0
9: 100 10 1008 1 0 0
10: 100 10 1009 1 0 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 0 1
13: 101 12 1002 1 0 1
14: 101 13 1003 2 0 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 0 0
17: 101 13 1006 1 0 0
18: 101 13 1007 1 0 0
19: 101 12 1008 1 1 1
20: 101 12 1009 1 0 1
I understand it is a messy looking chunk of data.table code, let me explain step by step.
dt[, .(
action_id,
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off,
team_id,
time_ref,
player_off
), by = game_id][order(game_id, time_ref)][, .(team_id,
time_ref,
action_id,
player_off,
even = cumsum(team2_off) == cumsum(team1_off)), by = game_id]
first, we take data.table dt, group by game_id, and does this calculation:
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off
data.table has some problem taking 2 grouping at once (group by game_id and team_id), but it handles logical expression inside of each group well. In this way, we effectively get team1_off and team2_off, by multiplying a logical output of team_id == max/min(team_id) with player_off. When both are 1, the output would be 1, which means, 1 player was off in the selected team.
Now we have a data table of:
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id]
game_id action_id team2_off team1_off team_id time_ref player_off
1: 100 1 0 0 10 1000 0
2: 100 1 0 0 10 1001 0
3: 100 1 0 0 10 1002 0
4: 100 1 0 0 11 1003 0
5: 100 2 0 0 11 1004 0
6: 100 1 0 0 11 1005 0
7: 100 3 0 1 10 1006 1
8: 100 1 0 0 11 1007 0
9: 100 1 0 0 10 1008 0
10: 100 1 0 0 10 1009 0
11: 101 3 0 0 12 1000 0
12: 101 1 0 0 12 1001 0
13: 101 1 0 0 12 1002 0
14: 101 2 0 0 13 1003 0
15: 101 3 1 0 13 1004 1
16: 101 1 0 0 12 1005 0
17: 101 1 0 0 13 1006 0
18: 101 1 0 0 13 1007 0
19: 101 1 0 1 12 1008 1
20: 101 1 0 0 12 1009 0
Now we no longer need to group by two groups (team_id, game_id), we can just do cumsum by game_id, and compare if cumsum(team1_off)==cumsum(team2_off), also, order it by game_id and time_ref, so the result would have the correct order.
I understand that NAs may have different meanings than 0 in this scenario. If you really care that much, just create a dummy column of player_off.
> dt$dummy<-dt$player_off
> dt$dummy[is.na(dt$dummy)]<-0
> dt<-as.data.table(dt)
> dt[, .(
+ action_id,
+ team2_off = (team_id == max(team_id)) * dummy,
+ team1_off = (team_id == min(team_id)) * dummy,
+ team_id,
+ time_ref,
+ player_off
+ ), by = game_id][order(game_id, time_ref)][, .(team_id,
+ time_ref,
+ action_id,
+ player_off,
+ even = as.numeric(cumsum(team2_off) == cumsum(team1_off))), by = game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 NA 1
2: 100 10 1001 1 NA 1
3: 100 10 1002 1 NA 1
4: 100 11 1003 1 NA 1
5: 100 11 1004 2 NA 1
6: 100 11 1005 1 NA 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 NA 0
9: 100 10 1008 1 NA 0
10: 100 10 1009 1 NA 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 NA 1
13: 101 12 1002 1 NA 1
14: 101 13 1003 2 NA 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 NA 0
17: 101 13 1006 1 NA 0
18: 101 13 1007 1 NA 0
19: 101 12 1008 1 NA 0
20: 101 12 1009 1 NA 0
I really think you question is very interesting, and I was dedicated to solve this using data.table. It took me few hours and I almost gave up on data.table, thinking that data.table just can't process two grouping at a time. I eventually solved it with a logical multiplication.
Great fun I had
team1_off = (team_id == min(team_id)) * dummy
team2_off = (team_id == max(team_id)) * dummy
I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.
I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2
I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33
I have a input dataframe like this (the real one is very large, so I want to do it faster):
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
A B C
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
5 5 9 13
I have a dataframe with replacement code like this (the entries here maybe more than df1):
df2 <- data.frame(X=c(1:15), Y=c(101:115))
X Y
1 1 101
2 2 102
3 3 103
4 4 104
5 5 105
6 6 106
7 7 107
8 8 108
9 9 109
10 10 110
11 11 111
12 12 112
13 13 113
14 14 114
15 15 115
By matching df2$X with value in df1$A and df1$B, I want to get a new_df1 by replace df1$A and df1$B with the corresponding values in df2$Y, i.e. resulting this new_df1
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
Could you mind to give me some guidance how to do it faster in R, as my dataframe is very large? Many thanks.
As Thilo mentioned Nico's answer assumes that df2 is ordered by X and X contains every integer 1,2,3....
I would prefer to use match() as a more general case:
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
df2 <- data.frame(X=c(1:15), Y=c(101:115))
new_df1 <- df1
new_df1$A <- df2$Y[match(df1$A,df2$X)]
new_df1$B <- df2$Y[match(df1$B,df2$X)]
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
It's supereasy! You just need to get the proper offsets in the array.
So for instance, to get the Y column of df2 corresponding to the values in the A column of df1 you'll write df2$Y[df1$A]
Hence, your code will be:
df_new <- data.frame("A" = df2$Y[df1$A], "B" = df2$Y[df1$B], "C" = df1$C)
Here is another (one-liner) way of doing it.
> with(c(df2,df1),data.frame(A = Y[match(A,X)],B = Y[match(B,X)],C))
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
However I am not sure whether it will be faster than the other suggestions