Select the top N values by group - r

This is in response to a question asked on the r-help mailing list.
Here are lots of examples of how to find top values by group using sql, so I imagine it's easy to convert that knowledge over using the R sqldf package.
An example: when mtcars is grouped by cyl, here are the top three records for each distinct value of cyl. Note that ties are excluded in this case, but it'd be nice to show some different ways to treat ties.
mpg cyl disp hp drat wt qsec vs am gear carb ranks
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 2.0
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 1.0
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 2.0
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.0
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1.0
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 1.5
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1.5
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 3.0
How to find the top or bottom (maximum or minimum) N records per group?

This seems more straightforward using data.table as it performs the sort while setting the key.
So, if I were to get the top 3 records in sort (ascending order), then,
require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]
does it.
And if you want the descending order
d[, tail(.SD, 3), by=cyl] # Thanks #MatthewDowle
Edit: To sort out ties using mpg column:
d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]
# cyl mpg disp hp drat wt qsec vs am gear carb rank
# 1: 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 11
# 2: 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 1
# 3: 4 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1 8
# 4: 4 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
# 5: 6 18.1 225.0 105 2.76 3.460 20.22 1 0 3 1 7
# 6: 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4 1
# 7: 6 17.8 167.6 123 3.92 3.440 18.90 1 0 4 4 2
# 8: 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4 7
# 9: 8 10.4 472.0 205 2.93 5.250 17.98 0 0 3 4 14
# 10: 8 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4 5
# 11: 8 13.3 350.0 245 3.73 3.840 15.41 0 0 3 4 3
# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]

dplyr does the trick
mtcars %>%
arrange(desc(mpg)) %>%
group_by(cyl) %>% slice(1:2)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
5 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
6 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Just sort by whatever (mpg for example, question is not clear on this)
mt <- mtcars[order(mtcars$mpg), ]
then use the by function to get the top n rows in each group
d <- by(mt, mt["cyl"], head, n=4)
If you want the result to be a data.frame:
Reduce(rbind, d)
Edit:
Handling ties is more difficult, but if all ties are desired:
by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])
Another approach is to break ties based on some other information, e.g.,
mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)

There are at least 4 ways to do this thing, however,each has some difference.
We using u_id to group and using lift value to order/sort
1 dplyr traditional way
library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)
and if you switch the order of arrange(desc(lift)) and group_by(u_id) the result is essential the same.And if there is tie for equal lift value,it will slice to make sure each group has no more than 10 values, if you only have 5 lift value in the group, it will only gives you 5 results for that group.
2 dplyr topN way
library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)
this one if you have tie in lift value, say 15 same lift for the same u_id, you will got all 15 observations
3 data.table tail way
library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]
It has the same row numbers as the first way, however, there are some rows are different, I guess they are using diff random algorithm dealing with tie.
4 data.table .SD way
library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]
This way is the most "uniform" way,if in a group there are only 5 observation it will repeat value to make it to 10 observations and if there are ties it will still slice and only hold for 10 observations.

If there were a tie at the fourth position for mtcars$mpg then this should return all the ties:
top_mpg <- mtcars[ mtcars$mpg >= mtcars$mpg[order(mtcars$mpg, decreasing=TRUE)][4] , ]
> top_mpg
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Since there is a tie at the 3-4 position you can test it by changing 4 to a 3, and it still returns 4 items. This is logical indexing and you might need to add a clause that removes the NA's or wrap which() around the logical expression. It's not much more difficult to do this "by" cyl:
Reduce(rbind, by(mtcars, mtcars$cyl,
function(d) d[ d$mpg >= d$mpg[order(d$mpg, decreasing=TRUE)][4] , ]) )
#-------------
mpg cyl disp hp drat wt qsec vs am gear carb
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Incorporating my suggestion to #Ista:
Reduce(rbind, by(mtcars, mtcars$cyl, function(d) d[ d$mpg <= sort( d$mpg )[3] , ]) )

You can write a function that splits the database by a factor, orders by another desired variable, extract the number of rows you want in each factor (category) and combine these into a database.
top<-function(x, num, c1,c2){
sorted<-x[with(x,order(x[,c1],x[,c2],decreasing=T)),]
splits<-split(sorted,sorted[,c1])
df<-lapply(splits,head,num)
do.call(rbind.data.frame,df)}
x is the dataframe;
num is the number of number of rows you would like to see;
c1 is the column number of the variable you would like to split by;
c2 is the column number of the variable you would like to rank by or handle ties.
Using the mtcars data, the function extracts the 3 heaviest cars (mtcars$wt is the 6th column) in each cylinder class (mtcars$cyl is the 2nd column)
top(mtcars,3,2,6)
mpg cyl disp hp drat wt qsec vs am gear carb
4.Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
4.Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
4.Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
6.Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6.Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
6.Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
8.Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
8.Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
8.Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
You can also easily get the lightest in a class by changing head in the lapply function to tail OR by removing the decreasing=T argument in the order function which will return it to its default, decreasing=F.

Since dplyr 1.0.0, the slice_max()/slice_min() functions were implemented:
mtcars %>%
group_by(cyl) %>%
slice_max(mpg, n = 2, with_ties = FALSE)
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
5 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
6 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
The documentation on with_ties parameter:
Should ties be kept together? The default, TRUE, may return more rows
than you request. Use FALSE to ignore ties, and return the first n
rows.

I prefer #Ista solution, cause needs no extra package and is simple.
A modification of the data.table solution also solve my problem, and is more general.
My data.frame is
> str(df)
'data.frame': 579 obs. of 11 variables:
$ trees : num 2000 5000 1000 2000 1000 1000 2000 5000 5000 1000 ...
$ interDepth: num 2 3 5 2 3 4 4 2 3 5 ...
$ minObs : num 6 4 1 4 10 6 10 10 6 6 ...
$ shrinkage : num 0.01 0.001 0.01 0.005 0.01 0.01 0.001 0.005 0.005 0.001 ...
$ G1 : num 0 2 2 2 2 2 8 8 8 8 ...
$ G2 : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ qx : num 0.44 0.43 0.419 0.439 0.43 ...
$ efet : num 43.1 40.6 39.9 39.2 38.6 ...
$ prec : num 0.606 0.593 0.587 0.582 0.574 0.578 0.576 0.579 0.588 0.585 ...
$ sens : num 0.575 0.57 0.573 0.575 0.587 0.574 0.576 0.566 0.542 0.545 ...
$ acu : num 0.631 0.645 0.647 0.648 0.655 0.647 0.619 0.611 0.591 0.594 ...
The data.table solution needs order on i to do the job:
> require(data.table)
> dt1 <- data.table(df)
> dt2 = dt1[order(-efet, G1, G2), head(.SD, 3), by = .(G1, G2)]
> dt2
G1 G2 trees interDepth minObs shrinkage qx efet prec sens acu
1: 0 FALSE 2000 2 6 0.010 0.4395953 43.066 0.606 0.575 0.631
2: 0 FALSE 2000 5 1 0.005 0.4294718 37.554 0.583 0.548 0.607
3: 0 FALSE 5000 2 6 0.005 0.4395753 36.981 0.575 0.559 0.616
4: 2 FALSE 5000 3 4 0.001 0.4296346 40.624 0.593 0.570 0.645
5: 2 FALSE 1000 5 1 0.010 0.4186802 39.915 0.587 0.573 0.647
6: 2 FALSE 2000 2 4 0.005 0.4390503 39.164 0.582 0.575 0.648
7: 8 FALSE 2000 4 10 0.001 0.4511349 38.240 0.576 0.576 0.619
8: 8 FALSE 5000 2 10 0.005 0.4469665 38.064 0.579 0.566 0.611
9: 8 FALSE 5000 3 6 0.005 0.4426952 37.888 0.588 0.542 0.591
10: 2 TRUE 5000 3 4 0.001 0.3812878 21.057 0.510 0.479 0.615
11: 2 TRUE 2000 3 10 0.005 0.3790536 20.127 0.507 0.470 0.608
12: 2 TRUE 1000 5 4 0.001 0.3690911 18.981 0.500 0.475 0.611
13: 8 TRUE 5000 6 10 0.010 0.2865042 16.870 0.497 0.435 0.635
14: 0 TRUE 2000 6 4 0.010 0.3192862 9.779 0.460 0.433 0.621
By some reason, it does not order the way pointed (probably because ordering by the groups). So, another ordering is done.
> dt2[order(G1, G2)]
G1 G2 trees interDepth minObs shrinkage qx efet prec sens acu
1: 0 FALSE 2000 2 6 0.010 0.4395953 43.066 0.606 0.575 0.631
2: 0 FALSE 2000 5 1 0.005 0.4294718 37.554 0.583 0.548 0.607
3: 0 FALSE 5000 2 6 0.005 0.4395753 36.981 0.575 0.559 0.616
4: 0 TRUE 2000 6 4 0.010 0.3192862 9.779 0.460 0.433 0.621
5: 2 FALSE 5000 3 4 0.001 0.4296346 40.624 0.593 0.570 0.645
6: 2 FALSE 1000 5 1 0.010 0.4186802 39.915 0.587 0.573 0.647
7: 2 FALSE 2000 2 4 0.005 0.4390503 39.164 0.582 0.575 0.648
8: 2 TRUE 5000 3 4 0.001 0.3812878 21.057 0.510 0.479 0.615
9: 2 TRUE 2000 3 10 0.005 0.3790536 20.127 0.507 0.470 0.608
10: 2 TRUE 1000 5 4 0.001 0.3690911 18.981 0.500 0.475 0.611
11: 8 FALSE 2000 4 10 0.001 0.4511349 38.240 0.576 0.576 0.619
12: 8 FALSE 5000 2 10 0.005 0.4469665 38.064 0.579 0.566 0.611
13: 8 FALSE 5000 3 6 0.005 0.4426952 37.888 0.588 0.542 0.591
14: 8 TRUE 5000 6 10 0.010 0.2865042 16.870 0.497 0.435 0.635

data.table way for picking the lowest 3 mpg per group:
data("mtcars")
setDT(mtcars)[order(mpg), head(.SD, 3), by = "cyl"]

# start with the mtcars data frame (included with your installation of R)
mtcars
# pick your 'group by' variable
gbv <- 'cyl'
# IMPORTANT NOTE: you can only include one group by variable here
# ..if you need more, the `order` function below will need
# one per inputted parameter: order( x$cyl , x$am )
# choose whether you want to find the minimum or maximum
find.maximum <- FALSE
# create a simple data frame with only two columns
x <- mtcars
# order it based on
x <- x[ order( x[ , gbv ] , decreasing = find.maximum ) , ]
# figure out the ranks of each miles-per-gallon, within cyl columns
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank ) )
}
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# done!
# but note only *two* values where cyl == 4 were kept,
# because there was a tie for third smallest, and the `rank` function gave both '3.5'
x[ x$ranks == 3.5 , ]
# ..if you instead wanted to keep all ties, you could change the
# tie-breaking behavior of the `rank` function.
# using the `min` *includes* all ties. using `max` would *exclude* all ties
if ( find.maximum ){
# note the negative sign (which changes the order of mpg)
# *and* the `rev` function, which flips the order of the `tapply` result
x$ranks <- unlist( rev( tapply( -x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) ) )
} else {
x$ranks <- unlist( tapply( x$mpg , x[ , gbv ] , rank , ties.method = 'min' ) )
}
# and there are even more options..
# see ?rank for more methods
# now just subset it based on the rank column
result <- x[ x$ranks <= 3 , ]
# look at your results
result
# and notice *both* cyl == 4 and ranks == 3 were included in your results
# because of the tie-breaking behavior chosen.

Related

apply function to dataframe's chosen column whilst grouping by another chosen column

I would like to apply the below function (cut.at.n.tile) to a data frame (some_data_frame) whilst grouping by a chosen column (e.g. SomeGroupingColumn) and choosing the target column (e.g. ChosenColumn). I tried using sapply() without success - see code below. Any input very much appreciated. Apologies for this not being fully replicable/self contained ...
cut.at.n.tile <- function(X, n = 7) {
cut(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE),
labels = 1:n, include.lowest = TRUE)
}
some_data_frame$SeasonTypeNumber = sapply(split(some_data_frame['ChosenColumn'], SomeGroupingColumn), cut.at.n.tile)
There are a few problems here.
some_data_frame['ChosenColumn'] always returns a single-column data.frame, not a vector which your function requires. I suggest switching to some_data_frame[['ChosenColumn']].
SomeGroupingColumn looks like it should be a column (hence the name) in the data, but it is not referenced within a frame. Perhaps some_data_frame[['SomeGroupingColumn']].
You need to ensure that the breaks= used are unique. For example,
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# Error in cut.default(X, breaks = quantile(X, probs = (0:n)/n, na.rm = TRUE), :
# 'breaks' are not unique
If we debug that function, we see
X
# [1] 360.0 360.0 275.8 275.8 275.8 472.0 460.0 440.0 318.0 304.0 350.0 400.0 351.0 301.0
quantile(X, probs = (0:n)/n, na.rm = TRUE)
# 0% 14.28571% 28.57143% 42.85714% 57.14286% 71.42857% 85.71429% 100%
# 275.8000 275.8000 303.1429 336.2857 354.8571 371.4286 442.8571 472.0000
where 275.8 is repeated. This can happen based on nuances in the raw data, and you can't really predict when it will occur.
Since we'll likely have multiple groups, all of the subvectors' levels= (since cut returns a factor) must be the same length, though admittedly 1 in one group is unlikely to be the same as 1 in another group.
Since in this case we can never be certain which n-tile a number strictly applies (in 275.8 in the first or second n-tile?), we can only adjust one of the dupes and accept the imperfection. I suggest a cumsum(duplicated(.)*1e-9): the premise is that it adds an iota to each value that is a dupe, rendering it no-longer a dupe. It is possible that adding 1e-9 to one value will make it a dupe of the next ... so we can be a little OCD by repeatedly doing this until we have no duplicates.
sapply is unlikely to return a vector, much (almost "certainly") more likely to return a list (if the groups are not perfectly balanced) or a matrix (perfectly balanced). We cannot simply unlist, since the order of the unlisted vectors will likely not be the order of the source data.
We can use `split<-`, or we can use a few other techniques (dplyr and/or data.table)
Updated function, and demonstration with mtcars:
cut.at.n.tile <- function(X, n = 7) {
brks <- quantile(X, probs = (0:n)/n, na.rm = TRUE)
while (any(dupes <- duplicated(brks))) brks <- brks + cumsum(1e-9*dupes)
cut(X, breaks = brks, labels = 1:n, include.lowest = TRUE)
}
base R
ret <- lapply(split(mtcars[['disp']], mtcars[['cyl']]), cut.at.n.tile)
mtcars[["newcol"]] <- NA # create an empty column
split(mtcars[['newcol']], mtcars[['cyl']]) <- ret
mtcars
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 1
# Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
# Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 1
# Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 7
# Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 7
# Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 6
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 5
# Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 3
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6
Validation:
cut.at.n.tile(subset(mtcars, cyl == 8)$disp)
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
# Levels: 1 2 3 4 5 6 7
subset(mtcars, cyl == 8)$newcol
# [1] 5 5 1 1 1 7 7 6 3 3 4 6 4 2
dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(newcol = cut.at.n.tile(disp)) %>%
ungroup()
# # A tibble: 32 × 12
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 5
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 6
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 7
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 7
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 4
# # … with 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
data.table
library(data.table)
as.data.table(mtcars)[, newcol := cut.at.n.tile(disp), by = .(cyl)][]
# mpg cyl disp hp drat wt qsec vs am gear carb newcol
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <fctr>
# 1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
# 2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2
# 3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
# 4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 7
# 5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 5
# 6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 6
# 7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 5
# 8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 7
# 9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 7
# 10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 4
# ---
# 23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 3
# 24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4
# 25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 6
# 26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 3
# 27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 5
# 28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 3
# 29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 4
# 30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
# 31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
# 32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 6

Change column names in nested df using map function

I am working on a project where I created a function to edit column names of a given df:
fix_names <- function(a, b, c) {
if (is.data.frame(a) == TRUE & is.character(b) == TRUE & is.character(c) == TRUE) {
str_replace_all(colnames(a), pattern = b, replacement = c)
} else {
return("invalid inputs")
}
}
And then I have a column, data, that contains four data frames. I am trying to rename the columns of all the data frames in data using my function above inside of a map function. It's successful in fixing the names, but I cannot figure out how to apply it to the df since the output is a list and the data frames are nested. Here's what I have:
map(.x = df$data, ~fix_names(., "OldName", "NewName"))
Thank you!
Edit: adding example df using mtcars
data(mtcars)
mtcars %>%
group_by(cyl) %>%
nest() -> nestMtcars
map(.x = nestMtcars$data, ~fix_names(., "mpg", "MPG"))
You could transpose the nested list to run the map function, and transpose it back to its original form :
library(stringr)
library(purrr)
fix_names <- function(a, b, c) {
if (is.data.frame(a) == TRUE & is.character(b) == TRUE & is.character(c) == TRUE) {
colnames(a) <- str_replace_all(colnames(a), pattern = b, replacement = c)
a
} else {
return("invalid inputs")
}
}
nestMtcars %>% transpose %>%
map(~{.x$data <- fix_names(.x$data,"mpg","MPG"); .x}) %>%
transpose
$cyl
$cyl[[1]]
[1] 6
$cyl[[2]]
[1] 4
$cyl[[3]]
[1] 8
$data
$data[[1]]
# A tibble: 7 x 10
MPG disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 160 110 3.9 2.62 16.5 0 1 4 4
2 21 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 145 175 3.62 2.77 15.5 0 1 5 6
$data[[2]]
# A tibble: 11 x 10
MPG disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 108 93 3.85 2.32 18.6 1 1 4 1
2 24.4 147. 62 3.69 3.19 20 1 0 4 2
3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
6 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1
7 21.5 120. 97 3.7 2.46 20.0 1 0 3 1
8 27.3 79 66 4.08 1.94 18.9 1 1 4 1
9 26 120. 91 4.43 2.14 16.7 0 1 5 2
10 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2
11 21.4 121 109 4.11 2.78 18.6 1 1 4 2
$data[[3]]
# A tibble: 14 x 10
MPG disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 360 175 3.15 3.44 17.0 0 0 3 2
2 14.3 360 245 3.21 3.57 15.8 0 0 3 4
3 16.4 276. 180 3.07 4.07 17.4 0 0 3 3
4 17.3 276. 180 3.07 3.73 17.6 0 0 3 3
5 15.2 276. 180 3.07 3.78 18 0 0 3 3
6 10.4 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 460 215 3 5.42 17.8 0 0 3 4
8 14.7 440 230 3.23 5.34 17.4 0 0 3 4
9 15.5 318 150 2.76 3.52 16.9 0 0 3 2
10 15.2 304 150 3.15 3.44 17.3 0 0 3 2
11 13.3 350 245 3.73 3.84 15.4 0 0 3 4
12 19.2 400 175 3.08 3.84 17.0 0 0 3 2
13 15.8 351 264 4.22 3.17 14.5 0 1 5 4
14 15 301 335 3.54 3.57 14.6 0 1 5 8

tidyverse function to `mutate_sample`?

I'm looking to mutate a column for a random sample, e.g., mutate_sample. Does anyone know whether there is a dplyr/other tidyverse verb for this? Below is a reprex for the behavior I am looking for and an attempt to functionalize (which isn't running because I'm struggling with quasiquotation in if_else).
library(dplyr)
library(tibble)
library(rlang)
# Setup -------------------------------------------------------------------
group_size <- 10
group_n <- 1
my_cars <-
mtcars %>%
rownames_to_column(var = "model") %>%
mutate(group = NA_real_, .after = model)
# Code to create mutated sample -------------------------------------------
group_sample <-
my_cars %>%
filter(is.na(group)) %>%
slice_sample(n = group_size) %>%
pull(model)
my_cars %>%
mutate(group = if_else(model %in% group_sample, group_n, group)) %>%
head()
#> model group mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 NA 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 Mazda RX4 Wag 1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 Datsun 710 1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 Hornet 4 Drive NA 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 Hornet Sportabout NA 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> 6 Valiant NA 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Function to create mutated sample ---------------------------------------
#
# Note: doesn't run because of var in if_else
# mutate_sample <- function(data, var, id, n, value) {
# # browser()
# sample <-
# data %>%
# filter(is.na({{var}})) %>%
# slice_sample(n = n) %>%
# pull({{id}})
#
# data %>%
# mutate(var = if_else({{id}} %in% sample, value, {{var}}))
# }
#
# mutate_sample(my_cars, group, model, group_size, group_n)
Created on 2020-10-21 by the reprex package (v0.3.0)
Looking through SO, I found this related post:
Mutate column as input to sample
I think you could achieve your goal with this two options.
With dplyr:
mtcars %>% mutate(group = sample(`length<-`(rep(group_n, group_size), n())))
or with base R:
mtcars[sample(nrow(mtcars), group_size), "group"] <- group_n
If you need an external function to handle it, you could go with:
mutate_sample <- function(.data, .var, .size, .value) {
mutate(.data, {{.var}} := sample(`length<-`(rep(.value, .size), n())))
}
mtcars %>% mutate_sample(group, group_size, group_n)
or
mutate_sample_rbase <- function(.data, .var, .size, .value) {
.data[sample(nrow(.data), size = min(.size, nrow(.data))),
deparse(substitute(.var))] <- .value
.data
}
mtcars %>% mutate_sample(group, group_size, group_n)
Note that if .size is bigger than the number of rows of .data, .var will be a constant equal to .value.
EDIT
If you're interested in keeping the old group, I suggest you another way to handle the problem:
library(dplyr)
# to understand this check out ?sample
resample <- function(x, ...){
x[sample.int(length(x), ...)]
}
# this is to avoid any error in case you choose a size bigger than the available rows to select in one group
resample_max <- function (x, size) {
resample(x, size = min(size, length(x)))
}
mutate_sample <- function(.data, .var, .size, .value) {
# creare column if it doesnt exist
if(! deparse(substitute(.var)) %in% names(.data)) .data <- mutate(.data, {{.var}} := NA)
# replace missing values randomly keeping existing non-missing values
mutate(.data, {{.var}} := replace({{.var}}, resample_max(which(is.na({{.var}})), .size), .value))
}
group_size <- 10
mtcars %>%
mutate_sample(group, group_size, 1) %>%
mutate_sample(group, group_size, 2)
#> mpg cyl disp hp drat wt qsec vs am gear carb group
#> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
#> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
#> 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2
#> 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1
#> 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1
#> 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
#> 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 2
#> 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
#> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
#> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 1
#> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 2
#> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 1
#> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 NA
#> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 2
#> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 1
#> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 1
#> 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2
#> 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
#> 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
#> 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
#> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 1
#> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 2
#> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 1
#> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 2
#> 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 2
#> 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
#> 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 2
#> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 1
#> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
#> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 2
#> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
Notice that this solution works even with grouped_df class (what you get after a dplyr::group_by): from each group [made by dplyr::group_by] a sample of .size units will be selected.
mtcars %>%
group_by(am) %>%
mutate_sample(group, 10, 1) %>%
ungroup() %>%
count(group)
#> # A tibble: 2 x 2
#> group n
#> <dbl> <int>
#> 1 1 20 # two groups, each with 10!
#> 2 NA 12

Referencing to the previous value in the same variable, no loop, in R

thanks for reviewing my questions
I am trying to create a new variable, that will get value from another variable if certain conditions are met; otherwise, take the value of the previous observation.
I can do it by running a loop like this:
data <- mtcars
data$test <- NA
data$test <- as.numeric(data$test)
a0 <- Sys.time()
for (i in 2:nrow(data)) {
ifelse(data$carb[[i]] < 4,
data$test[[i]] <- data[[i-1,'test']],
data$test[[i]] <- data[[i,'mpg']]
)
a1 <- Sys.time()
per_left <- (i)/nrow(data)
print(paste("Time left is", round(as.numeric(as.difftime(a1-a0, units = "mins"))/per_left,2),"mins"))
}
However, my data set more than 8million observations. I feel like this is not the optimal way to save time.
***For function Lag:
Somehow it seems like if I use lag, the lag function will use the data of the previously recorded, not updated one.
For example.
df1 <- data.frame(ID = c(1, 1, 1, 1, 4, 5),
condition = c(FALSE,TRUE,TRUE,TRUE, TRUE, FALSE),
var1 = c('a', 'b', 'c', 'd', 'f', 'e'))
df2 <- df1 %>%
mutate(
new_2 = '0',
new_2 = case_when(
ID == lag(ID) & condition == TRUE ~ lag(new_2),
TRUE ~ var1
))
> df2
ID condition var1 new_2
1 1 FALSE a a
2 1 TRUE b 0
3 1 TRUE c 0
4 1 TRUE d 0
5 4 TRUE f f
6 5 FALSE e e
It should be
ID condition var1 new_2
1 1 FALSE a a
2 1 TRUE b a
3 1 TRUE c a
4 1 TRUE d a
5 4 TRUE f f
6 5 FALSE e e
I run the above function, row 2 should take the previous value - a, not the default value of 0. While I if I go with the for loop, it will take the "a".
Is there are function that does it? Or how should I update my function to make it faster?
Please advise.
Thank you!
We can use lag and ifelse is already vectorized. So, either ifelse or case_when can be used. But, case_when would be more generalizable when there are multiple conditions
library(dplyr)
out <- data %>%
mutate(new = case_when(carb < 4 ~ lag(test), TRUE ~ mpg))
To make this faster, another option is shift from data.table
library(data.table)
setDT(data)[, new := fifelse(carb < 4, shift(test), mpg)]
For the second dataset, perhaps
library(dplyr)
df1 %>%
mutate(new_2 = replace(var1, condition, lag(var1[condition])))
-output
# ID condition var1 new_2
#1 1 FALSE a a
#2 1 TRUE b <NA>
#3 1 TRUE c b
#4 1 TRUE d c
#5 4 TRUE f d
#6 5 FALSE e e
Or it could be
df1 %>%
group_by(ID) %>%
mutate(new_2 = case_when(condition ~ lag(var1), TRUE ~ '0'))
# A tibble: 6 x 4
# Groups: ID [3]
# ID condition var1 new_2
# <dbl> <lgl> <chr> <chr>
#1 1 FALSE a 0
#2 1 TRUE b a
#3 1 TRUE c b
#4 1 TRUE d c
#5 4 TRUE f <NA>
or using data.table
setDT(df1)[condition, new_2 := shift(var1)]
Update
Based on the updated expected output
df1 %>%
group_by(ID) %>%
mutate(new_2 = lag(var1)) %>%
group_by(grp = rleid(condition), .add = TRUE) %>%
mutate(new_2 = coalesce(first(new_2), var1)) %>%
ungroup %>%
dplyr::select(-grp)
# A tibble: 6 x 4
# ID condition var1 new_2
# <dbl> <lgl> <chr> <chr>
#1 1 FALSE a a
#2 1 TRUE b a
#3 1 TRUE c a
#4 1 TRUE d a
#5 4 TRUE f f
#6 5 FALSE e e
You can do this in base R by getting the positions of the condition (data$carb < 4) and get the index to replace by subtracting -1 to those index values.
data <- mtcars
data$test <- mtcars$mpg
inds <- which(data$carb < 4)
data$test[inds] <- data$test[inds - 1]
Many of the R functions are vectorised so you would not need an explicit for loop for them.
One option in base R is to replace the ifelse with if() ... else. However, a much faster solution using base R is to use a combination of ifelse and Reduce. This is 3.46 / .044 ~ 78 times faster than the OP's solution. The solution is:
v2 <- mtcars
v2$test <- ifelse(v2$carb < 4, NA_real_, v2$mpg)
v2$test <- Reduce(
function(xprev, xnew)
if(is.na(xnew)) xprev else xnew,
v2$test, accumulate = TRUE, init = v2$mpg[1])[-1]
Here is a comparison with some alternatives:
# works even if the first entry does not comply with the condition
mtcars$carb[1] <- 1
# essentially the OPs solution
v0 <- mtcars
v0$test <- v0$mpg
for (i in 2:nrow(v0))
ifelse(v0$carb[[i]] < 4,
v0$test[[i]] <- v0[[i-1,'test']],
v0$test[[i]] <- v0[[i,'mpg']])
# using if ... else instead of ifelse
v1 <- mtcars
v1$test <- v1$mpg
for (i in 2:nrow(v1))
v1$test[i] <- if(v1$carb[i] < 4) v1$test[i - 1] else v1$test[i]
# we get the same
all.equal(v0, v1)
#R> [1] TRUE
# using ifelse + Reduce
v2 <- mtcars
v2$test <- ifelse(v2$carb < 4, NA_real_, v2$mpg)
v2$test <- Reduce(
function(xprev, xnew)
if(is.na(xnew)) xprev else xnew,
v2$test, accumulate = TRUE, init = v2$mpg[1])[-1]
# we get the same
all.equal(v0, v2)
#R> [1] TRUE
# compare the computation time
bench::mark(
`ifelse` = {
v0 <- mtcars
v0$test <- v0$mpg
for (i in 2:nrow(v0))
ifelse(v0$carb[[i]] < 4,
v0$test[[i]] <- v0[[i-1,'test']],
v0$test[[i]] <- v0[[i,'mpg']])
},
`if ... else` = {
v1 <- mtcars
v1$test <- v1$mpg
for (i in 2:nrow(v1))
v1$test[i] <- if(v1$carb[i] < 4) v1$test[i - 1] else v1$test[i]
},
`ifelse + reduce` = {
v2 <- mtcars
v2$test <- ifelse(v2$carb < 4, NA_real_, v2$mpg)
v2$test <- Reduce(
function(xprev, xnew)
if(is.na(xnew)) xprev else xnew,
v2$test, accumulate = TRUE, init = v2$mpg[1])[-1]
}, min_time = 1, check = FALSE)
#R> # A tibble: 3 x 13
#R> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#R> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#R> 1 ifelse 3.27ms 3.46ms 285. 50.05KB 16.8 254 15 891ms
#R> 2 if ... else 2.72ms 2.9ms 341. 48.7KB 17.9 304 16 892ms
#R> 3 ifelse + reduce 40.98µs 44.92µs 21803. 9.79KB 19.6 9991 9 458ms
However, my data set more than 8million observations. I feel like this is not the optimal way to save time.
The ifelse and Reduce solution runs in ~9 seconds on my computer with 8 million rows which I guess is manageable if it is only done once:
# simulate a large data set
set.seed(1)
n <- 8e6
dum_dat <- data.frame(var_1 = runif(n, 0, 8), var_2 = rnorm(n))
system.time({
dum_dat$test <- ifelse(dum_dat$var_1 < 4, NA_real_, dum_dat$var_2)
func <- compiler::cmpfun(
function(xprev, xnew)
if(is.na(xnew)) xprev else xnew)
dum_dat$test <- Reduce(
func, dum_dat$test, accumulate = TRUE, init = dum_dat$var_2[1])[-1]
})
#R> user system elapsed
#R> 8.816 0.064 8.882
we can just add lag() function to your ifelse condition:
data$test = ifelse(data$carb < 4, data$test <- lag(data$test), data$test <- data$mpg)
> data
mpg cyl disp hp drat wt qsec vs am gear carb test
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 21.0
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.0
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 22.8
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 21.4
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14.3
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 14.3
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 14.3
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 19.2
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 17.8
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 17.8
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 17.8
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 16.4
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 10.4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 10.4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 14.7
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 14.7
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 14.7
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 32.4
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 30.4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 33.9
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 21.5
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 13.3
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 13.3
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 13.3
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 19.2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 27.3
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 15.8
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 19.7
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.0
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 15.0
>

Assigning same random value to all of one variable?

Using mtcars as an example, I am trying to create a new column and assign all of the same values of cyl to a same random value.
I tried:
mtcars$cyl <- as.factor (mtcars$cyl)
mtcars %>%
group_by(cyl) %>%
mutate (rand = sample( c("A", "B"), replace = T)
However, the length seems to be wrong, and I'm not sure if it will just assign a random A or B to each row instead of the same random A or B to the same factor of cyl. Any insight, should I be creating a for loop for each unique (cyl)?
You need to specify size as 1 in sample to get the same value of cyl the same random value.
library(dplyr)
set.seed(567)
mtcars %>% group_by(cyl) %>% mutate(rand = sample(c("A", "B"), 1))
# mpg cyl disp hp drat wt qsec vs am gear carb rand
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 B
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 B
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 A
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 B
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 A
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 B
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 A
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 A
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 A
#10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 B
# … with 22 more rows
Given your precision, I think an easy solution is to use a merge. You first generate a dataframe associating cyls with a random value then merge using cyl
dfrand <- data.frame(
rand = sample(c("A","B"), size = length(unique(df$cyl)), replace = TRUE),
cyl = unique(df$cyl), stringsAsFactors = FALSE
)
dfrand
rand cyl
1 B 6
2 A 4
3 B 8
And then you merge. You can use base R
merge(df, dfrand, by = "cyl")
or dplyr:
dplyr::left_join(
df, dfrand, by = 'cyl'
)
The result should look like the following (I take 5 random rows of the generated dataframe)
merge(df, dfrand, by = "cyl")[sample(1:nrow(df), size = 5)]
cyl mpg disp hp drat wt qsec vs am gear carb rand
1: 8 13.3 350.0 245 3.73 3.84 15.41 0 0 3 4 B
2: 4 24.4 146.7 62 3.69 3.19 20.00 1 0 4 2 A
3: 8 17.3 275.8 180 3.07 3.73 17.60 0 0 3 3 B
4: 4 32.4 78.7 66 4.08 2.20 19.47 1 1 4 1 A
5: 4 22.8 108.0 93 3.85 2.32 18.61 1 1 4 1 A
We can use data.table
library(data.table)
as.data.table(mtcars)[, rand := sample(c("A", "B", 1), cyl]

Resources