how to compute rowsums using tidyverse - r

I did mtcars %>% by_row(sum) but got the message:
by_row() is deprecated; please use a combination of: tidyr::nest();
dplyr::mutate(); purrr::map()
My naive approach is this
mtcars %>%
group_by(id = row_number()) %>%
nest(-id) %>%
mutate(hi = map_dbl(data, sum))
Is there a way to do it without creating an "id" column?

Is this what you are looking for?
mtcars %>% mutate(rowsum = rowSums(.))
Output:
mpg cyl disp hp drat wt qsec vs am gear carb rowsum
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 328.980
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 329.795
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 259.580
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 426.135
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 590.310
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 385.540
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 656.920
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 270.980
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 299.570
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 350.460
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 349.660
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 510.740
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 511.500
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 509.850
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 728.560
16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 726.644

Subset of columns is available too.
mtcars %>% mutate(rowsum = rowSums(.[2:4]))

mtcars %>% mutate(rowsum = pmap_dbl(., sum))

Furthermore, you can use conditional subsetting, but then you sum up the number of the columns that meet the criterion, not the values:
mtcars %>%
select(all_of(c('gear', 'carb'))) %>%
mutate(
high_gear_carb = rowSums(. > 3)
)
gear carb high_gear_carb
1 4 4 2
2 4 4 2
3 4 1 1
4 3 1 0
5 3 2 0
6 3 1 0
7 3 4 1
...

Related

Correct way of selecting the jth row while using summarise after group_by

My query is: May I use summarise after group_by like this:
mydataset %>%
arrange(grouping_variable,ordering_variable) %>%
group_by(grouping_variable) %>% summarise(answer = another_variable[j])
I expect to see the jth ranked row in each group when I do the above.
I think the above is correct but not mentioned in the documentation.
I ran the following experiment to determine this.
Here is the whole data set:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% as.data.frame
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
2 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
7 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
8 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
9 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
10 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
11 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
12 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
13 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
15 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
16 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
17 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
18 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
19 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
20 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
21 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
22 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
23 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
24 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
25 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
26 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
27 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
30 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
31 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
32 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Here is the first row (j=1) in each group:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = mpg[1])
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.4
2 6 17.8
3 8 10.4
Here is the second row in each group (j=2):
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = mpg[2])
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.5
2 6 18.1
3 8 10.4
>
On the help page, it says that way to do this is as follows:
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = first(mpg))
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.4
2 6 17.8
3 8 10.4
>
> mtcars %>% arrange(cyl,mpg) %>% group_by(cyl) %>% summarise(answer = nth(mpg,2))
# A tibble: 3 × 2
cyl answer
<dbl> <dbl>
1 4 21.5
2 6 18.1
3 8 10.4
>
I don't remember which website I read this on. Since it was not mentioned on help page that I why I am asking here.

How to apply a window function over a subset of rows inside dplyr::mutate?

Let's imagine mtcars is an ordered data.frame and that I want the maximum over the previous rows. Then, I can do:
> mtcars %>% mutate(test = cummax(wt))
mpg cyl disp hp drat wt qsec vs am gear carb test
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2.620
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2.875
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2.875
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.215
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 3.440
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 3.460
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 3.570
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.570
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.570
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.570
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 3.570
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 4.070
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 4.070
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 4.070
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 5.250
16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 5.424
17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 5.424
...
Now I want to get the maximum, not just over all the previous rows, but only over the previous rows where wt < 4, so I would get a different result (represented below in the test2 column) from row 12 for test2 if the next instruction worked:
> mtcars %>% mutate(test = cummax(wt), test2 = cummax(wt[wt < 4]))
mpg cyl disp hp drat wt qsec vs am gear carb test test2
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2.620 2.620
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2.875 2.875
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 2.875 2.875
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.215 3.215
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 3.440 3.440
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 3.460 3.460
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 3.570 3.570
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.570 3.570
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.570 3.570
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.570 3.570
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 3.570 3.570
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 4.070 3.570
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 4.070 3.730
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 4.070 3.780
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 5.250 3.780
16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 5.424 3.780
17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 5.424 3.780
...
But actually, that instruction produces an error:
> mtcars %>% mutate(test = cummax(wt[wt<4]))
Error: Column `test` must be length 32 (the number of rows) or one, not 28
And let me add that I would like do this by group, so the instruction would start by mtcars %>% group_by(cyl) %>% ...
How can I do?
Thank you in advance!
I would use ifelse, and replace anything that violates your condition with -Inf
mtcars %>%
mutate(test = cummax(ifelse(wt < 4, wt, -Inf)))

Combine a list of data frames column wise and return a list of combined data frames using R

I would like to combine two list of data frames element wise and return a list of data frames. The following code works for the mtcars dataset
list1=split(mtcars[c(1:16),-11],mtcars[c(1:16),2])
list2=split(data.frame(mtcars[c(1:16),]),mtcars[c(1:16),2])
newList=Map(cbind, list1, list2)
How do I modify the Map function to just bind a specific column(s) from list2? Thanks
Since #thelatemail doesn't want to add an answer here is purrr version of his answer.
library(purrr)
map2(list1, map(list2, `[`, 'carb'), cbind)
#Or
#map2(list1, map(list2, `[`, 'carb'), dplyr::bind_cols)
#$`4`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
#2 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
#3 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
#$`6`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#4 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#5 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#6 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#$`8`
# mpg cyl disp hp drat wt qsec vs am gear carb
#1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#2 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#3 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#4 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#5 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#6 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#7 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

Applying function using dplyr and setting output as columns in dataframe

I have a huge dataframe and I am applying a function that has multiple outputs on one column and would like to add these outputs as columns in the dataframe.
Example function:
measure <- function(x){ # useless function for illustrative purposes
one <- x+1
two <- x^2
three <- x/2
m <- c(one,two,three)
names(m) <- c('Plus1','Square','Half')
return(m)
}
My current method which is very inefficient:
a <- mtcars %>% group_by(cyl) %>% mutate(Plus1 = measure(wt)[1], Square = measure(wt)[2],
Half = measure(wt)[3]) %>% as.data.frame()
Output:
head(a,15)
mpg cyl disp hp drat wt qsec vs am gear carb Plus1 Square Half
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 3.62 3.875 4.215
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3.62 3.875 4.215
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.32 4.190 4.150
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 3.62 3.875 4.215
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 4.44 4.570 5.070
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 3.62 3.875 4.215
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 4.44 4.570 5.070
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 3.32 4.190 4.150
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 3.32 4.190 4.150
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 3.62 3.875 4.215
11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 3.62 3.875 4.215
12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 4.44 4.570 5.070
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 4.44 4.570 5.070
14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 4.44 4.570 5.070
15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 4.44 4.570 5.070
Is there any more efficient way to do this? My actual function has 13 outputs and it is taking very long to apply to my large dataframe. Please help!
There could be various ways to solve this however, one option is to return a tibble output from the function, split the dataframe based on group, calculate the statistics for each and bind the result together.
library(tidyverse)
measure <- function(x){
tibble(Plus1 = x+1,Square = x^2,Half = x/2)
}
bind_cols(mtcars %>% arrange(cyl),
mtcars %>%
group_split(cyl) %>%
map_df(~measure(.$wt)))
# mpg cyl disp hp drat wt qsec vs am gear carb Plus1 Square Half
#1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3.320 5.382400 1.1600
#2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4.190 10.176100 1.5950
#3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4.150 9.922500 1.5750
#4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 3.200 4.840000 1.1000
#5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 2.615 2.608225 0.8075
#6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 2.835 3.367225 0.9175
#....
This calls measure only once per group irrespective of number of values returned unlike in the attempt it was called n times to extract n values.

Combine/merge multiple data frames by element names

I have data frames generated by lapply with distinct element names.
head(df1)
$Sample1
G1 G2 Group
1 1.016673 -1.04402692 1
2 1.019958 -0.86763046 1
3 1.033050 -1.09717438 1
4 1.036969 0.26971351 1
5 1.044059 1.73402959 1
$Sample2
G1 G2 Group
1 1.413218 0.22466456 1
2 1.413339 -0.91755436 1
3 1.415782 -0.23471118 1
4 1.434750 -0.77498973 1
5 1.436905 0.76642626 1
Another set is similar in format, specified by 2 under Group
head(df2)
$Sample1
G1 G2 Group
1 1.053269 -1.04460950 2
2 1.059461 -0.86711232 2
3 1.072446 -1.09748431 2
4 1.078763 0.26785751 2
5 1.038325 1.73818175 2
$Sample2
G1 G2 Group
1 1.438067 0.22933986 2
2 1.856085 -0.91988726 2
3 1.415782 -0.23405677 2
4 1.434750 -0.77406530 2
5 1.436905 0.76078091 2
My goal is to combine/merge them together by element names, for example Sample1 and Sample2.
$Sample1
G1 G2 Group
1 1.016673 -1.04402692 1
2 1.019958 -0.86763046 1
3 1.033050 -1.09717438 1
4 1.036969 0.26971351 1
5 1.044059 1.73402959 1
1 1.053269 -1.04460950 2
2 1.059461 -0.86711232 2
3 1.072446 -1.09748431 2
4 1.078763 0.26785751 2
5 1.038325 1.73818175 2
$Sample2
G1 G2 Group
1 1.413218 0.22466456 1
2 1.413339 -0.91755436 1
3 1.415782 -0.23471118 1
4 1.434750 -0.77498973 1
5 1.436905 0.76642626 1
1 1.438067 0.22933986 2
2 1.856085 -0.91988726 2
3 1.415782 -0.23405677 2
4 1.434750 -0.77406530 2
5 1.436905 0.76078091 2
I could not figure out how to do this. Could someone help me? Thanks!
Maybe try mapply and rbind:
a1 <- list(mtcars[1:5,],mtcars[6:10,])
a2 <- list(mtcars[11:15,],mtcars[16:20,])
> mapply(FUN = rbind,a1,a2,SIMPLIFY = FALSE)
[[1]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
[[2]]
mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Equivalently (I think) in purrr would be map2(a1,a2,rbind).

Resources