Creating new columns with mutate - r

i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...

OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.

You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]

Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072

Related

Join tables via multiple partial matching

I have a table describing a train track with each line being a segment of the track with a from and to station as well as a trackID and segment-ID. The stations names are completely random, not as structured as they appear here.
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",1:8),
to = paste0("station_2",1:8)
)
tracks
trackID segment from to
1 A a station_11 station_21
2 A b station_12 station_22
3 A c station_13 station_23
4 A d station_14 station_24
5 B e station_15 station_25
6 B f station_16 station_26
7 B g station_17 station_27
8 B h station_18 station_28
I have another table with sightings made on this train, and I would like to know what the correspoding trackID is per sighting. The table looks like this:
sightings <- data.frame(from = c("station_24","station_28","station_14"),
to = c("station_14","station_16","station_25"))
sightings
from to
1 station_24 station_14
2 station_28 station_16
3 station_14 station_25
I could gather the information on the trackID from the to and from information provided in the sightings table. BUT, from and to in the sightings-table do not correspond with the from and to in the track-table: from and to can be in different segments and can be interchanged (to-from). In some problematic cases, from and to are in different trackID, which would then return no match. The desired output from this example would be:
from to trackID
1 station_24 station_14 A
2 station_28 station_16 B
3 station_14 station_25 <NA> # no match since station_14 and 25 are from two different trackIDs
In my mind, the solution involves collapsing the tracks table by trackID and then doing a double partial matching of strings (using grepl()?). This next lines would take care of collapsing, but I have no clue where to go from here. Can someone point me in the right direction?
Solutions with R / dplyr very much preferred, but I would take anything!
library(dplyr)
tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
)
tracks
trackID from_to
<fct> <chr>
1 A station_11,station_12,station_13,station_14,station_21,station_22,station_23,station_24
2 B station_15,station_16,station_17,station_18,station_25,station_26,station_27,station_28
EDIT: It seems that I've oversimplified my problem in my minimal example. The main issue is that stations (from and to) are not unique in the table, and not even unique to a trackID. Only a combination of to and from is unique to a trackID. I've accepted the answer as it solves the problem as stated, but I will also provide my own solution that I've come up with in the meantime.
A double-join can work.
Notes: You don't appear to be using segment, so I'm discarding it here, but this might be adapted if needed. Also, I added stringsAsFactors=FALSE to your data, since otherwise combining vectors of factors can be problematic.)
library(dplyr)
tracksmod <- bind_rows(
select(tracks, trackID, sta=from),
select(tracks, trackID, sta=to)
)
head(tracksmod)
# trackID sta
# 1 A station_11
# 2 A station_12
# 3 A station_13
# 4 A station_14
# 5 B station_15
# 6 B station_16
sightings %>%
left_join(select(tracksmod, trackID, from=sta), by="from") %>%
left_join(select(tracksmod, trackID2=trackID, to=sta), by="to") %>%
mutate(trackID = if_else(trackID == trackID2, trackID, NA_character_)) %>%
select(-trackID2)
# from to trackID
# 1 station_24 station_14 A
# 2 station_28 station_16 B
# 3 station_14 station_25 <NA>
I did not assume that directionality was important. That is, I'm not assuming that a station listed in from must always be in the from column. This is why I converted tracks to tracksmod, in order to identify a station with an id regardless of direction.
As I've stated in the EDIT of my question, I've oversimplified my problem in the minimal Example.
Here is an updated version of the data that resembles my data more accurately. I've also added stringsAsFactor = F as commented by #r2evans.
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",c(1:4,1,2,5,6)),
to = paste0("station_2",1:8),
stringsAsFactors = F
)
sightings <- data.frame(
from = c("station_24","station_28","station_14"),
to = c("station_14","station_11","station_25"),
trackID = c("A","B",NA),
stringsAsFactors = F
)
I've solved the problem by collapsing the tracks table on the basis of trackID and then using the purrr package to use the loop functions in a nested manner.
library(dplyr)
# Collapsing the tracks-dataframe
tracks_collapse <- tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
# from = list(from),
# to = list(to),
# stas = list(c(from,to))
)
# a helper function to remove NAs when looking for matches
remove_na <- function(x){x[!is.na(x)]}
library(purrr)
pmap_dfr(sightings, function(from,to,trackID){ # pmap_dfr runs over a data.frame and returns a data.frame
data.frame(
from = from, # recreates the sightings data.frame
to = to, # dito
trackID = paste( # collapses the resulting vector
remove_na( # removes the NA values
pmap_chr( # matches every row from the sightings-data.frame with the tracks-data.frame
tracks_collapse,
function(trackID,from_to){
ifelse(grepl(from,from_to) & grepl(to,from_to),trackID,NA) # does partial string matching and returns the trackID if both strings match
}
)
),collapse = ","
)
)
})
Output:
from to trackID
1 station_24 station_14 A
2 station_28 station_11 B
3 station_14 station_25 <NA>

Calling specific cells in the same column (using dplyr?)

I have a dataframe with character and numeric data. I would like to use dplyr to create a summary grouped by time points and trials generating the following:
averages
standard deviations
variation
ratio between time points
(etc etc)
I feel like all of this could be done in the dplyr pipe, but I am struggling to make a ratio of averages between time points within trials.
I fully admit that I may be carrying around a hammer looking for nails, so please feel free to recommend solutions that utilize other packages or functions, but ideally I'd like simple/straight forward code for ease of use by multiple collaborators.
library(dplyr)
# creating an example DF
num <- runif(100, 50, 3200)
smpl <- 1:100
df <- data.frame( num, smpl)
df$time <- "time1"
df$time[seq(2,100,2)] <- "time2"
df$trial <- "a"
df$trial[26:50] <- "b"
df$trial[51:75] <- "c"
df$trial[75:100] <- "d"
# using the magic of pipelines to calculate useful things
df1 <- df %>%
group_by(time, trial) %>%
summarise(avg = mean(num),
var = var(num),
stdev = sd(num))
I'd love to get [the ratio time2/time1 of the avg for each trial] included in this block above, but I don't know how to call "avg" specifically by "time1" vs "time2" within the pipe.
From here on, nothing does quite what I'm hoping for...
df1 <- df1[with(df1,order(trial,time)),]
# this better ressembles my actual DF structure,
# so reordering it will make some of my next attempts to solve this make more sense
I tried to use the fact that 'every other line' is different (this is not ideal because each df will have a different number of rows, so I will either introduce NAs or it will require constantly change these #'s (or writing a function to constantly change them))
tm2 <- data.frame(x=df1$avg[seq(2,4,2)])
tm1 <- data.frame(x=df1$avg[seq(1,3,2)])
so minimally, this is the ratio I'd like included in the df, but tied to the avg & trial columns:
tm2/tm1
It doesn't matter to me 'which' time row this ratio ends up in, so long as it is consistent across all the trials (so if a column of ratios has "blank" for every "time1" and "value" for every "time2", that's fine).
# I added in a separate column to allow 'match' later
tm1$time <- "time1"
tm2$time <- "time1" # to keep them all 'in row'
df1$avg_tm1 <- tm1$x[match(df1$time, tm1$time)]
df1$avg_tm2 <- tm2$x[match(df1$time, tm2$time)]
but this fails to match by 'trial' also, since that info is lost in this new tm1 df ; this really makes me think it should all be done in dplry the first time...
Then I tried to create a new column in the tm1 df with the ratio
tm2$ratio <-tm2$x/tm1$x
and add in the ratio values only if the avg matches
df1$ratio <- tm2$ratio[match(tm2$x, df1$avg)]
This might work, but when I extract the avg values, it rounds, so the numbers do not match exactly. I'm also cautious about this because if I process ridiculous amounts of data, there's a higher and higher chance that two random averages will be similar enough to misplace these ratios.
I tried several other things that completely failed, so let's pretend that something worked and entered the ratio into the df1 as separate columns
Then any further calculations or annotations are straight forward:
df2 <- df1 %>%
mutate(ratio = avg_tm2/avg_tm1,
lost = 1- ratio,
word = paste0(round(lost*100),"%"))
But I am still stuck on 'how' to call specific cells inside the pipe or which other tools/packages to use to calculate deltas or ratios between cells in the same column.
Thanks in advance
We could group by 'trial' and mutate to create the 'ratio' column
df1 %>%
group_by(trial) %>%
mutate(ratio = last(avg)/first(avg))
# A tibble: 8 x 6
# Groups: trial [4]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 time1 a 1815. 715630. 846. 0.795
#2 time1 b 2012. 1299823. 1140. 0.686
#3 time1 c 1505. 878168. 937. 1.09
#4 time1 d 1387. 902364. 950. 1.17
#5 time2 a 1444. 998943. 999. 0.795
#6 time2 b 1380. 720135. 849. 0.686
#7 time2 c 1641. 1205778. 1098. 1.09
#8 time2 d 1619. 582418. 763. 1.17
NOTE: We used set.seed(2) for creating the dataset
Work out a separate data.frame:
set.seed(2)
# your code above to generate df1
df2 <- select(df1, time, trial, avg) %>%
spread(time, avg) %>%
mutate(ratio = time2/time1)
df2
# # A tibble: 4 × 4
# trial time1 time2 ratio
# <chr> <dbl> <dbl> <dbl>
# 1 a 1815.203 1443.731 0.7953555
# 2 b 2012.436 1379.981 0.6857266
# 3 c 1505.474 1641.439 1.0903135
# 4 d 1386.876 1619.341 1.1676176
and now you can merge the relevant column onto the original frame:
left_join(df1, select(df2, trial, ratio), by="trial")
# Source: local data frame [8 x 6]
# Groups: time [?]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 time1 a 1815.203 715630.4 845.9494 0.7953555
# 2 time1 b 2012.436 1299823.3 1140.0979 0.6857266
# 3 time1 c 1505.474 878168.3 937.1063 1.0903135
# 4 time1 d 1386.876 902363.7 949.9282 1.1676176
# 5 time2 a 1443.731 998943.3 999.4715 0.7953555
# 6 time2 b 1379.981 720134.6 848.6074 0.6857266
# 7 time2 c 1641.439 1205778.0 1098.0792 1.0903135
# 8 time2 d 1619.341 582417.5 763.1629 1.1676176

Calling a data.frame from global.env and adding a column with the data.frame name

I have a dataset consisting of pairs of data.frames (which are almost exact pairs, but not enough to merge directly) which I need to munge together. Luckily, each df has an identifier for the date it was created which can be used to reference the pair. E.g.
df_0101 <- data.frame(a = rnorm(1:10),
b = runif(1:10))
df_0102 <- data.frame(a = rnorm(5:20),
b = runif(5:20))
df2_0101 <- data.frame(a2 = rnorm(1:10),
b2 = runif(1:10))
df2_0102 <- data.frame(a2 = rnorm(5:20),
b2 = runif(5:20))
Therefore, the first thing I need to do is mutate a new column on each data.frame consisting of this date (01_01/ 01_02 / etc.) i.e.
df_0101 <- df_0101 %>%
mutate(df_name = "df_0101")
but obviously in a programmatic manner.
I can call every data.frame in the global environment using
l_df <- Filter(function(x) is(x, "data.frame"), mget(ls()))
head(l_df)
$df_0101
a b
1 0.7588803 0.17837296
2 -0.2592187 0.45445752
3 1.2221744 0.01553190
4 1.1534353 0.72097071
5 0.7279514 0.96770448
$df_0102
a b
1 -0.33415584 0.53597308
2 0.31730849 0.32995013
3 -0.18936533 0.41024220
4 0.49441962 0.22123885
5 -0.28985964 0.62388478
$df2_0101
a2 b2
1 -0.5600229 0.6283224
2 0.5944657 0.7384586
3 1.1284180 0.4656239
4 -0.4737340 0.1555984
5 -0.3838161 0.3373913
$df2_0102
a2 b2
1 -0.67987149 0.65352466
2 1.46878953 0.47135011
3 0.10902751 0.04460594
4 -1.82677732 0.38636357
5 1.06021443 0.92935144
but no idea how to then pull the names of each df down into a new column on each. Any ideas?
Thanks for reading,
We can use Map in base R
Map(cbind, names = names(l_df), l_df)
If we are going by the tidyverse way, then
library(tidyverse)
map2(names(l_df), l_df, ~(cbind(names = .x, .y)))
Also, this can be created a single dataset with bind_rows
bind_rows(l_df, .id = "names")

Deciles by Grouped Variable in R

I want to find out deciles for each grouped variable. I am specifically looking for methods using dplyr and lapply. I'd appreciate if you can help me out.
Here's my what I tried. I don't know how to pull deciles directly other than calling dplyr::ntile() (which didn't work for me)
Attempt 1
Here's what I tried using describe() from Hmisc package:
set.seed(10)
IData <- data.frame(let = sample( x = LETTERS, size = 10000, replace=TRUE), numbers = sample(x = c(1:20000),size = 10000))
Output<-IData %>% data.table::as.data.table(.) %>% split(.,by=c("let"),drop = TRUE,sorted = TRUE) %>% purrr::map(~describe(.$numbers))
This certainly helps but there are two problems with above code:
a) The output (even the list format) is not something I am looking for.
b) I don't really know how to extract 5%, 10%...from the list above.
The bottomline is that I am stuck
Attempt 2
I tried replacing describe by ntile, but the following code gave me an output which didn't make sense to me because the number of columns aren't 10. Upon running Output[[1]], I see a vector of ~400 numbers instead of 10.
Output<-IData %>% data.table::as.data.table(.) %>% split(.,by=c("let"),drop = TRUE,sorted = TRUE) %>% purrr::map(~dplyr::ntile(.$numbers,10))
Attempt 3 = Expected Output
Finally, I tried going the old school (i.e. copy-paste) to get the expected output:
Output<-IData %>%
dplyr::group_by(let) %>%
dplyr::summarise( QQuantile1 = quantile(`numbers`, c(.10)),
QQuantile1 = quantile(`numbers`, c(.10)),
QQuantile2 = quantile(`numbers`, c(.20)),
QQuantile3 = quantile(`numbers`, c(.30)),
QQuantile4 = quantile(`numbers`, c(.40)),
QQuantile5 = quantile(`numbers`, c(.50)),
QQuantile6 = quantile(`numbers`, c(.60)),
QQuantile7 = quantile(`numbers`, c(.70)),
QQuantile8 = quantile(`numbers`, c(.80)),
QQuantile9 = quantile(`numbers`, c(.90)),
QQuantile10 = quantile(`numbers`, c(.100)))
Question: Can someone please help me to generate above output by using these three (not one, but preferably all the methods for learning)
1) lapply
2) dplyr
3) data.table
I looked at several threads on SO, but they all talk about a specific quantile and not all of them. E.g. Find top deciles from dataframe by group thread.
To assemble my comments into an answer, base is shockingly simple:
aggregate(numbers ~ let, IData, quantile, seq(0.1, 1, 0.1))
## let numbers.10% numbers.20% numbers.30% numbers.40% numbers.50% numbers.60% numbers.70% numbers.80% ...
## 1 A 1749.8 3847.8 5562.6 7475.2 9926.0 11758.6 13230.6 15788.8
## 2 B 2393.5 4483.6 6359.1 7708.0 9773.0 11842.8 13468.9 16266.4
## 3 C 2041.5 3682.0 5677.5 7504.0 9226.0 11470.0 13628.5 15379.0
## 4 D 1890.7 4086.8 5661.9 7526.6 9714.0 11438.8 13969.2 15967.2
## 5 E 2083.6 4107.0 6179.8 7910.8 10095.0 11692.6 13668.0 15570.2
## 6 F 1936.6 4220.2 6197.0 8791.8 10382.0 12266.4 14589.2 16407.0
## 7 G 3059.4 4884.2 6519.6 8530.0 10481.0 12469.0 14401.6 16127.8
## 8 H 2186.5 4081.0 5801.5 7206.0 9256.5 11453.0 13692.0 15471.0
## 9 I 1534.1 3793.2 5822.2 7621.4 9417.5 11737.0 14191.2 15722.4
## 10 J 1967.2 4286.6 5829.6 7664.6 10606.0 12217.4 14422.2 16628.0
## ...
with the caveat that numbers is actually a nested column that may need to be unpacked for further usage.
dplyr works if you use list columns or do and reshape:
library(tidyverse)
IData %>% group_by(let) %>%
summarise(quant_prob = list(paste0('quant', seq(.1, 1, .1))),
quant_value = list(quantile(numbers, seq(.1, 1, .1)))) %>%
unnest() %>%
spread(quant_prob, quant_value)
## # A tibble: 26 × 11
## let quant0.1 quant0.2 quant0.3 quant0.4 quant0.5 quant0.6 quant0.7 quant0.8 quant0.9 quant1
## * <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 1749.8 3847.8 5562.6 7475.2 9926.0 11758.6 13230.6 15788.8 17763.0 19958
## 2 B 2393.5 4483.6 6359.1 7708.0 9773.0 11842.8 13468.9 16266.4 17877.4 19929
## 3 C 2041.5 3682.0 5677.5 7504.0 9226.0 11470.0 13628.5 15379.0 17265.0 19876
## 4 D 1890.7 4086.8 5661.9 7526.6 9714.0 11438.8 13969.2 15967.2 17961.0 19989
## 5 E 2083.6 4107.0 6179.8 7910.8 10095.0 11692.6 13668.0 15570.2 18011.4 19887
## 6 F 1936.6 4220.2 6197.0 8791.8 10382.0 12266.4 14589.2 16407.0 18345.0 19997
## 7 G 3059.4 4884.2 6519.6 8530.0 10481.0 12469.0 14401.6 16127.8 18219.2 19922
## 8 H 2186.5 4081.0 5801.5 7206.0 9256.5 11453.0 13692.0 15471.0 17331.0 19996
## 9 I 1534.1 3793.2 5822.2 7621.4 9417.5 11737.0 14191.2 15722.4 17706.6 19965
## 10 J 1967.2 4286.6 5829.6 7664.6 10606.0 12217.4 14422.2 16628.0 18091.2 19901
## # ... with 16 more rows
Another interesting option is purrrlyr::by_slice, which lets you collect the results to columns:
IData %>% group_by(let) %>%
by_slice(~quantile(.x$numbers, seq(0.1, 1, 0.1)), .collate = "cols")
## # A tibble: 26 × 11
## let .out1 .out2 .out3 .out4 .out5 .out6 .out7 .out8 .out9 .out10
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 1749.8 3847.8 5562.6 7475.2 9926.0 11758.6 13230.6 15788.8 17763.0 19958
## 2 B 2393.5 4483.6 6359.1 7708.0 9773.0 11842.8 13468.9 16266.4 17877.4 19929
## 3 C 2041.5 3682.0 5677.5 7504.0 9226.0 11470.0 13628.5 15379.0 17265.0 19876
## 4 D 1890.7 4086.8 5661.9 7526.6 9714.0 11438.8 13969.2 15967.2 17961.0 19989
## 5 E 2083.6 4107.0 6179.8 7910.8 10095.0 11692.6 13668.0 15570.2 18011.4 19887
## 6 F 1936.6 4220.2 6197.0 8791.8 10382.0 12266.4 14589.2 16407.0 18345.0 19997
## 7 G 3059.4 4884.2 6519.6 8530.0 10481.0 12469.0 14401.6 16127.8 18219.2 19922
## 8 H 2186.5 4081.0 5801.5 7206.0 9256.5 11453.0 13692.0 15471.0 17331.0 19996
## 9 I 1534.1 3793.2 5822.2 7621.4 9417.5 11737.0 14191.2 15722.4 17706.6 19965
## 10 J 1967.2 4286.6 5829.6 7664.6 10606.0 12217.4 14422.2 16628.0 18091.2 19901
## # ... with 16 more rows
though the column names are a little lousy.
We can do this in a compact way with data.table. Convert the 'data.frame' to 'data.table' (setDT(IData)), grouped by 'let', get the quantile of 'numbers' and convert it to list (as.list)
library(data.table)
setDT(IData)[, as.list(quantile(numbers, seq(.1, 1, .1))), by = let]

Dplyr: how to loop over specific columns whose names are in a list?

I have a dataframe that looks like this
set.seed(10)
sample <- data_frame(group = c('A','B','C','C',NA,'D'),
var_hello = rnorm(6),
var_how = rnorm(6),
var_are = rnorm(6),
var_you = rnorm(6),
var_buddy = rnorm(6))
# A tibble: 6 × 6
group var_hello var_how var_are var_you var_buddy
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.01874617 -1.2080762 -0.23823356 0.9255213 -1.2651980
2 B -0.18425254 -0.3636760 0.98744470 0.4829785 -0.3736616
3 C -1.37133055 -1.6266727 0.74139013 -0.5963106 -0.6875554
4 C -0.59916772 -0.2564784 0.08934727 -2.1852868 -0.8721588
5 <NA> 0.29454513 1.1017795 -0.95494386 -0.6748659 -0.1017610
6 D 0.38979430 0.7557815 -0.19515038 -2.1190612 -0.2537805
In my original dataset, there are many, many var_something variables.
I would like to group_by('group') and compute the mean of a subset of these var_something variables, but even this subset can be large. So I dont want to resort to typing manually each mutate for every variable.
In the example, I am interested in variables in the following list ['var_hello', 'var_are']
I dont know how to code that up efficiently in dplyr. In Pandas, one could simply write
for var in ['var_hello', 'var_are']:
sample[computation +'_' + var] = sample.groupby('group')[var].agg('mean')
Note how I can automatically create the new column names (of the form computation_var_hello) . What is the best way to achieve that in dplyr?
Many thanks!
You can do this simply by using group_by and summarize_each. You then specify which variables you want to summarize, then replace the prefix in the names using setNames.
sample %>%
group_by(group) %>%
summarize_each(funs(mean), var_hello, var_are) %>%
setNames(gsub("var_","computation_var_",colnames(.)))

Resources