Applying multiple functions to each column in a data frame using aggregate - r

When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner:
# bogus functions
foo1 <- function(x){mean(x)*var(x)}
foo2 <- function(x){mean(x)/var(x)}
# for illustration purposes only
npk$block <- as.numeric(npk$block)
subdf <- aggregate(npk[,c("yield", "block")],
by = list(N = npk$N, P = npk$P),
FUN = function(x){c(col1 = foo1(x), col2 = foo2(x))})
Having the results in a nicely ordered data frame is achieved by using:
df <- do.call(data.frame, subdf)
Can I avoid the call to do.call() by somehow using aggregate() smarter in this scenario or shorten the whole process by using another base R solution from the start?

As #akrun suggested, dplyr's summarise_each is well-suited to the task.
library(dplyr)
npk %>%
group_by(N, P) %>%
summarise_each(funs(foo1, foo2), yield, block)
# Source: local data frame [4 x 6]
# Groups: N
#
# N P yield_foo2 block_foo2 yield_foo1 block_foo1
# 1 0 0 2.432390 1 1099.583 12.25
# 2 0 1 1.245831 1 2205.361 12.25
# 3 1 0 1.399998 1 2504.727 12.25
# 4 1 1 2.172399 1 1451.309 12.25

You can use
df=data.frame(as.list(aggregate(...

Related

Creating a table that shows a function evaluated for sequences

How would I create a table that takes two varaibles composed of incremental sequences and evaluates a function for the these two variables. An example of what I want to create is like a multiplication table. So the function would be x*y and it would produce a table where [row, column] [1,1]=1, [1,2]=2 [5,5]=25 etc
I think you can use for loops bit I'm not sure.
Thanks in advance
JOE this is pretty basic ... try to follow a basic data manipulation tutorial.
For this type of operations you do not need loops. Read up on vector operations.
What you want to do can be easily done in R with a data frame/tibble.
base R
# create your test vectors
x <- c(1,1,5)
y <- c(1,2,5)
# store them in a data frame
df <- data.frame(x = x, y = y)
df
x y
1 1 1
2 1 2
3 5 5
# in base R you code by refernce to the object and dollar notation
df$mult <- df$x * df$y
df
x y mult
1 1 1 1
2 1 2 2
3 5 5 25
tidyverse
The tidyverse might be a bit more intuitive for vectorised operations:
library(dplyr) # the main data crunching package of the tidyverse
df <- data.frame(x = x, y = y)
# with mutate you can create a new vector (or overwrite an existing one)
df <- df %>% mutate(MULT = x * y)
df
x y MULT
1 1 1 1
2 1 2 2
Good luck with your learning journey!
3 5 5 25

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

sapply results with dplyr

In the example below I am trying to determine which value is closest to each of the vals_int, by id. I can solve this problem using sapply() in a matter similar to below, but I am wondering if the sapply() part can be done with another function in dplyr.
I am really just interested in if the sapply method and output can be reproduced using some function(s) in the dplyr package. I had thought that do() may work but am struggling to determine how.
library(tidyverse)
df <- data_frame(
id = rep(1:10, 10) %>%
sort,
visit = rep(1:10, 10),
value = rnorm(100)
)
vals_int <- c(1, 2, 3)
tmp <- sapply(vals_int,
function(val_i) abs(df$value - val_i))
Yes, you can use the rowwise() and do() functions in dplyr to perform the same operation on every row, like so:
df %>% rowwise %>% do(diffs = abs(.$value - vals_int))
This will create a column called diffs in a new tibble which is a list of vectors with length 3. If you coerce the output that do() returns to be a data frame, it will instead create a tibble with three columns, one for each of the values subtracted.
df %>% rowwise %>% do(as.data.frame(t(abs(.$value - vals_int))))
The answer by #qdread does what you are looking for, but the tidyverse is starting to move away from the do() function (if that matters to you, idk). Here is an alternative method using map from the purrr package.
df %>%
mutate(closest = map(value, function(x){
abs(x - vals_int) %>%
t() %>%
as.tibble()
})) %>%
unnest()
That gives you this:
# A tibble: 100 x 6
id visit value V1 V2 V3
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.91813183 0.08186817 1.081868 2.081868
2 1 2 -1.68556173 2.68556173 3.685562 4.685562
3 1 3 -0.05984289 1.05984289 2.059843 3.059843
4 1 4 0.40128729 0.59871271 1.598713 2.598713
5 1 5 -0.09995526 1.09995526 2.099955 3.099955
6 1 6 0.81802663 0.18197337 1.181973 2.181973
7 1 7 -1.49244225 2.49244225 3.492442 4.492442
8 1 8 -0.74256185 1.74256185 2.742562 3.742562
9 1 9 -0.43943907 1.43943907 2.439439 3.439439
10 1 10 0.54985857 0.45014143 1.450141 2.450141
# ... with 90 more rows

Loop rename variables in R with assign()

I am trying to rename a variable over several data frames, but assign wont work. Here is the code I am trying
assign(colnames(eval(as.name(DataFrameX)))[[3]], "<- NewName")
# The idea is, go through every dataset, and change the name of column 3 to
# "NewName" in all of them
This won't return any error (All other versions I could think of returned some kind of error), but it doesn't change the variable name either.
I am using a loop to create several data frames and different variables within each, now I need to rename some of those variables so that the data frames can be merged in one at a later stage. All that works, except for the renaming. If I input myself the names of the dataframe and variables in a regular call with colnames(DF)[[3]] <- "NewName", but somehow when I try to use assign so that it is done in a loop, it doesn't do anything.
Here is what you can do with a loop over all data frames in your environment. Since you are looking for just data frame in your environment, you are immune of the risk to touch any other variable. The point is that you should assign new changes to each data frame within the loop.
df1 <- data.frame(q=1,w=2,e=3)
df2 <- data.frame(q=1,w=2,e=3)
df3 <- data.frame(q=1,w=2,e=3)
# > df1
# q w e
# 1 1 2 3
# > df2
# q w e
# 1 1 2 3
# > df3
# q w e
# 1 1 2 3
DFs=names(which(sapply(.GlobalEnv, is.data.frame)))
for (i in 1:length(DFs)){
df=get(paste0(DFs[i]))
colnames(df)[3]="newName"
assign(DFs[i], df)
}
# > df1
# q w newName
# 1 1 2 3
# > df2
# q w newName
# 1 1 2 3
# > df3
# q w newName
# 1 1 2 3
We could try ?eapply() to apply setnames() from the data.table package to all data.frame's in your global enviromnent.
library(data.table)
eapply(.GlobalEnv, function(x) if (is.data.frame(x)) setnames(x, 3, "NewName"))

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Resources