sapply results with dplyr - r

In the example below I am trying to determine which value is closest to each of the vals_int, by id. I can solve this problem using sapply() in a matter similar to below, but I am wondering if the sapply() part can be done with another function in dplyr.
I am really just interested in if the sapply method and output can be reproduced using some function(s) in the dplyr package. I had thought that do() may work but am struggling to determine how.
library(tidyverse)
df <- data_frame(
id = rep(1:10, 10) %>%
sort,
visit = rep(1:10, 10),
value = rnorm(100)
)
vals_int <- c(1, 2, 3)
tmp <- sapply(vals_int,
function(val_i) abs(df$value - val_i))

Yes, you can use the rowwise() and do() functions in dplyr to perform the same operation on every row, like so:
df %>% rowwise %>% do(diffs = abs(.$value - vals_int))
This will create a column called diffs in a new tibble which is a list of vectors with length 3. If you coerce the output that do() returns to be a data frame, it will instead create a tibble with three columns, one for each of the values subtracted.
df %>% rowwise %>% do(as.data.frame(t(abs(.$value - vals_int))))

The answer by #qdread does what you are looking for, but the tidyverse is starting to move away from the do() function (if that matters to you, idk). Here is an alternative method using map from the purrr package.
df %>%
mutate(closest = map(value, function(x){
abs(x - vals_int) %>%
t() %>%
as.tibble()
})) %>%
unnest()
That gives you this:
# A tibble: 100 x 6
id visit value V1 V2 V3
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.91813183 0.08186817 1.081868 2.081868
2 1 2 -1.68556173 2.68556173 3.685562 4.685562
3 1 3 -0.05984289 1.05984289 2.059843 3.059843
4 1 4 0.40128729 0.59871271 1.598713 2.598713
5 1 5 -0.09995526 1.09995526 2.099955 3.099955
6 1 6 0.81802663 0.18197337 1.181973 2.181973
7 1 7 -1.49244225 2.49244225 3.492442 4.492442
8 1 8 -0.74256185 1.74256185 2.742562 3.742562
9 1 9 -0.43943907 1.43943907 2.439439 3.439439
10 1 10 0.54985857 0.45014143 1.450141 2.450141
# ... with 90 more rows

Related

tidyverse rename_with giving error when trying to provide new names based on existing column values

Assuming the following data set:
df <- data.frame(...1 = c(1, 2, 3),
...2 = c(1, 2, 3),
n_column = c(1, 1, 2))
I now want to rename all vars that start with "...". My real data sets could have different numbers of "..." vars. The information about how many such vars I have is in the n_column column, more precisely, it is the maximum of that column.
So I tried:
df %>%
rename_with(.cols = starts_with("..."),
.fn = paste0("new_name", 1:max(n_column)))
which gives an error:
# Error in paste0("new_name", 1:max(n_column)) :
# object 'n_column' not found
So I guess the problem is that the paste0 function does look for the column I provide within the current data set. Not sure, though, how I could do so. Any ideas?
I know I could bypass the whole thing by just creating an external scalar that contains the max. of n_column, but ideally I'd like to do everything in one pipeline.
You don't need information from n_column, .cols will pass only those columns that satisfy the condition (starts_with("...")).
library(dplyr)
df %>% rename_with(~paste0("new_name", seq_along(.)), starts_with("..."))
# new_name1 new_name2 n_column
#1 1 1 1
#2 2 2 1
#3 3 3 2
This is safer than using max(n_column) as well, for example if the data from n_column gets corrupted or the number of columns with ... change this will still work.
A way to refer to column values in rename_with would be to use anonymous function so that you can use .$n_column.
df %>%
rename_with(function(x) paste0("new_name", 1:max(.$n_column)),
starts_with("..."))
I am assuming this is part of longer chain so you don't want to use max(df$n_column).
We can use str_c
library(dplyr)
library(stringr)
df %>%
rename_with(~str_c("new_name", seq_along(.)), starts_with("..."))
Or using base R
i1 <- startsWith(names(df), "...")
names(df)[i1] <- sub("...", "new_name", names(df)[i1], fixed = TRUE)
df
new_name1 new_name2 n_column
1 1 1 1
2 2 2 1
3 3 3 2
A completly other approach would be
df %>% janitor::clean_names()
x1 x2 n_column
1 1 1 1
2 2 2 1
3 3 3 2

Going from dplyr to base: create a data frame of the first and last index for each level of a variable

Asking how to go from dplyr to base may be a weird ask, especially since I love the tidyverse, but I think because I learned the tidyverse first, my grasp of base is far from masterful, and I need a base solution because the package I'm helping to develop doesn't want any tidyverse dependencies
Data (there are many more columns, but abbreviated for reprex sake):
sample.df <- tibble(batch = rep(c(1,2,3), c(4,5,6)))
Desire base equivalent of:
sample.df %>%
mutate(rowid = row_number()) %>%
group_by(batch) %>%
summarize(idx_b = min(rowid),
idx_e = max(rowid))
# A tibble: 3 x 3
# Groups: batch [3]
batch idx_b idx_e
<dbl> <int> <int>
1 1 1 4
2 2 5 9
3 3 10 15
We create a sequence column in the data, use aggregate to get the range or min/max and convert the matrix column to regular data.frame column with do.call
out <- do.call(data.frame, aggregate(rowid ~ batch,
transform(sample.df, rowid = seq_len(nrow(sample.df))),
FUN = function(x) c(b = min(x), e = max(x))))
Another base R option using unique + ave
unique(
transform(
sample.df,
idx_b = ave(1:nrow(sample.df), batch, FUN = min),
idx_c = ave(1:nrow(sample.df), batch, FUN = max)
)
)
gives
batch idx_b idx_c
1 1 1 4
5 2 5 9
10 3 10 15

How to define a variable to record the number of processed rows when using R, dplyr and rowwise?

I have a function which needs a long time to run. So, I want to know how many rows of my data frame are processed. Usually, we can define a variable in for loop to deal with this easily. But I do not know how to do it in dplyr.
Let's say the code is:
library(tidyverse)
myFUN <-functin (x) {
x + 1
}
a <- tibble(id=c(1:3),x=c(3,5,1))
a1 <- a %>%
rowwise() %>%
mutate(y=myFUN(x))
I hope in somewhere the code, I can define a variable i. The value will be plus 1 every time one row is processed, then print its values in console like:
1
2
3
Can you pass another variable to the function which would be the row number of the dataframe and print it in the function. Something like :
myFUN <-function (x, y) {
message(y)
x + 1
}
and then use
library(dplyr)
a %>% mutate(y = purrr::map2_dbl(x, row_number(), myFUN))
#1
#2
#3
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 1 3 4
#2 2 5 6
#3 3 1 2
If your function is vectorized, you can let go map_dbl and do
a %>% mutate(y= myFUN(x, seq_len(n())))

ACF by group in R

I would like to calculate the acf of a time series grouped by a grouping variable. Specifically, I have a data frame contaning a single time series (variable a) and a grouping variable (e. g. weekday, variable b). Here is an example:
data <- data.frame(a=rnorm(1:150), b=rep(rep(1:3, each=5), 10))
Now, I would like to calculate the acf for the different values of the grouping variable. For example, for lag 2 and group 1 I would like to get the correlation between t and t-2 calculated only over time points t with b=1 (the value of b for t-2 does not matter). I know that the function acf can easily calculate the acf but I don't find a way to include the grouping variable.
I could manually calculate the desired correlation but as I have a large data set and a lot of lags and values for the grouping variables, I would hope that there is a more elegant and faster way. Here is the manual calculation for the example above (lag 2, b=1):
sel <- which(data$b==1)
cor(data$a[sel[sel > 2]], data$a[sel[sel>2] - 2])
If the time series object is a tsibble, the following works for me. Assuming the data frame is called df and the variable you are interested in is called var. You can specify max lag additionally
df %>% group_by(Region) %>% ACF(var, lag_max = 18) %>% autoplot()
I'm not sure I understand exactly what information you are looking for but if you just want the acf values for multiple groups this should accomplish that. Some people have mentioned creating a tidy solution and this uses dplyr, tidyr, and purrr to do grouped calculations.
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 × 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
You can have a look at split to seperate your data.frame in buckets and then lapply to apply your function to each group. Something like:
groups_data <- split(data, data$b)
groups_acf <- lapply(groups_data, acf,...)
Then you have to extract the required information from the output list for instance with `sapply(groups,acf, FUN=function(acfobject){acfobject$value})
For groups computations, I would also definitiely go with new ways "à la" Hadley Wickham with %>% operator and group_by ; studing that is on my todo list...

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Resources