R count() using dynamically generated list of variables/columns - r

If I have a tibble called observations with the following variables/columns:
category_1_red_length
category_1_red_width
category_1_red_depth
category_1_blue_length
category_1_blue_width
category_1_blue_depth
category_1_green_length
category_1_green_width
category_1_green_depth
category_2_red_length
category_2_red_width
category_2_red_depth
category_2_blue_length
category_2_blue_width
category_2_blue_depth
category_2_green_length
category_2_green_width
category_2_green_depth
Plus a load more. Is there a way to dynamically generate the following count()?
count(observations,
category_1_red_length,
category_1_red_width,
category_1_red_depth,
category_1_blue_length,
category_1_blue_width,
category_1_blue_depth,
category_1_green_length,
category_1_green_width,
category_1_green_depth,
category_2_red_length,
category_2_red_width,
category_2_red_depth,
category_2_blue_length,
category_2_blue_width,
category_2_blue_depth,
category_2_green_length,
category_2_green_width,
category_2_green_depth,
sort=TRUE)
I can create the list of columns I want to count with:
columns_to_count = list()
column_prefix = 'category'
aspects = c('red', 'blue', 'green')
dimensions = c('length', 'width', 'depth')
for (x in 1:2) {
for (aspect in aspects) {
for (dimension in dimensions) {
columns_to_count = append(columns_to_count, paste(column_prefix, x, aspect, dimension, sep='_'))
}
}
}
But then how do I pass my list of columns in columns_to_count to the count() function?
In my actual data set there are about 170 columns like this that I want to count so creating the list of columns without loops doesn't seem sensible.
Struggling to think of the name for what I'm trying to do so unable to find useful search results.
Thanks.

You can use non-standard evaluation using syms and !!!. For example, using mtcars dataset
library(dplyr)
library(rlang)
cols <- c('am', 'cyl')
mtcars %>% count(!!!syms(cols), sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
This is same as doing
mtcars %>% count(am, cyl, sort = TRUE)
# am cyl n
#1 0 8 12
#2 1 4 8
#3 0 6 4
#4 0 4 3
#5 1 6 3
#6 1 8 2
You don't need to include names in cols one by one by hand. You can use regex if the column contains a specific pattern or use position to get appropriate column name.

You can use .dots to receive strings as variables:
count(observations, .dots=columns_to_count, sort=TRUE)
r$> d
V1 V2
1 1 4
2 2 5
3 3 6
r$> count(d, .dots=list('V1', 'V2'))
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1
r$> count(d, V1, V2)
# A tibble: 3 x 3
V1 V2 n
<int> <int> <int>
1 1 4 1
2 2 5 1
3 3 6 1

Related

How to use a for loop to changed consecutive values in R?

How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1

What is the best way to apply a function to a range of values from another column in R data.frame so it remains vectorized?

I have several columns in R data.frame, and I want to create a new column based on ranges of values from some already existing column. Those ranges are not regular and are determined by start and end values written in first two columns. I want the calculation to remain vectorized. I don't want a for loop underneath.
required result, achieved with a for loop:
df = data.frame(start=c(2,1,4,4,1), end=c(3,3,5,4,2), values=c(1:5))
for (i in 1:nrow(df)) {
df[i, 'new'] <- sum(df[df[i, 'start']:df[i, 'end'], 'values'])
}
df
Here is a base R one-liner.
mapply(function(x1, x2, y){sum(y[x1:x2])}, df[['start']], df[['end']], MoreArgs = list(y = df[['values']]))
#[1] 5 6 9 4 3
And another one.
sapply(seq_len(nrow(df)), function(i) sum(df[['values']][df[i, 'start']:df[i, 'end']]))
#[1] 5 6 9 4 3
here is an option with map2
library(purrr)
library(dplyr)
df %>%
mutate(new = map2_dbl(start, end, ~ sum(values[.x:.y])))
-output
# start end values new
#1 2 3 1 5
#2 1 3 2 6
#3 4 5 3 9
#4 4 4 4 4
#5 1 2 5 3
Or with rowwise
df %>%
rowwise %>%
mutate(new =sum(.$values[start:end])) %>%
ungroup
-output
# A tibble: 5 x 4
# start end values new
# <dbl> <dbl> <int> <int>
#1 2 3 1 5
#2 1 3 2 6
#3 4 5 3 9
#4 4 4 4 4
#5 1 2 5 3
Or using data.table
library(data.table)
setDT(df)[, new := sum(df$values[start:end]), seq_len(nrow(df))]

How should a function be applied by row on a dataframe to generate a new or expanded dataframe in r

I am trying to expand an existing dataset, which currently looks like this:
df <- tibble(
site = letters[1:3],
years = rep(4, 3),
tr = c(3, 6, 4)
)
tr is the total number of replicates for each site/year combination. I simply want to add in the replicates and later the response variable for each replicate. This was easy for a single site/year combination using the following function:
f <- function(site=NULL, years=NULL, t=NULL){
df <- tibble(
site = rep(site, each = t, times= years),
tr = rep(1:t, times = years),
year = rep(1:years, each = t)
)
df
}
# For one site:
f(site='a', years=4, t=3)
# Producing this:
# # A tibble: 12 x 3
# site tr year
# <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
# 10 a 1 4
# 11 a 2 4
# 12 a 3 4
How can the function be applied to each row of the input dataframe to produce the final dataframe? One of the apply functions in base r or the pmap_df() in the purrr package would seem ideal, but being unfamiliar with how these functions work, all my efforts have only produced errors.
If we want to apply the same function, use pmap
library(purrr)
pmap_dfr(df, ~ f(..1, ..2, ..3))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
another option is condense from the devel version of dplyr
library(tidyr)
df %>%
group_by(rn = row_number()) %>%
condense(out = f(site, years, tr)) %>%
unnest(c(out))
Or in base R, we can also use do.call with Map
do.call(rbind, do.call(Map, c(f, unname(as.data.frame(df)))))
well in base R, you could do:
do.call(rbind,do.call(Vectorize(f,SIMPLIFY = FALSE),unname(df)))
# A tibble: 52 x 3
site tr year
* <chr> <int> <int>
1 a 1 1
2 a 2 1
3 a 3 1
4 a 1 2
5 a 2 2
6 a 3 2
7 a 1 3
8 a 2 3
9 a 3 3
10 a 1 4
# ... with 42 more rows
do.call(rbind, lapply(split(df, df$site), function(x){
with(x, data.frame(site,
years = rep(sequence(years), each = tr),
tr = rep(sequence(tr), years)))
}))
We can use Map to apply f to every value of site, years and tr.
do.call(rbind, Map(f, df$site, df$years, df$tr))
# A tibble: 52 x 3
# site tr year
# * <chr> <int> <int>
# 1 a 1 1
# 2 a 2 1
# 3 a 3 1
# 4 a 1 2
# 5 a 2 2
# 6 a 3 2
# 7 a 1 3
# 8 a 2 3
# 9 a 3 3
#10 a 1 4
# … with 42 more rows
Akrun's answer worked well for me, so I modified it to make the function to be applied to each row of the dataframe a little more explicit:
df1 <- pmap_df(df, function(site, years, tr){
site = rep(site, each = tr, times=years)
year = rep(1:years, each = tr)
tr = rep(1:tr, times=years)
return(tibble(site, year, tr))
})

Dense ranking of column based on order of second column

I am beating my brains out on something that is probably straight forward. I want to get a "dense" ranking (as defined for the data.table::frank function), on a column in a data frame, but not based on the columns proper order, the order should be given by another column (val in my example)
I managed to get the dense ranking with #Prasad Chalasani 's solution, like that:
library(dplyr)
foo_df <- data.frame(id = c(4,1,1,3,3), val = letters[1:5])
foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id)))
#> id val id_fac
#> 1 4 a 3
#> 2 1 b 1
#> 3 1 c 1
#> 4 3 d 2
#> 5 3 e 2
But I would like the factor levels to be ordered based on val. Desired output:
foo_desired <- foo_df %>% arrange(val) %>% mutate(id_fac = as.integer(factor(id, levels = c(4,1,3))))
foo_desired
#> id val id_fac
#> 1 4 a 1
#> 2 1 b 2
#> 3 1 c 2
#> 4 3 d 3
#> 5 3 e 3
I tried data.table::frank
I tried both methods by #Prasad Chalasani.
I tried setting the order of id using id[rank(val)] (and sort(val), and order(val)).
Finally, I also tried to sort the levels using rank(val) etc, but this throws an error (Evaluation error: factor level [3] is duplicated.)
I know that one can specify the level order, I used this for creation of the desired output. This solution is however not great as my data has way more rows and levels
I need that for convenience, in order to produce a table with a specific order, not for computations.
Created on 2018-12-19 by the reprex package (v0.2.1)
You can check with first
foo_df %>% arrange(val) %>%
group_by(id)%>%mutate(id_fac = first(val))%>%
ungroup()%>%
mutate(id_fac=as.integer(factor(id_fac)))
# A tibble: 5 x 3
id val id_fac
<dbl> <fctr> <int>
1 4 a 1
2 1 b 2
3 1 c 2
4 3 d 3
5 3 e 3
Why do you even need factors ? Not sure if I am missing something but this gives your desired output.
You can use match to get id_fac based on the occurrence of ids.
library(dplyr)
foo_df %>%
mutate(id_fac = match(id, unique(id)))
# id val id_fac
#1 4 a 1
#2 1 b 2
#3 1 c 2
#4 3 d 3
#5 3 e 3

R add rows to grouped df using dplyr

I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment

Resources