This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.
Related
I have information referring to groups of products stored in two matrices:
First matrix: is relatively small, contains one row per group, and then some numerical variables
Second matrix: it is large. Contains multiple rows per group and then some numerical variables.
As a minimal code example, see the following:
df_with_group_values = data.frame(
group= 1:6,
matrix(rnorm(mean=100, 30), nrow=6))
df_with_data = data.frame(
product_id = 1:1000,
group = sample(6, 1000, replace=T),
matrix(rnorm(mean=100, 5000), nrow=1000))
The matrix df_with_group_valueshas only 6 rows (one per group) and then 6 columns: 1 referring to the group and 5 other numerical variables. The matrix df_with_data contains 1000 rows and 6 columns.
I want, for each group index, to compute the distance (here I am considering a weighted euclidean distance) between the associated row in the dataset df_with_group_values, and all the rows referring to that group index in df_with_data.
I could do this by doing a for loop that iterates on the group index and subsetting at each iteration the large matrix, as I show here:
unique_group_idx = df_with_mean_values$group
weights = rnorm(5)
for(idx in seq_along(unique_group_idx))
{
filtered_df_with_data = df_with_data %>%
filter(group == unique_group_idx[idx])
matrix_mean = matrix(rep(as.numeric(df_with_mean_values[idx, -1]), nrow(filtered_df_with_data)), nrow=nrow(filtered_df_with_data), byrow=TRUE)
distance = sqrt(rowMeans(weights * (filtered_df_with_data[,3:7] - matrix_mean)^2))
if(idx == 1)
{
df_distancias = data.frame(product_id = filtered_df_with_data$product_id,
distance = distance)
} else{
df_distancias = df_distancias %>%
rbind(data.frame(product_id = filtered_df_with_data$product_id, distance = distance))
}
}
df_distancias = df_distancias %>% arrange(product_id)
But I was looking for something faster as this can get slow when the number of groups increase, and maybe more elegant. I have been thinking on how I could use the group_by function, as this seems a case where it could fit, but I got nothing so far.
How about something like this with dplyr:
library(dplyr)
library(tidyr)
set.seed(123)
df_with_group_values = data.frame(
group= 1:6,
matrix(rnorm(mean=100, 30), nrow=6))
df_with_data = data.frame(
product_id = 1:1000,
group = sample(6, 1000, replace=T),
matrix(rnorm(mean=100, 5000), nrow=1000))
weights = rnorm(5)
weights <- exp(weights)/sum(exp(weights))
df_with_data <- setNames(df_with_data, c("product_id", "group", paste0("d_", 1:5)))
df_with_group_values <- setNames(df_with_group_values, c("group", paste0("g_", 1:5)))
weight_data <- do.call(data.frame, as.list(weights))
weight_data <- setNames(weight_data, paste0("w_", 1:5))
df_distancias <- df_with_data %>%
left_join(df_with_group_values, by="group") %>%
bind_cols(weight_data) %>%
pivot_longer(-c("product_id", "group"), names_pattern="(.*)_(.*)", names_to=c(".value", "dim")) %>%
group_by(product_id, group) %>%
summarise(distance = sqrt(mean(weights*(d-g)^2)))
#> `summarise()` has grouped output by 'product_id'. You can override using the
#> `.groups` argument.
df_distancias
#> # A tibble: 1,000 × 3
#> # Groups: product_id [1,000]
#> product_id group distance
#> <int> <int> <dbl>
#> 1 1 5 0.693
#> 2 2 4 0.363
#> 3 3 5 0.674
#> 4 4 2 0.393
#> 5 5 1 0.423
#> 6 6 1 0.301
#> 7 7 3 0.488
#> 8 8 1 0.774
#> 9 9 6 0.439
#> 10 10 5 0.653
#> # … with 990 more rows
Created on 2022-11-17 by the reprex package (v2.0.1)
I'm looking for some kind kind of conditional rolling sum I thought a while loop would do what I need, but I'm having trouble implementing it. So this should look like PCAR[1]*time[1]+PCAR[2]*time[2]+PCAR[3]*time[3] etc where [] references the row of the column, and this would loop until the cumulative time value reachs <= 100 years, then the loop should add this value to a column and then start again until cumulative time is between 100 and <= 200, and so on until the bottom of the data set. It's going to be applied to datasets of varying sizes with tens of thousands of years in.
I hope that makes sense. In the example data below the PCAR_BIN column is what I'm aiming for as the outcome.
df <- tibble(cumulative.time = c(20,40,60,80,100, 120,140,160,180,200),
PCAR =1:10,
time = 1:10,
depth.along.core = 1:10,
Age.cal.BP = 1:10,
AFBD = 1:10,
assumed.C = rep(0.5, 10),
PCAR_BIN = c(55,330,NA,NA,NA,NA,NA,NA,NA,NA))
The function looks like
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up,
PCAR_BIN = ifelse(cumulative.time <= 100, sum(PCAR*time+lead(PCAR)*lead(time),NA)
)}
Obviously I had no luck with the ifelse satement, as it would only work for one iteration of time and the sum is wrong. I've tried similar with while and for loops but with no luck. Part of the problem is I'm not sure how to express the sum that I need. I've also tried binning the data with case_when, and working off that, but with no luck again.
Thanks people :)
EDIT
Following Martins method I now have the function working up to creating the ROLLSUM Column, I now need to create a column that will give the maximum value for each century group. Running the code from slicemax onward gives me the error:
Error in eval(lhs, parent, parent) : object 'tmp' not found
I've added the real data too.
dput(head(EMC))
structure(list(depth.along.core = c(0.5, 1.5, 2.5, 3.5, 4.5,
5.5), Age.cal.BP = c(-56.016347625, -55.075825875, -54.201453125,
-53.365755375, -52.541258625, -51.700488875), time = c(0.94052175,
0.87437275, 0.83569775, 0.82449675, 0.84076975, 0.88451675),
cumulative.time = c(0.94052175, 1.8148945, 2.65059225, 3.475089,
4.31585875, 5.2003755), AFBD = c(0.0711, 0.057, 0.0568, 0.0512,
0.0559, 0.0353), assumed.C = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5
)), row.names = c(NA, 6L), class = "data.frame")
MBA <- function(data) {
require(dplyr)
data %>% mutate(PCAR=((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)%>%
slice(1:(n()-1))%>%
group_by(Century = cut(cumulative.time, breaks = seq(0, max(cumulative.time), 100)))%>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))%>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)%>%
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))}
You could try this:
# Get cumulative sums by group (assuming per century groups)
df <- df %>%
group_by(Century = cut(cumulative.time,
breaks = seq(0, max(cumulative.time), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR * time)))
# Get maximum of each group
groupMaxima <- df %>%
slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
# Fill column as desired
df$ROLLSUM <- c(groupMaxima, rep(NA, nrow(df) - length(groupMaxima)))
We simply create a factor column to group the cumulative time column by centuries and use that factor to sum up the values. Lastly we edit the rolling sum column to contain only the max values and fill the other rows with NA.
# A tibble: 10 x 10
# Groups: Group [2]
cumulative.time PCAR time depth.along.core Age.cal.BP AFBD assumed.C PCAR_BIN Group ROLLSUM
<dbl> <int> <int> <int> <int> <int> <dbl> <dbl> <fct> <int>
1 20 1 1 1 1 1 0.5 55 (0,100] 55
2 40 2 2 2 2 2 0.5 330 (0,100] 330
3 60 3 3 3 3 3 0.5 NA (0,100] NA
4 80 4 4 4 4 4 0.5 NA (0,100] NA
5 100 5 5 5 5 5 0.5 NA (0,100] NA
6 120 6 6 6 6 6 0.5 NA (100,200] NA
7 140 7 7 7 7 7 0.5 NA (100,200] NA
8 160 8 8 8 8 8 0.5 NA (100,200] NA
9 180 9 9 9 9 9 0.5 NA (100,200] NA
10 200 10 10 10 10 10 0.5 NA (100,200] NA
Edit:
For this special case:
MBA <- function(data) {
require(dplyr)
data <- data %>% mutate(PCAR = ((lead(depth.along.core) - depth.along.core )/(lead(Age.cal.BP) - Age.cal.BP))*AFBD*assumed.C*10000,
PCA_NCP = PCAR*(lead(Age.cal.BP)-Age.cal.BP),
PCA_NCP[is.na(PCA_NCP)] <- 0,
CCP_Bottom_Up = rev(cumsum(rev(PCA_NCP))),
CCP_Top_Down = CCP_Bottom_Up[1]- CCP_Bottom_Up)
data <- data %>%
group_by(CTIME = cut(cumsum(cumulative.time),
breaks = seq(0, max(cumsum(cumulative.time), na.rm = T), 100))) %>%
mutate(ROLLSUM = rev(cumsum(PCAR*time)))
groupMaxima <- data %>% slice_max(order_by = ROLLSUM, n = 1) %>%
pull(ROLLSUM)
data$ROLLSUM <- c(groupMaxima, rep(NA, nrow(data) - length(groupMaxima)))
data
}
There are a number of ways, if your steps are really steps of 100 years, and the values go 0,20,40 in constant intervals- you can do this natively:
steps = 100
intervals = 20
ratio = steps / intervals
columns = df[,c("PCAR","time")]
indices = rep(ratio,nrow(df)) %>% cumsum
PCAR_BIN = lapply(indices,function(x){
localRange = (x-ratio):x
sum(columns[localRange,1] * columns[localRange,2])
})%>% unlist
we can now bind PICAR_BIN:
df = cbind(df,PICAR_BIN)
This question already has answers here:
Performing dplyr mutate on subset of columns
(5 answers)
Closed 3 years ago.
I'm trying to calculate row means on a dataset. I found a helpful function someone made here (dplyr - using mutate() like rowmeans()), and it works when I try out every column but not when I try to use a dplyr helper function.
Why does this work:
#The rowmeans function that works
my_rowmeans = function(..., na.rm=TRUE){
x =
if (na.rm) lapply(list(...), function(x) replace(x, is.na(x), as(0, class(x))))
else list(...)
d = Reduce(function(x,y) x+!is.na(y), list(...), init=0)
Reduce(`+`, x)/d
}
#The data
library(tidyverse)
data <- tibble(id = c(1:4),
turn_intent_1 = c(5, 1, 1, 4),
turn_intent_2 = c(5, 1, 1, 3),
turn_intent_3R = c(5, 5, 1, 3))
#The code that is cumbersome but works
data %>%
mutate(turn_intent_agg = my_rowmeans(turn_intent_1, turn_intent_2, turn_intent_3R))
#The output
# A tibble: 4 x 5
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
But this does not work:
#The code
data %>%
mutate(turn_intent_agg = select(., contains("turn")) %>%
my_rowmeans())
#The output
Error in class1Def#contains[[class2]] : no such index at level 1
Of course, I can type each column, but this dataset has many columns. It'd be much easier to use these wrappers.
I need the output to look like the correct one shown that contains all columns (such as id).
Thank you!
I think that you can simplify it to:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn"))))
id turn_intent_1 turn_intent_2 turn_intent_3R turn_intent_agg
<int> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 5 5
2 2 1 1 5 2.33
3 3 1 1 1 1
4 4 4 3 3 3.33
And you can indeed add also the na.rm = TRUE parameter:
data %>%
mutate(turn_intent_agg = rowMeans(select(., contains("turn")), na.rm = TRUE))
Say I have 900 dataframes at hand, and I wanted to get something similar to a frequency distribution based off of another column for each "type".
Sample Code makin;
df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7)
df2 <- df2 %>%
mutate(type = 1:7)
This is similar to what I currently have just working with one dataframe:
df2 %>% select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
mutate(Percentage = Count/7)
In the case that for any row, Sepal.Length = 0, then I'm not going to count it (count column will be = 0 for that row value).
But I'm going to have 900 dataframes that I'll be running this code on, so I was thinking about running it through a loop.
Ideally, if two dataframes are inputted, and both have Sepal.Length values >0 for row 1, then I want the count to be 2 for row 1 / type 1. Is there a better way to approach this? And if I do go for the looping option then is there a way to combine all the dataframes to tell R that row 1 / type 1 has multiple > 0 values?
For your iris example, what it sounds like you want is:
library(tidyverse)
df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7)
df2 <- df2 %>%
mutate(type = 1:7)
group_by(df2, type) %>%
transmute(has_sepal = sum(Sepal.Length > 0))
# A tibble: 7 x 2
# Groups: type [7]
# type has_sepal
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 6 1
# 7 7 1
To do this over 900 data frames... If you want this to work on iris, hard code. Someone who is familiar with writing functions using tidyverse evaluation could write a more general version for you, but that's still on my todo list.
f_fill_in_blank_first <- function(tib){
# hard code the var1 and var2
group_by(tib, <var1>) %>%
transmute(var1_not_zero = sum(<var 1> != 0))
}
f_iris <- function(tib)
group_by(tib, type) %>%
transmute(var1_not_zero = sum(Sepal.Length != 0)
}
Depending on the structure of your 900 data frames, you could vapply with this function (edit, no, not this function, refactor so it produces a named atomic vector if you want to vapply this function) to put the whole thing into an array, then collapse one of the dimensions with apply and sum
If you want to keep your code:
df2 %>% select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
mutate(Percentage = Count/7)
You can wrap it into a function (add_a_count):
library(tidyverse)
df1 <- as_tibble(iris)
df2 <- df1 %>%
mutate(type = nrow(df1))
add_a_count = function(df)
{
counted_df = df %>%
select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0),
Percentage = Count/7)
return(counted_df)
}
I generate 100 duplicates of the test df2 with the following function:
duplicate_df = function(df, no_duplicates)
{
tmp_df_list = list()
for(i in c(1:no_duplicates))
{
print(paste0("Duplicate ", i, " generated."))
tmp_df_list[[i]] = df
}
return(tmp_df_list)
}
data_frames_list = duplicate_df(df = df2, no_duplicates = 100)
And use it with lapply: counted_data_frames = lapply(data_frames_list, add_a_count)
The list counted_data_frames can relatively easily be manipulated (You can use another apply function if you want a non-list output). This might not be the fastest way to do it, but it's straightforward.
EDIT
You can get your Counts columns via looping over the list of data frames. A new data frame counts_data_frame contains all counts with every column being counts of one original data frame:
counts_data_frame = data.frame(type = seq(from = 1, to = nrow(df2)))
for(i in c(1:length(counted_data_frames)))
{
counts_data_frame = cbind(counts_data_frame, as.vector(counted_data_frames[[i]]["Count"]))
}
When looping over the rows of this new data frame, you can sum up your counts and get a vector of counts for plotting:
counts_summarised = vector(length = nrow(counts_data_frame))
for(i in c(1:nrow(counts_data_frame)))
{
counts_summarised[i] = sum(counts_data_frame[i, 2:ncol(counts_data_frame)])
}
plot(counts_summarised, ylab = "Counts", xlab = "Type")
In this solution, I will show you how to:
import all CSV files, into separate data frames in a list, assuming that they all have the same column name for the variable you are interested in and that the files are in one folder (your working directory, preferrably);
count the number of 0 and nonzero measurements and their proportions;
convert the list into a dataframe
Specifically, I used lapply() for looping through the data.frames, converting the list to a data.frame using enframe(), unnesting the value column with unnest(), and spreading the pct by type using spread().
Let's first create a data to work with.
library(tidyverse)
# create a list
datlist <- list()
# this list will contain ten data frames with
# a sample with up to 8 0's and 20 random uniforms as observations
for (i in seq_len(10)){
datlist[[i]] = data.frame(x = sample(c(sample(c(0,1,2,3,4), 8, replace = T), runif(20,0,10))))
}
# name each element of the list datlist
name_element <- LETTERS[1:10]
datlist <- set_names(datlist, name_element)
# save each file separately
mapply(write.csv, datlist, file=paste0(names(datlist), '.csv'), row.names = FALSE)
The following will import your data into R and store them as data.frames in a list.
# import all csv files in the folder into separate data frames in the temp list
temp <- list.files(pattern = "*.csv")
myfiles <- lapply(temp, read.csv)
The following will calculate the percentages by type if we assume that each file contains the same variables.
# Calculate the frequency and relative distributions
lapply(myfiles,
function(varname) mutate(varname, type = if_else(x == 0, 0, 1)) %>%
group_by(type) %>% summarise(n = n()) %>%
mutate(pct = n / sum(n))
) %>%
enframe() %>% # convert the list into a data.frame
unnest(value) %>% # unnest the values
spread(type, pct) # spread the values by type
# A tibble: 17 x 4
name n `0` `1`
<int> <int> <dbl> <dbl>
1 1 3 0.107 NA
2 1 25 NA 0.893
3 2 28 NA 1.00
4 3 1 0.0357 NA
5 3 27 NA 0.964
6 4 2 0.0714 NA
7 4 26 NA 0.929
8 5 28 NA 1.00
9 6 28 NA 1.00
10 7 2 0.0714 NA
11 7 26 NA 0.929
12 8 3 0.107 NA
13 8 25 NA 0.893
14 9 1 0.0357 NA
15 9 27 NA 0.964
16 10 1 0.0357 NA
17 10 27 NA 0.964
I am pretty new to R, so this question may be a bit naive.
I have got a tibble with several columns, and I want to create a factor (Bin) by binning the values in one of the columns in N bins. Which is done in a pipe. However, I would like to be able to define the column to be binned at the top of the script (e.g. bin2use = RT), because I want this to be flexible.
I've tried several ways of referring to a column name using this variable, but I cannot get it to work. Amongst others I have tried get(), eval(), [[]]
simplified example code
Subject <- c(rep(1,100), rep(2,100))
RT <- runif(200, 300, 800 )
data_st <- tibble(Subject, RT)
bin2use = 'RT'
nbin = 5
binned_data <- data_st %>%
group_by(Subject) %>%
mutate(
Bin = cut_number(get(bin2use), nbin, label = F)
)
Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator
We can use a non-standard evaluation with `lazyeval
library(dplyr)
library(ggplot2)
f1 <- function(colName, bin){
call <- lazyeval::interp(~cut_number(a, b, label = FALSE),
a = as.name(colName), b = bin)
data_st %>%
group_by(Subject) %>%
mutate_(.dots = setNames(list(call), "Bin"))
}
f1(bin2use, nbin)
#Source: local data frame [200 x 3]
#Groups: Subject [2]
# Subject RT Bin
# <dbl> <dbl> <int>
#1 1 752.2066 5
#2 1 353.0410 1
#3 1 676.5617 4
#4 1 493.0052 2
#5 1 532.2157 3
#6 1 467.5940 2
#7 1 791.6643 5
#8 1 333.1583 1
#9 1 342.5786 1
#10 1 637.8601 4
# ... with 190 more rows