I need to operate columns based on their name condition. In the following reproducible example, per each column that ends with 'x', I create a column that multiplies by 2 the respective variable:
library(dplyr)
set.seed(8)
id <- seq(1,700, by = 1)
a1_x <- runif(700, 0, 10)
a1_y <- runif(700, 0, 10)
a2_x <- runif(700, 0, 10)
df <- data.frame(id, a1_x, a1_y, a2_x)
#Create variables manually: For every column that ends with X, I need to create one column that multiplies the respective column by 2
df <- df %>%
mutate(a1_x_new = a1_x*2,
a2_x_new = a2_x*2)
Since I'm working with several columns, I need to automate this process. Does anybody know how to achieve this? Thanks in advance!
Try this:
df %>% mutate(
across(ends_with("x"), ~ .x*2, .names = "{.col}_new")
)
Thanks #RicardoVillalba for correction.
You could use transmute and across to generate the new columns for those column names ending in "x". Then, use rename_with to add the "_new" suffix and bind_cols back to the original data frame.
library(dplyr)
df <- df %>%
transmute(across(ends_with("x"), ~ . * 2)) %>%
rename_with(., ~ paste0(.x, "_new")) %>%
bind_cols(df, .)
Result:
head(df)
id a1_x a1_y a2_x a1_x_new a2_x_new
1 1 4.662952 0.4152313 8.706219 9.325905 17.412438
2 2 2.078233 1.4834044 3.317145 4.156466 6.634290
3 3 7.996580 1.4035441 4.834126 15.993159 9.668252
4 4 6.518713 7.0844794 8.457379 13.037426 16.914759
5 5 3.215092 3.5578827 8.196574 6.430184 16.393149
6 6 7.189275 5.2277208 3.712805 14.378550 7.425611
Related
I have a data frame similar to this one.
df <- data.frame(id=c(1,2,3), tot_1=runif(3, 0, 100), tot_2=runif(3, 0, 100), tot_3=runif(3, 0, 100), tot_4=runif(3, 0, 100))
I want to select or make an operation only with those with suffixes lower than 3.
#select
df <- df %>% select(id, tot_1, tot_2)
#or sum
df <- df %>% mutate(sumVar = rowSums(across(c(tot_1, tot_2))))
However, in my real data, there are many more variables and not in order. So how could I select them without doing it manually?
We may use matches
df %>%
mutate(sumVar = rowSums(across(matches('tot_[1-2]$'))))
If we need to be more flexible, extract the digit part from the column names that starts with 'tot', subset based on the condition and use that new names
library(stringr)
nm1 <- str_subset(names(df), 'tot')
nm2 <- nm1[readr::parse_number(nm1) <3]
df %>%
mutate(sumVar = rowSums(across(all_of(nm2))))
Solution with num_range
This is the rare case for the often forgotten num_range selection helper from dplyr, which extracts the numbers from the names in a single step, then selects a range:
determine the threshold
suffix_threshold <- 3
Select( )
library(dplyr)
df %>% select(id, num_range(prefix='tot_',
range=seq_len(suffix_threshold-1)))
id tot_1 tot_2
1 1 26.75082 26.89506
2 2 21.86453 18.11683
3 3 51.67968 51.85761
mutate() with rowSums()
library(dplyr)
df %>% mutate(sumVar = across(num_range(prefix='tot_', range=seq_len(suffix_threshold-1)))%>%
rowSums)
id tot_1 tot_2 tot_3 tot_4 sumVar
1 1 26.75082 26.89506 56.27829 71.79353 53.64588
2 2 21.86453 18.11683 12.91569 96.14099 39.98136
3 3 51.67968 51.85761 25.63676 10.01408 103.53730
Here is a base R way -
cols <- grep('tot_', names(df), value = TRUE)
#Select
df[c('id', cols[as.numeric(sub('tot_', '',cols)) < 3])]
# id tot_1 tot_2
#1 1 75.409112 30.59338
#2 2 9.613496 44.96151
#3 3 58.589574 64.90672
#Rowsums
df$sumVar <- rowSums(df[cols[as.numeric(sub('tot_', '',cols)) < 3]])
df
# id tot_1 tot_2 tot_3 tot_4 sumVar
#1 1 75.409112 30.59338 59.82815 50.495758 106.00250
#2 2 9.613496 44.96151 84.19916 2.189482 54.57501
#3 3 58.589574 64.90672 18.17310 71.390459 123.49629
Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85
I intend to create multiple data frame from a data like below:
ID Time Ethnicity LDL HDL ....
1 1 black
2 2 white
3 1 black
4 2 White
each data frame is mean values of the column LDL, HDL, ... in 4 rows displayed in the data. I used the following code but the problem is all the data frames are identical. I mean DF[[1]] is the same as DF[[2]], ...DF[[15]]. I would appreciate if you could help me find the solution.
dv=c(names(data[,4:15]))
library(ggplot2)
require(plyr)
for (i in 1:12) {
DF[[i]] = ddply(data, c("Time", "Ethnicity"), summarize,
Mean = mean(data[[paste(dv[i])]], na.rm = T))
}
plyr is retired, you could use dplyr. When you do mean(data[[paste(dv[i])]], you are subsetting the entire column and not respecting groups. Hence, you get the same mean for all the values in DF[[1]], DF[[2]] etc.
library(dplyr)
output_df <- data %>%
group_by(Time, Ethnicity) %>%
summarise_at(4:15, mean, na.rm = TRUE) %>%
ungroup
If you want list of dataframes you could use group_split :
DF <- output_df %>% group_split(Time, Ethnicity)
I am new to R . I have a large dataset with 1-minute resolution for one year . It makes total of 55940 observation all 1 minute apart with dates and times . I want to change it to six minute resolution data. It necessarily means adding first 6 rows then next 6 and so on so forth . Any good solutions ?
You could try something like this:
library(dplyr)
# original df
df <- data.frame(min = 1:60, val = rnorm(60))
# create a grouping variable and add to df
grp <- floor(df$min / 6)
df <- data.frame(grp, df)
# create new df at 6 min level
new.df <- df %>%
group_by(grp) %>%
summarise(new.val = sum(val))
Another option with a similar approach
library(dplyr)
# original dataframe
n <- 55940
df <- data.frame(id = 1:n , val = rnorm(n))
# new dataframe
df_new <- df %>%
group_by(cut(df$id, n/6)) %>%
summarise(new.val = sum(val))
I am trying to split an ordered data frame into 10 equal buckets. The following works but it introduces an X1., X2., X3. ... prefix to each bucket, which prevents me from iterating over the buckets to sum them.
num_dfs <- 10
buckets<-split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs)))
Produces a df[10] that looks like:
$`10`
predicted_duration actual_duration
177188 23.7402944 6
466561 23.7402663 12
479556 23.7401721 5
147585 23.7401666 48
Here's the crude code I am using to try to sum the groups.
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(as.data.frame(df[i],row.names=NULL)$X1.actual_duration) # X1., X2.,
print(paste(i,"=",p))
}
How do I remove the Xn. grouping prefix or programmatically reference it using the index i?
Here's a similar reproducible example:
df<-data.frame(actual_duration=sample(100))
num_dfs <- 10
df_grouped<-as.data.frame(split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs))))
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(df[i]$actual_duration) # does not work because postfix .1, .2.. was added by R
print(paste(p))
}
I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use
library(tidyverse)
df <- data.frame(actual_duration=sample(100))
df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))
alternatively if you want to keep the list format
df %>%
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))