Merging multiple connected columns - r

I have two different columns for several samples, which are connected. I want to merge all columns of type 1 to one column and all of type 2 to one column, but the rows should stay connected.
Example:
a1 <- c(1, 2, 3, 4, 5)
b1 <- c(1, 4, 9, 16, 25)
a2 <- c(2, 4, 6, 8, 10)
b2 <- c(4, 8, 12, 16, 20)
df1 <- data.frame(a1, b1, a2, b2)
a1 b1 a2 b2
1 1 1 2 4
2 2 4 4 8
3 3 9 6 12
4 4 16 8 16
5 5 25 10 20
I want to have it like this:
a b
1 1 1
2 2 4
3 2 4
4 3 9
5 4 8
6 4 16
7 5 25
8 6 12
9 8 16
10 10 20
My case
This is the example in my case. I have a lot of columns with different names and I want to extract abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new data frame, with all abs_dist in one column and all mean_vel in one column, but still connected.
I tried with unlist, but then of course the connection gets lost.
Thanks in advance.

A base R option using reshape
subset(
reshape(
setNames(df1, gsub("(\\d)", ".\\1", names(df1))),
direction = "long",
varying = 1:ncol(df1)
),
select = -c(time, id)
)
gives
a b
1.1 1 1
2.1 2 4
3.1 3 9
4.1 4 16
5.1 5 25
1.2 2 4
2.2 4 8
3.2 6 12
4.2 8 16
5.2 10 20

An option with pivot_longer from tidyr by specifying the names_sep as a regex lookaround to match between a lower case letter ([a-z]) and a digit in the column names
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])") %>%
select(-grp)
-output
# A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 1 1
# 2 2 4
# 3 2 4
# 4 4 8
# 5 3 9
# 6 6 12
# 7 4 16
# 8 8 16
# 9 5 25
#10 10 20
With the edited post, we need to change the names_sep i.e. the delimiter is now _ between a lower case letter and a digit
df1 %>%
pivot_longer(cols = everything(), names_to = c( '.value', 'grp'),
names_sep = "(?<=[a-z])_(?=[0-9])") %>%
select(-grp)
or with base R, use split.default on the substring of column names into a list of data.frame, then unlist each list element by looping over the list and convert to data.frame
data.frame(lapply(split.default(df1, sub("\\d+", "", names(df1))),
unlist, use.names = FALSE))

For the sake of completeness, here is a solution which uses data.table::melt() and the patterns() function to specify columns which belong together:
library(data.table)
melt(setDT(df1), measure.vars = patterns(a = "a", b = "b"))[
order(a,b), !"variable"]
a b
1: 1 1
2: 2 4
3: 2 4
4: 3 9
5: 4 8
6: 4 16
7: 5 25
8: 6 12
9: 8 16
10: 10 20
This reproduces the expected result for OP's sample dataset.
A more realistic example: reshape only selected columns
With the edit of the question, the OP has clarifified that the production data contains many more columns than those which need to be reshaped:
I have a lot of columns with different names and I want to extract
abs_dist_1, ... abs_dist_5 and mean_vel_1, ... mean_vel_5 in a new
data frame, with all abs_dist in one column and all mean_vel in one
column, but still connected.
So, the OP wants to extract and reshape the columns of interest in one go while ignoring all other data in the dataset.
To simulate this situation, we need a more elaborate dataset which includes other columns as well:
df2 <- cbind(df1, c1 = 11:15, c2 = 21:25)
df2
a1 b1 a2 b2 c1 c2
1 1 1 2 4 11 21
2 2 4 4 8 12 22
3 3 9 6 12 13 23
4 4 16 8 16 14 24
5 5 25 10 20 15 25
With a modified version of the code above
library(data.table)
cols <- c("a", "b")
result <- melt(setDT(df2), measure.vars = patterns(cols), value.name = cols)[, ..cols]
setorderv(result, cols)
result
we get
a b
1: 1 1
2: 2 4
3: 3 9
4: 4 16
5: 5 25
6: 2 4
7: 4 8
8: 6 12
9: 8 16
10: 10 20
For the production dataset as pictured in the edit, the OP needs to set
cols <- c("abs_dist", "mean_vel")

Related

Keep row as soon as cumulative value reaches a certain threshold R

I have a dataframe where I would like to keep a row as soon as the cumulative value of a column reaches a certain level. The dataset could look like this:
set.seed(0)
n <- 10
dat <- data.frame(id=1:n,
group=rep(LETTERS[1:2], n/2),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=abs(rnorm(n)))
dat
id group age type x
1 1 A 26 type 1 0.928567035
2 2 B 21 type 2 0.294720447
3 3 A 24 type 3 0.005767173
4 4 B 18 type 4 2.404653389
5 5 A 19 type 5 0.763593461
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
8 8 B 28 type 8 0.289461574
9 9 A 19 type 9 0.299215118
10 10 B 28 type 10 0.411510833
Where I want to keep a row as soon as the cumulative value of x reaches a threshold (e.g. 1), starting to count again as soon as a row was retained. Which would result in the following output:
id group age type x
2 2 B 21 type 2 0.294720447
4 4 B 18 type 4 2.404653389
6 6 B 30 type 6 0.799009249
7 7 A 24 type 7 1.147657009
10 10 B 28 type 10 0.411510833
I am trying to get a dplyr based solution but can't seem to figure it out. Any tips?
You can use purrr::accumulate to compute the cumsum with threshold, then use dplyr::slice_tail to get the last value before the cumsum cuts the threshold:
library(dplyr)
library(purrr)
dat %>%
group_by(a = cumsum(x == accumulate(x, ~ ifelse(.x <= 1, .x + .y, .y)))) %>%
slice_tail(n = 1)
# id group age type x gp
# 1 2 B 21 type 2 0.295 1
# 2 4 B 18 type 4 2.40 2
# 3 6 B 30 type 6 0.799 3
# 4 7 A 24 type 7 1.15 4
# 5 10 B 28 type 10 0.412 5
Another option is to use MESS::cumsumbinning, which might be more friendly to use:
library(MESS)
library(dplyr)
dat %>%
group_by(a = cumsumbinning(x, 1, cutwhenpassed = T)) %>%
slice_tail(n = 1)
Mael beat me with the cumsumbinning() from the MESS-package...
Here is a data.table option using that function:
library(MESS)
library(data.table)
setDT(dat)[, .SD[.N], by = MESS::cumsumbinning(x, 1, cutwhenpassed = TRUE)]
# MESS id group age type
# 1: 1 2 B 21 type 2
# 2: 2 4 B 18 type 4
# 3: 3 6 B 30 type 6
# 4: 4 7 A 24 type 7
# 5: 5 10 B 28 type 10

apply function or loop within mutate

Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13

How to replace repeating entries in a data frame with n-(number of times it's repeated) in R?

In my data I have repeating entries in a column. What I'm trying to do is if an entry n is repeated more than 2 times within a column, then I want to replace that entry with n-(number_of_times_it_has_repeated - 2). For example, if my data looks like this:
df <- data.frame(
A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> df
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
7 9
7 10
2 11
8 12
8 13
we can see that in df$A 7 is repeated 4 times. If the entry is repeated more than 2 times, then I want to replace that entry. So in my example,the 1st and 2nd entry of the number 7 would remain unchanged. The 3rd instance of the number 7 would be replaced by : 7 - (3-2). The 4th instance of number 7 would be replaced by 7 - (4-2).
We can also see that in df$A, the number 2 is repeated 3 times. using the same method, the 3rd instance of number 2 would be replaced with 2 - (3-2).
As there are no repeating values in df$B, that column would remain unchanged.
For clarity, my expected result would be:
dfNew <- data.frame(
A = c(1,2,2,4,5,7,7,6,5,1,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> dfNew
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
6 9
5 10
1 11
8 12
8 13
Here's how you can do it for one column -
library(dplyr)
df %>%
group_by(A) %>%
transmute(A = A - c(rep(0, 2), row_number())[row_number()]) %>%
ungroup
# A
# <dbl>
# 1 1
# 2 2
# 3 2
# 4 4
# 5 5
# 6 7
# 7 7
# 8 6
# 9 5
#10 1
#11 8
#12 8
To do it for all the columns you can use map_dfc -
purrr::map_dfc(names(df), ~{
df %>%
group_by(.data[[.x]]) %>%
transmute(!!.x := .data[[.x]] - c(rep(0, 2), row_number())[row_number()])%>%
ungroup
})
# A B
# <dbl> <dbl>
# 1 1 2
# 2 2 3
# 3 2 4
# 4 4 5
# 5 5 6
# 6 7 7
# 7 7 8
# 8 6 9
# 9 5 10
#10 1 11
#11 8 12
#12 8 13
The logic here is that for each number we subtract 0 from first 2 values and later we subtract -1, -2 and so on.
You can skip the order if you don't want it here is my approach, if you have some data where after the changes there are still some duplicates then i can work on the answer to put it in a function or something.
my_df <- data.frame(A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13),
stringsAsFactors = FALSE)
my_df <- my_df[order(my_df$A, my_df$B),]
my_df$Id <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_temp <- my_df %>% group_by(A) %>% filter(n() > 2) %>% mutate(Count = seq.int(from = 1, to = n(), by = 1)) %>% filter(Count > 2) %>% mutate(A = A - (Count - 2))
my_var <- which(my_df$Id %in% my_temp$Id)
if (length(my_var)) {
my_df <- my_df[-my_var,]
my_df <- rbind(my_df, my_temp[, c("A", "B", "Id")])
}
my_df <- my_df[order(my_df$A, my_df$B),]
A base R option using ave + pmax + seq_along
list2DF(
lapply(
df,
function(x) {
x - ave(x, x, FUN = function(v) pmax(seq_along(v) - 2, 0))
}
)
)
gives
A B
1 1 2
2 2 3
3 2 4
4 4 5
5 5 6
6 7 7
7 7 8
8 6 9
9 5 10
10 1 11
11 8 12
12 8 13

Grouped pivot_longer dplyr

This is an example dataframe. My real dataframe is larger. I highly prefer a tidyverse solution.
#my data
age <- c(18,18,19)
A1 <- c(3,5,3)
A2 <- c(4,4,3)
B1 <- c(1,5,2)
B2 <- c(2,2,5)
df <- data.frame(age, A1, A2, B1, B2)
I want my data to look like this:
#what i want
new_age <- c(18,18,18,18,19,19)
A <- c(3,5,4,4,3,3)
B <- c(1,5,2,2,2,5)
new_df <- data.frame(new_age, A, B)
I want to pivot longer and stack columns A1:A2 into column A, and B1:B2 into B. I also want to have the responses to match the correct age. For example, the 19 year old person in this example has only responded with 3's in columns A1:A2.
tidyr::pivot_longer(df, cols = -age, names_to = c(".value",'groupid'),
#1+ non digits followed by 1+ digits
names_pattern = "(\\D+)(\\d+)")
# A tibble: 6 x 4
age groupid A B
<dbl> <chr> <dbl> <dbl>
1 18 1 3 1
2 18 2 4 2
3 18 1 5 5
4 18 2 4 2
5 19 1 3 2
6 19 2 3 5
in Base R you will use reshape then select the columns you want. You can change the row names also
reshape(df,2:ncol(df),dir = "long",sep="")[,-c(2,5)] #
age A B
1.1 18 3 1
2.1 18 5 5
3.1 19 3 2
1.2 18 4 2
2.2 18 4 2
3.2 19 3 5
As you have a larger dataframe, maybe a solution with data.table will be faster. Here, you can use melt function from data.table package as follow:
library(data.table)
colA = grep("A",colnames(df),value = TRUE)
colB = grep("B",colnames(df),value = TRUE)
setDT(df)
df <- melt(df, measure = list(colA,colB), value.name = c("A","B"))
df[,variable := NULL]
dt <- dt[order(age)]
age A B
1: 18 3 1
2: 18 5 5
3: 18 4 2
4: 18 4 2
5: 19 3 2
6: 19 3 5
Does it answer your question ?
EDIT: Using patterns - suggestion from #Wimpel
As #Wimpel suggested it in comments, you can get the same result using patterns:
melt( setDT(df), measure.vars = patterns( A="^A[0-9]", B="^B[0-9]") )[, variable:=NULL][]
age A B
1: 18 3 1
2: 18 5 5
3: 19 3 2
4: 18 4 2
5: 18 4 2
6: 19 3 5

replace at once multiple columns names which end with different patterns in R

I have a table with hundreds of columns. Their names end either with .a or .b
What I need is to rename all columns.a with a columns.a_new and column.b with column->column.b_new at once.
I can do it only one pattern at a time but I don't know how to do it at once for all columns.
rename_at_example <- my_table %>% rename_at(vars(ends_with(".a")),
funs(str_replace(., ".a", ".a_new")))
Any idea how to write it in a compact way for all columns?
Thank you
One dplyr option could be:
df %>%
rename_at(vars(matches("[ab]$")), ~ paste0(., "_new"))
col1a_new col2a_new col1b_new col2b_new col1c col2c
1 1 11 1 11 1 11
2 2 12 2 12 2 12
3 3 13 3 13 3 13
4 4 14 4 14 4 14
5 5 15 5 15 5 15
6 6 16 6 16 6 16
7 7 17 7 17 7 17
8 8 18 8 18 8 18
9 9 19 9 19 9 19
10 10 20 10 20 10 20
Sample data:
df <- data.frame(col1a = 1:10,
col2a = 11:20,
col1b = 1:10,
col2b = 11:20,
col1c = 1:10,
col2c = 11:20,
stringsAsFactors = FALSE)
If '.a' names and '.b' names don't require the same replacement/action, e.g. adding '_new' to the end, you could use reduce2
library(tidyverse) # dplyr + purrr for reduce2
df <- data.frame(one.a = 1, one.d = 2, twoa = 3, two.b = 4, three.a = 5)
df
# one.a one.d twoa two.b three.a
# 1 1 2 3 4 5
df %>%
rename_all(~ reduce2(c('\\.a$', '\\.b$'), c('.a_new1', '.b_new2'),
str_replace, .init = .x))
# one.a_new1 one.d twoa two.b_new2 three.a_new1
# 1 1 2 3 4 5

Resources