Dataframe Column is an Offset of Multiple Column Values - Elegant solution desired

Dataframe Column is an Offset of Multiple Column Values - Elegant solution desired - r

I am in search of an elegant solution that produces a column of values that are column offsets of a 'column offset' column = 'relative_column_position.' The desired answer is provided (radio).
My actual data consists of thousands of rows with ~300 different column positions denoted in 'relative_column_position,' so a hand-solution such as this is not in the cards.
gaga <- tibble(relative_column_position = c(rep(1,3), rep(2,6), rep(3,3) ),
col_1 = 1:12,
col_2 = 13:24,
col_3 = 25:36
)
gaga
radio <- tibble( c(gaga$col_1[1:3],
gaga$col_2[4:9],
gaga$col_3[10:12])
)
radio

Base R answer using matrix subsetting -
gaga <- data.frame(gaga)
result <- data.frame(value = gaga[cbind(seq_len(nrow(gaga)),
gaga$relative_column_position + 1)])
result
# value
#1 1
#2 2
#3 3
#4 16
#5 17
#6 18
#7 19
#8 20
#9 21
#10 34
#11 35
#12 36
gaga$relative_column_position + 1 because the subsetting starts from the 2nd column in the dataset. So when gaga$relative_column_position is 1, we actually want to subset data from 2nd column in gaga dataset.

Here is a base R solution in two steps.
library(tibble)
gaga <- tibble(relative_column_position = c(rep(1,3), rep(2,6), rep(3,3) ),
col_1 = 1:12,
col_2 = 13:24,
col_3 = 25:36
)
radio <- tibble(c(gaga$col_1[1:3],
gaga$col_2[4:9],
gaga$col_3[10:12])
)
rcp <- split(seq_along(gaga$relative_column_position), gaga$relative_column_position)
unlist(mapply(\(x, i) x[i], gaga[-1], rcp))
#> col_11 col_12 col_13 col_21 col_22 col_23 col_24 col_25 col_26 col_31 col_32
#> 1 2 3 16 17 18 19 20 21 34 35
#> col_33
#> 36
Created on 2022-05-21 by the reprex package (v2.0.1)
As a tibble:
rcp <- split(seq_along(gaga$relative_column_position), gaga$relative_column_position)
radio <- tibble(rcp = unlist(mapply(\(x, i) x[i], gaga[-1], rcp)))
rm(rcp)
radio
#> # A tibble: 12 × 1
#> rcp
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 16
#> 5 17
#> 6 18
#> 7 19
#> 8 20
#> 9 21
#> 10 34
#> 11 35
#> 12 36
Created on 2022-05-21 by the reprex package (v2.0.1)

df |>
mutate(rel = apply(df, 1, \(x) x[colnames(df)[x["relative_col"]]] ))
to apply to your df example:
gaga |>
mutate(rel = apply(gaga, 1, \(x) x[colnames(gaga)[x["relative_column_position"] + 1]] ))
Assuming you have a relative column to map over, you can use apply and
mutate

Related

How to expand rows and fill in the numbers between given start and end

I have this data frame:
df <- tibble(x = c(1, 10))
x
<dbl>
1 1
2 10
I want this:
x
<int>
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Unfortunately I can't remember how I have to approach. I tried expand.grid, uncount, runner::fill_run.
Update: The real world data ist like this with groups and given start and end number. Here are only two groups:
df <- tibble(group = c("A", "A", "B", "B"),
x = c(10,30, 1, 10))
group x
<chr> <dbl>
1 A 10
2 A 30
3 B 1
4 B 10

We may need full_seq with either summarise or reframe or tidyr::complete
library(dplyr)
df %>%
group_by(group) %>%
reframe(x = full_seq(x, period = 1))
# or with
#tidyr::complete(x = full_seq(x, period = 1))
-output
# A tibble: 31 × 2
group x
<chr> <dbl>
1 A 10
2 A 11
3 A 12
4 A 13
5 A 14
6 A 15
7 A 16
8 A 17
9 A 18
10 A 19
# … with 21 more rows

A simple base R variation:
group <- c(rep("A", 21), rep("B ", 10))
x <- c(10:30, 1:10)
df <- tibble(group, x)
df
# A tibble: 31 × 2
group x
<chr> <int>
1 A 10
2 A 11
3 A 12
4 A 13
5 A 14
6 A 15
And here's an expand.grid solution:
g1 <- expand.grid(group = "A", x = 20:30)
g2 <- expand.grid(group = "B", x = 1:10)
df <- rbind(g1, g2)
df
group x
1 A 20
2 A 21
3 A 22
4 A 23
5 A 24
6 A 25
7 A 26

Using base:
stack(sapply(split(df$x, df$group), function(i) seq(i[ 1 ], i[ 2 ])))

How to delete entire rows from a dataframe based on the date the data was collected?

Let's say I have this example dataframe (but a lot bigger)
df = data.frame(ID_number = c(111,111,111,22,22,33,33),
date = c('2021-06-14','2021-06-12','2021-03-11',
'2021-05-20','2021-05-14',
'2018-04-20','2017-03-14'),
answers = 1:7,
sex = c('F','M','F','M','M','M','F') )
The output
ID_number date answers sex
1 111 2021-06-14 1 F
2 111 2021-06-12 2 M
3 111 2021-03-11 3 F
4 22 2021-05-20 4 M
5 22 2021-05-14 5 M
6 33 2018-04-20 6 M
7 33 2017-03-14 7 F
we can see that there are 7 different members, but the one who created the dataframe has made a mistake and assigned the same ID_number to members 1,2 and 3. The same ID_number to members 4 and 5 and so on ...
In the dataframe there is the data of the collection of the data of each member and I wish to only keep the member that has the earliest date. The resulted dataframe would look like this
ID_number date answers sex
1 111 2021-03-11 3 F
2 22 2021-05-14 5 M
3 33 2017-03-14 7 F
Appreciate the help.

You could filter on the min date per group_by like this:
library(dplyr)
df %>%
group_by(ID_number) %>%
filter(date == min(date))
#> # A tibble: 3 × 4
#> # Groups: ID_number [3]
#> ID_number date answers sex
#> <dbl> <chr> <int> <chr>
#> 1 111 2021-03-11 3 F
#> 2 22 2021-05-14 5 M
#> 3 33 2017-03-14 7 F
Created on 2023-01-04 with reprex v2.0.2

With slice_min:
library(dplyr)
df %>%
group_by(ID_number) %>%
slice_min(date)
In the dev. version, you can use inline grouping with .by:
devtools::install_github("tidyverse/dplyr")
df %>%
slice_min(date, .by = ID_number)

Using base R
subset(df, as.numeric(date) == ave(as.numeric(date), ID_number, FUN = min))
ID_number date answers sex
3 111 2021-03-11 3 F
5 22 2021-05-14 5 M
7 33 2017-03-14 7 F

adding rows in datasets for missing values with R

I am working with R.
i have a list of datasets where each of those sets should have a row length 5 for each month (Jan-May). it should look like this:
data.frame(name = rep("B", 5),
doc_month = c("2022.01", "2022.02", "2022.03", "2022.04", "2022.05"),
i_name = rep("Aa",5),
aggregation = rep("34"), 5)
but some of my datasets dont have data for certain months, or are completely empty, and therefore have a shorter row length/no rows at all. like this:
data.frame(name = "A",
doc_month = "2022.01",
i_name = "Aa",
aggregation = "34")
I would like to extend each dataset, even empty ones, with the specific months , copy all the other information into the row and put a 0 for aggregation.
I tried to use extend and complete by tidyr but couldnt make it work.

With tidyr's complete with purrr's reduce to add more dataframes.
Also tweaked aggregation = rep(34, 5).
library(tidyverse)
df1 <- data.frame(name = rep("B", 5),
doc_month = c("2022.01", "2022.02", "2022.03", "2022.04", "2022.05"),
i_name = rep("Aa",5),
aggregation = rep(34, 5))
df2 <- data.frame(name = "A",
doc_month = "2022.01",
i_name = "Aa",
aggregation = 34)
reduce(list(df1, df2, df1), bind_rows) |>
complete(doc_month, nesting(name, i_name), fill = list(aggregation = 0))
#> # A tibble: 15 × 4
#> doc_month name i_name aggregation
#> <chr> <chr> <chr> <dbl>
#> 1 2022.01 A Aa 34
#> 2 2022.01 B Aa 34
#> 3 2022.01 B Aa 34
#> 4 2022.02 A Aa 0
#> 5 2022.02 B Aa 34
#> 6 2022.02 B Aa 34
#> 7 2022.03 A Aa 0
#> 8 2022.03 B Aa 34
#> 9 2022.03 B Aa 34
#> 10 2022.04 A Aa 0
#> 11 2022.04 B Aa 34
#> 12 2022.04 B Aa 34
#> 13 2022.05 A Aa 0
#> 14 2022.05 B Aa 34
#> 15 2022.05 B Aa 34
Created on 2022-06-10 by the reprex package (v2.0.1)

You could create a skeleton dataset with the five months and then join it to each of your partial datasets.
library(dplyr)
library(tidyr)
data_A <- data.frame(name = "A",
doc_month = "2022.01",
i_name = "Aa",
aggregation = "34")
reference <- data.frame(doc_month = c("2022.01", "2022.02", "2022.03", "2022.04", "2022.05"))
data_A |>
full_join(reference, by = "doc_month") |>
mutate(aggregation = replace_na(aggregation, "0")) |>
fill(name, i_name)
Output:
#> name doc_month i_name aggregation
#> 1 A 2022.01 Aa 34
#> 2 A 2022.02 Aa 0
#> 3 A 2022.03 Aa 0
#> 4 A 2022.04 Aa 0
#> 5 A 2022.05 Aa 0
Created on 2022-06-10 by the reprex package (v2.0.1)

Break a small sentence in multiple rows with a single string each in R dplyr

I have a data frame that looks like this
library(tidyverse)
data=data.frame(POS=c(172367,10), SNP=c("ATCG","AG"), QUAL=c(30,20))
data
#> POS SNP QUAL
#> 1 172367 ATCG 30
#> 2 10 AG 20
Created on 2022-02-02 by the reprex package (v2.0.1)
and I want to make it look like this
POS SNP QUAL
172367 A 30
172368 T 30
172369 C 30
172370 G 30
10 A 20
11 G 20
I want to break the multistring into rows with single string and then change
the position as well.
Any help is highly appreciated

You can do:
library(dplyr)
library(tidyr)
data %>%
separate_rows(SNP, sep = "(?<=[ACGT])") %>%
mutate(POS = ave(POS, POS, FUN = \(x) x + seq_along(x) - 1))
# A tibble: 6 x 3
POS SNP QUAL
<dbl> <chr> <dbl>
1 172367 A 30
2 172368 T 30
3 172369 C 30
4 172370 G 30
5 10 A 20
6 11 G 20

Dynamic Columns in Dplyr using NSE on the RHS

I am attempting to reference existing columns in dplyr through a loop. Effectively, I would like to evaluate the operations from one table (evaluation in below example) to be performed to another table (dt in below example). I do not want to hardcode the column names on the RHS within mutate(). I would like to control the evaluations being performed from the evaluation table below. So I am trying to make the process dynamic.
Here is a sample dataframe:
dt = data.frame(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
Here is a table of sample operations to be performed:
evaluation = data.frame(
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
What I am trying to do is the following:
for (i in 1:nrow(evaluation)) {
var = evaluation$New_Var[i]
dt = dt %>%
rowwise() %>%
mutate(!!var := ifelse(eval(parse(text = evaluation$Operation[i])),
evaluation$Result[i],
!!var))
}
my desired result would be something like this except for the "AA" in the AA column would be the original numeric values of the AA column of 1, 1, 1, 1, 1.
UPDATED:
I believe my syntax in the "False" part of the ifelse statement is incorrect. What is the correct syntax to specify "!!var" in the false portion of the ifelse statement?
I know there are other ways to do it using base R, but I would rather do it through dplyr as it is cleaner code to look at. I am leveraging "rowise()" to do it element by element.

Modified data to (a) enforce type consistency for columns AA and BB and (b) ensure that at least one row satisfies the second condition.
dt = tibble(
A = c(1:20),
B = c(10:29), ## Note the change
C = c(21:40),
AA = rep("a", 20), ## Note initialization with strings
BB = rep("b", 20) ## Ditto
)
To make your loop work, you need to convert your code strings into actual expressions. You can use rlang::sym() for variable names and rlang::parse_expr() for everything else.
for( i in 1:nrow(evaluation) )
{
var <- rlang::sym(evaluation$New_Var[i])
op <- rlang::parse_expr(evaluation$Operation[i])
dt = dt %>% rowwise() %>%
mutate(!!var := ifelse(!!op, evaluation$Result[i],!!var))
}
# # A tibble: 20 x 5
# A B C AA BB
# <int> <int> <int> <chr> <chr>
# 1 1 10 21 a False
# 2 2 11 22 a False
# 3 3 12 23 a b
# 4 4 13 24 a b
# 5 5 14 25 a b
# 6 6 15 26 a b
# 7 7 16 27 a b
# 8 8 17 28 a b
# 9 9 18 29 a b
# 10 10 19 30 True b
# 11 11 20 31 True b
# 12 12 21 32 True b
# 13 13 22 33 True b
# 14 14 23 34 True b
# 15 15 24 35 True b
# 16 16 25 36 True b
# 17 17 26 37 True b
# 18 18 27 38 True b
# 19 19 28 39 True b
# 20 20 29 40 True b

Assuming that Felipe's answer was the functionality you desired, here's a more "tidyverse"/pipe-oriented/functional approach.
Data
library(rlang)
library(dplyr)
library(purrr)
operations <- tibble(
old_var = exprs(A, B),
new_var = exprs(AA, BB),
test = exprs(2*A > B, 2*B <= C),
result = exprs("True", "False")
)
original <- tibble(
A = sample.int(30, 10),
B = sample.int(30, 10),
C = sample.int(30, 10)
)
original
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 4 20 5
2 30 29 11
3 1 27 14
4 2 21 4
5 17 19 24
6 14 25 9
7 5 22 22
8 6 13 7
9 25 4 21
10 12 11 12
Functions
# Here's your reusable functions
generic_mutate <- function(dat, new_var, test, result, old_var) {
dat %>% mutate(!!new_var := ifelse(!!test, !!result, !!old_var))
}
generic_ops <- function(dat, ops) {
pmap(ops, generic_mutate, dat = dat) %>%
reduce(full_join)
}
generic_mutate takes a single original dataframe, a single new_var, etc. It performs the test, adds the new column with the appropriate name and values.
generic_ops is the "vectorized" version. It takes the original dataframe as the first argument, and a dataframe of operations as the second. It then parallel maps over each column of new variable names, tests, etc, and calls generic_mutate on each one. That results in a list of dataframes, each with one added column. The reduce then combines them back all together with a sequential full_join.
Results
original %>%
generic_ops(operations)
Joining, by = c("A", "B", "C")
# A tibble: 10 x 5
A B C AA BB
<int> <int> <int> <chr> <chr>
1 4 20 5 4 20
2 30 29 11 True 29
3 1 27 14 1 27
4 2 21 4 2 21
5 17 19 24 True 19
6 14 25 9 True 25
7 5 22 22 5 22
8 6 13 7 6 13
9 25 4 21 True False
10 12 11 12 True 11
The magic here is using exprs(...) so you can store NSE names and operations in a tibble without forcing their evaluation. I think this is a lot cleaner than storing names and operations in strings with quotation marks.

How's this:
evaluation = data.frame(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c("(A*2) > B", "(B*2) <= C"),
Result = c("True", "False")
) %>% mutate_all(as.character)
for (i in 1:nrow(evaluation)) {
old <- sym(evaluation$Old_Var[i])
new <- sym(evaluation$New_Var[i])
op <- sym(evaluation$Operation[i])
res <- sym(evaluation$Result[i])
dt <- dt %>%
mutate(!!new := ifelse(!!op, !!res, !!old))
}
EDIT: My last answer doesn't work because rlang tries to find a variable named !!op (e.g. named (A*2) > B) instead of evaluating the expression. I got this to work using a mix of tidyselect and base R. You can of course follow #Brian's advice and use this solution with pmap. I honestly don't know how well this will perform though, as I think it will evaluate the ifelse once per row, and am not sure it's a vectorized operation...
dt <- tibble(
A = c(1:20),
B = c(11:30),
C = c(21:40),
AA = rep(1, 20),
BB = rep(2, 20)
)
evaluation = tibble(
Old_Var = c('A', 'B'),
New_Var = c("AA", "BB"),
Operation = c('(A*2) > B', '(B*2) <= C'),
Result = c("True", "False")
)
for (i in 1:nrow(evaluation)) {
old <- evaluation$Old_Var[i]
new <- evaluation$New_Var[i]
op <- evaluation$Operation[i]
res <- evaluation$Result[i]
dt <- dt %>%
mutate(!!sym(new) := eval(parse(text = sprintf('ifelse(%s, "%s", %s)', op, res, old))))
}

One way is to rework the conditions first, then pass them to mutate :
conds <- parse(text=evaluation$Operation) %>%
as.list() %>%
setNames(evaluation$New_Var) %>%
imap(~expr(ifelse(!!.,"True", !!sym(.y))))
conds
#> $AA
#> ifelse((A * 2) > B, "True", AA)
#>
#> $BB
#> ifelse((B * 2) <= C, "True", BB)
dt %>% mutate(!!!conds)
#> A B C AA BB
#> 1 1 11 21 1 2
#> 2 2 12 22 1 2
#> 3 3 13 23 1 2
#> 4 4 14 24 1 2
#> 5 5 15 25 1 2
#> 6 6 16 26 1 2
#> 7 7 17 27 1 2
#> 8 8 18 28 1 2
#> 9 9 19 29 1 2
#> 10 10 20 30 1 2
#> 11 11 21 31 True 2
#> 12 12 22 32 True 2
#> 13 13 23 33 True 2
#> 14 14 24 34 True 2
#> 15 15 25 35 True 2
#> 16 16 26 36 True 2
#> 17 17 27 37 True 2
#> 18 18 28 38 True 2
#> 19 19 29 39 True 2
#> 20 20 30 40 True 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dataframe Column is an Offset of Multiple Column Values - Elegant solution desired - r

df |> mutate(rel = apply(df, 1, \(x) x[colnames(df)[x["relative_col"]]] )) to apply to your df example: gaga |> mutate(rel = apply(gaga, 1, \(x) x[colnames(gaga)[x["relative_column_position"] + 1]] )) Assuming you have a relative column to map over, you can use apply and mutate

Related

How to expand rows and fill in the numbers between given start and end

How to delete entire rows from a dataframe based on the date the data was collected?

adding rows in datasets for missing values with R

Break a small sentence in multiple rows with a single string each in R dplyr

Dynamic Columns in Dplyr using NSE on the RHS

Categories

Resources