Use 2 column to generate absent and present matrix - r

I have a data.frame:
df[19:30,]
gene sample
19 rtxA sample-320
20 rtxA sample-359
21 rtxA sample-594
22 vibF sample-350
23 vibF sample-346
24 vibF sample-349
25 vibF sample-368
26 vibF sample-362
27 vibF sample-363
28 vibF sample-369
29 vibF sample-345
30 vibF sample-326
the df only show 2 elements in gene (rtxA and vibF) column, nevertheless it present 150 genes, and in sample column are 21 samples (sample-320 ... sample-x), so I want just convert it to absent/present data.frame (or matrix) using the samples as columns and gene as rows. Something like:
sample-320 sample-359 sample-594 sample-350 .....
rtxA 1 1 0 0
vibF 0 0 1 1
If present 1, and 0 if absent.

We can use table
+(table(df) > 0)
Or do
table(unique(df))
-output
sample
gene sample-320 sample-326 sample-345 sample-346 sample-349 sample-350 sample-359 sample-362 sample-363 sample-368 sample-369
rtxA 1 0 0 0 0 0 1 0 0 0 0
vibF 0 1 1 1 1 1 0 1 1 1 1
sample
gene sample-594
rtxA 1
vibF 0

A tidyverse solution:
library(dplyr)
library(tidyr)
df %>%
mutate(val = 1) %>%
pivot_wider(names_from = "sample", values_from = val, values_fill = 0)
#> # A tibble: 2 x 13
#> gene `sample-320` `sample-359` `sample-594` `sample-350` `sample-346`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 rtxA 1 1 1 0 0
#> 2 vibF 0 0 0 1 1
#> # ... with 7 more variables: sample-349 <dbl>, sample-368 <dbl>,
#> # sample-362 <dbl>, sample-363 <dbl>, sample-369 <dbl>, sample-345 <dbl>,
#> # sample-326 <dbl>
Data:
df <- read.table(text = " gene sample
rtxA sample-320
rtxA sample-359
rtxA sample-594
vibF sample-350
vibF sample-346
vibF sample-349
vibF sample-368
vibF sample-362
vibF sample-363
vibF sample-369
vibF sample-345
vibF sample-326", header = T, stringsAsFactors = F)

Related

Using for loops with mutate function?

I have a task that's becoming quite difficult for me.
I have to create a variable (pr_test_1) to test whether a variable for a procedure (I10_PR1) is in a list of procedures, and this code is working great:
df <- df %>%
mutate(pr_test_1=ifelse(I10_PR1 %in% abl_pr, 1,0))
However, I have 25 variables for procedures (I10_PR1 to I10_PR25) and I have to create one for each (pr_test_1 to pr_test_25).
I don't seem to find the right syntax to get a for loop to work.
Any help will be greatly appreciated!
dplyr::across allows you to apply a function to multiple columns as specified with a selector (the below uses the starts_with selector).
library(dplyr)
library(purrr)
# sample data
df <- tibble::tibble(
I10_PR1 = sample(100),
I10_PR2 = sample(100),
I10_PR3 = sample(100),
I10_PR4 = sample(100)
)
# a sample list of values to compare against
match_list <- sample(10)
df %>%
mutate(
across(
starts_with("I10_PR"),
~ if_else(.x %in% match_list, 1, 0),
.names = "pr_test_{.col}"
)
)
#> # A tibble: 100 × 8
#> I10_PR1 I10_PR2 I10_PR3 I10_PR4 pr_test_I10_PR1 pr_test_I10…¹ pr_te…² pr_te…³
#> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 93 45 47 46 0 0 0 0
#> 2 91 89 90 76 0 0 0 0
#> 3 16 32 30 24 0 0 0 0
#> 4 66 26 46 41 0 0 0 0
#> 5 53 51 79 9 0 0 0 1
#> 6 36 64 61 32 0 0 0 0
#> 7 45 75 23 25 0 0 0 0
#> 8 86 61 77 52 0 0 0 0
#> 9 17 87 64 53 0 0 0 0
#> 10 6 42 57 33 1 0 0 0
#> # … with 90 more rows, and abbreviated variable names ¹​pr_test_I10_PR2,
#> # ²​pr_test_I10_PR3, ³​pr_test_I10_PR4
Created on 2022-10-26 with reprex v2.0.2
This for() loop works perfectly with your one (slightly modified) line of code and dynamic variable names
for(i in 1:3){
df <- df %>%
mutate(!!paste0("pr_test_",i) := ifelse(!!as.name(paste0("I10_PR",i)) %in% abl_pr, 1,0))
}
Data used:
abl_pr <- sample(LETTERS)[1:10]
I10_PR1 <- sample(LETTERS)
I10_PR2 <- sample(LETTERS)
I10_PR3 <- sample(LETTERS)
df <- data.frame(I10_PR1,I10_PR2,I10_PR3)

Slicing using dplyr based on condition in a column

I have a dataframe containing repeated measurements for some individuals, the data.frame looks like this
# A tibble: 853 x 5
ID Test N_ind Pheno Week
<chr> <chr> <int> <dbl> <int>
1 02A01 Int 96 0 12
2 02A01 Int 96 0 24
3 02A01 Int 94 0 36
4 02A01 Int 90 0 48
7 02A01 Int 78 1 84
9 02A03 Int 96 0 12
10 02A03 Int 96 0 24
11 02A03 Int 94 0 36
19 02C03 Int 96 1 12
20 0202C03 Int 96 0 24
21 0202C03 Int 94 0 36
22 0202C03 Int 90 0 48
23 02E02 Int 96 0 12
24 02E02 Int 96 0 24
25 02E02 Int 94 1 36
26 02E02 Int 90 1 48
I want to subset the data.frame, first grouping by ID and then within the group selecting for those individuals with a 1 in the column Pheno the lowest Week value, but for those individuals with a 0 in the column Pheno the highest value in Week.
The optimal result should look like this:
ID Test N_ind Pheno Week
<chr> <chr> <int> <dbl> <int>
7 02A01 Int 78 1 84
11 02A03 Int 94 0 36
19 02C03 Int 96 1 12
22 0202C03 Int 90 0 48
25 02E02 Int 94 1 36
I have managed to do that for the 1 values but I am stuck in how to do it for the 0 values.
Here is my code:
df_sub <- data %>%
group_by(ID) %>%
arrange(Week, .by_group = TRUE) %>%
ungroup()
Any help would be much appreciated
You can try -
library(dplyr)
data %>%
group_by(ID) %>%
filter(Week == if(all(Pheno == 0)) max(Week) else min(Week[Pheno == 1])) %>%
ungroup
# ID Test N_ind Pheno Week
# <chr> <chr> <int> <int> <int>
#1 02A01 Int 78 1 84
#2 02A03 Int 94 0 36
#3 02C03 Int 96 1 12
#4 0202C03 Int 90 0 48
#5 02E02 Int 94 1 36
If all the Pheno values in a group are 0 return the max value or else from the Pheno = 1 values return the minimum one.
using dplyr and if, else in filter,
df %>%
group_by(ID) %>%
filter(if (Pheno == 0) Week == max(Week)
else Week == min(Week))
ID Test N_ind Pheno Week
<chr> <chr> <dbl> <dbl> <dbl>
1 02A01 Int 78 1 84
2 02A03 Int 94 0 36
3 02C03 Int 96 1 12
4 0202C03 Int 90 0 48
5 02E02 Int 90 1 48
We can first group_by ID, then filter with the max value in Pheno.
After that, we can filter with the conditions you want, separated by an OR | statement.
library(dplyr)
df %>% group_by(ID) %>%
filter(Pheno==max(Pheno))%>%
filter(Pheno==0 & Week==min(Week)|Pheno==1 & Week==max(Week))
# A tibble: 5 x 5
# Groups: ID [5]
ID Test N_ind Pheno Week
<chr> <chr> <int> <int> <int>
1 02A01 Int 78 1 84
2 02A03 Int 96 0 12
3 02C03 Int 96 1 12
4 0202C03 Int 96 0 24
5 02E02 Int 90 1 48
This is what I came up with:
data %>%
group_by(ID, Pheno) %>%
filter(case_when(
Pheno == 1 ~ Week == min(Week),
Pheno == 0 ~ Week == max(Week)
)) %>%
ungroup(Pheno) %>%
filter(case_when(
any(Pheno == 1) ~ Pheno == 1,
all(Pheno == 0) ~ Pheno == 0
)) %>%
ungroup()
Which results to:
# A tibble: 5 x 5
ID Test N_ind Pheno Week
<chr> <chr> <int> <int> <int>
1 02A01 Int 78 1 84
2 02A03 Int 94 0 36
3 02C03 Int 96 1 12
4 0202C03 Int 90 0 48
5 02E02 Int 94 1 36

R: Calculate linear regression and get slope for "a subset of data"

My goal is to find out half life (from terminal phase if anyone is familiar with Pharmacokinetics)
I have some data containing the following;
1500 rows, with ID being main "key". There is 15 rows per ID. Then I have other columns TIME and CONCENTRATION. Now What I want to do is, for each ID, remove the first TIME (which equals "000" (numeric)), then run lm() function on the remaining 14 rows per ID, and then use abs() to extract the absolute value of the slope, then then save this to a new column named THALF. (If anyone is familiar with Pharmacokinetics maybe there is better way to do this?)
But I have not be able to do this using my limited knowledge of R.
Here is what I've come up with so far:
data_new <- data %>% dplyr::group_by(data $ID) %>% dplyr::filter(data $TIME != 10) %>% dplyr::mutate(THAFL = abs(lm$coefficients[2](data $CONC ~ data $TIME)))
From what I've understood from other Stackoverflow answers, lm$coefficients[2] will extract the slope.
But however, I have not been able to make this work. I get this error from trying to run the code:
Error: Problem with `mutate()` input `..1`.
x Input `..1` can't be recycled to size 15.
i Input `..1` is `data$ID`.
i Input `..1` must be size 15 or 1, not 1500.
i The error occurred in group 1: data$ID = "pat1".
Any suggestions on how to solve this? IF you need more info, let me know please.
(Also, if anyone is familiar with Pharmacokinetics, when they ask for half life from terminal phase, do I do lm() from the concentration max ? I Have a column with value for the highest observed concentration at what time. )
If after the model fitting you still need the observations with TIME == 10, you can try summarising after you group by ID and then using a right join
data %>%
filter(TIME != 10) %>%
group_by(ID) %>%
summarise(THAFL = abs(lm(CONC ~ TIME)$coefficients[2])) %>%
right_join(data, by = "ID")
# A tibble: 30 x 16
ID THAFL Sex Weight..kg. Height..cm. Age..yrs. T134A A443G G769C G955C A990C TIME CONC LBM `data_combine$ID` CMAX
<chr> <dbl> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl>
1 pat1 0.00975 F 50 135 47 0 2 1 2 0 10 0 Under pat1 60
2 pat1 0.00975 F 50 135 47 0 2 1 2 0 20 6.93 Under pat1 60
3 pat1 0.00975 F 50 135 47 0 2 1 2 0 30 12.2 Under pat1 60
4 pat1 0.00975 F 50 135 47 0 2 1 2 0 45 14.8 Under pat1 60
5 pat1 0.00975 F 50 135 47 0 2 1 2 0 60 15.0 Under pat1 60
6 pat1 0.00975 F 50 135 47 0 2 1 2 0 90 12.4 Under pat1 60
7 pat1 0.00975 F 50 135 47 0 2 1 2 0 120 9.00 Under pat1 60
8 pat1 0.00975 F 50 135 47 0 2 1 2 0 150 6.22 Under pat1 60
9 pat1 0.00975 F 50 135 47 0 2 1 2 0 180 4.18 Under pat1 60
10 pat1 0.00975 F 50 135 47 0 2 1 2 0 240 1.82 Under pat1 60
# ... with 20 more rows
If after the model fitting you don't want the rows with TIME == 10 to appear on your dataset, you can use mutate
data %>%
filter(TIME != 10) %>%
group_by(ID) %>%
mutate(THAFL = abs(lm(CONC ~ TIME)$coefficients[2]))
# A tibble: 28 x 16
# Groups: ID [2]
ID Sex Weight..kg. Height..cm. Age..yrs. T134A A443G G769C G955C A990C TIME CONC LBM `data_combine$ID` CMAX THAFL
<chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 pat1 F 50 135 47 0 2 1 2 0 20 6.93 Under pat1 60 0.00975
2 pat2 M 75 175 29 0 2 0 0 0 20 6.78 Under pat2 60 0.00835
3 pat1 F 50 135 47 0 2 1 2 0 30 12.2 Under pat1 60 0.00975
4 pat2 M 75 175 29 0 2 0 0 0 30 11.6 Above pat2 60 0.00835
5 pat1 F 50 135 47 0 2 1 2 0 45 14.8 Under pat1 60 0.00975
6 pat2 M 75 175 29 0 2 0 0 0 45 13.5 Under pat2 60 0.00835
7 pat1 F 50 135 47 0 2 1 2 0 60 15.0 Under pat1 60 0.00975
8 pat2 M 75 175 29 0 2 0 0 0 60 13.1 Above pat2 60 0.00835
9 pat1 F 50 135 47 0 2 1 2 0 90 12.4 Under pat1 60 0.00975
10 pat2 M 75 175 29 0 2 0 0 0 90 9.77 Under pat2 60 0.00835
# ... with 18 more rows
You can use broom:
library(broom)
library(dplyr)
#Code
data %>% group_by(ID) %>%
filter(TIME!=10) %>%
do(fit = tidy(lm(CONC ~ TIME, data = .))) %>%
unnest(fit) %>%
filter(term=='TIME') %>%
mutate(estimate=abs(estimate))
Output:
# A tibble: 2 x 6
ID term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 pat1 TIME 0.00975 0.00334 -2.92 0.0128
2 pat2 TIME 0.00835 0.00313 -2.67 0.0204
If joining with original data is needed, try:
#Code 2
data <- data %>% left_join(data %>% group_by(ID) %>%
filter(TIME!=10) %>%
do(fit = tidy(lm(CONC ~ TIME, data = .))) %>%
unnest(fit) %>%
filter(term=='TIME') %>%
mutate(estimate=abs(estimate)) %>%
select(c(ID,estimate)))
Similar to #RicS.

How to sum values of matching columns while merging two dataframes in r

I have two dataframes in r
ship_no bay_1 bay_2 bay_3 bay_5 bay_6
ABC 0 10 15 20 30
DEF 10 20 0 25 10
ERT 0 10 0 10 0
ship_no bay_1 bay_2 bay_7 bay_5 bay_6
ABC 10 10 10 0 0
DEF 10 10 0 15 10
ERT 0 0 0 10 0
I want to add columns values while merging above two dataframes on column key ship_no
My desired dataframe would be
ship_no bay_1 bay_2 bay_3 bay_5 bay_6 bay_7
ABC 10 20 15 20 30 10
DEF 20 30 0 40 20 0
ERT 0 10 0 20 0 0
How can I do it in r?
We can place the datasets in a list, use rbindlist to rbind the datasets, grouped by 'ship_no', get the sum of other columns
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[,lapply(.SD, sum, na.rm = TRUE) , ship_no]
# ship_no bay_1 bay_2 bay_3 bay_5 bay_6 bay_7
#1: ABC 10 20 15 20 30 10
#2: DEF 20 30 0 40 20 0
#3: ERT 0 10 0 20 0 0
Another option would be dplyr
library(dplyr)
bind_rows(df1, df2) %>%
group_by(ship_no) %>%
summarise_all(funs(sum(., na.rm = TRUE)))
# A tibble: 3 x 7
# ship_no bay_1 bay_2 bay_3 bay_5 bay_6 bay_7
# <chr> <int> <int> <int> <int> <int> <int>
#1 ABC 10 20 15 20 30 10
#2 DEF 20 30 0 40 20 0
#3 ERT 0 10 0 20 0 0

How to calculate the cumulative data difference with preceding data by group?

the reduced raw data is as follow
Data group
2016/1/10 1
2016/2/4 1
2016/3/25 1
2016/4/13 1
2016/5/5 1
2016/7/1 2
2016/8/1 2
2016/10/1 2
2016/12/1 2
2016/12/31 2
what the final data i want to get is like:
Data group cum_diff_preceding
2016/1/10 1 0
2016/2/4 1 25
2016/3/25 1 125
2016/4/13 1 182
2016/5/5 1 270
2016/7/1 2 0
2016/8/1 2 31
2016/10/1 2 153
2016/12/1 2 336
2016/12/31 2 380
the calculation method is as follow:
for row 2016/1/10, cum_diff_preceding is 0
for row 2016/2/4, cum_diff_preceding is (2016/2/4-2016/1/10)
for row 2016/3/25, cum_diff_preceding is (2016/3/25-2016/1/10)+(2016/3/25-2016/2/4)
for row 2016/4/13, cum_diff_preceding is (2016/4/13-2016/1/10)+(2016/4/13- 2016/2/4)+(2016/4/13-2016/3/25)
for row 2016/5/5, cum_diff_preceding is (2016/5/5-2016/1/10)+(2016/5/5- 2016/2/4)+(2016/5/5-2016/3/25)+(2016/4/13-2016/4/13)
for row 2016/7/1, cum_diff_preceding is 0
for row 2016/8/1, cum_diff_preceding is (2016/8/1-2016/7/1)
for row 2016/10/1, cum_diff_preceding is (2016/10/1-2016/7/1)+(2016/10/1- 2016/8/1)
for row 2016/12/1, cum_diff_preceding is (2016/12/1-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)
for row 2016/12/31, cum_diff_preceding is (2016/12/31-2016/7/1)+(2016/10/1- 2016/8/1)+(2016/10/1- 2016/10/1)+(2016/12/31- 2016/12/1)
my major code is as follow
>as.Date(df$Data,"%Y-%m-%d")
>fun_forcast<-function(df){for(i in 2:nrow(df)){df$cum_diff_preceeding[i]<-sum(df$data[i]-df$data[1:(i-1)])}}
>ddply(df,.(group),transform,cum_diff_preceding<-fun_forcast)
but it not work.
or when i change my code to
>fun_forcast<-function(df)(df$cum_diff_preceding<-sapply(1:NROW(df), >function(i) sum(df$data[i] - df$data[1:(i-1)])))
ddply(df,.(group),fun_forcast)
it work, but the result format is
> ddply(df,.(group),fun_forcast)
group V1 V2 V3 V4 V5
1 1 0 25 125 182 270
2 2 0 31 153 336 380
i don't know how to take the results back into cum_diff_preceding in original data.frame.
please
We can do this with ave from base R
df$Data <- as.Date(df$Data, "%Y/%m/%d")
fun_forcast <- function(v1) sapply(seq_along(v1), function(i) sum(v1[i] - v1[1:(i-1)]))
df$cum_diff_preceding <- with(df, ave(as.numeric(Data), group, FUN = fun_forcast))
df$cum_diff_preceding
#[1] 0 25 125 182 270 0 31 153 336 456
Or use dplyr
library(dplyr)
df %>%
group_by(group) %>%
mutate(cum_diff_preceding = fun_forcast(Data))
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
#10 2016-12-31 2 456
By converting the dates to numeric, and generalizing the formula:
df %>%
group_by(group) %>%
mutate(numdata = as.numeric(Data),
cum_diff_preceding = (1:n())*numdata-cumsum(numdata)) %>%
select(-numdata)
# A tibble: 10 x 3
# Groups: group [2]
# Data group cum_diff_preceding
# <date> <int> <dbl>
# 1 2016-01-10 1 0
# 2 2016-02-04 1 25
# 3 2016-03-25 1 125
# 4 2016-04-13 1 182
# 5 2016-05-05 1 270
# 6 2016-07-01 2 0
# 7 2016-08-01 2 31
# 8 2016-10-01 2 153
# 9 2016-12-01 2 336
# 10 2016-12-31 2 456

Resources