Divide data by the preceding row and create new dataframe - r

I have a data set and I'm trying to calculate the rate of change between the rows.
My input looks like this:
foo = structure(list(date = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("10/03/2020",
"11/03/2020", "12/03/2020", "13/03/2020", "9/03/2020"), class = "factor"),
A = c(0.60256322, 0.634543306, 0.022976661, 0.009839044,
0.319456765), B = c(45.42320826, 57.32689951, 32.49487759,
29.40804164, 54.33691346), C = c(5.114123914, 3.674167652,
2.330610757, 5.510280192, 5.717950467), D = c(4.187409484,
4.835943165, 4.340614439, 4.607468576, 3.14338155)), row.names = c(NA,
5L), class = "data.frame")
I'm trying to divide each of the following cells with the one before
eg. [5,2] / [4,2]; [4,2] / [3,2]... etc
and I'm trying to create a new output like this:
bar = structure(list(date = structure(c(5L, 1L, 2L, 3L, 4L), .Label = c("10/03/2020",
"11/03/2020", "12/03/2020", "13/03/2020", "9/03/2020"), class = "factor"),
A = c(0, 1.053073412, 0.03620976, 0.428219052, 32.46827283
), B = c(0, 1.262061878, 0.56683473, 0.90500546, 1.847688946
), C = c(0, 0.718435398, 0.634323465, 2.364307371, 1.037687789
), D = c(0, 1.154877063, 0.897573501, 1.061478424, 0.682236134
)), row.names = c(NA, 5L), class = "data.frame")
I'm sure there's a better way than finding the length of the column and looping through. Can anyone point me in the right direction?

You cans use mutate_if or mutate_at from dplyr package.
library(dplyr)
foo %>%
mutate_if(!grepl("date", names(.)), function(x) x/lag(x))
OR
foo %>%
mutate_at(vars(-date), function(x) x/lag(x))

In base R, we can use head and tail to divide data.
foo[-1] <- lapply(foo[-1], function(x) c(0, tail(x, -1)/head(x, -1)))
foo
# date A B C D
#1 9/03/2020 0.00000000 0.0000000 0.0000000 0.0000000
#2 10/03/2020 1.05307341 1.2620619 0.7184354 1.1548771
#3 11/03/2020 0.03620976 0.5668347 0.6343235 0.8975735
#4 12/03/2020 0.42821905 0.9050055 2.3643074 1.0614784
#5 13/03/2020 32.46827283 1.8476889 1.0376878 0.6822361

Another tidyverse approach.
library(tidyverse)
bar <- foo %>%
mutate_if(is.double, ~ replace_na(./lag(.), replace = 0))
bar
#> date A B C D
#> 1 9/03/2020 0.00000000 0.0000000 0.0000000 0.0000000
#> 2 10/03/2020 1.05307341 1.2620619 0.7184354 1.1548771
#> 3 11/03/2020 0.03620976 0.5668347 0.6343235 0.8975735
#> 4 12/03/2020 0.42821905 0.9050055 2.3643074 1.0614784
#> 5 13/03/2020 32.46827283 1.8476889 1.0376878 0.6822361

Related

Aggregate columns based on categories given by another dataframe

I have a dataframe where each column has some vector of data. I want to apply the mean columnwise, but filtered by groups which are given by a second dataframe. That is, each column belongs to a group and this information is in the second dataframe.
Here is some example dataset: df is the dataframe with the data vectors, df_category contains the category for each column.
df=structure(list(x1 = c(0.461302090047301, -1.19974381763812, -0.888258056235799,
0.300889698419314, 0.836911163114131, 0.0540388337324712), x2 = c(1.33736696170763,
-0.687026295689823, 1.12205295626651, -0.848925266014684, 1.16092168555067,
0.591202293337843), x3 = c(-0.279052669225263, -0.780435476613128,
-0.852870619718068, -0.708611614262357, -0.761659405740852, 0.487033696695474
), x4 = c(-0.222767493777229, 1.50328295132467, 0.934670132217215,
1.37678188537077, 0.343280062984192, 1.23279081824003), x5 = c(-1.08074586121729,
0.208120194894818, -0.52245832008453, 0.944618465137011, 0.749834485631317,
-0.81118414509141)), class = "data.frame", row.names = c(NA,
-6L))
df_category=structure(list(Col_name = structure(1:5, .Label = c("x1", "x2",
"x3", "x4", "x5"), class = "factor"), Category = structure(c(1L,
1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
The result I want is this one:
df_result=structure(list(mean_A = c(0.899334525877468, -0.943385056663974,
0.116897450015357, -0.274017783797685, 0.998916424332403, 0.322620563535157
), mean_B = c(-0.527522008073261, 0.310322556535454, -0.146886269195128,
0.537596245415141, 0.110485047624885, 0.302880123281364)), class = "data.frame", row.names = c(NA,
-6L))
in Base R:
a <- with(df_category, setNames(Category, Col_name))[names(df)[col(df)]]
tapply(unlist(df), list(row(df), a), mean)
A B
1 0.8993345 -0.5275220
2 -0.9433851 0.3103226
3 0.1168975 -0.1468863
4 -0.2740178 0.5375962
5 0.9989164 0.1104850
6 0.3226206 0.3028801
Another option:
sapply(with(df_category, split.default(df[Col_name], Category)), rowMeans)
A B
[1,] 0.8993345 -0.5275220
[2,] -0.9433851 0.3103226
[3,] 0.1168975 -0.1468863
[4,] -0.2740178 0.5375962
[5,] 0.9989164 0.1104850
[6,] 0.3226206 0.3028801
We can use tidyverse to reshape the data values, merge the category data, and compute means for groups "A" and "B":
library(tidyverse)
df_result <- df %>%
mutate(idx = row_number()) %>%
pivot_longer(-idx) %>%
inner_join(df_category, c(name = 'Col_name')) %>%
group_by(Category, idx) %>%
summarize(mean = mean(value)) %>%
pivot_wider(names_from = Category, values_from = mean, names_prefix = 'mean_') %>%
select(-idx)
mean_A mean_B
<dbl> <dbl>
1 0.899 -0.528
2 -0.943 0.310
3 0.117 -0.147
4 -0.274 0.538
5 0.999 0.110
6 0.323 0.303

Dplyr function to match column value to row values and replace it

I have 2 data frames. Data Frame A is and Data Frame B is
I want to take the values of column active ident in Data Frame B and input them as a row on the top of Data Frame A in a way where they match up according to column name from Data Frame B
I have tried using dplyr but I cant seem to figure out how to do this in R. Would appreciate any help
Attaching dput(head) for both my files
Cell Labels
structure(list(`hnscc.vp.fibroblasts#active.ident` = structure(c(3L,
2L, 3L, 3L, 3L, 3L), .Label = c("Cluster_0", "Cluster_4", "Cluster_3",
"Cluster_2", "Cluster_1"), class = "factor")), row.names = c("pat01.pre_AAACCTGAGGAGCGAG",
"pat01.pre_AAACCTGCACTACAGT", "pat01.pre_AAACCTGTCACCGTAA", "pat01.pre_AAATGCCCACTATCTT",
"pat01.pre_AACCATGAGCATCATC", "pat01.pre_AACCGCGCAGATGGCA"), class = "data.frame")
Gene Count Per Cell :
dput(head(Gene_Counts_per_Cell[, c(1:5)]))
structure(list(pat01.pre_AAACCTGAGGAGCGAG = c(1.99399322071276,
1.5433201979508, 2.4725719042226, -2.59159111384049, 1.56977481481343,
0.192853860719877), pat01.pre_AAACCTGCACTACAGT = c(2.90248911455912,
2.27707326162242, 2.12992680712843, -1.44512552229319, 0.541062218328074,
1.8626908687607), pat01.pre_AAACCTGTCACCGTAA = c(3.99090573935858,
3.00560247848693, 2.9656947677965, -3.23693215603618, 4.72557633990864,
0.0247894431208639), pat01.pre_AAATGCCCACTATCTT = c(1.08405270702075,
-0.884466121620786, 0.500175551980942, -2.28142505510742, 3.97105313918843,
-1.01130712883293), pat01.pre_AACCATGAGCATCATC = c(4.55944063063621,
2.43937477176712, 3.93016796802459, -1.92695887361317, 3.16070890309665,
1.65917938530014)), row.names = c("ACTB", "ACTG1", "ACTN1", "ADAP2",
"ADM", "ADRB2"), class = "data.frame")
This maybe what you are looking for. Note have had to convert the clusters into numeric to ensure type consistency for columns. Use rownames to distinguish between clusters and other numeric data in the columns.
library(dplyr)
library(tidyr)
library(tibble)
library(stringr)
dfb %>%
rownames_to_column("rowname") %>%
pivot_wider(names_from = rowname, values_from = `hnscc.vp.fibroblasts#active.ident`) %>%
mutate(across(everything(), ~as.numeric(str_extract(.x, "\\d$")))) %>%
mutate(cluster = "cluster") %>%
column_to_rownames(var = "cluster") %>%
bind_rows(dfa)
#> pat01.pre_AAACCTGAGGAGCGAG pat01.pre_AAACCTGCACTACAGT
#> cluster 3.0000000 4.0000000
#> ACTB 1.9939932 2.9024891
#> ACTG1 1.5433202 2.2770733
#> ACTN1 2.4725719 2.1299268
#> ADAP2 -2.5915911 -1.4451255
#> ADM 1.5697748 0.5410622
#> ADRB2 0.1928539 1.8626909
#> pat01.pre_AAACCTGTCACCGTAA pat01.pre_AAATGCCCACTATCTT
#> cluster 3.00000000 3.0000000
#> ACTB 3.99090574 1.0840527
#> ACTG1 3.00560248 -0.8844661
#> ACTN1 2.96569477 0.5001756
#> ADAP2 -3.23693216 -2.2814251
#> ADM 4.72557634 3.9710531
#> ADRB2 0.02478944 -1.0113071
#> pat01.pre_AACCATGAGCATCATC pat01.pre_AACCGCGCAGATGGCA
#> cluster 3.000000 3
#> ACTB 4.559441 NA
#> ACTG1 2.439375 NA
#> ACTN1 3.930168 NA
#> ADAP2 -1.926959 NA
#> ADM 3.160709 NA
#> ADRB2 1.659179 NA
data
dfb <- structure(list(`hnscc.vp.fibroblasts#active.ident` = structure(c(3L,
2L, 3L, 3L, 3L, 3L), .Label = c("Cluster_0", "Cluster_4", "Cluster_3",
"Cluster_2", "Cluster_1"), class = "factor")), row.names = c("pat01.pre_AAACCTGAGGAGCGAG",
"pat01.pre_AAACCTGCACTACAGT", "pat01.pre_AAACCTGTCACCGTAA", "pat01.pre_AAATGCCCACTATCTT",
"pat01.pre_AACCATGAGCATCATC", "pat01.pre_AACCGCGCAGATGGCA"), class = "data.frame")
dfa <- structure(list(pat01.pre_AAACCTGAGGAGCGAG = c(1.99399322071276,
1.5433201979508, 2.4725719042226, -2.59159111384049, 1.56977481481343,
0.192853860719877), pat01.pre_AAACCTGCACTACAGT = c(2.90248911455912,
2.27707326162242, 2.12992680712843, -1.44512552229319, 0.541062218328074,
1.8626908687607), pat01.pre_AAACCTGTCACCGTAA = c(3.99090573935858,
3.00560247848693, 2.9656947677965, -3.23693215603618, 4.72557633990864,
0.0247894431208639), pat01.pre_AAATGCCCACTATCTT = c(1.08405270702075,
-0.884466121620786, 0.500175551980942, -2.28142505510742, 3.97105313918843,
-1.01130712883293), pat01.pre_AACCATGAGCATCATC = c(4.55944063063621,
2.43937477176712, 3.93016796802459, -1.92695887361317, 3.16070890309665,
1.65917938530014)), row.names = c("ACTB", "ACTG1", "ACTN1", "ADAP2",
"ADM", "ADRB2"), class = "data.frame")
Created on 2022-03-21 by the reprex package (v2.0.1)

In R, make a conditional indicator variable based on (a) the first instance of a record type and (b) a date difference

Background
Here's a df with some data in it from a Costco-like members-only big-box store:
d <- data.frame(ID = c("a","a","b","c","c","d"),
purchase_type = c("grocery","grocery",NA,"auto","grocery",NA),
date_joined = as.Date(c("2014-01-01","2014-01-01","2013-04-30","2009-03-08","2009-03-08","2015-03-04")),
date_purchase = as.Date(c("2014-04-30","2016-07-08","2013-06-29","2015-04-07","2017-09-10","2017-03-10")),
stringsAsFactors=T)
d <- d %>%
mutate(date_diff = d$date_purchase - d$date_joined)
This yields the following table:
As you can see, it's got a member ID, purchase types based on the broad category of what people bought, and two dates: the date the member originally became a member, and the date of a given purchase. I've also made a variable date_diff to tally the time between a given purchase and the beginning of membership.
The Problem
I'd like to make a new variable early_shopper that's marked 1 on all of a member's purchases if
That member's first purchase was made within a year of joining (so date_diff <= 365 days).
This first purchase doesn't have an NA in purchase_type.
If these criteria aren't met, give a 0.
What I'm looking for is a table that looks like this:
Note that Member a is the only "true" early_shopper: their first purchase is non-NA in purchase_type, and only 119 days passed between their joining the store and making a purchase there. Member b looks like they could be based on my date_diff criterion, but since they don't have a non-NA value in purchase_type, they don't count as an early_shopper.
What I've Tried
So far, I've tried using mutate and first functions like this:
d <- d %>%
mutate(early_shopper = if_else(!is.na(first(purchase_type,order_by = date_joined)) & date_diff < 365, 1, 0))
Which gives me this:
Something's kinda working here, but not fully. As you can see, I get the correct early_shopper = 1 in Member a's first purchase, but not their second. I also get a false positive with member b, who's marked as an early_shopper when I don't want them to be (because their purchase_type is NA).
Any ideas? I can further clarify if need be. Thanks!
You could use
library(dplyr)
d %>%
mutate(date_diff = date_purchase - date_joined) %>%
group_by(ID, purchase_type) %>%
arrange(ID, date_joined) %>%
mutate(
early_shopper = +(!is.na(first(purchase_type)) & date_diff <= 365)
) %>%
group_by(ID) %>%
mutate(early_shopper = max(early_shopper)) %>%
ungroup()
which returns
# A tibble: 6 x 6
ID purchase_type date_joined date_purchase date_diff early_shopper
<fct> <fct> <date> <date> <drtn> <int>
1 a grocery 2014-01-01 2014-04-30 119 days 1
2 a grocery 2014-01-01 2016-07-08 919 days 1
3 b NA 2013-04-30 2013-06-29 60 days 0
4 c auto 2009-03-08 2015-04-07 2221 days 0
5 c grocery 2009-03-08 2017-09-10 3108 days 0
6 d NA 2015-03-04 2017-03-10 737 days 0
If you want the early_shopper column to be boolean/logical, just remove the +.
Data
I used this data, here the date_joined for b is 2013-04-30 like shown in your images and not like in your actual data posted.
structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L), .Label = c("a",
"b", "c", "d"), class = "factor"), purchase_type = structure(c(2L,
2L, NA, 1L, 2L, NA), .Label = c("auto", "grocery"), class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311,
16498), class = "Date"), date_purchase = structure(c(16190,
16990, 15885, 16532, 17419, 17235), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))
Here is my approach using a join to get the early_shopper value to be the same for all rows of the same ID.
library(dplyr)
d <- structure(list(ID = structure(c(1L, 1L, 2L, 3L, 3L, 4L),
.Label = c("a","b", "c", "d"),
class = "factor"),
purchase_type = structure(c(2L, 2L, NA, 1L, 2L, NA),
.Label = c("auto", "grocery"),
class = "factor"),
date_joined = structure(c(16071, 16071, 15825, 14311, 14311, 16498),
class = "Date"),
date_purchase = structure(c(16190, 16990, 15885, 16532, 17419, 17235),
class = "Date")),
class = "data.frame", row.names = c(NA, -6L))
d %>%
inner_join(d %>%
mutate(date_diff = d$date_purchase - d$date_joined) %>%
group_by(ID) %>%
slice_min(date_diff) %>%
transmute(early_shopper = if_else(!is.na(first(purchase_type,
order_by = date_joined)) &
date_diff < 365, 1, 0)) %>%
ungroup()
)
ID purchase_type date_joined date_purchase early_shopper
1 a grocery 2014-01-01 2014-04-30 1
2 a grocery 2014-01-01 2016-07-08 1
3 b <NA> 2013-04-30 2013-06-29 0
4 c auto 2009-03-08 2015-04-07 0
5 c grocery 2009-03-08 2017-09-10 0
6 d <NA> 2015-03-04 2017-03-10 0

Wilcoxon Test in a loop for a number of datasets at the same time

I have a question about whether I can do a Wilcoxon test in a loop for all the table generated.
Basically, I want to do a paired Wilcoxon test between 2 variables for each dataset, and the 2 variables are in the same position(like xth and yth column) for every dataset. (For people who are familiar with Biology, in fact this is the RPKM values for like between control and treated sample for some repetitive elements) And I hope I can generate a table for the p-value from Wilcoxon test for each dataset.
I ready generated all the tables/dataset/dataframe using the below code and I think I want to do a Wilcoxon test for each dataset so I think I need to continue with the loop but i don't know how to do it:
data=sample_vs_norm
filter=unique(data$family)
for(i in 1:length(filter)){
table_name=paste('table_', filter[i], sep="")
print(table_name)
assign(table_name, data[data$Subfamily == filter[i]])
here is the structure of a single dataset:
so basically i would like to do a Wilcoxon test between the variables "R009_initial_filter_rpkm" and "normal_filter_rpkm"
Chr Start End Mappability Strand R009_initial_filter_NormalizedCounts
1: chr11 113086868 113087173 1 - 2
2: chr2 24290845 24291132 1 - 11
3: chr4 15854425 15854650 1 - 0
4: chr6 43489623 43489676 1 + 11
normal_filter_NormalizedCounts R009_initial_filter_rpkm
1: 14.569000 0.169752
2: 1.000000 0.992191
3: 14.815900 0.000000
4: 0.864262 5.372810
normal_filter_rpkm FoldChange p.value FDR FoldChangeFPKM
1: 1.236560 0.137278 0.999862671 1.000000000 0.1372776
2: 0.000000 11.000000 0.003173828 0.008149271 Inf
3: 1.704630 0.000000 1.000000000 1.000000000 0.0000000
4: 0.422137 12.727600 0.003173828 0.008149271 12.7276453
structure(list(Chr = structure(1:4, .Label = c("chr11", "chr2",
"chr4", "chr6"), class = "factor"), Start = c(113086868L, 24290845L,
15854425L, 43489623L), End = c(113087173L, 24291132L, 15854650L,
43489676L), Mappability = c(1L, 1L, 1L, 1L), Strand = structure(c(1L,
1L, 1L, 2L), .Label = c("-", "+"), class = "factor"), R009_initial_filter_NormalizedCounts = c(2L,
11L, 0L, 11L), normal_filter_NormalizedCounts = c(14.569,
1, 14.8159, 0.864262), R009_initial_filter_rpkm = c(0.169752,
0.992191, 0, 5.37281), normal_filter_rpkm = c(1.23656,
0, 1.70463, 0.422137), FoldChange = c(0.137278, 11, 0, 12.7276
), p.value = c(0.999862671, 0.003173828, 1, 0.003173828), FDR = c(1,
0.008149271, 1, 0.008149271), FoldChangeFPKM = c(0.1372776, Inf,
0, 12.7276453), class = "data.frame", row.names = c(NA,
-4L))
I'm sorry if I use incorrect terminology as I am a newbie in R, and thank you so much for the help
One approach is to use grouping with by = in data.table.
library(data.table)
setDT(data)
data[,wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = TE_Subfamily]
# TE_Subfamily statistic p.value
#1: AluYf4 7.5 1
You can group by any number of variables, for example TE_Subfamily and Chr:
data[TE_Subfamily %in% filter,
wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = .(TE_Subfamily,Chr)]
# TE_Subfamily Chr statistic p.value
#1: AluYf4 chr11 0 1
#2: AluYf4 chr2 1 1
#3: AluYf4 chr4 0 1
#4: AluYf4 chr6 1 1
If you need to only perform comparisons for certain TE_Subfamily, you could try something like this:
filter <- c("AluYf4")
data[TE_Subfamily %in% filter,
wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = TE_Subfamily]
# TE_Subfamily statistic p.value
#1: AluYf4 7.5 1
For bonus points, you can correct for multiple testing:
data[TE_Subfamily %in% filter,
wilcox.test(R009_initial_filter_rpkm,
normal_filter_rpkm)[c("statistic","p.value")],
by = TE_Subfamily][,adjusted.p.value := p.adjust(p.value,method = "bonferroni")][]

merging quarterly time series in R

How do you merge a and b xts series that are presented like this:
a:
1948-01-01 1
1948-04-01 1
1948-07-01 1
1948-10-01 1
b:
1948-03-01 2
1948-06-01 2
1948-09-01 2
1948-12-01 2
Result should look like this
a b
1948Q1 1 2
1948Q2 1 2
1948Q3 1 2
1948Q4 1 2
In what format the date is doesn't matter, as long as the a and b are aligned by quarters. This is much easier to do with monthly using indexFormat() and %b%Y and etc., but there isn't an index available for quarterly.
aggregate(a, as.yearqtr) doesn't work too well because for some reason it takes Q2 as the first quarter of the year. Then if you want to take yearly average, it takes Q2,Q3,Q4,Q1 for each year, instead of Q1-Q4. So, I am looking for another method. Let me know if such exists.
Try this 2-step solution:
# Step 1: set rownames as quarters
library(zoo)
rownames(a) <- format.yearqtr(as.Date(a$V1))
rownames(b) <- format.yearqtr(as.Date(b$V1))
# Step2: merge by rownames (quarters)
merge(a, b, by = "row.names", all = TRUE, suffixes = c(".a",".b"))[, -c(2, 4)]
# output (you can now store the result in a dataframe and change colnames as you want)
Row.names V2.a V2.b
1 1948 Q1 1 2
2 1948 Q2 1 2
3 1948 Q3 1 2
4 1948 Q4 1 2
Data
a <- structure(list(V1 = c("1948-01-01", "1948-04-01", "1948-07-01",
"1948-10-01"), V2 = c(1L, 1L, 1L, 1L)), .Names = c("V1", "V2"
), class = "data.frame", row.names = c("1948 Q1", "1948 Q2",
"1948 Q3", "1948 Q4"))
b <- structure(list(V1 = c("1948-03-01", "1948-06-01", "1948-09-01",
"1948-12-01"), V2 = c(2L, 2L, 2L, 2L)), .Names = c("V1", "V2"
), class = "data.frame", row.names = c("1948 Q1", "1948 Q2",
"1948 Q3", "1948 Q4"))

Resources