Related
I am trying to do a large data check for a database. Some fields in the database are hidden, so when I am doing the datacheck, I need to ignore all hidden fields. Fields are hidden based on conditional logic stored in the database. I have exported this conditional logic and have stored it in a dataframe in R. Now I need to automate the data check by somehow using the text string of a conditional argument to automate the script writing itself, which I do not think is possible, or finding a way around this problem.
Below is example code that I need to solve:
id <- c(1001, 1002, 1003, 1004, 1005, 1001, 1002, 1003, 1004, 1005)
target_var <- c("race","race","race","race","race", "race_other",
"race_other", "race_other", "race_other", "race_other")
value <- c(1, NA, 1, 1, 6, NA, NA, NA, NA, "Asian")
branching_logic <- c(NA, NA, NA, NA, NA,
"race == 6", "race == 6", "race == 6",
"race == 6", "race == 6")
race <- c(NA, NA, NA,NA, NA, 1, 1, 1, 6, 6)
data <- data.frame(id, var, value, branching_logic, race) %>%
mutate(data_check_result = case_when(
!is.na(value) ~ "No Missing Data",
is.na(value) & is.na(branching_logic) ~ "Missing Data 1",
is.na(value) & race == 6 ~ "Missing Data 2",
is.na(value) & race != 6 ~ "Hidden field",
))
It would be great if I could replace (race==6) with a variable or somehow directing the script to the conditional expression already saved as a string, but I know that R can't do that.
The above problem has four categories which the data could fall into:
No Missing Data: only if value is non-na
Missing Data 1: if the value is NA, and there is no branching logic that hid the variable.
Missing Data 2: if the value is NA and the branching logic is met to show the field
Hidden Field: if the value is NA and the branching logic is NOT net to show the field
I have thousands of fields to check with accompanying branching logic, so I need a way to use the branching logic saved in the "branching_logic" column within the script.
IMPORTANT NOTE: The case here is the simplest case. Many target_var variables and value variables have branching logic that looks at multiple other variables to determine whether to hide the field (Ex. race==6 & race==1)
This is only my second time posting, and I usually do not see such in depth problems here, but it would be great if someone has an idea!
You can store the expression you want to evaluate as a string if you pass it into parse() first as explained in this answer.
Here's a simple example of how you can store the expression in a column and then feed it to dplyr::case_when().
library(tidyverse)
set.seed(1)
d <- tibble(
a = sample(10),
b = sample(10),
c = "a > b"
)
d %>%
mutate(a_bigger = case_when(
eval(parse(text = c)) ~ "Y",
TRUE ~ "N"
))
#> # A tibble: 10 x 4
#> a b c a_bigger
#> <int> <int> <chr> <chr>
#> 1 9 3 a > b Y
#> 2 4 1 a > b Y
#> 3 7 5 a > b Y
#> 4 1 8 a > b N
#> 5 2 2 a > b N
#> 6 5 6 a > b N
#> 7 3 10 a > b N
#> 8 10 9 a > b Y
#> 9 6 4 a > b Y
#> 10 8 7 a > b Y
Created on 2022-03-07 by the reprex package (v2.0.1)
I have this following dataset:
df <- structure(list(Data = structure(c(1623888000, 1629158400, 1629158400
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Client = c("Client1",
"Client1", "Client1"), Fund = c("Fund1", "Fund1", "Fund2"), Nature = c("Application",
"Rescue", "Application"), Quantity = c(433.059697, 0, 171.546757
), Value = c(69800, -70305.67, 24875), `NAV Yesterday` = c(162.40991399996,
162.40991399996, 145.044589000056), `NAV in Application Date` = c(161.178702344125,
162.346370458944, 145.004198476337), `Var NAV` = c(0.00763879866215962,
0.00039140721678275, 0.000278547270652531), `Var * Value` = c(533.188146618741,
-27.5181466187465, 6.92886335748171), FinalValue = c(70333.1881466187,
-70333.1881466187, 24881.9288633575), `Rentability WRONG` = c(0.0210345899274819,
0.0210345899274819, 0.0210345899274819)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
What I need to do is:
If quantity = 0, then remove all rows with the same Fund name as that one, but remove only the rows that have Date < or = Date of the Quantity = 0 Fund
What I did here is:
I grouped the data by Fund
Arranged each group by Data
Created a column zero_point that assigns 1 to the row where Quantity == 0 and NA otherwise
Filled the fields in zero_point that come before the actual "zero point" with the same value.
filtered those rows out.
output <- df %>%
group_by(Fund) %>%
arrange(Data) %>%
mutate(zero_point = case_when(Quantity == 0 ~ 1)) %>%
fill(zero_point, .direction = "up") %>%
filter(is.na(zero_point))
(On the condition that there is only one instance where Quantity is 0 per Fund group)
You can try -
library(dplyr)
df %>%
filter({
#Row index where Quantity = 0
inds = which(Quantity == 0)
#Drop rows where Data value is less than Data value at Quantity = 0
#and Fund is same as present at Quantity = 0.
!(Data <= Data[inds] & Fund %in% Fund[inds])
})
Here's a thought:
df %>%
group_by(Fund) %>%
filter(!any(Quantity == 0) | Data <= Data[which.min(Quantity)])
# # A tibble: 3 x 12
# # Groups: Fund [2]
# Data Client Fund Nature Quantity Value `NAV Yesterday` `NAV in Applica~ `Var NAV` `Var * Value` FinalValue `Rentability WR~
# <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2021-06-17 00:00:00 Clien~ Fund1 Appli~ 433. 69800 162. 161. 0.00764 533. 70333. 0.0210
# 2 2021-08-17 00:00:00 Clien~ Fund1 Rescue 0 -70306. 162. 162. 0.000391 -27.5 -70333. 0.0210
# 3 2021-08-17 00:00:00 Clien~ Fund2 Appli~ 172. 24875 145. 145. 0.000279 6.93 24882. 0.0210
I'm assuming you meant "Data <= Data of the Quantity = 0 Fund", therefore using Data instead of Date (not found) or NAV in Application Date.
This filters nothing in this sample data, I'm hoping the logic is correct.
Testing for equality with floating-point (numeric) can be problematic at times (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). If you have some small near-zero numbers, then this will silently produce counter-intuitive results without warning or error. You might be more defensive to use something like:
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) | Data <= Data[which.min(Quantity)])
or even
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) |
row_number() == which.min(Quantity) |
Data < Data[which.min(Quantity)])
While the latter is a bit paranoid (and double-calculates which.min(.), it should not succumb to problems with equality tests.
The only time this will fail is if all(is.na(Quantity)); that is, which.min(c(NA,NA)) returns integer(0) which will cause an error in dplyr::filter. One might choose to add safeguard with something like filter(any(!is.na(Quantity)) & (...)).
I have a proteomics dataset currently with ~60 columns (patients and information such as protein names) and ~1800 rows (the specific proteins).
I need to convert from long to wide format so that each row corresponds to the patients while all the columns represent the proteins. I can do (very) simple conversions, but there are many columns in this example and, in extension, some data management is required as new covariates needs to be created/extracted from the raw proteomics output below. I simply does not know how to proceed and have not found any solutions looking at many available walk-throughs of converting large datasets like this.
I prefer dplyr-inputs, hints or solutions.
The raw output from the proteomic-software looks something like this:
> head(Heat_BT)
# A tibble: 11 x 6
protein gene Intensity_10 Intensity_11 Intensity_MB_1 Intensity_Ref1
<chr> <chr> <chr> <chr> <chr> <chr>
1 NA NA Bruschi Bruschi Reichl Reichl
2 NA NA Ctrl Ctrl Tumor Ctrl
3 NA NA Hydro Hydro Malignant Hydro
4 NA NA Ctrl Ctrl MB Ctrl
5 von Willebrand factor VWF 0.674627721 0.255166769 0.970489979 0.215972215
6 Sex hormone-binding globulin SHBG 0.516914487 0.476843655 0.88173753 0.306484252
7 Glyceraldehyde-3-phosphate dehydrogenase GAPDH 0.622163594 0.231107563 0.71856463 0.204625234
8 Nestin NES 0.868476391 0.547319174 0.832109928 0.440162212
9 Heat shock 70 kDa protein 13 HSPA13 0.484973907 0.435322136 0.539334834 0.28678757
10 Isocitrate dehydrogenase [NADP], mitochondrial IDH2 1.017596364 0.107395157 0.710225344 0.251976997
11 Mannan-binding lectin serine protease 1 MASP1 0.491321206 0.434995681 0.812500775 0.403583705
Expected output:
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro 0.6746277 0.5169145 0.6221636 0.8684764 0.4849739 1.0175964 0.4913212
2 Intensity_11 Bruschi Ctrl Hydro 0.2551668 0.4768437 0.2311076 0.5473192 0.4353221 0.1073952 0.4349957
3 Intensity_MB_1 Reichl Tumor Malignant 0.9704900 0.8817375 0.7185646 0.8321099 0.5393348 0.7102253 0.8125008
4 Intensity_Ref1 Reichl Ctrl Hydro 0.2159722 0.3064843 0.2046252 0.4401622 0.2867876 0.2519770 0.4035837
The proteomic-software automatically prints the first four rows as categories, which each patient belongs to.
Based on these first four rows:
There must be added four new covariates to the wide format: (1) Heat_BT$id correspond to the study name of each patient, (2) Heat_BT$lab correspond to what lab have produced the data, (3) Heat_BT$malig correspond to whether the patient case is a control case or a tumor case and finally, (4) Heat_BT$diag correspond to the underlying diagnosis.
Data
Heat_BT <- structure(list(protein = c(NA, NA, NA, NA, "von Willebrand factor",
"Sex hormone-binding globulin", "Glyceraldehyde-3-phosphate dehydrogenase",
"Nestin", "Heat shock 70 kDa protein 13", "Isocitrate dehydrogenase [NADP], mitochondrial",
"Mannan-binding lectin serine protease 1"), gene = c(NA, NA,
NA, NA, "VWF", "SHBG", "GAPDH", "NES", "HSPA13", "IDH2", "MASP1"
), Intensity_10 = c("Bruschi", "Ctrl", "Hydro", "Ctrl", "0.674627721",
"0.516914487", "0.622163594", "0.868476391", "0.484973907", "1.017596364",
"0.491321206"), Intensity_11 = c("Bruschi", "Ctrl", "Hydro",
"Ctrl", "0.255166769", "0.476843655", "0.231107563", "0.547319174",
"0.435322136", "0.107395157", "0.434995681"), Intensity_MB_1 = c("Reichl",
"Tumor", "Malignant", "MB", "0.970489979", "0.88173753", "0.71856463",
"0.832109928", "0.539334834", "0.710225344", "0.812500775"),
Intensity_Ref1 = c("Reichl", "Ctrl", "Hydro", "Ctrl", "0.215972215",
"0.306484252", "0.204625234", "0.440162212", "0.28678757",
"0.251976997", "0.403583705")), row.names = c(NA, -11L), class = c("tbl_df",
"tbl", "data.frame"))
Here is a dplyr solution for you. Its two steps, as you would need to collect intensity-variables first.
Heat_BT <- Heat_BT %>% na.exclude()
Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,
names_to = "id"
) %>% pivot_wider(
names_from = gene
) %>% mutate(
across(.cols = -"id", as.numeric)
)
Which gives the following output
# A tibble: 4 x 8
id VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
I had trouble seeing the connection between the variables you wanted to add from the data, so I assumed that once you were able to pivot you data correctly, you would be able to fill in the rest.
Ill happily revise my answer, if you can explain it more plainly how these variables are related.
Best
EDIT: Notice that I removed the first four rows from the data as I didnt immediately see the connection between the variables that you wanted added.
EDIT 2: I assumed that the first 3 rows are the covariates that you want to add such that the first row is lab, malig and diag respectively.
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1, names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT %>% na.exclude()
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = Intensity_10:Intensity_Ref1,names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
Which gives the output,
# A tibble: 4 x 11
id lab malig diag VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Intensity_10 Bruschi Ctrl Hydro 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
This will work with the data that you have, regardless of size. Clearly, you can make the approach more bullet-proof by replacing, for example, cols = Intensity_10:Intensity_Ref1 to contains("intensity").
Edit 3
You have a lot more variables than provided here, so when you pivot these are not modified during the pivot-process.
So we can take a more robust approach, assuming that all the variables not provided here are similar to the ones provided, by changing cols-argument accordingly.
# Extract the relevant information
# from the data.
id_cols <- bind_cols(
var = c("lab", "malig", "diag"),
Heat_BT[1:3,-c(1,2)]
) %>% group_by(var) %>% pivot_longer(
cols = -"var", names_to = "id"
) %>% pivot_wider(
names_from = var,
)
# Remove these identifiers;
Heat_BT <- Heat_BT[-(1:4),]
# Pivot the table;
pivoted_table <- Heat_BT[,-1] %>% pivot_longer(
cols = -"gene",
names_to = "id"
) %>% pivot_wider(
names_from = gene,
) %>% mutate(
across(.cols = -"id", as.numeric)
)
# Join with the ID colums
left_join(
id_cols,
pivoted_table
)
Which gives the same output as above.
You could do:
Heat_BT[,2][1:3] <- c('lab', 'malig', 'diag')
data.table::transpose(Heat_BT[,-1],keep.names = 'gene',make.names = TRUE)
gene lab malig diag NA VWF SHBG GAPDH NES HSPA13 IDH2 MASP1
1 Intensity_10 Bruschi Ctrl Hydro Ctrl 0.674627721 0.516914487 0.622163594 0.868476391 0.484973907 1.017596364 0.491321206
2 Intensity_11 Bruschi Ctrl Hydro Ctrl 0.255166769 0.476843655 0.231107563 0.547319174 0.435322136 0.107395157 0.434995681
3 Intensity_MB_1 Reichl Tumor Malignant MB 0.970489979 0.88173753 0.71856463 0.832109928 0.539334834 0.710225344 0.812500775
4 Intensity_Ref1 Reichl Ctrl Hydro Ctrl 0.215972215 0.306484252 0.204625234 0.440162212 0.28678757 0.251976997 0.403583705
I have daily panel data with four variables: date, cusip(id identifier), PD (probability of default), and price. PD is only available on a quarterly basis for the first day of January, April, July, and October. I want to generate daily data for PD using Chow-Lin frequency conversion from tempdisagg package. I know how to apply td() function on time series, but I didn't find examples with panel data frames. Here are my code and sample data using reproduce() from devtools package, so only few sample days are included instead of full quarter. Running td() reports an error:
Error in td(PD ~ price, conversion = "first", method = "chow-lin-fixed", fixed.rho
= 0.5) : In numeric mode, 'to' must be an integer number.
I know that both price and PD are high-frequency daily indicators in mydata, so I guess I need to use to.quarterly() function on PDor something similar.
library(dplyr)
library(zoo)
library(tempdisagg)
library(tsbox)
mydata <- structure(list(date = structure(c(13516, 13516, 13517, 13517,13518, 13518, 13521, 13605, 13605, 13606), class = "Date"), cusip = c("31677310","66585910", "31677310", "66585910", "31677310", "66585910", "31677310","66585910", "31677310", "66585910"), PD = c(0.076891, 0.096,NA, NA, NA, NA, NA, 0.094341, 0.08867, NA), price = c(40.98, 61.31,40.99, 60.77, 40.18, 59.97, 39.92, 59.96, 38.6, 60.69)), row.names = c(6L,13L, 36L, 43L, 66L, 73L, 96L, 1843L, 1866L, 1873L), class = "data.frame")
mydata <- mydata%>%
group_by(cusip) %>%
arrange(cusip,date) %>%
mutate(PDdaily = td(PD ~ price, conversion = "first",method = "chow-lin-fixed", fixed.rho = 0.5))
Your example is not sufficient. For each disaggregation, we need at least 3 low frequency values to be able to perform a regression.
Here is an alternative example, with 3 pairs of low and high frequency series:
library(tidyverse)
library(tempdisagg)
library(tsbox)
mydata <- ts_c(
low_freq = ts_frequency(fdeaths, "year"),
high_freq = mdeaths
) %>%
ts_tbl() %>%
ts_wide() %>%
crossing(id = 1:3) %>%
arrange(id)
Applying td multiple times on data in a data frame will be cumbersome.
It is easier to extract the data into two lists, one with the low and one with high frequency series:
list_lf <- group_split(ts_na_omit(select(mydata, time, value = low_freq, id)), id, keep = FALSE)
list_hf <- group_split(select(mydata, time, value = high_freq, id), id, keep = FALSE)
Now you can use Map() or map2() to apply the function to each pair of elements:
ans <- map2(list_lf, list_hf, ~ predict(td(.x ~ .y)))
Transforming the disaggregated data back to a data frame:
bind_rows(ans, .id = "id")
#> # A tibble: 216 x 3
#> id time value
#> <chr> <date> <dbl>
#> 1 1 1974-01-01 59.2
#> 2 1 1974-02-01 54.2
#> 3 1 1974-03-01 54.4
#> 4 1 1974-04-01 54.4
#> 5 1 1974-05-01 47.3
#> 6 1 1974-06-01 42.8
#> 7 1 1974-07-01 43.3
#> 8 1 1974-08-01 40.6
#> 9 1 1974-09-01 42.0
#> 10 1 1974-10-01 47.3
#> # … with 206 more rows
Created on 2020-06-03 by the reprex package (v0.3.0)
I have a dataframe of 4032 row for 48 columns.
The first 24 columns are the fluxes calculated for several compounds and each row is the flux calculated in a time resolution of 30 min.
The following 24 columns are the calculated limit of detection (LOD) of the fluxes, for the same compounds in the previous 24 columns in the same order.
I want to see, row to row, if each compound in each column is >LOD or <-LOD for the correspetive LOD calculated for each compound in the respective column.
At the end I want to create a new dataframe where if the flux pass this condition the value is written, otherwise a NA is written.
I share a reduced versione of my dataset:
structure(list(mz31_fluxmax = c(0.0075314, 0.039237, -0.0091778,
-0.0074935, -0.0062872, -0.012777), mz33_fluxmax = c(-0.10383,
0.26369, -0.073705, -0.052205, -0.055995, -0.30571), mz39_fluxmax = c(0.13112,
-0.24524, 0.099267, 0.14686, 0.23026, 0.2555), mz42_fluxmax = c(-0.0064381,
0.0068372, 0.010509, 0.013523, -0.0039596, 0.018889), mz45_fluxmax = c(0.024457,
0.10681, 0.033549, 0.034579, -0.052483, 0.057419), mz47_fluxmax = c(-0.030953,
-0.060969, -0.027106, 0.04804, 0.048647, -0.050288), mz59_fluxmax = c(0.030912,
0.063897, 0.03306, 0.042901, -0.032359, -0.052992), mz61_fluxmax = c(-0.013039,
-0.018731, -0.017816, 0.035933, 0.025714, 0.023489), mz69_fluxmax = c(0.02081,
0.021299, -0.0077438, 0.011213, 0.019074, -0.02709), mz71_fluxmax = c(0.008063,
-0.0069763, 0.0023735, 0.0043432, 0.003758, 0.010974), mz75_fluxmax = c(-1.8245e-17,
7.0344e-18, -0.0006465, 0.00086653, -0.00052278, 0.00056043),
mz79_fluxmax = c(-0.0099819, 0.029971, 0.011572, 0.009469,
0.02177, -0.032429), mz85_fluxmax = c(0.0068045, -0.021908,
0.0050362, -0.0090931, -0.0058598, -0.019743), mz87_fluxmax = c(0.0090713,
0.011222, 0.0051697, 0.0097271, 0.0021328, 0.0090713), mz93_fluxmax = c(-0.029838,
0.05316, 0.044835, 0.021252, -0.040539, 0.039774), mz99_fluxmax = c(-0.0072673,
0.0077081, -0.0037859, -0.0046982, -0.0010743, 0.0071997),
mz101_fluxmax = c(0.0048883, 0.011394, 0.0029878, -0.006759,
0.0065672, 0.010028), mz107_fluxmax = c(-0.027853, -0.054236,
0.023384, 0.022094, 0.022981, 0.036405), mz111_fluxmax = c(-0.0016328,
0.0066329, -0.0018345, 0.004555, 0.0015514, 0.0032013), mz113_fluxmax = c(-0.0013015,
0.0055934, 0.00089352, 0.0015395, -0.0011601, 0.0038798),
mz135_fluxmax = c(-0.0061842, -0.0098238, 0.0036505, 0.0052973,
0.0029078, 0.012724), mz137_fluxmax = c(0.026894, 0.034569,
0.016971, -0.00055361, 0.03223, 0.0020253), mz149_fluxmax = c(-0.0017587,
-0.0033536, 0.00090186, -0.00060427, -0.00083038, 0.0017915
), mz155_fluxmax = c(0.0011551, 0.00011869, 0.00052767, 0.00054035,
-5.7848e-05, -1.2613e-05), mz31_LOD = c(0.0056881436858662,
0.014850612037564, 0.00263459553228289, 0.00479935397746244,
0.0152440068257583, 0.0178542775892762), mz33_LOD = c(0.0125308028387973,
0.00911763719872646, 0.0151284350477026, 0.0372508988086331,
0.0402229125266234, 0.0936355242726306), mz39_LOD = c(0.0301850520395113,
0.0296992069156593, 0.0201949605533048, 0.217490160513958,
0.00223029803079041, 0.124007419481375), mz42_LOD = c(0.00320496355324591,
0.000990716035552583, 0.00114254522034714, 0.00153880263591558,
0.00948843346611039, 0.00829842969627028), mz45_LOD = c(0.0330936038635234,
0.0167556608587841, 0.0122716423260542, 0.000398211936512332,
0.00540592950218144, 0.0183693318587938), mz47_LOD = c(0.0111770867410492,
0.00282666705854054, 0.00172080651807461, 0.0115511710261517,
0.0156374551396285, 0.0544621906247567), mz59_LOD = c(0.0159506436971311,
0.0467280850597503, 0.00896526672250792, 0.00208209542259193,
0.0196628887796654, 0.00302598893847008), mz61_LOD = c(0.0016309734207739,
0.000905825894770442, 0.00793279030609907, 0.0131829166139475,
0.0108149832147901, 0.0153864222552258), mz69_LOD = c(0.00638344838052493,
0.00465756945316134, 0.000733281330641999, 0.00235604303405109,
0.00314352406984064, 0.00504395302927101), mz71_LOD = c(0.000455024687674437,
0.00326558077604542, 0.000174790097425541, 0.00121549851806748,
0.00163842732208755, 0.000892298876604984), mz75_LOD = c(NA,
NA, 0.00145895087681435, 4.90107803327739e-05, 0.000251573571031492,
0.00363292289535981), mz79_LOD = c(0.00521113925555237, 0.0103629801610154,
0.0118890958199121, 0.0122255131032432, 0.00536736523974168,
0.00568381024749507), mz85_LOD = c(0.0132788357415617, 0.00102839338218391,
0.00940732247246199, 0.000348774983294675, 0.00298067320381836,
0.00205641452275468), mz87_LOD = c(0.00201091935375826, 0.000776210717592691,
0.00279198390479745, 0.00141482880373932, 0.000748541000610013,
0.00281145814206216), mz93_LOD = c(0.00697408929207704, 0.0260339773064747,
0.00810401572478017, 0.00100041305177681, 0.00665795713420106,
0.00396693358778718), mz99_LOD = c(0.00402957819499522, 0.000566331511400743,
0.00155300896677703, 0.00232847303855026, 0.00464435693739678,
0.00171045854038109), mz101_LOD = c(0.00178420487408269,
0.00115586456923503, 0.00254601356943224, 0.00310985936245129,
0.00432584813531501, 0.00243251979505525), mz107_LOD = c(0.00407638866821389,
0.0229674850748965, 0.00701861441818298, 0.0116410684433383,
0.00485523640022218, 0.0155737255675545), mz111_LOD = c(0.000843805958946711,
0.00287785932050435, 0.00134575880747311, 0.000532630272225315,
0.00201047010477024, 0.00283236237275034), mz113_LOD = c(0.000636492422450974,
0.000453940678672287, 0.00108923919956853, 0.000493113580579477,
0.000200586155571694, 0.000500537860017757), mz135_LOD = c(0.00203273369486478,
0.00908905787659258, 0.000826768270592192, 0.00179533094202209,
0.00202657955605344, 0.00809631808214351), mz137_LOD = c(0.010197651904802,
0.00809757134440575, 0.00307654713824166, 0.00113203086563082,
0.00217444118117416, 0.00803526410617303), mz149_LOD = c(4.94861889361863e-05,
0.00217371652333924, 0.000952885071549479, 0.000215375843276559,
0.000171446563764392, 9.19079668394535e-05), mz155_LOD = c(0.000246712993094256,
0.00185548030033775, 9.85004369721625e-05, NA, 0.000121478907895942,
NA)), row.names = c(NA, 6L), class = "data.frame")
So, to be concrete with an example:
I want to se if mz31_fluxmax in the first row is >mz31_LOD or <-mz31_LOD.
If the condition is respected, the value of mz31_fluxmax is written in the new dataframe, otherwise a NA is written.
Then so on for the next row.
Obviously, I want iterate this process for each column.
I didn't try any code.
I really don't know how do this.
With tidyverse you could try the following:
library(tidyverse)
df %>%
pivot_longer(cols = everything(), names_to = c("variable", ".value"), names_sep = "_") %>%
mutate(result = if_else(abs(fluxmax) > LOD, fluxmax, as.numeric(NA)))
Output
# A tibble: 144 x 4
variable fluxmax LOD result
<chr> <dbl> <dbl> <dbl>
1 mz31 0.00753 0.00569 0.00753
2 mz33 -0.104 0.0125 -0.104
3 mz39 0.131 0.0302 0.131
4 mz42 -0.00644 0.00320 -0.00644
5 mz45 0.0245 0.0331 NA
6 mz47 -0.0310 0.0112 -0.0310
7 mz59 0.0309 0.0160 0.0309
8 mz61 -0.0130 0.00163 -0.0130
9 mz69 0.0208 0.00638 0.0208
10 mz71 0.00806 0.000455 0.00806
# ... with 134 more rows
Something like this? assuming df is your dataframe:
mat1 = df[,1:24]
mat2 = df[,25:48]
mat1[abs(mat1)>mat2] = NA
mat1
Note you have NAs on the LOD columns. Since all your values in the LOD columns are positive, your problem can be simplified to, if the absolute of the flux max is more than the LOD, set it to NA.