Delete/overwrite rows by partial matching - r

I need to check if rows are partially duplicated and delete/overwrite those where 2 columns match a different row where 3 values are present. one problem is, that the "real" dataframe contains a couple of list columns which makes some operations unfeasible. Best case would be if any row where a match can be found would be checked independently of column numbers - meaning only the row with the most columns having non NA values (out of all which include matching column values) is kept.
o1 o2 o3
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 5 9 NA # this row has only 2 values which match values from row 11 but the last value is na
8 10 NA NA
9 12 NA NA
10 13 NA NA
11 5 9 14 # this row has values in all 3 columns
12 14 NA NA
13 8 11 15 # so does this row
14 16 NA NA
15 17 NA NA
16 18 NA NA
17 19 NA NA
18 20 NA NA
The result should be the same data frame - just without row 7 or where row 7 is overwritten by row 11.
This should be easy to do but for some reason i didn't manage it (except with a convoluted for loop that is hard to generalize should more columns be added at a later time). Is there a straight forward way to do this?
dput of above df:
structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L,
5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L), o2 = c(NA, NA, NA, NA,
NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA), o3 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA,
NA, NA)), row.names = c(NA, -18L), class = "data.frame")
If there is already an answer for something like this, please let me know.

I thought of using dplyr:
library(dplyr)
df %>%
mutate(rn = row_number(),
count_na = rowSums(across(o1:o3, is.na))) %>%
group_by(o1, o2) %>%
slice_min(count_na) %>%
arrange(rn) %>%
ungroup() %>%
select(o1:o3)
This returns
# A tibble: 17 x 3
o1 o2 o3
<int> <int> <int>
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 10 NA NA
8 12 NA NA
9 13 NA NA
10 5 9 14
11 14 NA NA
12 8 11 15
13 16 NA NA
14 17 NA NA
15 18 NA NA
16 19 NA NA
17 20 NA NA
This solution is based on the following ideas:
For every row we count the number of NAs in this row.
We group for o1 and o2 to create groups of data that belong together. Here is a possible flaw: perhaps it is a better approach to group by o1 only or do some other grouping. This depends on the structure of your data: should 1, <NA>, <NA> be overwritten by 1, 2, <NA>?
After grouping, we select the row with the smallest number of NAs.
Finally we do some clean up: removing the auxiliary columns, arranging the data and ungrouping.

A partial solution to detect the duplicates, it remains to specify which rows to delete, ran out of time. I've went ahead and "duplicated" a couple more rows.
df=read.table(text="
o1 o2 o3
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 6 NA NA
6 7 NA NA
7 5 9 NA
8 10 NA NA
9 12 NA NA
10 13 NA NA
11 5 9 14
12 14 NA NA
13 8 11 15
14 16 NA NA
15 7 1 2
16 18 NA NA
17 7 1 3
18 20 NA NA",h=T)
The main trick is to calculate a distance matrix and check which rows have a distance of zero, since dist will automatically estimate a pairwise distance, removing missing values.
tmp=as.matrix(dist(df))
diag(tmp)=NA
tmp[lower.tri(tmp)]=NA
tod=data.frame(which(tmp==0,arr.ind=T))
resulting in
row col
X7 7 11
X6 6 15
X6.1 6 17

Here's another way which considers all columns, should work with any number of columns and regardless of their names or positions
library(dplyr)
mydf <- structure(list(o1 = c(1L, 2L, 3L, 4L, 6L, 7L, 5L, 10L, 12L, 13L,
5L, 14L, 8L, 16L, 17L, 18L, 19L, 20L),
o2 = c(NA, NA, NA, NA,
NA, NA, 9L, NA, NA, NA, 9L, NA, 11L, NA, NA, NA, NA, NA),
o3 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 14L, NA, 15L, NA, NA, NA,
NA, NA)),
row.names = c(NA, -18L),
class = "data.frame")
columns <- names(mydf)
dummy_cols <- paste0(columns, "_dummy")
mydf %>%
# duplicate the dataframe
cbind(mydf %>% `names<-`(dummy_cols)) %>%
# arrange across all columns
arrange(across(columns)) %>%
# fill NAs downwards
tidyr::fill(dummy_cols, .direction = "down") %>%
# create a dummy ID
tidyr::unite(id_dummy, dummy_cols, sep = "") %>%
# group by the id
group_by(id_dummy) %>%
# get the first row of each
filter(row_number()==1) %>%
ungroup() %>%
select(columns)
P.S. also replaces 1 - NA - NA with 1 - 2 - NA and replaces 1 - NA - NA with 1 - NA - 3

Related

How do I conduct a row sums for loop across specific columns using [row,col] distance indexing

Re-purposing a previous question but hopefully with more clarity + dput().
I'm working with the data below that almost similar to a key:value pairing such that every "type" variable has a corresponding variable that contains a "total" value across each row.
structure(list(type3a1 = c(2L, 6L, 5L, NA, 1L, 3L, NA), type3b1 = c(NA,
3L, 1L, 5L, 6L, 3L, NA), type3a1_arc = c(1L, 2L, 5L, 4L, 5L,
4L, NA), type3b1_arc = c(2L, 2L, 3L, 4L, 1L, 1L, NA), testing = c("Yes",
NA, "No", "No", NA, "Yes", NA), cars = c(5L, 12L, 1L, 6L, NA,
2L, NA), house = c(5L, 4L, 0L, 5L, 0L, 10L, NA), type3a2 = c(50L,
NA, 20L, 4L, 5L, NA, NA), type3b2 = c(10L, 10L, 15L, 1L, 3L,
1L, NA), type3a2_arc = c(50L, 25L, 30L, 10L, NA, 10L, NA), type3b2_arc = c(NA,
20L, 10L, 50L, 5L, 1L, NA), X = c(NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-7L))
I am trying to do a summation loop that goes through every row, and scans each "type" variable (i.e. type3a1, type3b1, type3c1, etc.). Each "type" has a matching variable that contains its "total" value (i.e. type3a2, type3b2, type3c2, etc.)
Process:
Check if the "type" variable contains values in (1,2,3,4 or 5).
If that type column's [row,col] value is in (1:5), then move 7 columns from its current [row,col] index to grab it's total value and ready for summation.
After checking every "type" variable, sum all the gathered "total" values and plop into a new overall totals column.
Essentially, I want to end up with a total value like the one below:
The first row shows a total of 100 since type3b1 has a value of "NA" which is not in (1:5). Hence, its total pairing (i.e. +7 columns away = cell value of "10") is not accounted for in the row summation.
My approach this time compared to a previous attempt is using a for loop and relying on indexing based on how far a column was away from another column. I was having a lot of trouble approaching this using dplyr / mutate methods and was having a lot of issues with the variability in the type:total name pairings (i.e. no pattern in the naming conventions, very messy data)...
# Matching pairing variables (i.e. type_vars:"type3a1" with total_vars:"type3a2")
type_vars <- c("type3a1", "type3b1", "type3a1_arc", "type3b1_arc")
total_vars <- c("type3a2", "type3b2", "type3a2_arc", "type3b2_arc")
valid_list <- c(1,2,3,4,5)
totals = list()
for(row in 1:nrow(df)) {
sum = 0
for(col in type_vars) {
if (df[row,col] %in% valid_list) {
sum <- sum + (df[row,col+7])
}
}
totals <- sum
}
I'm hoping this is the right approach but in either case, the code gives me an error at the sum <- sum + (df[row,col+7]) line where: Error in col + 7 : non-numeric argument to binary operator.
It's weird since if I were to do this manually and just indicate df[1,1+2], it gives me a value of "1" which is the value of the intersect [row1, type3a1_arc] in the df above.
Any help or assistance would be appreciated.
The error you received is because col in your original for loop iterates through
type_vars which is a character data type. One way around this is to reference
column indices of type_vars using the which() function.
Here is a solution with just a couple of modifications to your for loop:
totals <- c()
for(row in 1:nrow(df)) {
sum = 0
for(col in which(names(df) %in% type_vars)) {
if (df[row,col] %in% valid_list) {
sum <- sum(c(sum, (df[row,col+7])), na.rm=T)
}
}
totals[row] <- sum
}
df$totals <- totals
df$totals
[1] 100 55 75 61 10 12 0
Here is one way with tidyverse - loop across the columns with names that matches the 'type' followed by one or more digits (\\d+), a letter ([a-z]) and the number 2, then get the corresponding column name by replacing the column name (cur_column()) substring digit 2 with 1, get the value using cur_data(), create a logical vector with %in%, negate (!) and replace those not in 1:5 to NA, then wrap with rowSums and na.rm = TRUE to get the total
library(dplyr)
library(stringr)
df1 %>%
mutate(total = rowSums(across(matches('^type\\d+[a-z]2'), ~
replace(.x, !cur_data()[[str_replace(cur_column(),
"(\\d+[a-z])\\d+", "\\11")]] %in% 1:5, NA)), na.rm = TRUE))
-output
type3a1 type3b1 type3a1_arc type3b1_arc testing cars house type3a2 type3b2 type3a2_arc type3b2_arc X total
1 2 NA 1 2 Yes 5 5 50 10 50 NA NA 100
2 6 3 2 2 <NA> 12 4 NA 10 25 20 NA 55
3 5 1 5 3 No 1 0 20 15 30 10 NA 75
4 NA 5 4 4 No 6 5 4 1 10 50 NA 61
5 1 6 5 1 <NA> NA 0 5 3 NA 5 NA 10
6 3 3 4 1 Yes 2 10 NA 1 10 1 NA 12
7 NA NA NA NA <NA> NA NA NA NA NA NA NA 0
Or may also use two across (assuming the columns are in order)
df1 %>%
mutate(total = rowSums(replace(across(8:11),
!across(1:4, ~ .x %in% 1:5), NA), na.rm = TRUE))
-output
type3a1 type3b1 type3a1_arc type3b1_arc testing cars house type3a2 type3b2 type3a2_arc type3b2_arc X total
1 2 NA 1 2 Yes 5 5 50 10 50 NA NA 100
2 6 3 2 2 <NA> 12 4 NA 10 25 20 NA 55
3 5 1 5 3 No 1 0 20 15 30 10 NA 75
4 NA 5 4 4 No 6 5 4 1 10 50 NA 61
5 1 6 5 1 <NA> NA 0 5 3 NA 5 NA 10
6 3 3 4 1 Yes 2 10 NA 1 10 1 NA 12
7 NA NA NA NA <NA> NA NA NA NA NA NA NA 0
Or using base R
df1$total <- rowSums(mapply(\(x, y) replace(y, !x %in% 1:5, NA),
df1[1:4], df1[8:11]), na.rm = TRUE)
df1$total
[1] 100 55 75 61 10 12 0
Here’s a base R solution:
valid_vals <- sapply(type_vars, \(col) df[, col] %in% valid_list)
temp <- df[, total_vars]
temp[!valid_vals] <- NA
df$total <- rowSums(temp, na.rm = TRUE)
df$total
# [1] 100 55 75 61 10 12 0

Rowsums on two vectors of paired columns but conditional on specific values

I have a dataset that looks like the one below where there are three "pairs" of columns pertaining to the type (datA, datB, datC), and the total for each type (datA_total, datB_total, datC_total):
structure(list(datA = c(1L, NA, 5L, 3L, 8L, NA), datA_total = c(20L,
30L, 40L, 15L, 10L, NA), datB = c(5L, 5L, NA, 6L, 1L, NA), datB_total = c(80L,
10L, 10L, 5L, 4L, NA), datC = c(NA, 4L, 1L, NA, 3L, NA), datC_total = c(NA,
10L, 15L, NA, 20L, NA)), class = "data.frame", row.names = c(NA,
-6L))
# datA datA_total datB datB_total datC datC_total
#1 1 20 5 80 NA NA
#2 NA 30 5 10 4 10
#3 5 40 NA 10 1 15
#4 3 15 6 5 NA NA
#5 8 10 1 4 3 20
#6 NA NA NA NA NA NA
I'm trying to create a rowSums across each row to determine the total visits across each data type conditional on whether they meet a criteria of having ANY score ranging (1-5).
Here is my thought process:
Select only the variables that are the data types (i.e. datA, datB, datC)
Across each row based on EACH data type, determine if that data type meets a criteria (i.e. datA -> does it contain (1,2,3,4,5))
If that data type column does contain one of the 5 values above ^, then look to its paired total variable and ready that value to be rowSummed (i.e. datA -> does it contain (1,2,3,4,5)? -> if yes, then grab datA_total value = 20).
The goal is to end up with a total column like below:
# datA datA_total datB datB_total datC datC_total overall_total
#1 1 20 5 80 NA NA 100
#2 NA 30 5 10 4 10 20
#3 5 40 NA 10 1 15 55
#4 3 15 6 5 NA NA 15
#5 8 10 1 4 3 20 24
#6 NA NA NA NA NA NA 0
You'll notice that row #2 only contained a total of 20 even though there is 30 in datA_total. This is a result of the conditional selection in that datA for row#2 contains "NA" rather than one of the five scores (1,2,3,4,5). Hence, the datA_total of 30 was not included in the rowSums calculation.
My code below shows the vectors I created and my attempt at a conditional rowSums but I end up getting an error regarding mutate... I'm not sure how to integrate the "conditional pairing" portion of this problem:
type_vars <- c("datA", "datB", "datC")
type_scores <- c("1", "2", "3", "4", "5")
type_visits <- c("datA_total", "datB_total", "datC_total")
df <- df %>%
mutate(overall_total = rowSums(all_of(type_visits[type_vars %in% type_scores])))
Any help/tips would be appreciated
dplyr's across should do the job.
library(dplyr)
# copying your tibble
data <-
tibble(
datA = c(1, NA, 5, 3, 8, NA),
datA_total = c(20, 30, 40, 15, 10, NA),
datB = c(5, 5, NA, 6, 1, NA),
datB_total = c(80, 10, 10, 5, 4, NA),
datC = c(NA, 4, 1, NA, 3, NA),
datC_total = c(NA, 10, 15, NA, 20, NA)
)
data %>%
mutate(across(c('A', 'B', 'C') %>% paste0('dat', .), \(x) (x %in% 1:5) * get(cur_column() %>% paste0(., '_total')), .names = "{col}_aux")) %>%
rowwise() %>%
mutate(overall_total = sum(across(ends_with('aux')), na.rm = TRUE)) %>%
select(any_of(c(names(data), 'overall_total')))
# A tibble: 6 × 7
datA datA_total datB datB_total datC datC_total overall_total
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 5 80 NA NA 100
2 NA 30 5 10 4 10 20
3 5 40 NA 10 1 15 55
4 3 15 6 5 NA NA 15
5 8 10 1 4 3 20 24
6 NA NA NA NA NA NA 0
First, we create an 'aux' column for each dat. It is 0 if dat is not within 1:5, and dat_total otherwise. Then we sum ignoring NA.

Impute missing values with a value from previous month (if exists)

I have a dataframe with more than 100 000 rows and 30 000 unique ids.
My aim is to fill all the NAs among the different columns if there is a value from the previous month and the same id. However, most of the times the previous recorded value is from more than a month ago. Those NAs I would like to leave untouched.
The id column and the date column do not have NAs.
Here is an example of the data I have:
df3
id oxygen gluco dias bp date
1 0,25897842 0,20201604 0,17955655 0,14100962 31.7.2019
2 NA NA 0,38582622 0,12918231 31.12.2014
2 0,35817147 0,32943499 NA 0,43667462 30.11.2018
2 0,68557053 0,42898807 0,93897514 NA 31.10.2018
2 NA NA 0,99899076 0,44168223 31.7.2018
2 0,43848054 0,38604586 NA NA 30.4.2013
2 0,15823254 0,06216771 0,07829624 0,69755251 31.1.2016
2 NA NA 0,61645303 NA 29.2.2016
2 0,94671363 0,50682091 0,96770222 0,97403356 31.5.2018
3 NA 0,77352235 0,660479 0,11554399 30.4.2019
3 0,15567703 NA 0,4553325 NA 31.3.2017
3 NA NA 0,22181609 0,08527658 30.9.2017
3 0,93660763 NA NA NA 31.3.2018
3 0,73416759 NA NA 0,78501791 30.11.2018
3 NA NA NA NA 28.2.2019
3 0,84525106 0,54360374 NA 0,40595426 31.8.2014
3 0,76221263 0,62983336 0,84592719 0,10640734 31.8.2013
4 NA 0,29108942 0,3863479 NA 31.1.2018
4 0,74075742 NA 0,38117415 0,58849266 30.11.2018
4 0,09400641 0,68860814 NA 0,88895224 31.8.2014
4 0,72202944 0,49901387 0,19967415 NA 31.8.2018
4 0,98205262 0,85213969 0,34450998 0,98962306 30.11.2013
This is the last code implementation that I have tried:
´´´
df3 %>%
group_by(id) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, maxgap = 30)))
´´´
But apparently "mutate_all() ignored the following grouping variables:
Column id"
You can use the tidyverse for that. Here's an approach:
Change the date column to class Date, then order by date
Prepare the dates and remove the days in Ym
get the time difference in mo
flag the rows which have max one month difference
get groups by cumsum the inverse logic in flag
fill the rows from the same groups
library(dplyr)
library(tidyr)
library(lubridate)
df$date <- as.Date(df$date, format="%d.%m.%Y")
df %>%
arrange(date) %>%
mutate(
Ym = ym(strftime(date, "%Y-%m")),
mo = interval(Ym, lag(Ym, default=as.Date("1970-01-01"))) / months(1),
flag = cumsum(!(mo > -2 & mo < 1))) %>%
group_by(id, flag) %>%
fill(names(.), .direction="down") %>%
ungroup() %>%
select(-c("Ym","mo","flag")) %>%
print(n=nrow(.))
Output
# A tibble: 22 × 6
id oxygen gluco dias bp date
<int> <chr> <chr> <chr> <chr> <date>
1 2 0,43848054 0,38604586 NA NA 2013-04-30
2 3 0,76221263 0,62983336 0,84592719 0,10640734 2013-08-31
3 4 0,98205262 0,85213969 0,34450998 0,98962306 2013-11-30
4 3 0,84525106 0,54360374 NA 0,40595426 2014-08-31
5 4 0,09400641 0,68860814 NA 0,88895224 2014-08-31
6 2 NA NA 0,38582622 0,12918231 2014-12-31
7 2 0,15823254 0,06216771 0,07829624 0,69755251 2016-01-31
8 2 0,15823254 0,06216771 0,61645303 0,69755251 2016-02-29
9 3 0,15567703 NA 0,4553325 NA 2017-03-31
10 3 NA NA 0,22181609 0,08527658 2017-09-30
11 4 NA 0,29108942 0,3863479 NA 2018-01-31
12 3 0,93660763 NA NA NA 2018-03-31
13 2 0,94671363 0,50682091 0,96770222 0,97403356 2018-05-31
14 2 NA NA 0,99899076 0,44168223 2018-07-31
15 4 0,72202944 0,49901387 0,19967415 NA 2018-08-31
16 2 0,68557053 0,42898807 0,93897514 NA 2018-10-31
17 2 0,35817147 0,32943499 0,93897514 0,43667462 2018-11-30
18 3 0,73416759 NA NA 0,78501791 2018-11-30
19 4 0,74075742 NA 0,38117415 0,58849266 2018-11-30
20 3 NA NA NA NA 2019-02-28
21 3 NA 0,77352235 0,660479 0,11554399 2019-04-30
22 1 0,25897842 0,20201604 0,17955655 0,14100962 2019-07-31
Data
df <- structure(list(id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), oxygen = c("0,25897842",
NA, "0,35817147", "0,68557053", NA, "0,43848054", "0,15823254",
NA, "0,94671363", NA, "0,15567703", NA, "0,93660763", "0,73416759",
NA, "0,84525106", "0,76221263", NA, "0,74075742", "0,09400641",
"0,72202944", "0,98205262"), gluco = c("0,20201604", NA, "0,32943499",
"0,42898807", NA, "0,38604586", "0,06216771", NA, "0,50682091",
"0,77352235", NA, NA, NA, NA, NA, "0,54360374", "0,62983336",
"0,29108942", NA, "0,68860814", "0,49901387", "0,85213969"),
dias = c("0,17955655", "0,38582622", NA, "0,93897514", "0,99899076",
NA, "0,07829624", "0,61645303", "0,96770222", "0,660479",
"0,4553325", "0,22181609", NA, NA, NA, NA, "0,84592719",
"0,3863479", "0,38117415", NA, "0,19967415", "0,34450998"
), bp = c("0,14100962", "0,12918231", "0,43667462", NA, "0,44168223",
NA, "0,69755251", NA, "0,97403356", "0,11554399", NA, "0,08527658",
NA, "0,78501791", NA, "0,40595426", "0,10640734", NA, "0,58849266",
"0,88895224", NA, "0,98962306"), date = structure(c(18108,
16435, 17865, 17835, 17743, 15825, 16831, 16860, 17682, 18016,
17256, 17439, 17621, 17865, 17955, 16313, 15948, 17562, 17865,
16313, 17774, 16039), class = "Date")), row.names = c(NA,
-22L), class = "data.frame")

Select values in data frame 1 to a new data frame if certain criteria is met

I am trying to do the followings:
take values from the columns ScanNo and Intensity from df1, if m/z value meets df1['m/z'] >= 126.126226) & df1['m/z'] <= 126.129226 to the corresponding ScanNo and TMT126 columns in df2;
take values from the columns ScanNo and Intensity from df1, if m/z value meets df1['m/z'] >= 127.123261) & df1['m/z'] <= 127.126261, put the values of ScanNo and Intensity in df1 into the correponding ScanNo and TMT127 columns in df2.
etc
df1
ScanNo m/z Intensity
6 3 126.9017 499.1501
7 3 127.2447 592.0988
8 3 131.0728 576.3497
9 3 131.1089 632.2596
227 5 126.8965 658.6285
228 5 126.9355 650.5634
229 5 128.7293 606.1353
404 7 127.6725 651.5209
405 7 128.9860 615.9063
556 9 128.2417 612.7980
557 9 129.5913 615.2646
749 12 129.7950 579.4946
820 13 128.6606 699.6893
821 13 130.1904 632.3969
822 13 130.3656 561.7806
881 14 131.1699 617.8976
969 16 128.9069 765.4885
970 16 131.0128 628.3944
1200 18 129.1965 579.4517
1324 19 127.9362 588.1160
1407 20 131.5393 605.0532
df2
ScanNo TMT126 TMT127 etc
Does anyone know how to do that using R? Thanks!
Using within and ifelse inside. You probably want NAs if the values are not inside the ranges. I create a simplified m.z column to demonstrate.
df2 <- within(df1, {
TMT126 <- ifelse(m.z >= 1 & m.z <= 2, m.z, NA)
TMT127 <- ifelse(m.z >= 3 & m.z <= 4, m.z, NA)
TMT128 <- ifelse(m.z >= 5 & m.z <= 6, m.z, NA)
rm(m.z, Intensity)
})
df2
# ScanNo TMT128 TMT127 TMT126
# 1 3 NA NA 2
# 2 3 NA 3 NA
# 3 3 NA 3 NA
# 4 3 NA NA 2
# 5 3 NA 4 NA
# 6 5 NA 4 NA
# 7 5 6 NA NA
# 8 5 6 NA NA
# 9 5 5 NA NA
# 10 7 5 NA NA
# 11 9 NA 4 NA
# 12 13 NA NA 2
# 13 13 NA NA 2
# 14 13 6 NA NA
# 15 13 NA 4 NA
# 16 16 5 NA NA
# 17 16 NA NA 2
# 18 16 NA NA 1
# 19 16 NA 4 NA
# 20 19 NA 4 NA
Data:
df1 <- structure(list(ScanNo = c(3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L,
7L, 9L, 13L, 13L, 13L, 13L, 16L, 16L, 16L, 16L, 19L), m.z = c(2L,
3L, 3L, 2L, 4L, 4L, 6L, 6L, 5L, 5L, 4L, 2L, 2L, 6L, 4L, 5L, 2L,
1L, 4L, 4L), Intensity = c(499.050819190312, 502.115755613237,
498.921830630967, 500.373553890647, 498.659124958938, 500.670703826751,
499.295448634045, 499.948336887528, 499.49054987242, 500.160221846888,
500.036135738485, 500.946913174943, 500.580928969496, 498.996895445679,
496.507093594431, 500.788140622824, 500.167440904356, 499.120163471469,
497.046420199033, 499.682652479155)), row.names = c(NA, -20L), class = "data.frame")
Here is a function to help you work with your mass spec data. It uses dplyr functions.
library(dplyr)
select_scans <- function(data, mz_min, mz_max) {
data %>% # setting up the pipe
mutate(across(everything(), as.numeric)) %>% # convert all columns to numeric if needed
filter(between(`m_z`, mz_min, mz_max)) %>% # filtering only the m/z values you want
select(ScanNo, Intensity) %>% # keeping only the columns you want
rename(paste0("TM", round(average(mz_min, mz_max), 0)) = Intensity)
# rename the intensity column like you want
}
So, you run
df126 <- select_scans(df1, 126.126226, 126.129226)

Tackling missing data in R

I have encountered this problem in a project that I'm currently doing.
I have a sparse dataframe and I need to calculate the difference between the first and the last observation per row under some conditions:
Conditions:
If the row only contains NA's then the difference is 0.
If the row contains only 1 observation then the difference is 0.
If row elements (>= 2) are non-NA's then their difference is the difference between the first and the last (tail - head).
The dataframe that I have:
S1 S2 S3 S4 S5
1 NA NA NA NA NA
2 NA 3 NA 5 NA
3 1 NA NA NA 5
4 1 NA 2 NA 7
5 2 NA NA NA NA
6 NA NA 3 4 NA
7 NA NA 3 NA NA
The dataframe that I need:
S1 S2 S3 S4 S5 diff
1 NA NA NA NA NA 0
2 NA 3 NA 5 NA 2
3 1 NA NA NA 5 4
4 1 NA 2 NA 7 6
5 2 NA NA NA NA 0
6 NA NA 3 4 NA 1
7 NA NA 3 NA NA 0
What I've written up till now:
last_minus_first <- function(x, y = na.omit(x)) tail(y, 1) - y[1]
But it doesn't resolve for the fact if the row contains all NA's.
Any help would be much appreciated.
I would suggest using a defined function with apply(). Here the code:
#Data
df <- structure(list(S1 = c(NA, NA, 1L, 1L, 2L, NA, NA), S2 = c(NA,
3L, NA, NA, NA, NA, NA), S3 = c(NA, NA, NA, 2L, NA, 3L, 3L),
S4 = c(NA, 5L, NA, NA, NA, 4L, NA), S5 = c(NA, NA, 5L, 7L,
NA, NA, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7"))
Code:
#Function
myown <- function(x)
{
#Check NA
i <- sum(!is.na(x))
#Compute
if(i<=1)
{
y <- 0
} else
{
#Detect positions
j1 <- max(which(!is.na(x)))
j2 <- min(which(!is.na(x)))
#Diff
y <- x[j1]-x[j2]
}
return(y)
}
#Apply function by row
df$NewVar <- apply(df,1,myown)
Output:
S1 S2 S3 S4 S5 NewVar
1 NA NA NA NA NA 0
2 NA 3 NA 5 NA 2
3 1 NA NA NA 5 4
4 1 NA 2 NA 7 6
5 2 NA NA NA NA 0
6 NA NA 3 4 NA 1
7 NA NA 3 NA NA 0
Here's an easier (in my mind) way to handle this, using rowwise from the dplyr package to do calculations along rows.
df %>%
dplyr::rowwise() %>%
dplyr::mutate(max_pop = max(which(!is.na(dplyr::c_across(S1:S5)))),
min_pop = min(which(!is.na(dplyr::c_across(S1:S5)))),
diff = tidyr::replace_na(dplyr::c_across()[max_pop] - dplyr::c_across()[min_pop], 0))
I've broken that mutate call down into the various parts to show what we're doing, but essentially, it goes across all columns in a row to find the last populated column (max_pop), the first populated column (min_pop) and then uses those values to retrieve the values therein.
You have to specify columns for max_pop and min_pop above because creating new interim columns affects the column indexing. c_across() defaults to using all columns, though, so you can actually do this all in one mutate call without specifying any columns.
df %>%
rowwise() %>%
mutate(diff = replace_na(c_across()[max(which(!is.na(c_across())))] - c_across()[min(which(!is.na(c_across())))], 0))
A vectorized option in base R would be to extract the values based on row/column index and then subtract
df1$NewVar <- df1[cbind(seq_len(nrow(df1)), max.col(!is.na(df1), 'last'))] -
df1[cbind(seq_len(nrow(df1)), max.col(!is.na(df1), 'first'))]
df1$NewVar[is.na(df1$NewVar)] <- 0
df1
# S1 S2 S3 S4 S5 NewVar
#1 NA NA NA NA NA 0
#2 NA 3 NA 5 NA 2
#3 1 NA NA NA 5 4
#4 1 NA 2 NA 7 6
#5 2 NA NA NA NA 0
#6 NA NA 3 4 NA 1
#7 NA NA 3 NA NA 0
data
df1 <- structure(list(S1 = c(NA, NA, 1L, 1L, 2L, NA, NA), S2 = c(NA,
3L, NA, NA, NA, NA, NA), S3 = c(NA, NA, NA, 2L, NA, 3L, 3L),
S4 = c(NA, 5L, NA, NA, NA, 4L, NA), S5 = c(NA, NA, 5L, 7L,
NA, NA, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7"))

Resources