How to include function arguments in apply family? [duplicate] - r

This question already has answers here:
passing several arguments to FUN of lapply (and others *apply)
(4 answers)
Closed 5 years ago.
I'm always bumping into the problem of wanting to use function's arguments into an apply function. I have looked around but I could not find any suitable answer...
For example
I have a simple matrix like
dput (tab)
> structure(c(108.13, 108.13, 107.7, 107.66, 107.65, NA, NA, 115.56,
115.5, 115.45, NA, NA, NA, 122.72, 122.66, 124.81, 124.82, 124.87,
124.91, 124.94, NA, NA, NA, NA, 130.18), .Dim = c(5L, 5L), .Dimnames = list(
NULL, NULL))
And I want to get the minimum of each column.
I would do something like:
apply (test, 2, min)
> 107.65 NA NA 124.81 NA
But now let's say I want to skip the NAs.
For the first column, I would do
min (test[,1], min(na.rm = TRUE))
> 107.65
But I cannot use
apply (test, 2, min(na.rm = TRUE))
So, how am I supposed to pass arguments to a function inside apply?

We can use a vectorized colMins from matrixStats
library(matrixStats)
colMins(tab, na.rm = TRUE)
When we are not sure about how to use the arguments, anonymous function call can be used
apply(test, 2, function(x) min(x, na.rm = TRUE))
Or otherwise as #ahly suggested

Related

ifelse conditional statement based on the value before and after a particular element

How can I create a conditional statement using ifelse() based on the value of a previous or following element in a vector in R?
data ≤– c(NA, NA, 3, NA, 4, 6, 7, 9, NA)
ifelse(conditional statement, NA, data)
Where the conditional statement is: if the elements on either side are both NA, then it should also be NA. If it is a continuous series of numbers, they stay the same - unchanged. The output should be
NA, NA, NA, NA, 4, 6, 7, 9, NA
I have tried data[-1]==NA & data[+1]==NA as the conditional statement but it converts everything to NA regardless
I like to use lead and lag functions in dplyr for this. First set your vector in a dataframe.
df <- data.frame(data = data)
df %>%
transmute(result = ifelse(is.na(lag(data)) & is.na(lead(data)), NA, data))

Extract the first x observations in each column while keeping the indexing by row in R

The following code is designed to extract the first x observations of each column, which are time series spanning different periods. (or to erase everything else than the x first values in each column …)
The first values, can be numbers followed by NAs, as long as it is the beginning of the time series.
This is crucial that each value stay linked to its own place in the indexing (the first column 'Year')
# data example
df <- data.frame("Year" = 1791:1800,
"F1" = c(NA, NA, NA, 1.2,1.3, NA, NA, NA, NA, NA),
"F2" = c(NA, NA, 2.1, 2.2, 2.3, 2.4, 2.5, NA, NA, NA),
"F3" = c(NA, NA, NA, NA, NA, 0.1,0.2,0.3,0.4,0.5),
"F4" = c(NA, 3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9))
# Convert the dataframe to a list by column
long <- setNames(lapply(names(df)[-1], function(x) cbind(df[1], df[x])), names(df)[-1])
# and select only the first 3 elements after NAs in each column
mylist <- lapply(long, function(x){
head(na.omit(x), 3)
})
# or in a more concise writing ??
mylist2 <- lapply(df, function(x){
head(na.omit(cbind(df[[1]],x)), 3)
})
# Now ‘mylist’ (or ‘mylist2’) contains several vector of different lengths,
# not very appropriate for dataframe, let's switch to long format dataframe
mydata <- do.call(rbind, lapply(mylist, function(x){
require(reshape2)
melt(x, id.vars="Year")
})
)
# and switch back to regular spreadsheet format
library(tidyverse)
mydataCOL <- spread(mydata, key = "variable", value = "value")
write.table(mydataCOL, “sheet1.txt”)
This thing is complicated to apply to a list of dataframe (multiple excel files). Is there an easier way to achieve this ? To do such operations on each column of each dataframe of the list :)
I'm currently trying with 'nested' lapply() :
mylist <- lapply(d, function(x){
lapply(x, function(y){
head(na.omit(cbind(x[[1]],y)), 50)
})
})
but this is not the easiest way I guess... Thanks !
If you are using the tidyverse anyway, why not go all in with Hadley's stuff?
GetTop <- function(indf){
indf %>%
pivot_longer(-Year,names_to="F") %>%
na.omit() %>%
group_by(F) %>%
top_n(3,wt=-Year) %>%
pivot_wider(names_from="F")
}
Now if we can call it for one dataframe
> mytops <- GetTop(df)
If you have a list of these dataframes you can use lapply to do this to each one.
allmytop <- lapply(biglist,FUN=GetTop)
That will give you a list of dataframes. Seems like you also want to join them into one fat dataframe.
fatdf <- lapply(biglist,FUN=GetTop) %>% reduce(full_join,by="Year")

Comparing exact values with many significant digits across columns

I have a table of numbers with many, varying significant digits. I need to find the exact matches for these digits across columns - e.g.
find_mz_matches <- data.frame("mz1" = c(3.14222, 314.12003, 214.220,
254.111223, NA, NA, NA, NA, NA), "mz2" = c(3.14222, 456.2200001, NA,
NA, NA, NA, NA, NA, NA), "mz3" = c(300.112223, 456.2200001, 3.14222,
254.111223, 900.232, 476.0012503, 459.00201, 500.60402, 300.4053102))
I want to know what values are shared between mz1 and mz2, mz2 and mz3, and finally between all three columns together.
So, comparing mz1 and mz2 should yield:
mz1_v_mz2
3.14222
456.2200001
And comparing all three:
mz_all
3.14222
I have cobbled together something that almost works, but the problem is that it's rounding somewhere and my output includes numbers that are similar but not the same, e.g. 3.14222 should not match with 3.14223. It also includes NAs in the output, which is not desired.
duplicates_across1 <- find_mz_matches[find_mz_matches$mz1
%in% find_mz_matches$mz2, ]
That should work to compare the first two columns, so I figured I would just take the output and do it again for the next comparison - compare the output of duplicates_across1 to find_mz_matches$mz3. For some reason, it doesn't catch the presence of 3.14222 between all three columns and I have no idea why.
duplicates_all <- duplicates_across1[duplicates_across1$mz1
%in% find_mz_matches$mz3, ]
Here is a very lapply-y answer that compares each combination and stores the results in a list. It should be flexible if you have more than 3 columns as well. Good luck!
# Creating all combinations of columns in a list (and flattening it with unlist)
combos <- unlist(lapply(2:ncol(find_mz_matches), combn, x = find_mz_matches, simplify = F), recursive = F)
# Checking for common elements
common_elements <- lapply(combos, function(x) Reduce(base::intersect, x))
# Renaming the elements
names(common_elements) <- sapply(lapply(combos, names), paste, collapse = "_")
common_elements
$mz1_mz2
[1] 3.14222 NA
$mz1_mz3
[1] 3.14222 254.11122
$mz2_mz3
[1] 3.14222 456.22000
$mz1_mz2_mz3
[1] 3.14222
See Reduce():
Reduce(intersect, find_mz_matches, accumulate = T)
Reduce(intersect, find_mz_matches, accumulate = T, right = T)
The argument accumulate is optional - it's only there to show you what's happening. For your use, you can take it out and it would result in 3.14222.
Reduce(intersect, find_mz_matches)
[1] 3.14222
I would do something like:
find_mz_matches <- data.frame("mz1" = c(3.14222, 314.12003, 214.220,
254.111223, NA, NA, NA, NA, NA), "mz2" = c(3.14222, 456.2200001, NA,
NA, NA, NA, NA, NA, NA), "mz3" = c(300.112223, 456.2200001, 3.14222,
254.111223, 900.232, 476.0012503, 459.00201, 500.60402, 300.4053102))
find_mz_matches$mz_allmz1mz2 <- ifelse(find_mz_matches$mz1 == find_mz_matches$mz2 ,find_mz_matches$mz1 , NA)
find_mz_matches$mz_allmz2mz3 <- ifelse(find_mz_matches$mz2 == find_mz_matches$mz3 ,find_mz_matches$mz2 , NA)
After that, I would combine the resulting columns if no conflict present.

Replacing NA values using a rolling window

How can I replace a NA value by the average of the previous non-NA and next non-NA values?
For example, I want to replace the first NA value by -0.873, and the 4th/5th by the average of -0.497+53.200.
Thanks!
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
=================== ADD ON ===================
Thank you all for answering the question! Sorry for the late response. This is only a part of a dataframe (10000 * 91) and I only took out the first 10 rows from the first column in order to simplify the question. I think David and MKR have the result that I am expected to have.
Here's a possible vectorized approach using base R (some steps could be probably improved but I have no time to look into it right now)
x <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
# Store a boolean vector of NA locaiotns for firther use
na_vals <- is.na(x)
# Find the NAs location compaed to the non-NAs
start_ind <- findInterval(which(na_vals), which(!na_vals))
# Createa right limit
end_ind <- start_ind + 1L
# Replace zero locations with NAs
start_ind[start_ind == 0L] <- NA_integer_
# Calculate the means and replace the NAs
x[na_vals] <- rowMeans(cbind(x[!na_vals][start_ind], x[!na_vals][end_ind]), na.rm = TRUE)
x
# [1] -0.8730 -0.8730 -0.4970 26.3515 26.3515 53.2000 39.6500 39.6500 39.6500 26.1000
This should work properly for NAs on both sides of the vector.
This function imputes values for NA in a vector based on the average of the non-NA values in a rolling window from the first element to the next element.
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
roll_impute <- function(x){
n <- length(x)
res <- x
for (i in seq_along(x)){
if (is.na(x[i])){
res[i] <- mean(rep_len(x, i+1), na.rm = TRUE )
}
}
if (is.na(x[n])) x[n] <- mean(x, na.rm = TRUE)
res
}
roll_impute(t)
# [1] -0.87300 -0.87300 -0.49700 -0.68500 17.27667 53.20000 17.27667 17.27667 19.48250
# [10] 26.10000
roll_impute() includes code that corrects the rolling window in the case that the final element is NA, so that the vector isn't recycled. This isn't the case in your example, but is needed in order to generalize the function. Any improvements on this function would be welcome :) It does use a for loop, but doesn't grow any vectors. No simple way to avoid the for loop and rely on the structure of the objects jumps to my mind right now.
One dplyr and tidyr based solution could be:
library(dplyr)
library(tidyr)
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
data.frame(t) %>%
mutate(last_nonNA = ifelse(!is.na(t), t, NA)) %>%
mutate(next_nonNA = ifelse(!is.na(t), t, NA)) %>%
fill(last_nonNA) %>%
fill(next_nonNA, .direction = "up") %>%
mutate(t = case_when(
!is.na(t) ~ t,
!is.na(last_nonNA) & !is.na(next_nonNA) ~ (last_nonNA + next_nonNA)/2,
is.na(last_nonNA) ~ next_nonNA,
is.na(next_nonNA) ~ last_nonNA
)
) %>%
select(t)
# t
# 1 -0.8730
# 2 -0.8730
# 3 -0.4970
# 4 26.3515
# 5 26.3515
# 6 53.2000
# 7 39.6500
# 8 39.6500
# 9 39.6500
# 10 26.1000
Note: It looks a bit complicated but it does the trick. One can achieve same thing via for loop.

Replace values in a dataframe based on another factor which contains NA's in R

I have a dataframe which contains (among other things) a numeric column with a concentration, and a factor column with a status flag. This status flag contains NA's.
Here's an example
df<-structure(list(conc = c(101.769, 1.734, 62.944, 92.697, 25.091, 27.377, 24.343, 55.084, 0.335, 23.280), status = structure(c(NA, NA, NA, NA, NA, NA, 2L, NA, 1L, NA), .Label = c("<LLOQ", "NR"), class = "factor")), .Names = c("conc", "status"), row.names = c(NA, -10L), class = "data.frame")
I want to replace the concentration column with a string for some values of the flag column, or with the concentration value formatted to a certain number of significant digits.
When I try this
ifelse(df$status=="NR","NR",df$conc)
The NA's in the status flag don't trigger either the true or false condition (and return NA) - as the documentation suggests it will. I could loop over the rows and use IF then else on each one but this seems inefficient.
Am I missing something ? I've tried as.character(df$status) as well which doesn't work. My mojo must be getting low....
Use %in% instead of == :
ifelse(df$status %in% "NR","NR", df$conc)
Side-by-side comparison of the two methods:
data.frame(df, ph = ifelse(df$status=="NR","NR",df$conc), mp = ifelse(df$status %in% "NR","NR",df$conc))
Check out ?match for more information - I'm not sure I could explain it well.
You must explicit test for NA so you can use:
ifelse(df$status=="NR" | is.na(df$status),"NR",df$conc) # gives you NR for NA
or
ifelse(df$status=="NR" & !is.na(df$status),"NR",df$conc) # gives you df$conc for NA
How about testing for missingness:
ifelse(is.na(df$status), df$conc, as.character(df$status))

Resources