I have a table of numbers with many, varying significant digits. I need to find the exact matches for these digits across columns - e.g.
find_mz_matches <- data.frame("mz1" = c(3.14222, 314.12003, 214.220,
254.111223, NA, NA, NA, NA, NA), "mz2" = c(3.14222, 456.2200001, NA,
NA, NA, NA, NA, NA, NA), "mz3" = c(300.112223, 456.2200001, 3.14222,
254.111223, 900.232, 476.0012503, 459.00201, 500.60402, 300.4053102))
I want to know what values are shared between mz1 and mz2, mz2 and mz3, and finally between all three columns together.
So, comparing mz1 and mz2 should yield:
mz1_v_mz2
3.14222
456.2200001
And comparing all three:
mz_all
3.14222
I have cobbled together something that almost works, but the problem is that it's rounding somewhere and my output includes numbers that are similar but not the same, e.g. 3.14222 should not match with 3.14223. It also includes NAs in the output, which is not desired.
duplicates_across1 <- find_mz_matches[find_mz_matches$mz1
%in% find_mz_matches$mz2, ]
That should work to compare the first two columns, so I figured I would just take the output and do it again for the next comparison - compare the output of duplicates_across1 to find_mz_matches$mz3. For some reason, it doesn't catch the presence of 3.14222 between all three columns and I have no idea why.
duplicates_all <- duplicates_across1[duplicates_across1$mz1
%in% find_mz_matches$mz3, ]
Here is a very lapply-y answer that compares each combination and stores the results in a list. It should be flexible if you have more than 3 columns as well. Good luck!
# Creating all combinations of columns in a list (and flattening it with unlist)
combos <- unlist(lapply(2:ncol(find_mz_matches), combn, x = find_mz_matches, simplify = F), recursive = F)
# Checking for common elements
common_elements <- lapply(combos, function(x) Reduce(base::intersect, x))
# Renaming the elements
names(common_elements) <- sapply(lapply(combos, names), paste, collapse = "_")
common_elements
$mz1_mz2
[1] 3.14222 NA
$mz1_mz3
[1] 3.14222 254.11122
$mz2_mz3
[1] 3.14222 456.22000
$mz1_mz2_mz3
[1] 3.14222
See Reduce():
Reduce(intersect, find_mz_matches, accumulate = T)
Reduce(intersect, find_mz_matches, accumulate = T, right = T)
The argument accumulate is optional - it's only there to show you what's happening. For your use, you can take it out and it would result in 3.14222.
Reduce(intersect, find_mz_matches)
[1] 3.14222
I would do something like:
find_mz_matches <- data.frame("mz1" = c(3.14222, 314.12003, 214.220,
254.111223, NA, NA, NA, NA, NA), "mz2" = c(3.14222, 456.2200001, NA,
NA, NA, NA, NA, NA, NA), "mz3" = c(300.112223, 456.2200001, 3.14222,
254.111223, 900.232, 476.0012503, 459.00201, 500.60402, 300.4053102))
find_mz_matches$mz_allmz1mz2 <- ifelse(find_mz_matches$mz1 == find_mz_matches$mz2 ,find_mz_matches$mz1 , NA)
find_mz_matches$mz_allmz2mz3 <- ifelse(find_mz_matches$mz2 == find_mz_matches$mz3 ,find_mz_matches$mz2 , NA)
After that, I would combine the resulting columns if no conflict present.
Related
I am combining two dataframes using left_join as I need to keep all the rows from X and only matched ones from Y. However, there is no one column which provides a perfect match. Instead there is 1 column in X which partially matches 2 columns in Y, and a second column in X which partially matches a third column in Y. Only by matching on all 3 can I adequately merge the data since they all match on different combinations of rows (with some overlap).
My strategy so far has been to do three seperate left_join commands, each with a different match. Doing this enables me to match about 95% of the rows (the first match only gets about 70%). However, this leaves me with three versions of each column from Y in the merged data. I have tried using paste and a number of other ways to combine them, but either it hasn't worked or the new columns aren't helpful as they are concatenated (e.g. showing "12345 NA 12345" or "NA NA NA"). I need it to show only the first non-NA result as all non-NA results will be identical. So for the two examples above I would want to return only "12345" and "NA".
So I think I either need to figure out how to do a left_join on multiple columns (This doesn't work, but something like:
left_join(X, Y, by = (c("Column1" = Column1) OR c("Column1" = "Column2") OR c( Column 2, Y = "Column 3")).
Or, less elegantly, I just need to figure out how to reformat the merged/ pasted column to keep only the first non-NA result.
In the example below, X has 2 columns and 5 rows and Y has 4 columns and 7 rows (5 of which match the rows in X). The only way to merge X and Y fully is to match X$Column1 with Y$Column1 OR Y$Column2, OR X$Column2 with Y$Column3. In the real data there are around 50,000 rows in X and also lots of other columns in X and Y). The desired output from the below should be the five rows from X (which also have matches in Y), along with the corresponding values in Y$Column4.
X$Column1 = c(10, 150, 3550, 9421, 22000)
X$Column2 = c(Dog, Cat, Bird, Rat, Fox)
Y$Column1 = c(NA, 453, NA, NA, 3550, 9421, NA, 4200)
Y$Column2 = c(22, NA, 10, 150, 3550, NA, NA, 4200)
Y$Column3 = c(NA, Badger, Dog, NA, NA, NA, Fox, Mouse)
Y$Column4 = c(NA, 4500, 12345, 54, NA, 5555, 321, 65, 20)
From the above I would want to return 5 rows (one for each of the rows in X), along with 3 columns (the 3 original columns of X plus the 4th column of Y). The other three columns in Y are only useful for matching. As in the above example, in my data there is no way of completing a full (or close to full) match without joining on all three matches. I have been searching ways to do this for ages with no luck but I'm quite new to R so sorry if I'm being stupid.
My code at the moment:
merged_pvga <- left_join(merged_pvga, sherpa, by = c("issn1" = "issn_print"))
merged_pvga <- left_join(merged_pvga, sherpa, by = c("issn1" = "issn_electronic"))
merged_pvga <- left_join(merged_pvga, sherpa, by = (c("journal_title" = "title")))
merged_pvga$id_all <- paste(merged_pvga$id.x, merged_pvga$id.y, merged_pvga$id)
merged_pvga$subject_all <- paste(merged_pvga$subject.x, merged_pvga$subject.y, merged_pvga$subject)
etc. for other columns
Here is an example of left-join data Y with X using X twice on variable ColumnA and then on variable ColumnB.
Note:
Y and X share two variables ColumnA, ColumnB, so that after each left-join, you want to combine the columns that are not used as the join-key (for example, after joining on ColumnA, combine ColumnB's of the two datasets).
Be sure to know want to do about potential overlap that may emerge by joining twice on two different variables. The example below prioritize the first left-join in that those already joined from X are excluded in the second left-join.
library(dplyr)
X = tibble(id_x=1:5)
Y = tibble(id_y=1:8)
X$ColumnA = c(10, 150, 3550, 9421, 22000)
X$ColumnB = c('Dog', 'Cat', 'Bird', 'Rat', 'Fox')
Y$ColumnA = c(NA, 453, NA, NA, 3550, 9421, NA, 4200)
Y$Column2 = c(22, NA, 10, 150, 3550, NA, NA, 4200)
Y$ColumnB = c(NA, 'Badger', 'Dog', NA, NA, NA, 'Fox', 'Mouse')
Y$Column4 = c(NA, 4500, 12345, 54, NA, 5555, 321, 65)
replace_na_with_blank <- function(df, varnames) {
for (varname in varnames) {
df[is.na(df[[varname]]), varname] <- ""
}
return(df)
}
concat_columns <- function(df, v1, v2) {
idx_na <- df[[v1]]==""
df[[v1]][idx_na] <- paste(df[[v1]][idx_na], df[[v2]][idx_na], sep='')
df[[v2]] <- NULL
return(df)
}
concat_columns_num <- function(df, v1, v2) {
idx_na <- is.na(df[[v1]])
df[[v1]][idx_na] <- df[[v2]][idx_na]
df[[v2]] <- NULL
return(df)
}
merged_1 <- left_join(Y, X, by = c("ColumnA" = "ColumnA"), suffix=c("",".x"))
merged_1 = replace_na_with_blank(merged_1, c("ColumnB","ColumnB.x"))
merged_1 <- concat_columns(merged_1, "ColumnB", "ColumnB.x")
merged_1 # first merge indicator is "id_x.x"
merged_2 <- left_join(merged_1, X %>% filter(!(id_x %in% merged_1$id_x)),
by = c("ColumnB" = "ColumnB"), suffix=c("",".x"))
merged_2 <- concat_columns_num(merged_2, "ColumnA", "ColumnA.x")
merged_2 # second merge indicator is "id_x.x.x"
merge1 is
merge2 is
The following code is designed to extract the first x observations of each column, which are time series spanning different periods. (or to erase everything else than the x first values in each column …)
The first values, can be numbers followed by NAs, as long as it is the beginning of the time series.
This is crucial that each value stay linked to its own place in the indexing (the first column 'Year')
# data example
df <- data.frame("Year" = 1791:1800,
"F1" = c(NA, NA, NA, 1.2,1.3, NA, NA, NA, NA, NA),
"F2" = c(NA, NA, 2.1, 2.2, 2.3, 2.4, 2.5, NA, NA, NA),
"F3" = c(NA, NA, NA, NA, NA, 0.1,0.2,0.3,0.4,0.5),
"F4" = c(NA, 3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8,3.9))
# Convert the dataframe to a list by column
long <- setNames(lapply(names(df)[-1], function(x) cbind(df[1], df[x])), names(df)[-1])
# and select only the first 3 elements after NAs in each column
mylist <- lapply(long, function(x){
head(na.omit(x), 3)
})
# or in a more concise writing ??
mylist2 <- lapply(df, function(x){
head(na.omit(cbind(df[[1]],x)), 3)
})
# Now ‘mylist’ (or ‘mylist2’) contains several vector of different lengths,
# not very appropriate for dataframe, let's switch to long format dataframe
mydata <- do.call(rbind, lapply(mylist, function(x){
require(reshape2)
melt(x, id.vars="Year")
})
)
# and switch back to regular spreadsheet format
library(tidyverse)
mydataCOL <- spread(mydata, key = "variable", value = "value")
write.table(mydataCOL, “sheet1.txt”)
This thing is complicated to apply to a list of dataframe (multiple excel files). Is there an easier way to achieve this ? To do such operations on each column of each dataframe of the list :)
I'm currently trying with 'nested' lapply() :
mylist <- lapply(d, function(x){
lapply(x, function(y){
head(na.omit(cbind(x[[1]],y)), 50)
})
})
but this is not the easiest way I guess... Thanks !
If you are using the tidyverse anyway, why not go all in with Hadley's stuff?
GetTop <- function(indf){
indf %>%
pivot_longer(-Year,names_to="F") %>%
na.omit() %>%
group_by(F) %>%
top_n(3,wt=-Year) %>%
pivot_wider(names_from="F")
}
Now if we can call it for one dataframe
> mytops <- GetTop(df)
If you have a list of these dataframes you can use lapply to do this to each one.
allmytop <- lapply(biglist,FUN=GetTop)
That will give you a list of dataframes. Seems like you also want to join them into one fat dataframe.
fatdf <- lapply(biglist,FUN=GetTop) %>% reduce(full_join,by="Year")
How can I replace a NA value by the average of the previous non-NA and next non-NA values?
For example, I want to replace the first NA value by -0.873, and the 4th/5th by the average of -0.497+53.200.
Thanks!
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
=================== ADD ON ===================
Thank you all for answering the question! Sorry for the late response. This is only a part of a dataframe (10000 * 91) and I only took out the first 10 rows from the first column in order to simplify the question. I think David and MKR have the result that I am expected to have.
Here's a possible vectorized approach using base R (some steps could be probably improved but I have no time to look into it right now)
x <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
# Store a boolean vector of NA locaiotns for firther use
na_vals <- is.na(x)
# Find the NAs location compaed to the non-NAs
start_ind <- findInterval(which(na_vals), which(!na_vals))
# Createa right limit
end_ind <- start_ind + 1L
# Replace zero locations with NAs
start_ind[start_ind == 0L] <- NA_integer_
# Calculate the means and replace the NAs
x[na_vals] <- rowMeans(cbind(x[!na_vals][start_ind], x[!na_vals][end_ind]), na.rm = TRUE)
x
# [1] -0.8730 -0.8730 -0.4970 26.3515 26.3515 53.2000 39.6500 39.6500 39.6500 26.1000
This should work properly for NAs on both sides of the vector.
This function imputes values for NA in a vector based on the average of the non-NA values in a rolling window from the first element to the next element.
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
roll_impute <- function(x){
n <- length(x)
res <- x
for (i in seq_along(x)){
if (is.na(x[i])){
res[i] <- mean(rep_len(x, i+1), na.rm = TRUE )
}
}
if (is.na(x[n])) x[n] <- mean(x, na.rm = TRUE)
res
}
roll_impute(t)
# [1] -0.87300 -0.87300 -0.49700 -0.68500 17.27667 53.20000 17.27667 17.27667 19.48250
# [10] 26.10000
roll_impute() includes code that corrects the rolling window in the case that the final element is NA, so that the vector isn't recycled. This isn't the case in your example, but is needed in order to generalize the function. Any improvements on this function would be welcome :) It does use a for loop, but doesn't grow any vectors. No simple way to avoid the for loop and rely on the structure of the objects jumps to my mind right now.
One dplyr and tidyr based solution could be:
library(dplyr)
library(tidyr)
t <- c(NA, -0.873, -0.497, NA, NA, 53.200, NA, NA, NA, 26.100)
data.frame(t) %>%
mutate(last_nonNA = ifelse(!is.na(t), t, NA)) %>%
mutate(next_nonNA = ifelse(!is.na(t), t, NA)) %>%
fill(last_nonNA) %>%
fill(next_nonNA, .direction = "up") %>%
mutate(t = case_when(
!is.na(t) ~ t,
!is.na(last_nonNA) & !is.na(next_nonNA) ~ (last_nonNA + next_nonNA)/2,
is.na(last_nonNA) ~ next_nonNA,
is.na(next_nonNA) ~ last_nonNA
)
) %>%
select(t)
# t
# 1 -0.8730
# 2 -0.8730
# 3 -0.4970
# 4 26.3515
# 5 26.3515
# 6 53.2000
# 7 39.6500
# 8 39.6500
# 9 39.6500
# 10 26.1000
Note: It looks a bit complicated but it does the trick. One can achieve same thing via for loop.
This question already has answers here:
passing several arguments to FUN of lapply (and others *apply)
(4 answers)
Closed 5 years ago.
I'm always bumping into the problem of wanting to use function's arguments into an apply function. I have looked around but I could not find any suitable answer...
For example
I have a simple matrix like
dput (tab)
> structure(c(108.13, 108.13, 107.7, 107.66, 107.65, NA, NA, 115.56,
115.5, 115.45, NA, NA, NA, 122.72, 122.66, 124.81, 124.82, 124.87,
124.91, 124.94, NA, NA, NA, NA, 130.18), .Dim = c(5L, 5L), .Dimnames = list(
NULL, NULL))
And I want to get the minimum of each column.
I would do something like:
apply (test, 2, min)
> 107.65 NA NA 124.81 NA
But now let's say I want to skip the NAs.
For the first column, I would do
min (test[,1], min(na.rm = TRUE))
> 107.65
But I cannot use
apply (test, 2, min(na.rm = TRUE))
So, how am I supposed to pass arguments to a function inside apply?
We can use a vectorized colMins from matrixStats
library(matrixStats)
colMins(tab, na.rm = TRUE)
When we are not sure about how to use the arguments, anonymous function call can be used
apply(test, 2, function(x) min(x, na.rm = TRUE))
Or otherwise as #ahly suggested
I have encountered some unexpected behaviour when using the within() function in R. I (eventually!) tracked the cause to a situation where the last element(s) of particular columns in question in a data frame contain NA.
I have simplified the code to create a reproducible example. Obviously the real world application in which I encountered this is substantially more complex (data frame >500k rows 400 columns, >100 lines inside within(), etc.), and rather inconvenient to avoid using within().
This works as expected:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest$Bearing <- NA
fooTest$Bearing[which(fooTest$Group=="Cup")] <-
as.character(fooTest$CupComposition[which(fooTest$Group=="Cup")])
fooTest$Bearing[which(fooTest$Group=="Shell")] <-
as.character(fooTest$LinerComposition[which(fooTest$Group=="Shell")])
fooTest$Bearing
Whereas this (which should be equivalent) throws an error:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest <- within(fooTest, {
Bearing <- NA
Bearing[which(Group=="Cup")] <-
as.character(CupComposition[which(Group=="Cup")])
Bearing[which(Group=="Shell")] <-
as.character(LinerComposition[which(Group=="Shell")])
})
The error message is
Error in [<-.data.frame(*tmp*, nl, value = list(Bearing = c("Polyethylene", :
replacement element 1 has 3 rows, need 5
The last two rows, in which Group is NA, are evidently not being included. NA rows in the middle of the data are OK.
A couple of questions:
The behaviour of within() is a bit unexpected; is this a bug? I am not very experienced, so am slightly reticent about filing bugs where it is likely to be my understanding that is deficient!
In this particular case, I expect there is a neater way to populate the "Bearing" column than the method I have employed. Suggestions welcome!
I tend to use "%in%" in this case; it handles NAs nicer:
fooTest <- data.frame(Group = c("Shell", NA, "Cup", NA, NA),
CupComposition = c("Metal", NA, "Polyethylene", NA, "Test"),
LinerComposition = c("Polyethylene", NA, NA, NA, "Test"))
fooTest <- within(fooTest, {
Bearing <- NA
Bearing[Group %in% "Cup"] <-
as.character(CupComposition[Group %in% "Cup"])
Bearing[Group %in% "Shell"] <-
as.character(LinerComposition[Group %in% "Shell"])
})
Regarding the error message using within, you can try:
within(fooTest, {Bearing <- NA
Bearing[Group=='Cup' & !is.na(Group)] <-
as.character(CupComposition)[Group=='Cup' & !is.na(Group)]
Bearing[Group=='Shell' & !is.na(Group)] <-
as.character(LinerComposition)[Group=='Shell' & !is.na(Group)]
})
It is not clear whether the Group column and all other columns are following some order. From the column names, I couldn't find a common pattern that helps in matching the elements in Group. Based on the example provided, you could also do (for the bigger dataset)
fooTest1 <- fooTest
fooTest1[] <- lapply(fooTest1, as.character)#convert the columns to character class
Un1 <- sort(unique(na.omit(fooTest1$Group)))
m1 <- do.call(cbind,Map(function(v, x,y)
ifelse(v==y & !is.na(v), x, NA) , list(fooTest1[,1]),
fooTest1[,-1], Un1))
indx1 <- which(!is.na(m1), arr.ind=TRUE)[,1]
fooTest1$Bearing <- NA
fooTest1$Bearing[indx1] <- m1[!is.na(m1)]
fooTest1
# Group CupComposition LinerComposition Bearing
#1 Shell Metal Polyethylene Polyethylene
#2 <NA> <NA> <NA> <NA>
#3 Cup Polyethylene <NA> Polyethylene
#4 <NA> <NA> <NA> <NA>
#5 <NA> Test Test <NA>