Related
This question already has answers here:
How to omit NA values while pasting numerous column values together?
(2 answers)
suppress NAs in paste()
(13 answers)
Closed 1 year ago.
I am trying to concoctate two columns in R using:
df_new$conc_variable <- paste(df$var1, df$var2)
My dataset look as follows:
id var1 var2
1 10 NA
2 NA 8
3 11 NA
4 NA 1
I am trying to get it such that there is a third column:
id var1 var2 conc_var
1 10 NA 10
2 NA 8 8
3 11 NA 11
4 NA 1 1
but instead I get:
id var1 var2 conc_var
1 10 NA 10NA
2 NA 8 8NA
3 11 NA 11NA
4 NA 1 1NA
Is there a way to exclude NAs in the paste process? I tried including na.rm = FALSE but that just added FALSE add the end of the NA in conc_var column. Here is the dataset:
id <- c(1,2,3,4)
var1 <- c(10, NA, 11, NA)
var2 <- c(NA, 8, NA, 1)
df <- data.frame(id, var1, var2)
One out of many options is to use ifelse as in:
df <- data.frame(var1 = c(10, NA, 11, NA),
var2 = c(NA, 8, NA, 1))
df$new <- ifelse(is.na(df$var1), yes = df$var2, no = df$var1)
print(df)
Depending on the circumstances rowSums might be suitable as well as in
df$new2 <- rowSums(df[, c("var1", "var2")], na.rm = TRUE)
print(df)
You can use tidyr::unite -
df <- tidyr::unite(df, conc_var, var1, var2, na.rm = TRUE, remove = FALSE)
df
# id conc_var var1 var2
#1 1 10 10 NA
#2 2 8 NA 8
#3 3 11 11 NA
#4 4 1 NA 1
Like in the example if in a row at a time you'll have only one value you can also use pmax or coalesce.
pmax(df$var1, df$var2, na.rm = TRUE)
dplyr::coalesce(df$var1, df$var2)
You could use glue from the glue package instead.
glue::glue(10, NA, .na = '')
Not sure what I'm doing wrong but I'm struggling getting the index per row of the last column (among several columns) that is not NA.
Using tidyverse and across, I'm getting as many output columns as input columns where I'd expect one single output column with the index of the respective column.
dat <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
I tried the following (among others, inspired by this one: Return last data frame column which is not NA):
dat %>%
mutate(last = across(-id, ~max.col(!is.na(.x), ties.method="last")))
Expected outcome would be:
id x y z last
1 1 1 NA 3 3
2 2 NA NA 1 3
3 3 NA NA NA NA
The problems with your current flow:
across is going to pass one column at a time to the function/expression; your code needs a row or a matrix/frame. For this, across is not appropriate.
Your desired output of NA for the last row is inconsistent with the logic: !is.na(.x) should return c(F,F,F), which still has a max. Your logic then requires a custom function, since you need to handle it differently.
Try this adaptation of max.col into a custom function:
max.col.notna <- function (m, ties.method = c("random", "first", "last")) {
ties.method <- match.arg(ties.method)
tieM <- which(ties.method == eval(formals()[["ties.method"]]))
out <- .Internal(max.col(as.matrix(m), tieM))
m[] <- !m %in% c(0,NA) # 'm[] <-' is required to maintain the matrix shape
replace(out, rowSums(m) == 0, NA_integer_)
}
dat %>%
mutate(last = max.col.notna(!is.na(select(., -id)), ties.method = "last"))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
Note: I've edited/changed the function several times, trying to ensure a consistent API to the intent of this custom function. As it stands now, the notna in the function name to me reflects a sense of "emptiness" (either 0 or NA). With this logic, the function is usable with logical (as here) and numeric data. Perhaps it's overkill, but I prefer APIs that operate consistently/predictably across input classes.
tidyverse isn't really suitable for row-wise operation. Most of the times reshaping the data into long format (as shown in #Rui Barradas answer) is a good approach.
Here is one way using rowwise keeping the data wide.
library(dplyr)
dat %>%
rowwise() %>%
mutate(last = {ind = which(!is.na(c_across(x:z)));
if(length(ind)) tail(ind, 1) else NA})
# id x y z last
# <dbl> <dbl> <lgl> <dbl> <int>
#1 1 1 NA 3 3
#2 2 NA NA 1 3
#3 3 NA NA NA NA
An R base solution:
dat$last = apply(dat[,2:4], 1,
FUN = function(x) ifelse(max(which(is.na(x))) == length(x), NA, max(which(is.na(x)))+1 ))
dat
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
You want to use c_across() and rowwise() to do this. rowwise() works similar to group_by_all(), except it is more explicit. c_across() creates flat vectors out of columns (whereas across() creates tibbles).
If we first define a function seperately to pull out the last non-NA value, or return NA if there are none:
get_last <- function(x){
y <- c(NA,which(!is.na(x)))
y[length(y)]
}
We can then apply that function c_across() the variables we need, but only after converting into a rowwise_df using rowwise()
dat %>%
rowwise() %>%
mutate(last = get_last(c_across(x:z)))
base R
df <- data.frame(id = c(1, 2, 3),
x = c(1, NA, NA),
y = c(NA, NA, NA),
z = c(3, 1, NA))
df$last <- apply(df[-1], 1, function(x) max(as.vector(!is.na(x)) * seq_len(length(x))))
df$last[df$last == 0] <- NA
df
#> id x y z last
#> 1 1 1 NA 3 3
#> 2 2 NA NA 1 3
#> 3 3 NA NA NA NA
Created on 2020-12-29 by the reprex package (v0.3.0)
Starting with a vector of NAs, you could step through each col and if the given element passes your check_fun returning TRUE, assign the index of that col to that element. The difference from the other answers here is that this does not check the condition row-wise or create a matrix from the data. Not sure whether creating two new temp vectors for each column is better/worse than just converting the entire data to a matrix first though.
library(tidyverse) # purrr and dplyr
last_matching_ind <- function(dat, check_fun){
check_fun <- as_mapper(check_fun)
reduce2(dat, seq_along(dat), .init = NA_integer_,
function(prev, dat, ind) if_else(check_fun(dat), ind, prev) )
}
dat %>%
mutate(last = last_matching_ind(dat[-1], ~ !is.na(.x)))
# id x y z last
# 1 1 1 NA 3 3
# 2 2 NA NA 1 3
# 3 3 NA NA NA NA
let's create an example first
scale1 <- c(5,NA,2,1)
scale2 <- c(NA,4,NA,3)
scale3 <- c(3,NA,5,NA)
scale4 <- c(2,1,NA,5)
df<- data.frame(scale1,scale2,scale3,scale4)
df
Here is the output
## scale1 scale2 scale3 scale4
#1 5 NA 3 2
#2 NA 4 NA 1
#3 2 NA 5 NA
#4 1 3 NA 5
Here is what I'm stuck.
I am doing a survey where the participants have to rate on multiple scales. The value of scale is supposed to in this order with
scale 1 >= scale 2 >= scale 3 >= scale 4
so I want to remove those violated this order while keeping NA (as the scales are randomly assigned)
The output should look like this (case 3 and 4 removed)
## scale1 scale2 scale3 scale4
#1 5 NA 3 2
#2 NA 4 NA 1
Is there an efficient way to achieve this (since I have lots of sets of scales in my actual data)
Thank you!
You can do this with row-wise apply :
cols <- grep('scale', names(df))
df[apply(df[cols], 1, function(x) all(diff(na.omit(x)) < 0)), ]
# scale1 scale2 scale3 scale4
#1 5 NA 3 2
#2 NA 4 NA 1
and the same using dplyr :
library(dplyr)
df %>%
rowwise() %>%
filter(all(diff(na.omit(c_across(starts_with('scale')))) < 0 ))
This selects the rows where all the values in the row is smaller than the previous value in the row.
data
df <- structure(list(scale1 = c(5, NA, 2, 1), scale2 = c(NA, 4, NA,
3), scale3 = c(3, NA, 5, NA), scale4 = c(2, 1, NA, 5)),
class = "data.frame", row.names = c(NA, -4L))
I am begginer in R and this is a very simple question, but I can't find the answer.
I would like to select cells in a table that match a particular pattern and exclude everything else.
Example data:
data.t <- data.frame(ColA = c("NARG_ECOLI^Q:103", "NARG_ECOLI^NARG", "SPEB_KLEP7^Q:103"), ColB = c(NA, NA, NA), ColC = c("KLEP7^Q:103", "NARG_ECOLI^KLEP7", NA), ColD = c("RPOC_ENTFA^Q:2", NA, NA), ColE = c("Y1546_STAS1^Q:6", NA, NA))
which generates a table like this:
ColA ColB ColC ColD ColE
1 NARG_ECOLI^Q:103 NA KLEP7^Q:103 RPOC_ENTFA^Q:2 NA
2 NARG_ECOLI^NARG NA NARG_ECOLI^KLEP7 <NA> NA
3 SPEB_KLEP7^Q:103 NA <NA> <NA> NA
I would like to select only cells containing ECOLI. Thus, the desired output would look like this one:
ColA ColC
1 NARG_ECOLI^Q:103 NARG_ECOLI^KLEP7
2 NARG_ECOLI^NARG <NA>
One possible solution is to visually inspect and make the selections in my data, but the actual table has dozens of columns and hundreds of rows. Any help would be greatly appreciated. Thank you in advance!
If you want to return ONLY the items in the data frame that have "ECOLI" in them, then here is a tidyverse approach
library(tidyverse)
filter_all(data.t, any_vars(grepl("ECOLI", .))) %>%
.[map_lgl(., ~any(grepl("ECOLI", .x)))] %>%
map_df(~replace(.x, !grepl("ECOLI", .x), NA_character_))
# A tibble: 2 x 2
ColA ColC
<fctr> <fctr>
1 NARG_ECOLI^Q:103 <NA>
2 NARG_ECOLI^NARG NARG_ECOLI^KLEP7
data.t <- data.t[grepl('ECOLI', data.t$ColA), ]
To obtain a staggered list of all instances of ECOLI in each column of data.t:
out <- lapply(data.t, grep, pattern='ECOLI', value=T)
If you want to drop 0 length entries.
nout <- sapply(out, length)
out <- out[nout > 0]
nout <- nout[nout > 0]
To merge that staggered list into a rectangular object like a data frame is unwise, but:
mapply(c, out, mapply(rep, NA, max(nout)-nout))
I tried solving this using base functions.
# Data
data.t <- data.frame(ColA = c("NARG_ECOLI^Q:103", "NARG_ECOLI^NARG",
"SPEB_KLEP7^Q:103"), ColB = c(NA, NA, NA), ColC = c("KLEP7^Q:103",
"NARG_ECOLI^KLEP7", NA), ColD = c("RPOC_ENTFA^Q:2", NA, NA), ColE =
c("Y1546_STAS1^Q:6", NA, NA), stringsAsFactors = FALSE)
# First wrote a function to check cell value. If value contains
"ECOLI" then value # of cell is retained else value is replaced with NA
findECOLI <- function(x){
ifelse(grepl("ECOLI", x, fixed = TRUE), x, NA)
}
d1 <- sapply(data.t, findECOLI)
#> d1
# ColA ColB ColC ColD ColE
#[1,] "NARG_ECOLI^Q:103" NA NA NA NA
#[2,] "NARG_ECOLI^NARG" NA "NARG_ECOLI^KLEP7" NA NA
#[3,] NA NA NA NA NA
# Now, remove the rows containing only NA
d1 <- d1[rowSums(is.na(d1)) != ncol(d1), ]
#> d1
# ColA ColB ColC ColD ColE
#[1,] "NARG_ECOLI^Q:103" NA NA NA NA
#[2,] "NARG_ECOLI^NARG" NA "NARG_ECOLI^KLEP7" NA NA
# Remove the columns containing only NA
d1 <- d1[, colSums(is.na(d1)) != nrow(d1)]
#Result:
#>d1
# ColA ColC
#[1,] "NARG_ECOLI^Q:103" NA
#[2,] "NARG_ECOLI^NARG" "NARG_ECOLI^KLEP7"
I would like to replace values of one dataframe with NA of another dataframe that have the same identifier. That is, for all values of df1 that have the same id, assign the "NA" values of df2 at the corresponding id and indices.
I have df1 and df2:
df1 =data.frame(id = c(1,1,2,2,6,6),a = c(2,4,1,7,5,3), b = c(5,3,0,3,2,5),c = c(9,3,10,33,2,5))
df2 =data.frame(id = c(1,2,6),a = c("NA",0,"NA"), b= c("NA", 9, 9),c=c(0,"NA","NA"))
what i would like is df3:
df3 = data.frame(id = c(1,1,2,2,6,6),a = c("NA","NA",1,7,"NA","NA"), b = c("NA","NA",0,3,2,5),c = c(9,3,"NA","NA","NA","NA"))
I have tried the lookup function and the library "data.table", but i could get the correct df3. Could anyone please help me with this?
We can do a join on 'id' and then replace the NA values by multiplying the .
library(data.table)
nm1 <- names(df1)[-1]
setDT(df1)[df2, (nm1) := Map(function(x, y) x*(NA^is.na(y)), .SD,
mget(paste0('i.', nm1))), on = .(id), .SDcols = nm1]
df1
# id a b c
#1: 1 NA NA 9
#2: 1 NA NA 3
#3: 2 1 0 NA
#4: 2 7 3 NA
#5: 6 NA 2 NA
#6: 6 NA 5 NA
data
df2 =data.frame(id = c(1,2,6),a = c(NA,0,NA), b= c(NA, 9, 9),c=c(0,NA,NA))
NOTE: In the OP's post NA were "NA"
Since your NA values are actually text "NA" you will have to turn all your variables into text (with as.character). You can join both dataframes by id column. Since both dataframes have columns a,b, and c R will rename then a.x, b.x and c.x (df1) and a.y, b.y and c.y (df2).
After that you can create new columns a,b, and c. These than have "NA" whenever a.y == "NA" and a.x otherwise (and so on). If your NA values were real NA you need to test differently is.na(value) (see example below in the code).
library(dplyr)
df1 %>%
mutate_all(as.character) %>% # allvariables as text
left_join(df2 %>%
mutate_all(as.character) ## all variables as text
, by = "id") %>% ## join tables by 'id'; a.x from df1 and a.y from df2 and so on
mutate(a = case_when(a.y == "NA" ~ "NA", TRUE ~ a.x), ## if a.y == "NA" take this,else a.x
b = case_when(b.y == "NA" ~ "NA", TRUE ~ b.x),
c = case_when(c.y == "NA" ~ "NA", TRUE ~ c.x)) %>%
select(id, a, b, c) ## keep only these initial columns
id a b c
1 1 NA NA 9
2 1 NA NA 3
3 2 1 0 NA
4 2 7 3 NA
5 6 NA 2 NA
6 6 NA 5 NA
##if your dataframe head real NA this is how you can test:
missing_value <- NA
is.na(missing_value) ## TRUE
missing_value == NA ## Does not work with R