Subset dataframe to contain only cells that match a pattern in r - r

I am begginer in R and this is a very simple question, but I can't find the answer.
I would like to select cells in a table that match a particular pattern and exclude everything else.
Example data:
data.t <- data.frame(ColA = c("NARG_ECOLI^Q:103", "NARG_ECOLI^NARG", "SPEB_KLEP7^Q:103"), ColB = c(NA, NA, NA), ColC = c("KLEP7^Q:103", "NARG_ECOLI^KLEP7", NA), ColD = c("RPOC_ENTFA^Q:2", NA, NA), ColE = c("Y1546_STAS1^Q:6", NA, NA))
which generates a table like this:
ColA ColB ColC ColD ColE
1 NARG_ECOLI^Q:103 NA KLEP7^Q:103 RPOC_ENTFA^Q:2 NA
2 NARG_ECOLI^NARG NA NARG_ECOLI^KLEP7 <NA> NA
3 SPEB_KLEP7^Q:103 NA <NA> <NA> NA
I would like to select only cells containing ECOLI. Thus, the desired output would look like this one:
ColA ColC
1 NARG_ECOLI^Q:103 NARG_ECOLI^KLEP7
2 NARG_ECOLI^NARG <NA>
One possible solution is to visually inspect and make the selections in my data, but the actual table has dozens of columns and hundreds of rows. Any help would be greatly appreciated. Thank you in advance!

If you want to return ONLY the items in the data frame that have "ECOLI" in them, then here is a tidyverse approach
library(tidyverse)
filter_all(data.t, any_vars(grepl("ECOLI", .))) %>%
.[map_lgl(., ~any(grepl("ECOLI", .x)))] %>%
map_df(~replace(.x, !grepl("ECOLI", .x), NA_character_))
# A tibble: 2 x 2
ColA ColC
<fctr> <fctr>
1 NARG_ECOLI^Q:103 <NA>
2 NARG_ECOLI^NARG NARG_ECOLI^KLEP7

data.t <- data.t[grepl('ECOLI', data.t$ColA), ]
To obtain a staggered list of all instances of ECOLI in each column of data.t:
out <- lapply(data.t, grep, pattern='ECOLI', value=T)
If you want to drop 0 length entries.
nout <- sapply(out, length)
out <- out[nout > 0]
nout <- nout[nout > 0]
To merge that staggered list into a rectangular object like a data frame is unwise, but:
mapply(c, out, mapply(rep, NA, max(nout)-nout))

I tried solving this using base functions.
# Data
data.t <- data.frame(ColA = c("NARG_ECOLI^Q:103", "NARG_ECOLI^NARG",
"SPEB_KLEP7^Q:103"), ColB = c(NA, NA, NA), ColC = c("KLEP7^Q:103",
"NARG_ECOLI^KLEP7", NA), ColD = c("RPOC_ENTFA^Q:2", NA, NA), ColE =
c("Y1546_STAS1^Q:6", NA, NA), stringsAsFactors = FALSE)
# First wrote a function to check cell value. If value contains
"ECOLI" then value # of cell is retained else value is replaced with NA
findECOLI <- function(x){
ifelse(grepl("ECOLI", x, fixed = TRUE), x, NA)
}
d1 <- sapply(data.t, findECOLI)
#> d1
# ColA ColB ColC ColD ColE
#[1,] "NARG_ECOLI^Q:103" NA NA NA NA
#[2,] "NARG_ECOLI^NARG" NA "NARG_ECOLI^KLEP7" NA NA
#[3,] NA NA NA NA NA
# Now, remove the rows containing only NA
d1 <- d1[rowSums(is.na(d1)) != ncol(d1), ]
#> d1
# ColA ColB ColC ColD ColE
#[1,] "NARG_ECOLI^Q:103" NA NA NA NA
#[2,] "NARG_ECOLI^NARG" NA "NARG_ECOLI^KLEP7" NA NA
# Remove the columns containing only NA
d1 <- d1[, colSums(is.na(d1)) != nrow(d1)]
#Result:
#>d1
# ColA ColC
#[1,] "NARG_ECOLI^Q:103" NA
#[2,] "NARG_ECOLI^NARG" "NARG_ECOLI^KLEP7"

Related

Paste two columns in R but NAs are included in new column [duplicate]

This question already has answers here:
How to omit NA values while pasting numerous column values together?
(2 answers)
suppress NAs in paste()
(13 answers)
Closed 1 year ago.
I am trying to concoctate two columns in R using:
df_new$conc_variable <- paste(df$var1, df$var2)
My dataset look as follows:
id var1 var2
1 10 NA
2 NA 8
3 11 NA
4 NA 1
I am trying to get it such that there is a third column:
id var1 var2 conc_var
1 10 NA 10
2 NA 8 8
3 11 NA 11
4 NA 1 1
but instead I get:
id var1 var2 conc_var
1 10 NA 10NA
2 NA 8 8NA
3 11 NA 11NA
4 NA 1 1NA
Is there a way to exclude NAs in the paste process? I tried including na.rm = FALSE but that just added FALSE add the end of the NA in conc_var column. Here is the dataset:
id <- c(1,2,3,4)
var1 <- c(10, NA, 11, NA)
var2 <- c(NA, 8, NA, 1)
df <- data.frame(id, var1, var2)
One out of many options is to use ifelse as in:
df <- data.frame(var1 = c(10, NA, 11, NA),
var2 = c(NA, 8, NA, 1))
df$new <- ifelse(is.na(df$var1), yes = df$var2, no = df$var1)
print(df)
Depending on the circumstances rowSums might be suitable as well as in
df$new2 <- rowSums(df[, c("var1", "var2")], na.rm = TRUE)
print(df)
You can use tidyr::unite -
df <- tidyr::unite(df, conc_var, var1, var2, na.rm = TRUE, remove = FALSE)
df
# id conc_var var1 var2
#1 1 10 10 NA
#2 2 8 NA 8
#3 3 11 11 NA
#4 4 1 NA 1
Like in the example if in a row at a time you'll have only one value you can also use pmax or coalesce.
pmax(df$var1, df$var2, na.rm = TRUE)
dplyr::coalesce(df$var1, df$var2)
You could use glue from the glue package instead.
glue::glue(10, NA, .na = '')

Merge 2 R dataframes keeping matched rows from 2nd dataframe and unmatched from 1st

I would like to merge 2 dataframes in by matching the id column in the following way
dfmain =
id name val res
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
and
dfsub =
id name val res
1 2 two true thanks
2 4 four false Sorry
to get
dfmain =
id name val res
1: 1 a
2: 2 two true thanks
3: 3 c
4: 4 four false Sorry
5: 5 e
Please note that -
the columns in both the dataframes will remain the same in number and names
the id values in the second dataframe will always be a subset of those in the first dataframe
Currently I am using anti_join function to get unmatched rows in the first dataframe and joining the second dataframe to these rows
Is there any more efficient method to do this in place?
Tried using setDT from data.table library but I was only able to update values of one column at a time.
Sorry if I am missing any obvious solution that exists as I am new to R, any help would be appreciated
You can try (thank data by #Anoushiravan R)
library(data.table)
library(dplyr)
setDT(dfsub)[setDT(dfmain),
on = "id"
][,
names(dfmain),
with = FALSE
][
,
Map(coalesce, .SD, dfmain)
]
which gives
id name val res
1: 1 a NA <NA>
2: 2 two TRUE thanks
3: 3 c NA <NA>
4: 4 four FALSE Sorry
5: 5 e NA <NA>
I hope this is what you have in mind, otherwise please let me know. I noticed that you only replaced rows in dfmain with those of the same id in dfsub by also retaining the columns of dfsub, so here is how I think might get you to what you want:
library(dplyr)
dfmain <- tribble(
~id, ~name, ~ val, ~ res,
1, "a", NA, NA,
2, "b", NA, NA,
3, "c", NA, NA,
4, "d", NA, NA,
5, "e" , NA, NA
)
dfsub <- tribble(
~id, ~name, ~val, ~res,
2, "two", TRUE, "thanks",
4 ,"four", FALSE, "Sorry"
)
dfmain %>%
filter(! id %in% dfsub$id) %>%
bind_rows(dfsub) %>%
arrange(id)
# A tibble: 5 x 4
id name val res
<dbl> <chr> <lgl> <chr>
1 1 a NA NA
2 2 two TRUE thanks
3 3 c NA NA
4 4 four FALSE Sorry
5 5 e NA NA

R data.table merge two columns from the same table

I have:
inputDT <- data.table(COL1 = c(1, NA, NA), COL1 = c(NA, 2, NA), COL1 = c(NA, NA, 3))
inputDT
COL1 COL1 COL1
1: 1 NA NA
2: NA 2 NA
3: NA NA 3
I want
outputDT <- data.table(COL1 = c(1,2,3))
outputDT
COL1
1: 1
2: 2
3: 3
Essentially, I have a data.table with multiple columns whose names are the same (values are mutually exclusive), and I need to generate just one column to combine those.
How to achieve it?
The OP is asking for a data.table solution. As of version v1.12.4 (03 Oct 2019), the fcoalesce() function is available:
library(data.table)
inputDT[, .(COL1 = fcoalesce(.SD))]
COL1
1: 1
2: 2
3: 3
Alternatively (less elegant than #Uwe's answer), if you have only numbers and NA, you can calculate the max of each row while removing NA:
library(data.table)
inputDT[, .(COL2 = do.call(pmax, c(na.rm=TRUE, .SD)))]
COL2
1: 1
2: 2
3: 3

Replace certain variables with NA, one variable is NA

What is the best function to use if I want to replace certain variables with NA based on a conditional?
If status = NA, then score_1:score_3 will be NA
tried:
if(df2$status == NA){
df2$score_2 <- NA
}else{
df2$score_2 <- df$score_2
}
Thanks in advance
One option is to find the NAs in 'status' and assign the columns that having 'score' as column name to NA in base R
i1 <- is.na(df2$Status)
df2[i1, grep("^Score_\\d+$", names(df2))] <- NA
Or an option in dplyr
library(dplyr)
df2 %>%
mutate_at(vars(starts_with('Score')), ~ replace(., is.na(Status), NA))
You can do this by finding out which rows in the data frame are NA and then setting the columns in those rows to NA.
df <- data.frame(client_id = 1:4,
Date = 1:4,
Status = c(1, NA, 1, NA),
Score1 = runif(4)*100,
Score2 = runif(4)*100,
Score3 = runif(4)*100)
idx <- is.na(df$Status)
df[idx, 4:6] <- NA
df
#> client_id Date Status Score1 Score2 Score3
#> 1 1 1 1 48.08677 16.62185 91.80062
#> 2 2 2 NA NA NA NA
#> 3 3 3 1 14.04552 64.55724 56.45998
#> 4 4 4 NA NA NA NA

Exclusive Full Join in r

Trying to implement exclusive full join in r code.
Implemented the below code which works correctly but is the correct approach since the filter is filled lots of conditions. Since this is the sample code didn't add much columns but in real time scenario we have many columns so adding up the columns to filter would make things difficult.
So any other better approach available ?
library(tidyverse)
persons = data.frame(
name = c("Ponting", "Clarke", "Dave", "Bevan"),
age = c(24, 32, 26, 29),
col1 = c(1,2,3,4),
col2 = c("a", "z", "h", "p")
)
person_sports = data.frame(
name = c("Ponting", "Dave", "Roshan"),
sports = c("soccer", "tennis", "boxing"),
rank = c(8, 4, 1),
col3 = c("usa", "australia", "england"),
col4 = c("a", "f1", "z2")
)
persons %>% full_join(person_sports, by = c("name")) %>%
filter((is.na(age) & is.na(col1) & is.na(col2)) | (is.na(sports) & is.na(rank) & is.na(col3) & is.na(col4)))
Output:
Try using complete.cases. This will return a vector of TRUE/FALSE where FALSE indicates an NA is found on a given row in at least one column.
persons %>% full_join(person_sports, by = c("name")) %>% .[!complete.cases(.), ]
# name age col1 col2 sports rank col3 col4
# 2 Clarke 32 2 z <NA> NA <NA> <NA>
# 4 Bevan 29 4 p <NA> NA <NA> <NA>
# 5 Roshan NA NA <NA> boxing 1 england z2
As an alternative, which works similarly to the above, use filter_all and any_vars from the dplyr package.
persons %>% full_join(person_sports, by = c("name")) %>% filter_all(any_vars(is.na(.)))
# name age col1 col2 sports rank col3 col4
# 1 Clarke 32 2 z <NA> NA <NA> <NA>
# 2 Bevan 29 4 p <NA> NA <NA> <NA>
# 3 Roshan NA NA <NA> boxing 1 england z2
Finally, since you mentioned your actual dataset is much bigger, you might want to compare to a data.table solution and see what works best in your real world data.
library(data.table)
setDT(persons)
setDT(person_sports)
merge(persons, person_sports, by = "name", all = TRUE) %>% .[!complete.cases(.)]
# name age col1 col2 sports rank col3 col4
# 1: Bevan 29 4 p NA NA NA NA
# 2: Clarke 32 2 z NA NA NA NA
# 3: Roshan NA NA NA boxing 1 england z2

Resources