How to remove specific rows in R? - r

A data set I'm using is the following:
C1 C2 C3 R1
R1 NA NA NA 5
R2 NA NA 0.4 7
R3 0.1 NA 6
R4 NA NA NA 2
From the data frame, I want to remove rows that contain numbers which is larger than zero from C1 to C3.
The final outcome must be:
C1 C2 C3 R1
R1 NA NA NA 5
R4 NA NA NA 2
I tried with:
df<- df %>% filter_at(vars('C1' : 'C2`), all_vars(. > 0))
but I got en error with this. How Can I fix it?
Imported from Excel:
Wrote in R:

You can use rowSums in base R :
cols <- paste0('C', 1:3)
df[rowSums(df[cols] > 0, na.rm = TRUE) == 0, ]
Or using filter_at :
library(dplyr)
df %>% filter_at(vars(C1:C3), all_vars(. <= 0 | is.na(.)))
# C1 C2 C3 R1
#R1 NA NA NA 5
#R4 NA NA NA 2
and filter_at has been deprecated so you can write this with across as :
df %>% filter(across(C1:C3, ~. <= 0 | is.na(.)))
data
df <- structure(list(C1 = c(NA, NA, 0.1, NA), C2 = c(NA, NA, NA, NA
), C3 = c(NA, 0.4, NA, NA), R1 = c(5L, 7L, 6L, 2L)),
class = "data.frame", row.names = c("R1", "R2", "R3", "R4"))

A more manual approach is as follows:
df <- as.data.table(df)
if(length(which(df$C1 > 0)) > 0){df <- df[-(which(df$C1 > 0)),]}
if(length(which(df$C2 > 0)) > 0){df <- df[-(which(df$C2 > 0)),]}
if(length(which(df$C3 > 0)) > 0){df <- df[-(which(df$C3 > 0)),]}

Related

Reduce number of columns with priority for certain values

I would like to collapse a data frame with < 100 columns fourfold,
whereby the code would iterate over groups of 4 adjacent columns and collapse them into one.
However, the resulting values based on each set of 4, depend on the priority of the value.
For example, the highest priority is '1', so whenever any of the 4 columns has a value '1' for that row it should be the resulting value. The second priority is 0, if the set has one '0' and three NA's, the result should be '0' (as long as there's no '1's). The lowest priority is NA, only sets consisting of NA completely would remain NA. An example below, with reproducible code underneath.
ID c1 c2 c3 c4 c5 c6 c7 c8
row1 1 0 0 0 1 0 0 NA
row2 NA NA NA 0 NA NA NA NA
becomes
ID c1 c2
row1 1 1
row2 0 NA
structure(list(ID = c("row1", "row2"), c1 = c(1, NA), c2 = c(0,
NA), c3 = c(0, NA), c4 = c(0, 0), c5 = c(1, NA), c6 = c(0, NA
), c7 = c(0, NA), c8 = c(NA, NA)), class = "data.frame", row.names = c(NA,
-2L))
How about this:
dat <- structure(list(ID = c("row1", "row2"), c1 = c(1, NA), c2 = c(0,
NA), c3 = c(0, NA), c4 = c(0, 0), c5 = c(1, NA), c6 = c(0, NA
), c7 = c(0, NA), c8 = c(NA, NA)), class = "data.frame", row.names = c(NA,
-2L))
out <- data.frame(ID = dat$ID)
k <- 2 # first column to start
i <- 1 # first variable name
while(k < ncol(dat)){
out[[paste0("c", i)]] <- apply(dat[,k:(k+3)], 1, max, na.rm=TRUE)
out[[paste0("c", i)]] <- ifelse(is.finite(out[[paste0("c", i)]]), out[[paste0("c", i)]], NA)
k <- k+4
i <- i+1
}
#> Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
out
#> ID c1 c2
#> 1 row1 1 1
#> 2 row2 0 NA
Created on 2022-11-21 by the reprex package (v2.0.1)
Here is an alternative approach:
f <- function(x) fifelse(all(is.na(x)), NA_real_,1*(sum(x,na.rm = T)>0))
dcast(
melt(setDT(df),"ID",variable.name = "v")[
,f(value), .(ID,r=paste0("c",rep(1:(uniqueN(v)/4), each=uniqueN(v))))],
ID~r, value.var = "V1",
)
Output:
ID c1 c2
1: row1 1 1
2: row2 0 NA
Using split.default to split dataframe every 4th column, then use pmax:
x <- df1
x[ is.na(x) ] <- -1
res <- cbind(df1[ "ID" ],
lapply(split.default(x[, -1], rep(1:2, each = 4)),
function(i) do.call(pmax, i)))
res[ res == -1 ] <- NA
res
# ID 1 2
# 1 row1 1 1
# 2 row2 0 NA

Editing data in rows

I'm trying to convert my data in R, but I can't manage to get the column I want.
My dataset is as below, and the column I want to get is "total", it is the sum of D1 + D2 + D3 + D4 + D5, and ignores "NA".
NR
D1
D2
D3
D4
D5
total
A
1
NA
NA
1
NA
2
B
NA
NA
NA
NA
NA
NA
C
NA
1
NA
NA
NA
1
It is probably quite a domb question, but I can't get it.
I already tried:
total <- NA
total <- ifelse(D1==1, 1, total)
total <- ifelse(D2==1, total + 1, total)
total <- ifelse(D3==1, total + 1, total)
total <- ifelse(D4==1, total + 1, total)
total <- ifelse(D5==1, total + 1, total)
But it returns all my rows to "NA"
and i tried:
total <- mutate(dataset, total=D1+D2+D3+D4+D5)
but then I don't get an aggregation of the values of D1 to D5.
We could use rowSums
df1$total <- rowSums(df1[startsWith(names(df1), "D")], na.rm = TRUE)
df1$total[df1$total == 0] <- NA
Or the same logic in dplyr
library(dplyr)
df1 %>%
mutate(total = na_if(rowSums(select(., starts_with('D')), na.rm = TRUE), 0))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
data
df1 <- structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA), total = c(2L, NA, 1L)), class = "data.frame", row.names = c(NA,
-3L))
Here is a solution with c_across and rowwise
library(dplyr)
df %>%
rowwise() %>%
mutate(Total = sum(c_across(D1:D5 & where(is.numeric)), na.rm = TRUE))
Output:
NR D1 D2 D3 D4 D5 Total
<chr> <int> <int> <lgl> <int> <lgl> <int>
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA 0
3 C NA 1 NA NA NA 1
data:
structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA)), row.names = c(NA, -3L), class = "data.frame")
You can try the code below
df$total <- replace(u <- rowSums(!is.na(df)) - 1, u == 0, NA)
which gives
> df
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
And also this one:
library(dplyr)
library(purrr)
df1 <- df1[, !names(df1) %in% "total"]
df1 %>%
mutate(total = pmap_dbl(select(cur_data(), starts_with("D")), ~ ifelse(all(is.na(c(...))),
NA, sum(c(...), na.rm = TRUE))))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1

merge multiple columns in one table?

I have a table with several columns, I would like to make a column by combining 'R1,R2 and R3' columns in a table.
DF:
ID R1 T1 R2 T2 R3 T3
rs1 A 1 NA . NA 0
rs21 NA 0 C 1 C 1
rs32 A 1 A 1 A 0
rs25 NA 2 NA 0 A 0
Desired output:
ID R1 T1 R2 T2 R3 T3 New_R
rs1 A 1 NA . NA 0 A
rs21 NA 0 C 1 C 1 C
rs32 A 1 A 1 A 0 A
rs25 NA 2 NA 0 A 0 A
We can use tidyverse
library(tidyverse)
DF %>%
mutate(New_R = pmap_chr(select(., starts_with("R")), ~c(...) %>%
na.omit %>%
unique %>%
str_c(collape="")))
#. ID R1 T1 R2 T2 R3 T3 New_R
#1 rs1 A 1 <NA> . <NA> 0 A
#2 rs21 <NA> 0 C 1 C 1 C
#3 rs32 A 1 A 1 A 0 A
#4 rs25 <NA> 2 <NA> 0 A 0 A
If there is only one non-NA element per row, we can use coalecse
DF %>%
mutate(New_R = coalesce(!!! select(., starts_with("R"))))
Or in base R
DF$New_R <- do.call(pmin, c(DF[grep("^R\\d+", names(DF))], na.rm = TRUE))
data
DF <- structure(list(ID = c("rs1", "rs21", "rs32", "rs25"), R1 = c("A",
NA, "A", NA), T1 = c(1L, 0L, 1L, 2L), R2 = c(NA, "C", "A", NA
), T2 = c(".", "1", "1", "0"), R3 = c(NA, "C", "A", "A"), T3 = c(0L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -4L))
you can use the ifelse function in a nested way:
DF$New_R <- ifelse(!is.na(DF$R1), DF$R1,
ifelse(!is.na(DF$R2), DF$R2,
ifelse(!is.na(DF$R3), DF$R3, NA)))
ifelse takes three arguments, a condition, what to do if the condition is fulfilled, and what to do if the condition is not fulfilled. It can be applied to data frame column treating each raw separately. In my example it will pick the first non NA value found.
We can use apply row-wise, remove NA values and keeping only unique values.
cols <- paste0("R", 1:3)
df$New_R <- apply(df[cols], 1, function(x)
paste0(unique(na.omit(x)), collapse = ""))
df
# ID R1 T1 R2 T2 R3 T3 New_R
#1 rs1 A 1 <NA> . <NA> 0 A
#2 rs21 <NA> 0 C 1 C 1 C
#3 rs32 A 1 A 1 A 0 A
#4 rs25 <NA> 2 <NA> 0 A 0 A

Determine if sub string appears in a string by row of dataframe

I have a dataframe that is revised every day. When an error occurs, It's checked, and if it can be solved, then the keyword "REVISED" is added to the beginning of the error message. Like so:
ID M1 M2 M3
1 NA "REVISED-error" "error"
2 "REVISED-error" "REVISED-error" NA
3 "REVISED-error" "REVISED-error" "error"
4 NA "error" NA
5 NA NA NA
I want to find a way to add two columns, helping me determine if there are any error, and how many of them have been revised. Like this:
ID M1 M2 M3 i1 ix
1 NA "REVISED-error" "error" 2 1 <- 2 errors, 1 revised
2 "REVISED-error" "REVISED-error" NA 2 2
3 "REVISED-error" "REVISED-error" "error" 3 2
4 NA "error" NA 1 0
5 NA NA NA 0 0
I found this code:
df <- df%>%mutate(i1 = rowSums(!is.na(.[2:4])))
That helps me to know how many errors are in those specific columns. How can I know if any of said errors contains the keyword REVISED? I've tried a few things but none have worked so far:
df <- df%>%
mutate(i1 = rowSums(!is.na(.[2:4])))%>%
mutate(ie = rowSums(.[2:4) %in% "REVISED")
This returns an error x must be an array of at least two dimensions
You could use apply to find number of times "error" and "REVISED" appears in each row.
df[c("i1", "ix")] <- t(apply(df[-1], 1, function(x)
c(sum(grepl("error", x)), sum(grepl("REVISED", x)))))
df
# ID M1 M2 M3 i1 ix
#1 1 <NA> REVISED-error error 2 1
#2 2 REVISED-error REVISED-error <NA> 2 2
#3 3 REVISED-error REVISED-error error 3 2
#4 4 <NA> error <NA> 1 0
#5 5 <NA> <NA> <NA> 0 0
Althernative approach using is.na and rowSums to calculate i1.
df$i1 <- rowSums(!is.na(df[-1]))
df$ix <- apply(df[-1], 1, function(x) sum(grepl("REVISED", x)))
data
df <- structure(list(ID = 1:5, M1 = structure(c(NA, 1L, 1L, NA, NA),
.Label = "REVISED-error", class = "factor"),
M2 = structure(c(2L, 2L, 2L, 1L, NA), .Label = c("error",
"REVISED-error"), class = "factor"), M3 = structure(c(1L,
NA, 1L, NA, NA), .Label = "error", class = "factor")), row.names = c(NA,
-5L), class = "data.frame")
You can use str_count() from the stringr library to count the number of times REVISED appears, like so
df <- data.frame(M1=as.character(c(NA, "REVISED-x", "REVISED-x")),
M2=as.character(c("REVISED-x", "REVISED-x", "REVISED-x")),
stringsAsFactors = FALSE)
library(stringr)
df$ix <- str_count(paste0(df$M1, df$M2), "REVISED")
df
# M1 M2 ix
# 1 <NA> REVISED-x 1
# 2 REVISED-x REVISED-x 2
# 3 REVISED-x REVISED-x 2

remove NA values and combine non NA values into a single column

I have a data set which has numeric and NA values in all columns. I would like to create a new column with all non NA values and preserve the row names
v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 NA NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5
I have tried using the coalesce function from dplyr
digital_metrics_FB <- fb_all_data %>%
mutate(fb_metrics = coalesce("v1",
"v2",
"v3",
"v4",
"v5"))
and also tried an apply function
df2 <- sapply(fb_all_data,function(x) x[!is.na(x)])
still cannot get it to work.
I am looking for the final result to be where all non NA values come together in the final column and the row names are preserved
final
a 1
b 2
c 3
d 4
e 5
any help would be much appreciated
We can use pmax
do.call(pmax, c(fb_all_data , na.rm = TRUE))
If there are more than one non-NA element and want to combine as a string, a simple base R option would be
data.frame(final = apply(fb_all_data, 1, function(x) toString(x[!is.na(x)])))
Or using coalesce
library(dplyr)
library(tibble)
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = coalesce(v1, v2, v3, v4, v5)) %>%
column_to_rownames('rn')
# final
#a 1
#b 2
#c 3
#d 4
#e 5
Or using tidyverse, for multiple non-NA elements
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = pmap_chr(.[-1], ~ c(...) %>%
na.omit %>%
toString)) %>%
column_to_rownames('rn')
NOTE: Here we are showing data that the OP showed as example and not some other dataset
data
fb_all_data <- structure(list(v1 = c(1L, NA, NA, NA, NA), v2 = c(NA, 2L, NA,
NA, NA), v3 = c(NA, NA, 3L, NA, NA), v4 = c(NA, NA, NA, 4L, NA
), v5 = c(NA, NA, NA, NA, 5L)), class = "data.frame",
row.names = c("a",
"b", "c", "d", "e"))
With tidyverse, you can do:
df %>%
rownames_to_column() %>%
gather(var, val, -1, na.rm = TRUE) %>%
group_by(rowname) %>%
summarise(val = paste(val, collapse = ", "))
rowname val
<chr> <chr>
1 a 1
2 b 2, 3
3 c 3
4 d 4
5 e 5
Sample data to have a row with more than one non-NA value:
df <- read.table(text = " v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 3 NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5", header = TRUE)

Resources