Determine if sub string appears in a string by row of dataframe

Determine if sub string appears in a string by row of dataframe - r

I have a dataframe that is revised every day. When an error occurs, It's checked, and if it can be solved, then the keyword "REVISED" is added to the beginning of the error message. Like so:
ID M1 M2 M3
1 NA "REVISED-error" "error"
2 "REVISED-error" "REVISED-error" NA
3 "REVISED-error" "REVISED-error" "error"
4 NA "error" NA
5 NA NA NA
I want to find a way to add two columns, helping me determine if there are any error, and how many of them have been revised. Like this:
ID M1 M2 M3 i1 ix
1 NA "REVISED-error" "error" 2 1 <- 2 errors, 1 revised
2 "REVISED-error" "REVISED-error" NA 2 2
3 "REVISED-error" "REVISED-error" "error" 3 2
4 NA "error" NA 1 0
5 NA NA NA 0 0
I found this code:
df <- df%>%mutate(i1 = rowSums(!is.na(.[2:4])))
That helps me to know how many errors are in those specific columns. How can I know if any of said errors contains the keyword REVISED? I've tried a few things but none have worked so far:
df <- df%>%
mutate(i1 = rowSums(!is.na(.[2:4])))%>%
mutate(ie = rowSums(.[2:4) %in% "REVISED")
This returns an error x must be an array of at least two dimensions

You could use apply to find number of times "error" and "REVISED" appears in each row.
df[c("i1", "ix")] <- t(apply(df[-1], 1, function(x)
c(sum(grepl("error", x)), sum(grepl("REVISED", x)))))
df
# ID M1 M2 M3 i1 ix
#1 1 <NA> REVISED-error error 2 1
#2 2 REVISED-error REVISED-error <NA> 2 2
#3 3 REVISED-error REVISED-error error 3 2
#4 4 <NA> error <NA> 1 0
#5 5 <NA> <NA> <NA> 0 0
Althernative approach using is.na and rowSums to calculate i1.
df$i1 <- rowSums(!is.na(df[-1]))
df$ix <- apply(df[-1], 1, function(x) sum(grepl("REVISED", x)))
data
df <- structure(list(ID = 1:5, M1 = structure(c(NA, 1L, 1L, NA, NA),
.Label = "REVISED-error", class = "factor"),
M2 = structure(c(2L, 2L, 2L, 1L, NA), .Label = c("error",
"REVISED-error"), class = "factor"), M3 = structure(c(1L,
NA, 1L, NA, NA), .Label = "error", class = "factor")), row.names = c(NA,
-5L), class = "data.frame")

You can use str_count() from the stringr library to count the number of times REVISED appears, like so
df <- data.frame(M1=as.character(c(NA, "REVISED-x", "REVISED-x")),
M2=as.character(c("REVISED-x", "REVISED-x", "REVISED-x")),
stringsAsFactors = FALSE)
library(stringr)
df$ix <- str_count(paste0(df$M1, df$M2), "REVISED")
df
# M1 M2 ix
# 1 <NA> REVISED-x 1
# 2 REVISED-x REVISED-x 2
# 3 REVISED-x REVISED-x 2

Related

Sum many rows with some of them have NA in all needed columns

I am trying to do rowSums but I got zero for the last row and I need it to be "NA".
My df is
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
I used this code based on this link; Sum of two Columns of Data Frame with NA Values
df$sum<-rowSums(df[,c("a", "b", "c")], na.rm=T)
Any advice will be greatly appreciated

For each row check if it is all NA and if so return NA; otherwise, apply sum. We have selected columns a, b and c even though that is all the columns because the poster indicated that there might be additional ones.
sum_or_na <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
transform(df, sum = apply(df[c("a", "b", "c")], 1, sum_or_na))
giving:
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 NA 8
4 NA NA NA NA
Note
df in reproducible form is assumed to be:
df <- structure(list(a = c(1L, 2L, 3L, NA), b = c(4L, NA, 5L, NA),
c = c(7L, 8L, NA, NA)),
row.names = c("1", "2", "3", "4"), class = "data.frame")

R compare two columns return third column if any column match conditions

I have a dataset:data1 which have ME and PDR columns.
I want to create this third column: case which would look like this:
ME PDR case
1 2 2
NA 1 1
NA 1 1
1 2 2
NA NA NA
I tried to use this command but it doesn't return me 1 when I have 1 in either columns and no 2 in any of them.
data1$case=ifelse(data1$ME==2 | data1$PDR==2 ,2,ifelse(data1$ME==NA & data1$PDR==NA,NA,1))

We can use pmax
data1$case <- do.call(pmax, c(data1, na.rm = TRUE))
data1$case
#[1] 2 1 1 2 NA
Regarding the OP's case with NA, the == returns NA for any element that is an NA. So, we need to take care of the NA with adding a condition (& !is.na(ME) - for both columns)
with(data1, ifelse((ME == 2 & !is.na(ME)) | (PDR == 2 & !is.na(PDR)),
2, ifelse(is.na(ME) &is.na(PDR), NA, 1)))
#[1] 2 1 1 2 NA
NOTE: The == for checking NA is not recommended as there are functions to get a logical vector when there are missing values (is.na, complete.cases)
data
data1 <- structure(list(ME = c(1L, NA, NA, 1L, NA), PDR = c(2L, 1L, 1L,
2L, NA)), class = "data.frame", row.names = c(NA, -5L))

Remove rows with specific NA column

I have the Following dataset where some entries (unique A) Don't have data in B and others that have sometimes.
A B
1 NA
2 NA
3 77
1 NA
2 81
I want to delete the entries that Always have NA and keep the rest
A B
2 NA
3 77
2 81

We can use ave grouped by A and remove the groups that has all NAs
df[!with(df, ave(is.na(B), A, FUN = all)), ]
# A B
#2 2 NA
#3 3 77
#5 2 81
Using the same logic with dplyr
library(dplyr)
df %>%
group_by(A) %>%
filter(!all(is.na(B)))

Assuming the input shown reproducibly in the Note at the end, for each group defined by A we return TRUE if any of its elements in B are not NA.
subset(DF, ave(!is.na(B), A, FUN = any))
Note
Lines <- "
A B
1 NA
2 NA
3 77
1 NA
2 81"
DF <- read.table(text = Lines, header = TRUE)

We can use data.table
library(data.table)
setDT(df1)[, .SD[any(!is.na(B))], A]
# A B
#1: 2 NA
#2: 2 81
#3: 3 77
data
df1 <- structure(list(A = c(1L, 2L, 3L, 1L, 2L), B = c(NA, NA, 77L,
NA, 81L)), class = "data.frame", row.names = c(NA, -5L))

How to combine rowSums and ifelse with mutate

I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.

Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")

A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)

R - Creating conditional variables with loop

I have a dataset like this (but this is just a subset; the real dataset has hundreds of ID_Desc variables), where each data point has a person's gender, and whether they checked off a number of descriptors (1) or not (NA):
Gender ID1_Desc_1 ID1_Desc_2 ID1_Desc_3 ID2_Desc_1 ID2_Desc_2 ID2_Desc_3 ID3_Desc_1 ID3_Desc_2 ID3_Desc_3
1 NA NA 1 NA NA 1 NA NA NA
2 NA 1 1 NA NA NA 1 1 NA
1 1 1 1 NA 1 NA NA NA NA
I'm trying to write a loop that will (1) check their gender, (2) based on their gender, check whether they checked off the same descriptor in the first list they saw (lists ID1 and ID2 for Gender=1 and lists ID1 and ID3 for Gender=2), and (3) create a new variable (Same#) that indicates whether they checked off the same descriptor in both lists (by writing a 1) or not (by writing a 0).
I've been working with this code, which seems to be checking their gender ok and creating the new variables (Same#), but it's writing 0's for everything, which is not correct:
for (i in 1:3){
assign(paste("Same",i,sep=""),
ifelse(Gender=="1",
ifelse(paste("ID1_Desc_",i,sep="")==paste("ID2_Desc_",i,sep=""),1,0),
ifelse(paste("ID1_Desc_",i,sep="")==paste("ID3_Desc_",i,sep=""),1,0)
)
)
}
Based on the data I provided, Same1 should be 0 0 1 (since Gender=1 and they chose Desc_3 in both the ID1 and ID2 lists), Same2 should be 0 1 0 (since Gender=2 and they chose Desc_2 in both the ID1 and ID3 lists), and Same3 should be 0 1 0 (since Gender=1 and they chose Desc_2 in both the ID1 and ID2 lists) but right now, all 3 come out as 0 0 0.
I know using loops may not be the best way to do this, but I'd really like to know how to do it with loop if it's possible. If not, anything that works would be incredibly appreciated. Thanks.

You may try this
ind1 <- grep("^ID1", colnames(df))
ind2 <- grep("^ID2", colnames(df))
ind3 <- grep("^ID3", colnames(df))
cond1 <- do.call(cbind,Map(`==` , df[ind1], df[ind2]))
cond2 <- do.call(cbind,Map(`==` , df[ind1], df[ind3]))
Finalind <- do.call(cbind, Map(`|`, as.data.frame(t(cond1)),
as.data.frame(t(cond2))))
res <- (!is.na(Finalind))+0
rownames(res) <- paste0("Same", 1:3)
t(res)
# Same1 Same2 Same3
#V1 0 0 1
#V2 0 1 0
#V3 0 1 0
cbind(df, t(res))
data
df <- structure(list(Gender = c(1L, 2L, 1L), ID1_Desc_1 = c(NA, NA,
1L), ID1_Desc_2 = c(NA, 1L, 1L), ID1_Desc_3 = c(1L, 1L, 1L),
ID2_Desc_1 = c(NA, NA, NA), ID2_Desc_2 = c(NA, NA, 1L), ID2_Desc_3 = c(1L,
NA, NA), ID3_Desc_1 = c(NA, 1L, NA), ID3_Desc_2 = c(NA, 1L,
NA), ID3_Desc_3 = c(NA, NA, NA)), .Names = c("Gender", "ID1_Desc_1",
"ID1_Desc_2", "ID1_Desc_3", "ID2_Desc_1", "ID2_Desc_2", "ID2_Desc_3",
"ID3_Desc_1", "ID3_Desc_2", "ID3_Desc_3"), class = "data.frame",
row.names = c(NA, -3L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Determine if sub string appears in a string by row of dataframe - r

Related

Sum many rows with some of them have NA in all needed columns

R compare two columns return third column if any column match conditions

Remove rows with specific NA column

How to combine rowSums and ifelse with mutate

R - Creating conditional variables with loop

Categories

Resources