Compare vector to subset of row in a data.frame - r

I have a data.frame "dat" and a numeric vector "test":
code <- c("A22", "B15", "C03")
v.1 <- 1:3
v.2 <- 3:1
v.3 <- c(2, NA, 2)
bob <- c("yes", "no", "no")
dat <- data.frame(code, v.1, v.2, v.3, bob, stringsAsFactors = FALSE)
test <- c(3, 1, 2)
I want to find the row in the data.frame where the second to fourth columns ("v.1", "v.2", "v.3") contain the same values as the vector, in the same order, and return the value from the "code"-column (in this case "C03").
I tried
dat[dat[, 2:4] == test]$code
and
which(apply(dat, 1, function(x) all.equal(dat[, 2:4], test)) == FALSE)
both of which do not work.
I would prefer a solution with base R.

Your second option (with which) does not work for several problems: using apply on whole dat converts it to a matrix of character, you're actually not using x, the function argument and you should use all instead of all.equal and probably TRUE instead of FALSE (the comparison is actually not needed).
You can modify it a bit to make it work:
which(apply(dat[, 2:4], 1, function(x) all(x==test)))
[1] 3
Or
dat[apply(dat[, 2:4], 1, function(x) all(x==test)), "code"]
[1] C03

With apply we can paste the columns together and check which row has the same value as that of test when pasted together and selected the column code of respective row.
dat[apply(dat[2:4], 1, paste0, collapse = "|") ==
paste0(test, collapse = "|"), "code"]
#[1] C03

We just need to replicate the 'test' to make the lengths equal before doing the comparison
dat[2:4] == test[row(dat[2:4])]
If we need the 'code'
dat$code[rowSums(dat[2:4] == test[row(dat[2:4])], na.rm = TRUE)==3]
#[1] C03

Related

Paste leading zero in columns A and B if column A meets condition

Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011

R: Compare a column of a dataframe with each elements of a list

Who can help me, please?
I have a dataframe (df) containing various data and I have a list (lst) containing unique values from the df.
For example:
[
I need to compare each element in df$Col1 with the first element of lst$Col1, then with the second element, the third etc.
Then the same procedure for the second and third columns.
If the comparison is True return 1, if not return 0.
you can do it with apply wich applies if you say MARGIN = c(1,2) a function to each element of a matrix / data.frame.
df <- data.frame(col1 = c(3,2,9,7,7,1,9),
col2 = c(3,6,12,7,10,4,6),
col3 = c(7,3,10,2,2,2,10))
List <- list(col1 = c(1,2,3,7,9),
col2 = c(3,4,6,7,10,12),
col3 = c(2,3,7,10))
apply(df, MARGIN = c(1,2), function(x) x %in% List[[1]])
apply(df, MARGIN = c(1,2), function(x) x %in% List[[2]])
apply(df, MARGIN = c(1,2), function(x) x %in% List[[3]])
If you want to combine this you can put it in a lappy-function which applies a function to each element of a list:
lapply(List, function(x) apply(df, MARGIN = c(1,2), function(y) y %in% x))
Instead of TRUE / FALSE you wanted to have 1/0 as returning value. Just use as.numeric() to your logic vector
lapply(List, function(x) apply(df, MARGIN = c(1,2), function(y) as.numeric(y %in% x)))

Create Value in final column of dataframe based on multiple columns

I have a dataframe that looks like this (but with a lot more variables/columns)
set.seed(5)
id<-seq(5)*floor(runif(5,min=1000, max=10000))
vals1<-c("Y","N","N","N","N")
vals2<-c("N","N","N","N","N")
vals3<-c("N","N","N","Y","N")
df<-data.frame(id,vals1,vals2,vals3)
I'd like to create a final column in the frame such that it generates a final flag with the following logic: If there is any value of 'Y' for any id the final flag is 'Y', otherwise it would be a 'N'. So, for this dataframe the 1st and 4th ids (2801, 14236) has a 'Y' in the final column and the rest have an 'n' for the final column. I tried a few approaches like apply and if...else to no avail.
Initialize by assigning "N" to every row. In next step, for the rows with "Y" (check using apply), assign "Y"
df$final = "N"
df$final[apply(df, 1, function(a) "Y" %in% a)] = "Y"
A solution for your letter encoding below.
set.seed(5)
id <- seq(5) * floor(runif(5, min=1000, max=10000))
vals1 <- c("Y","N","N","N","N")
vals2 <- c("N","N","N","N","N")
vals3 <- c("N","N","N","Y","N")
df <- data.frame(id, vals1, vals2, vals3)
# If you really want to use the letter encoding, my solution works as below
df$Final <- apply(df[,2:4], MARGIN = 1, FUN = function(x) {any(x == 'Y')})
However, I think you should use a boolean (TRUE/FALSE) for this.
Works well in combination with apply and any
set.seed(5)
id <- seq(5) * floor(runif(5, min=1000, max=10000))
vals1 <- c("Y","N","N","N","N")
vals2 <- c("N","N","N","N","N")
vals3 <- c("N","N","N","Y","N")
df <- data.frame(id, vals1, vals2, vals3)
# Convert your labels into booleans:
df[,2:4] <- df[,2:4] == 'Y'
# Then summarise across rows
df$Final <- apply(df[,2:4], MARGIN = 1, FUN = function(x) {any(x)})
Somewhat similar to the #d.b answer:
df$final <- apply(df, 1, function(x) c("N","Y")[any(x == "Y")+1])

Find a vector of strings in R

I have a vector of strings like:
vector=c("a","hb","cd")
and also I have a matrix which has a column, each element of this column is a list of strings which separated by "|" separator, like:
1 "ab|hb"
2 "ab|hbc|cd"
I want to find each string of vector appears in which row of matrix completely.
For the above vector, the result is:
NA, 1, 2
You can use strsplit for splitting strings:
x <- strsplit("ab|hbc|cd", split="|", fixed=T)
and then check if values of vector appear in the data, e.g.
sapply(vector, function(x) x %in% strsplit("a|ab|cd|efg|bh",
split="|", fixed=T)[[1]])
Warning: strsplit outputs data as a list, so in the example above I extract only the first element of the list with [[1]], however you can deal with it in other way if you choose.
EDIT: answering to your question on data as a vector:
data <- c("ab|cd|ef", "aaa|b", "ab", "wf", "fg|hb|a", "cd|cd|df")
sapply(sapply(data, function(x) strsplit(x, split="|", fixed=T)[[1]]),
function(y) sapply(vector, function(z) z %in% y))
Here's an approach using regular expressions:
# Example data
vector <- c("a","hb","cd")
mat <- matrix(c("ab|hb", "ab|hbc|cd"), nrow = 2)
sapply(paste0("\\b", vector, "\\b"), function(x)
if(length(tmp <- grep(x, mat[ , 1]))) tmp else NA,
USE.NAMES = FALSE)
# [1] NA 1 2

Combining conditions

my recode attempts
df$test[(df$1st==(1:3) & df$2nd <= 4)] <- 1
df$test[(df$1st==(1:3) & df$2nd <= 5)] <- 2
df$test[(df$1st==(1:3) & df$2nd <= 6)] <- 3
result in a "longer object length is not a multiple of shorter object length" warning and a lot of NAs in df$test, even though some recodes work correctly.
What am I missing? Any help appreciated.
dw
Problem is in this line:
df$1st==(1:3)
You could use %in%
df$1st %in% (1:3)
Warning comes cause you compare vectors of different lengths (1:3 has length 3 and df$1st has length "only you know what").
Beside I think you missed that your values are overwritten: df$2nd <= 4 is also df$2nd <= 6 so all 1 and 2 are overwrite by 3.
I am not sure what you're trying to achieve with df$1st==(1:3), but it probably doesn't do what you think it does. It recycles c(1,2,3) as many times as it needs to make it as long as df.
If you are trying to check if df$1st is between 1 and 3, you might want to spell it out:
df$1st>=1 & df$1st<=3
You may also want to consider using transform() to deal with recoding issues such as this. transform() will perform slower than the logical indexing method, but is easier to digest the intent of the code. A good discussion of the pros and cons of the different methods can be found here. Consider:
set.seed(42)
df <- data.frame("first" = sample(1:5, 10e5, TRUE), "second" = sample(4:8, 10e5, TRUE))
df <- transform(df
, test = ifelse(first %in% 1:3 & second == 4, 1
, ifelse(first %in% 1:3 & second == 5, 2
, ifelse(first %in% 1:3 & second == 6, 3, NA)))
)
Secondly, the column names 1st and 2nd are not syntactically valid column names. Take a look at make.names() for more details on what constitutes valid column names. When working with a data.frame, you can use/abuse the check.names argument. For example:
> df <- data.frame("1st" = sample(1:5, 10e5, TRUE), "2nd" = sample(4:8, 10e5, TRUE), check.names = FALSE)
> colnames(df)
[1] "1st" "2nd"
> df <- data.frame("1st" = sample(1:5, 10e5, TRUE), "2nd" = sample(4:8, 10e5, TRUE), check.names = TRUE)
> colnames(df)
[1] "X1st" "X2nd"

Resources