Combining conditions - r

my recode attempts
df$test[(df$1st==(1:3) & df$2nd <= 4)] <- 1
df$test[(df$1st==(1:3) & df$2nd <= 5)] <- 2
df$test[(df$1st==(1:3) & df$2nd <= 6)] <- 3
result in a "longer object length is not a multiple of shorter object length" warning and a lot of NAs in df$test, even though some recodes work correctly.
What am I missing? Any help appreciated.
dw

Problem is in this line:
df$1st==(1:3)
You could use %in%
df$1st %in% (1:3)
Warning comes cause you compare vectors of different lengths (1:3 has length 3 and df$1st has length "only you know what").
Beside I think you missed that your values are overwritten: df$2nd <= 4 is also df$2nd <= 6 so all 1 and 2 are overwrite by 3.

I am not sure what you're trying to achieve with df$1st==(1:3), but it probably doesn't do what you think it does. It recycles c(1,2,3) as many times as it needs to make it as long as df.
If you are trying to check if df$1st is between 1 and 3, you might want to spell it out:
df$1st>=1 & df$1st<=3

You may also want to consider using transform() to deal with recoding issues such as this. transform() will perform slower than the logical indexing method, but is easier to digest the intent of the code. A good discussion of the pros and cons of the different methods can be found here. Consider:
set.seed(42)
df <- data.frame("first" = sample(1:5, 10e5, TRUE), "second" = sample(4:8, 10e5, TRUE))
df <- transform(df
, test = ifelse(first %in% 1:3 & second == 4, 1
, ifelse(first %in% 1:3 & second == 5, 2
, ifelse(first %in% 1:3 & second == 6, 3, NA)))
)
Secondly, the column names 1st and 2nd are not syntactically valid column names. Take a look at make.names() for more details on what constitutes valid column names. When working with a data.frame, you can use/abuse the check.names argument. For example:
> df <- data.frame("1st" = sample(1:5, 10e5, TRUE), "2nd" = sample(4:8, 10e5, TRUE), check.names = FALSE)
> colnames(df)
[1] "1st" "2nd"
> df <- data.frame("1st" = sample(1:5, 10e5, TRUE), "2nd" = sample(4:8, 10e5, TRUE), check.names = TRUE)
> colnames(df)
[1] "X1st" "X2nd"

Related

Add zero padding to numbers in a column by using str_pad in string package in r [duplicate]

I have a data.frame X, with a column A filled with chr, most of them are of nchar = 5, but some are of nchar=4. I want to put a 0 in front of those.
I would do it with the following kind-of-pseudo-code :
foreach( element_of_X$A as a ){ # this line is pseudo-code for Idk how to do it in R
if(nchar(a) < 5){ # I think these lines are valid
paste0(0,a) # I think these lines are valid
}
}
(Obviously I come from PHP). How can I do that in clean R code ? (That, or a more efficient solution)
Thanks
Actually sprintf didn't work for me, so if you don't mind a common dependency:
#reproducible example -- this happens with zip codes sometimes
X <- data.frame(A = c('10002','8540','BIRD'), stringsAsFactors=FALSE)
# X$A <- sprintf('%05s',X$A) didn't work for me
# Note in ?sprintf: 0: For numbers, pad to the field width with leading zeros.
# For characters, this zero-pads on some platforms and is ignored on others.
library('stringr')
X$A <- str_pad(X$A, width=5, side='left', pad='0')
X
# A
#1 10002
#2 08540
#3 0BIRD
or, if you prefer a base solution, the following is equivalent:
X$A <- ifelse(nchar(X$A) < 5, paste(c(rep("0",5-nchar(X$A)), X$A), collapse=""), X$A)
(note this works on strings of length 4 or less, not just 4)
If you use dplyr and stringr you could do the following
library(dplyr)
library(stringr)
## Assuming "element_of_X" has element 'A'
element_of_X <-
element_of_X %>%
mutate(A = str_pad(A, 5, side = 'left', pad = '0'))
Edit
Or perhaps more simply, as suggested in the comments:
element_of_X$A <- str_pad(element_of_X$A, 5, side = 'left', pad = '0')
This should do the trick:
X$A <- ifelse(nchar(X$A) < 5, paste("0", X$A, sep=""), X$A)
Try something like this (assuming data frame name and column name are right):
element_of_X$a <- with(element_of_X , ifelse(nchar(a) == 4, paste('0', a, sep = ''), a)
library(stringr)
x$A=str_pad(x$A, 5, pad = "0")

Paste leading zero in columns A and B if column A meets condition

Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6

Compare vector to subset of row in a data.frame

I have a data.frame "dat" and a numeric vector "test":
code <- c("A22", "B15", "C03")
v.1 <- 1:3
v.2 <- 3:1
v.3 <- c(2, NA, 2)
bob <- c("yes", "no", "no")
dat <- data.frame(code, v.1, v.2, v.3, bob, stringsAsFactors = FALSE)
test <- c(3, 1, 2)
I want to find the row in the data.frame where the second to fourth columns ("v.1", "v.2", "v.3") contain the same values as the vector, in the same order, and return the value from the "code"-column (in this case "C03").
I tried
dat[dat[, 2:4] == test]$code
and
which(apply(dat, 1, function(x) all.equal(dat[, 2:4], test)) == FALSE)
both of which do not work.
I would prefer a solution with base R.
Your second option (with which) does not work for several problems: using apply on whole dat converts it to a matrix of character, you're actually not using x, the function argument and you should use all instead of all.equal and probably TRUE instead of FALSE (the comparison is actually not needed).
You can modify it a bit to make it work:
which(apply(dat[, 2:4], 1, function(x) all(x==test)))
[1] 3
Or
dat[apply(dat[, 2:4], 1, function(x) all(x==test)), "code"]
[1] C03
With apply we can paste the columns together and check which row has the same value as that of test when pasted together and selected the column code of respective row.
dat[apply(dat[2:4], 1, paste0, collapse = "|") ==
paste0(test, collapse = "|"), "code"]
#[1] C03
We just need to replicate the 'test' to make the lengths equal before doing the comparison
dat[2:4] == test[row(dat[2:4])]
If we need the 'code'
dat$code[rowSums(dat[2:4] == test[row(dat[2:4])], na.rm = TRUE)==3]
#[1] C03

How to replace existing values by new values from look-up list without causing NA?

I have a data frame. One column contains the following values:
df$current_column=(A,B,C,D,E)
A vector contains a look up value:
v <- c(A=X, B=Y)
I want to replace this column to come up with a list of (X, Y, C,D,E)
I am thinking to create a new column like
df$new_column <- v[df$current_column]
It does the replacement of A and B but it also makes C,D,E as NA (X,Y, NA, NA, NA).
How to keep C,D and E or is there any other way?
looks like ifelse() could help:
d$current_column <- ifelse( d$current_column == A, X,
ifelse( d$current_column == B, Y, d$current_column ))
We can create a logical index with %in% and then do the conversion
i1 <- df$current_column %in% names(v)
df$new_column <- df$current_column
df$new_column[i1] <- v[df$new_column[i1]]
df$new_column
#[1] "X" "Y" "C" "D" "E"
Or use a single ifelse
with(df, ifelse(current_column %in% names(v),
v[current_column], current_column))
Update
If the 'current_column' is factor class, convert to character class and it should work.
df$new_column <- as.character(df$current_column)
df$new_column[i1] <- v[df$new_column[i1]]
data
df <- data.frame(current_column = LETTERS[1:5],
stringsAsFactors=FALSE)
v <- setNames(c('X', 'Y'), LETTERS[1:2])
user2029709,
-- was working off of your little example; for a more generic approach it would be nice to see a snippet of the real data or close simulation. In any case, here is something that may work for you better, without coding manually all ifelse() options, and is still a relatively straightforward solution:
vd <- data.frame(current_column = names(v), new_column = v, stringsAsFactors = FALSE)
df <- merge(df, vd, by = 'current_column', all.x = TRUE)
df$new_column <- ifelse(is.na(df$new_column), df$current_column, df$current_column)
You may have to modify data types when creating vd data.frame to assure proper merge.
Best,
oleg

Resources