Issues with replacing all words in a string except one - r

I have a simple problem:
I have a column with thousands of values and I'm trying to convert it into a dichotomous variable (Yes|No). Replacing strings with 'No' was easy enough as the value I was converting was a single asterisk
Data$Complete <- gsub("\\*", "No", Data$Complete)
But when I attempt to replace everything apart from 'No', the following code replaces everything with 'Yes' in my string. I don't understand why it would as I'm specifying to replace everthing apart from "No":
Data$Complete <- Data[!Data$Complete %in% c("No"), "Complete"] <- "Yes"
Any pointers would be appreciated.

You can use combination of ifelse function and grepl to extract necessary data as below:
library(stringi)
# data simulation
set.seed(123)
n <- 1000
data <- data.frame(
complete = stri_rand_strings(n = n, length = 20, pattern = "[A-Za-z0-9\\*]")
)
# string matching
data$yes_no <- ifelse(grepl("\\*", data$complete), "No", "Yes")
head(data)
Output:
complete yes_no
1 HmOsw1WtXRxRfZ5tE1Jx Yes
2 tgdzehXaH8xtgn0TkCJD Yes
3 7PPM87DSFr1Qn6YC7ktM Yes
4 e4NGoRoonQkch*SCMbL6 No
5 EfPm5QztsA7eKeJAm4SV Yes
6 aJTxTtubO8vH2wi7XxZO Yes

Related

Standardize group names using a vector of possible matches

I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:
df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
df$grpl <- grepl(paste0(i), df$b)
df[ which(df$grpl == TRUE),]$standard <- "male"
}
The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.
Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:
df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
df[ grepl(i, df$b), "standard"] <- "male"
}
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female male
# 3 3 depression_hsgrad <NA>
# 4 4 depression_collgrad <NA>
Then you've got the issue that the "male" pattern matches "female" as well.
Perhaps you're looking for sub instead? It works like find/replace:
df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female female
# 3 3 depression_hsgrad hsgrad
# 4 4 depression_collgrad collgrad
It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.

mutate_at() to rename multiple variables without altering original columns

I would like to mutate several variables at once using mutate_at(). This is how I've been doing up until now, but since I'm dealing with a long list of variables to recode/rename, I want to know how I can do this using mutate_at(). I want to maintain the original columns, which is why I'm not using rename() but mutate() instead. This is what I normally do:
df <- df %>%
mutate(q_50_a = as.numeric(`question_50_part_a: very long very long very long very long` == "yes"),
q_50_b = as.numeric(`question_50_part_b: very long very long very long very long` == "yes"),
q_50_c = as.numeric(`question_50_part_c: very long very long very long very long` == "yes"))
This is what I have so far:
df <- df %>% mutate_at(vars(starts_with("question_50")), funs(q_50 = as.numeric(. == "yes")))
It works and creates a new numeric variable but I'm not sure how to get it to rename the new variables like this: q_50_a, q_50_b, q_50_c, ect.
Thank you.
edit: this is what the data looks like (except there are many many more columns which all look alike)
question_50_part_a: a very long title question_50_part_b: a very long title
yes yes
yes no
yes no
yes yes
no no
yes yes
but would like this:
q_50_a q_50_b
1 1
1 0
1 0
1 1
0 0
1 1
but I want to keep the original columns as they are and simply mutate these new columns with the shorter name and numeric binary coding.
We can use rename_at to rename the new columns.
library(dplyr)
df %>%
mutate_at(vars(starts_with('question_50')),
list(new = ~as.numeric(. == 'yes'))) %>%
rename_at(vars(ends_with('new')),
~sub('\\w+(_\\d+)_part(\\w+):.*', 'q\\1\\2', .))
# question_50_part_a: a very long title question_50_part_b: a very long title
#1 yes yes
#2 yes no
#3 yes no
#4 yes yes
#5 no no
#6 yes yes
# q_50_a q_50_b
#1 1 1
#2 1 0
#3 1 0
#4 1 1
#5 0 0
#6 1 1
Here is an approach that loops over each column:
column_names = colnames(df)
# optional filter out column names you don't want to change here
for(col in column_names){
# construct replacement name
col_replace = paste0("q_", substr(col, 10, 11), "_", substr(col, 18, 18))
# assign and drop old column
df = df %>%
mutate(!!sym(col_replace) := ifelse(!!sym(col) == "yes", 1, 0)) %>%
select(-!!sym(col))
}
Points to note:
If you have other columns you don't want changed, be sure to exclude them
The !!sym(col) construction takes the text string stored in col and turns it into a column name.
We use := rather than = because the LHS requires some evaluation before assignment can happen.
I have used ifelse instead of as.numeric but you can code the RHS of the equals sign as you please.
creating col_replace makes some assumptions about the format of your input names. If everything is the same length this should work. If the number of characters differ (e.g. Q_9_a and Q_10_a) then you may want to use a method based on strsplit instead.
The - sign in select makes it exclude the specified column

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

R: How do you subset all data-frames within a list?

I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202

Using grep inside a which statement (nested grep) in R

I'm trying to implement using which as a nested grep statement. For instance if I am looking for all lung data in a data frame it can be done simply with a which statement.
site <- c("lung", "breast", "colon","lung", "brain")
vals <- c(1:5)
df <- data.frame(site,vals)
> df[which(df$site=="lung"),]
site vals
1 lung 1
4 lung 4
but if I want to get the same results with a nested grep statement for "lung" , I'm not getting the second result. Any ideas?
> df[which (grep("lung",df$site)==TRUE),]
site vals
1 lung 1
And if I wanted to extend this a bit and assign a column, say 'lung_flag' where it would put something like a 'Y' next to the matches in rows 1 and 4, how would this best be done?
Like this:
df[grep("lung", df$site), ]
or
df[grepl("lung", df$site), ]
grep returns a vector of indices with a match: c(1, 4),
grepl returns a vector of logical: c(TRUE, FALSE, FALSE, TRUE, FALSE).

Resources