Best option to search for characters in a column using R - r

I have a df with column which contains different codes (ICD-10). The column contains codes which consists of 4 alpha numeric characters. I want to search for specific codes based on just the first two characters. For example if this is the column
codes = c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
and I want all those rows where it contains S0, S1, T0, T9, T1 and assign it one and 0 if not present. I previously have used %like% with case_when. However, I would like to know if there an efficient way to do this in R.Thanks

Use grepl() to test for a regular expression and return true for any string that starts with s0, s1, T0, T1, T9 and otherwise false. Then ifelse() to take that vector of TRUEs and FALSEs and assigned 1 for the TRUEs, otherwise 0.
codes <- c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
ifelse(grepl("^s[01]|^T[019]", codes), 1, 0)
Output:
[1] 1 1 0 1 1 0 0 0
Can also do:
as.numeric(grepl("^s[01]|^T[019]", codes))

We can use
+(grepl("^s[01]|^T[019]", codes))
[1] 1 1 0 1 1 0 0 0

We could define a pattern you want to detect and then use str_detect and assign 1 to TRUE and 0 to FALSE:
library(dplyr)
library(stringr)
# your dataframe with codes column
df <- data.frame(codes = c("s001", "s1234", "s4g6",
"T002", "T191","t985",
"s761","t17.5"))
# define what you want to search for
search_pattern <- "S0|S1|T0|T9|T1"
# check with `str_detect`
df %>%
mutate(check = ifelse(str_detect(df$codes, search_pattern)==TRUE, 1, 0))
Output:
codes check
1 s001 0
2 s1234 0
3 s4g6 0
4 T002 1
5 T191 1
6 t985 0
7 s761 0
8 t17.5 0

Another option with grepl
> +grepl("^([sT][01]|T9)", codes)
[1] 1 1 0 1 1 0 0 0

You can also use the substring approach. Extract only first 2 characters from the codes using substr and compare it against the correct_values.
codes = c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
correct_values <- c("s0", "s1", "T0", "T9", "T1")
as.integer(substr(codes, 1, 2) %in% correct_values)
#[1] 1 1 0 1 1 0 0 0

Related

If any value is present in a column of Dataframe, change the value to 1 else insert 0

I have a dataframe with about 1000 rows and 1000 columns. What I want to do is that if any value is present in any cell of the dataframe then change the value to 1 or else put a 0 in that cell. I am programming in R so a R code would be appreciated. I don't want the value of the T column to change but only for the rest of the columns to change.
For example
I have a dataframe like this :
T A B C D
1 29 90 0 100
2 30 12 76 0
3 0 12 0 32
convert it to :
T A B C D
1 1 1 0 1
2 1 1 1 0
3 0 1 0 1
To ignore the first column, you could combine it with a simple modification of akrun's first solution. For example,
data.frame(df[, 1, drop=FALSE], +(df[,-1] != 0))
We can convert to a logical matrix and coerce it to integer
df1 <- +(df != 0)
Or with replace
replace(df, df != 0, 1)
If we need to do this without taking the first column
df[-1] <- +(df[-1] != 0)
Or with sapply
+(sapply(df, `!=`, 0))
In tidyverse, we can use mutate_all
library(dplyr)
df <- df %>%
mutate_all(~ as.integer(. != 0))

Placeholders in R

I have a dataset where different values can only be classified by the occurrence of the digit 1. All values consist of 5 digits. Now I need to create a new variable that groups the values. My question now is, whether there is a way similiar to Excel to set placeholders in order to identify those values that start with 1.
What I have done so far is:
w$r <- ifelse(w$f == 1****, 1, 0)
Here, I wanted to filter out all values where 1 is the first digit.
It is noteworthy that some values have a reoccuring 1, i.e. on 2 digits.
All variables have either a 1 in them or are zero.
Examples for data are 00000, 00001, 11100 etc. The goal is to create a variable for every 1 at a different position. E.g. First digit one should be a variable, but also a variable were the 1 occurs as the first and third digit needs to be accounted for in the created variable 1 and variable 3.
EDIT:
Not quite sure whether that's what you want, but here's a try:
Data:
Since you also seem to have data with leading zeros you need to convert them to character:
df <- data.frame(w = c("00000", "00001", "11100", "10010", "11000", "10000", "10100", "00100", "10001"))
Solution:
# variable for "1" in first position:
df$r1 <- ifelse(grepl("^1", df$w), 1, 0)
# variable for "1" in second position:
df$r2 <- ifelse(grepl("^\\d1", df$w), 1, 0)
# variable for "1" in third position:
df$r3 <- ifelse(grepl("^\\d{2}1", df$w), 1, 0)
# variable for "1" in fourth position:
df$r4 <- ifelse(grepl("^\\d{3}1", df$w), 1, 0)
# variable for "1" in fifth position:
df$r5 <- ifelse(grepl("^\\d{4}1", df$w), 1, 0)
Result:
df
w r r2 r3 r4
1 00000 0 0 0 0
2 00001 0 0 0 1
3 11100 1 1 1 0
4 10010 1 0 0 0
5 11000 1 1 0 0
6 10000 1 0 0 0
7 10100 1 0 1 0
8 00100 0 0 1 0
9 10001 1 0 0 1

Find unique values in a character vector separated by commas and then one-hot encoding

Basically I have a vector of strings separated by commas. I'm looking to one-hot encode using the unique values of the strings. I believe I have to first find the unique values (separated by commas) to use as the columns before one-hot encoding, but I'm not sure. For instance, say I have the following character vector:
people_names
Bob,Megan,Mike,Sarah
Mike,Sarah
Megan,Sarah
Bob
I'm looking to create a resulting one-hot encoded data frame that corresponds to this vector like this:
Bob Megan Mike Sarah
1 1 1 1
0 0 1 1
0 1 0 1
1 0 0 0
Thank you for any help. I really appreciate it.
people_names = c("Bob,Megan,Mike,Sarah",
"Mike,Sarah",
"Megan,Sarah",
"Bob")
library(tidyverse)
data.frame(people_names) %>% # create a dataframe
mutate(id = row_number(), # add row id (useful for reshaping)
value = 1) %>% # add a column of 1s to denote existence
separate_rows(people_names) %>% # create one row per name keeping relevant info
spread(people_names, value, fill = 0) %>% # reshape
select(-id) # remove row id
# Bob Megan Mike Sarah
# 1 1 1 1 1
# 2 0 0 1 1
# 3 0 1 0 1
# 4 1 0 0 0
As an alternative, there is a helper function in the splitstackshape package that you might find useful. The output is a matrix
splitstackshape:::charMat(strsplit(people_names, ","), fill = 0L)
# Bob Megan Mike Sarah
#[1,] 1 1 1 1
#[2,] 0 0 1 1
#[3,] 0 1 0 1
#[4,] 1 0 0 0
From the same package you might also try cSplit_e
library(splitstackshape)
out <- cSplit_e(
data.frame(people_names),
split.col = "people_names",
sep = ",",
mode = "binary",
type = "character",
fill = 0L,
drop = TRUE
)
# remove prefix of column names
(out <- setNames(out, sub("people_names_", "", names(out), fixed = TRUE)))
data
people_names = c("Bob,Megan,Mike,Sarah",
"Mike,Sarah",
"Megan,Sarah",
"Bob")

How to add an aggregated variable to an existing dataset in R

How do you add a variable to a dataset using the aggregate and by commands? For example, I have:
num x1
1 1
1 0
2 0
2 0
And I'm looking to create a variable to identify every variable for which any num is 1, for example:
num x1 x2
1 1 1
1 0 1
2 0 0
2 0 0
or
num x1 x2
1 1 TRUE
1 0 TRUE
2 0 FALSE
2 0 FALSE
I've tried to use
df$x2 <- aggregate(df$x1, by = list(df$num), FUN = sum)
But I'm getting an error that says the replacement has a different number of rows than the data. Can anyone help?
This can be done by grouping with 'num' and checking if there are any 1 element in 'x'1. The ave from base R is convenient for this instead of aggregate
df1$x2 <- with(df1, ave(x1==1, num, FUN = any))
df1$x2
#[1] 1 1 0 0
Or using dplyr, we group by 'num' and create the 'x2' by checking if any 'x1' is equal to 1. It will be a logical vector if we are not wrapping with as.integer to convert to binary
library(dplyr)
df1 %>%
group_by(num) %>%
mutate(x2 = as.integer(any(x1==1)))

R Sort one column ascending, all others descending (based on column order)

I have an ordered table, similar to as follows:
df <- read.table(text =
"A B C Size
1 0 0 1
0 1 1 2
0 0 1 1
1 1 0 2
0 1 0 1",
header = TRUE)
In reality there will be many more columns, but this is fine for a solution.
I wish to sort this table first by SIZE (Ascending), then by each other column in priority sequence (Descending) - i.e. by column A first, then B, then C, etc.
The problem is that I will not know the column names in advance so cannot name them, but need in effect "all columns except SIZE".
End result should be:
A B C Size
1 0 0 1
0 1 0 1
0 0 1 1
1 1 0 2
0 1 1 2
I've seen examples of sorting by two columns, but I just can't find the correct syntax to sort by 'all other columns sequentially'.
Many thanks
With the names use order like this. No packages are used.
o <- with(df, order(Size, -A, -B, -C))
df[o, ]
This gives:
A B C Size
1 1 0 0 1
5 0 1 0 1
3 0 0 1 1
4 1 1 0 2
2 0 1 1 2
or without the names just use column numbers:
o <- order(df[[4]], -df[[1]], -df[[2]], -df[[3]])
or
k <- 4
o <- do.call("order", data.frame(df[[k]], -df[-k]))
If Size is always the last column use k <- ncol(df) instead or if it is not necessarily last but always called Size then use k <- match("Size", names(df)) instead.
Note: Although not needed in the example shown in the question if the columns were not numeric then one could not negate them so a more general solution would be to replace the first line above with the following where xtfrm is an R function which converts objects to numeric such that the result sorts in the order expected.
o <- with(df, order(Size, -xtfrm(A), -xtfrm(B), -xtfrm(C)))
We can use arrange from dplyr
library(dplyr)
arrange(df, Size, desc(A), desc(B), desc(C))
For more number of columns, arrange_ can be used
cols <- paste0("desc(", names(df)[1:3], ")")
arrange_(df, .dots = c("Size", cols))

Resources