I have a grouped set of values in a column I am trying to replace with a since value
col1
a
a;a;b;c
c;b;a
NA
b;b;b
I want to replace all values with either mixed or the single present value if for example a;a;a;a becomes a
Expected Output
col1
a
Mixed
Mixed
NA
b
Code
grouping = function(x){
y = as.list(strsplit(x, ";")[[1]])
#select first element, and test if each is the same element.
z = ""
for (i in 1:length(y)){
if (as.character(y[1]) != as.character(y[i])) {
z = 'mixed'
break
} else {
z = as.character(y[1])
}
}
return(z)
}
db %>%
select(col1) %>%
mutate(
test = grouping(col1)
)
I have tried it a few different ways and either end up with it not working at all or giving the value a for everything
A base R option via defining a user function f
f <- function(x) ifelse(length(u <- unique(unlist((strsplit(x, ";"))))) > 1, "Mixed", u)
such that
> transform(df, col1 = Vectorize(f)(col1))
col1
1 a
2 Mixed
3 Mixed
4 <NA>
5 b
You can also consider this for your function and use base R:
#Function
myfun <- function(x)
{
y <- unlist(strsplit(x, ";"))
if(length(unique(y))==1)
{
z <- unique(y)
} else
{
z <- 'Mixed'
}
}
#Apply
df$New <- apply(df,1,myfun)
Output:
df
col1 New
1 a a
2 a;a;b;c Mixed
3 c;b;a Mixed
4 <NA> <NA>
5 b;b;b b
Some data used:
#Data
df <- structure(list(col1 = c("a", "a;a;b;c", "c;b;a", NA, "b;b;b")), class = "data.frame", row.names = c(NA,
-5L))
We can extract the substring from the 'col1' which are letters, check the number of distinct elements with n_distinct, use case_when to change those which have more one unique elements to 'Mixed'
library(dplyr)
library(stringr)
library(purrr)
df1 %>%
mutate(col1 = case_when(map_dbl(str_extract_all(col1,
"[a-z]"), n_distinct) >1 ~ "Mixed",
is.na(col) ~ NA_character_,
TRUE ~ substr(col1, 1, 1)))
-output
# col1
#1 a
#2 Mixed
#3 Mixed
#4 <NA>
#5 b
Or another option is to split the column by the delimiter with separate_rows, and do a group by row_number to summarise elements having more than one row (after the distinct) to be 'Mixed'
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(col1) %>%
distinct() %>%
group_by(rn) %>%
summarise(col1 = case_when(n() > 1 ~ 'Mixed', TRUE ~ first(col1)),
.groups = 'drop') %>%
select(-rn)
-output
# A tibble: 5 x 1
# col1
# <chr>
#1 a
#2 Mixed
#3 Mixed
#4 <NA>
#5 b
Or using base R with a compact option
v1 <- gsub("([a-z])\\1+", "\\1", gsub(";", "", df1$col1))
replace(v1, nchar(v1) > 1, "Mixed")
#[1] "a" "Mixed" "Mixed" NA "b"
The issue in the OP's function is that it is extracting only the first [[1]] list element
as.list(strsplit(x, ";")[[1]])
as strsplit returns a list with length equal to the number of rows of the initial data. So, basically by selecting only the first, it is recycled
data
df1 <- structure(list(col1 = c("a", "a;a;b;c", "c;b;a", NA, "b;b;b")),
class = "data.frame", row.names = c(NA,
-5L))
You can write the grouping function as :
grouping <- function(x) {
sapply(strsplit(x, ';'), function(x)
if(length(unique(x)) == 1) unique(x) else 'Mixed')
}
db$test <- grouping(db$col1)
db
# col1 test
#1 a a
#2 a;a;b;c Mixed
#3 c;b;a Mixed
#4 <NA> <NA>
#5 b;b;b b
Related
I have a dataframe with over hundreds of variables, grouped in different factors ("Happy_","Sad_", etc) and I want to create a set new variables indicating whether a participant put a rating of 4 in any of the variables in one factor. However, if any of the variable in that factor is NA, then the new variable will also output NA.
I have tried the following, but it didn't work:
library(tidyverse)
df <- data.frame(Subj = c("A", "B", "C", "D"),
Happy_1_Num = c(4,2,2,NA),
Happy_2_Num = c(4,2,2,1),
Happy_3_Num = c(1,NA,2,4),
Sad_1_Num = c(2,1,4,3),
Sad_2_Num = c(NA,1,2,4),
Sad_3_Num = c(4,2,2,1))
# Don't work
df <- df %>% mutate(Happy_Any4 = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), NA,
ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
Sad_Any4 = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), NA,
ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))
I tried a workaround by first generating a set of variables to indicate if that factor has any NA, and after that check if participant put any rating of "4". it works; but since I have many factors, I was wondering if there is a more elegant way of doing it.
# workaround
df <- df %>% mutate(
NA_Happy = ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ is.na(.)), 1,0),
NA_Sad = ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ is.na(.)), 1,0))
df <- df %>% mutate(
Happy_Any4 = ifelse(NA_Happy == 1, NA,
ifelse(if_any(matches("^Happy_") & matches("_Num$"), ~ . == 4),1,0)),
Sad_Any4 = ifelse(NA_Sad == 1, NA,
ifelse(if_any(matches("^Sad_") & matches("_Num$"), ~ . == 4),1,0)))
Here is a base R option using split.default -
tmp <- df[-1]
cbind(df, sapply(split.default(tmp, sub('_.*', '', names(tmp))),
function(x) as.integer(rowSums(x== 4) > 0)))
# Subj Happy_1_Num Happy_2_Num Happy_3_Num Sad_1_Num Sad_2_Num Sad_3_Num Happy Sad
#1 A 4 4 1 2 NA 4 1 NA
#2 B 2 2 NA 1 1 2 NA 0
#3 C 2 2 2 4 2 2 0 1
#4 D NA 1 4 3 4 1 NA 1
sub would keep only either "Happy" or "Sad" part of the names, split.default splits the data based on that and use sapply to calculate if any value of 4 is present in a row.
If you can afford to write each and every factor manually you can do -
library(dplyr)
df %>%
mutate(Happy = as.integer(rowSums(select(., starts_with('Happy')) == 4) > 0),
Sad = as.integer(rowSums(select(., starts_with('Sad')) == 4) > 0))
here is another workaround by transposing the data.frame and an apply on colonns. I'm not sure it's more elegant but here it is ^^
tmp <- cbind(sub("^((Happy)|(Sad))(_.*_Num)$", "\\1", colnames(df)), t(df))
Happy_Any4 <- apply(tmp[tmp[,1]== "Happy", -1], 2,
function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )
Sad_Any4 <- apply(tmp[tmp[,1]== "Sad", -1], 2,
function(x) ifelse(any(is.na(x)), NA, length(grep("4", x))) )
df <- cbind(df, Happy_Any4 = Happy_Any4, Sad_Any4 = Sad_Any4)
EDIT : Above was a strange test, but now this work with more beauty !
This is because the sum of anything where there is an NA will return NA.
df <- df %>% mutate(Happy_Any4 = apply(df[,grep("^Happy_.*_Num$", colnames(df))],
1, function(x) 1*(sum(x == 4) > 0)),
Sad_Any4 = apply(df[, grep("^Sad_.*_Num$", colnames(df))],
1, function(x) 1*(sum(x == 4) > 0)))
The apply will look every row, only on columns where we find the correct part in colnames (with grep. It then find every occurence of 4, which form a logical vector, and it's sum is the number of occurence. The presence of an NA will bring the sum to NA. I then just check if the sum is above 0 and the 1* will turn the numeric into logical.
Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
This question already has answers here:
How to replace NA values in a table for selected columns
(12 answers)
Closed 2 years ago.
I'm trying to replace all the NAs present in the column of integer type with 0 and NAs present in the column of factor type with empty string "". The code below is the one that i'm using but it doesn't seem to work
for(i in 1:ncol(credits)){
if(sapply(credits[i], class) == 'integer'){
credits[is.na(credits[,i]), i] <- 0
}
else if(sapply(credits[i], class) == 'factor'){
credits[is.na(credits[,i]), i] <- ''
}
You can use across in dplyr to replace column values by class :
library(dplyr)
df %>%
mutate(across(where(is.factor), ~replace(as.character(.), is.na(.), '')),
across(where(is.numeric), ~replace(., is.na(.), 0)))
# a b
#1 1 a
#2 2 b
#3 0 c
#4 4 d
#5 5
b column is of class "character" now, if you need it as factor, you can add factor outside replace like :
across(where(is.factor), ~factor(replace(as.character(.), is.na(.), ''))),
data
df <- data.frame(a = c(1, 2, NA, 4:5), b = c(letters[1:4], NA),
stringsAsFactors = TRUE)
Another way of achieving the same:
library(dplyr)
# Dataframe
df <- data.frame(x = c(1, 2, NA, 4:5), y = c('a',NA, 'd','e','f'),
stringsAsFactors = TRUE)
# Creating new columns
df_final<- df %>%
mutate(new_x = ifelse(is.numeric(x)==TRUE & is.na(x)==TRUE,0,x)) %>%
mutate(new_y = ifelse(is.factor(y)==TRUE & is.na(y)==TRUE,"",y))
# Printing the output
df_final
I have a data set as following:-
a <- data.frame(X1="A", X2="B", X3="C", X4="D", X5="0",
X6="0", X7="0", X8="0", X9="0", X10="0")
Basically it is a 1 row X 10 column data.frame.
The resulting data.frame should have the column elements of a as rows rather than columns. And any columns in a which are equal to "0" should not be present in the new data.frame. For ex. -
# b
# [1] A
# [2] B
# [3] C
# [4] D
Use a transpose and subset with a logical condition
data.frame("b" = t(df1)[t(df1) != 0])
A second look gave me chance to play with code, you did not need a transpose
data.frame("b" = df1[df1 != 0])
You could unlist and then subset
subset(data.frame(b = unlist(a), row.names = NULL), b != 0)
# b
#1 A
#2 B
#3 C
#4 D
Using pivot_longer function, you can reshape your dataframe into a longer format and then filter values that are "0". With the function column_to_rownames from tibble package, you can pass the first column as rownames.
Altogether, you can do something like this:
library(tidyr)
library(dplyr)
library(tibble)
a %>% pivot_longer(everything(), names_to = "Row", values_to = "b") %>%
filter(b != "0") %>%
column_to_rownames("Row")
b
X1 A
X2 B
X3 C
X4 D
I want to replace a part of the value in df1 with value from df2. If df1$col1 starts with the numbers in df2$col1, replace those four numbers (keep rest) with df2$col2. Same for df1$col2. Example: For 16122567 replace with 5059 resulting in 50592567. Have tried different kinds of starts_with, loops, for(i in ..), mutate etc.. Anyone? (I'm new to R).
df1 col1 col2
1 16122567 89992567
2 17236945 16126548
3 95781657 19995670
4 16126972 56972541
df2 col1 col2
1 1612 5059
2 1723 5044
3 8999 5094
4 1999 9053
Here is one way with dplyr. We can create a new column with first 4 characters of col1, left_join with df2, replace NA's with four characters of col2.x. Finally, we use substr to replace values at specific position.
library(dplyr)
df3 <- df1 %>%
mutate(col1 = substr(col1, 1, 4)) %>%
left_join(df2 %>% mutate(col1 = as.character(col1)), by = 'col1') %>%
mutate(col2.y = ifelse(is.na(col2.y), substr(col2.x, 1, 4), col2.y),
col2.x = as.character(col2.x))
substr(df3$col2.x, 1, 4) <- df3$col2.y
df3
# col1 col2.x col2.y
#1 1612 50592567 5059
#2 1723 50446548 5044
#3 9578 19995670 1999
#4 1612 50592541 5059
Here is another approach using base R. We can create a function to do the check and manipulate the text and then apply that function to any column we want to modify.
# the data
df1 <- data.frame(col1 = c(16122567, 17236945, 95781657, 16126972),
col2 = c(89992567, 16126548, 19995670, 56972541))
df2 <- data.frame(col1 = c(1612, 1723, 8999, 1999),
col2 = c(5059, 5044, 5094, 9053))
# a function to do the check and create the chimera strings
check_and_paste <- function(check1, check2, replacement) {
res <- c()
for (i in seq_along(check1)) {
four_digits <- substr(check1[i], 1, 4)
if (four_digits %in% check2) {
res[i] <- paste(replacement[which(four_digits == check2)],
substr(check1[i], 5, 8),
sep = "")
} else {
res[i] <- check1[i]
}
}
return(as.numeric(as.character(res))) # to return numbers
}
# apply to the first column
new_col1 <- check_and_paste(
check1 = df1$col1,
check2 = df2$col1,
replacement = df2$col2
)
# and the second
new_col2 <- check_and_paste(
check1 = df1$col2,
check2 = df2$col1,
replacement = df2$col2
)
# the new data frame
data.frame(new_col1, new_col2)
# new_col1 new_col2
#1 50592567 50942567
#2 50446945 50596548
#3 95781657 90535670
#4 50596972 56972541