How to convert all columns where entries have length ≤1 to numeric? - r

I have a data frame with ~80 columns, and ~20-40 of those columns have single-digit integers that were stored as characters. Other character columns are complete sentences (so, length >>> 1), and so get coerced to NA if I try mutate_if(is.character, as.numeric).
I would like to transform those efficiently, and based on this question, I was hoping for something like this:
df %>% map_if(is.character & length(.) <= 1, as.numeric)
However, that doesn't work. I'm hoping for a tidy solution, maybe using purrr?

The best function for these situations is type_convert(), from readr:
"[type_convert() re-converts character columns in a data frame], which is useful if you need to do some manual munging - you can read the columns in as character, clean it up with (e.g.) regular expressions and other transformations, and then let readr take another stab at parsing it."
So, all you need to do is add it at the end of your pipe:
df %>% ... %>% type_convert()
Alternatively, we can use type.convert from base R, which would automatically detect the column type based on the value and change it
df[] <- type.convert(df, as.is = TRUE)
If the constraint is to look for columns that have only one character
i1 <- !colSums(nchar(as.matrix(df)) > 1)
df[i1] <- type.convert(df[i1])
If we want to use tidyverse, there is parse_guess from readr
library(tidyverse)
library(readr)
df %>%
mutate_if(all(nchar(.) == 1), parse_guess)

You could check for nchar of the column in mutate_if
library(dplyr)
df %>% mutate_if(~all(nchar(.) == 1) & is.character(.), as.numeric)
Using with an example data
df <- data.frame(a = c("ab", "bc", "de", "de", "ef"),
b = as.character(1:5), stringsAsFactors = FALSE)
df1 <- df %>% mutate_if(~all(nchar(.) == 1) & is.character(.), as.numeric)
str(df1)
#'data.frame': 5 obs. of 2 variables:
# $ a: chr "ab" "bc" "de" "de" ...
# $ b: num 1 2 3 4 5
You could do the same with map_if as well however, it returns a list back and you need to convert it back to dataframe
library(purrr)
df %>%
map_if(~all(nchar(.) == 1) & is.character(.), as.numeric) %>%
as.data.frame(., stringsAsFactors = FALSE)

Related

subset dataframe by specific string entries in R

I have data in data frame format here (data). I want to subset the data by using a specific string "Spatially clustered". So, the subset data will the data frame with all columns with entries that are "Spatially clustered". How can I do that? I have tried this
moran_deviation_data_multiple_correction_1january_raw_pval_conclusion = data
moran_deviation_data_multiple_correction_1january_raw_pval_conclusion_spatially_clustered = select(moran_deviation_data_multiple_correction_1january_raw_pval_conclusion, matches("clustered"))
moran_deviation_data_multiple_correction_1january_raw_pval_conclusion_spatially_clustered
also this one
moran_deviation_data_multiple_correction_1january_raw_pval_conclusion_spatially_clustered = moran_deviation_data_multiple_correction_1january_raw_pval_conclusion[apply(moran_deviation_data_multiple_correction_1january_raw_pval_conclusion,1, function(x) any(grepl("dispersed", x))), ]
moran_deviation_data_multiple_correction_1january_raw_pval_conclusion_spatially_clustered
However, the result is not what I expected.
Perhaps this helps
library(dplyr)
library(stringr)
df2 <- df1 %>%
select(where(~ any(str_detect(.x, "Spatially clustered"))))
-output
> dim(df2)
[1] 5 17989
> dim(df1)
[1] 5 23474

Compare two character vectors in R based on vector of strings

I have two lists A and B. The dates in A are 2000 - 2022 while those in B are 2023-2030.
names(A) and names(B) give the follow character vectors:
a <- c("ACC_a_his", "BCC_b_his", "Can_c_his", "CES_d_his")
b <- c("ACC_a_fu", "BCC_b_fu", "Can_c_fu", "CES_d_fu","FGO_c_fu")
Also, I have a string vector, c which is common across the names in a and b:
c=c("ACC","BCC", "Can", "CES", "FGO")
Note that the strings in c do not always appear in the same position in filenames. The string can be at the beginning, middle or end of filenames.
Challenge
Using the strings in c I would like to get the difference (i.e., which name exists in b but not in a or vice versa) between the names in a and b
Expected output = "FGO_c_fu"
rbind (or whatever is best) matching dataframes in lists A and B if the names are similar based on string in c
Update: See OP's comment:
Try this:
library(dplyr)
library(tibble)
library(tidyr)
library(stringr)
# or just library(tidyverse)
df %>%
pivot_longer(everything()) %>%
mutate(x = str_extract(value, paste(c, collapse = "|"))
) %>%
group_by(x) %>%
filter(!any(row_number() > 1)) %>%
na.omit() %>%
pull(value)
[1] "FGO_c_fu"
First answer:
Here is an alternative approach:
We create a list
the vectors are of unequal length
With data.frame(lapply(my_list, length<-, max(lengths(my_list)))) we create a data frame
pivot longer and group by all before the first underline
remove NA and filter:
library(dplyr)
library(tidyr)
library(tibble)
my_list <- tibble::lst(a, b)
df <- data.frame(lapply(my_list, `length<-`, max(lengths(my_list))))
df %>%
pivot_longer(everything()) %>%
group_by(x = sub("\\_.*", "", value)) %>%
filter(!any(row_number() > 1)) %>%
na.omit() %>%
pull(value)
[1] "FGO_c_fu"

Add multiple columns with mutate using column-based conditions, without using explicit column name + POSIX

I have a dataframe of data: 1 column is POSIX, the rest is data.
I need to remove selectively some data from a group of columns and add these "new" columns to the original dataframe.
I can "easily" do it in base R (I am an old-style user). I'd like to do it more compactly with mutate_at or with other function... although I am having several issues.
A solution homemade with base R could be
df <- data.frame("date" = seq.POSIXt(as.POSIXct(format(Sys.time(),"%F %T"),tz="UTC"),length.out=20,by="min"), "a.1" = rnorm(20,0,3), "a.2" = rnorm(20,1,2), "b.1"= rnorm(20,1,4), "b.2"= rnorm(20,3,4))
df1 <- lapply(df[,grep("^a",names(df))], function(x) replace(x, which(x > 0 & x < 0.2), NA))
df1 <- data.frame(matrix(unlist(df1), nrow = nrow(df), byrow = F)) ## convert to data.frame
names(df1) <- grep("^a",names(df),value=T) ## rename columns
df1 <- cbind.data.frame("date"=df$date, df1) ## add date
Can anyone help me in setting up something working with dplyr + transmute?
So far I come up with something like:
df %>%
select(starts_with("a.")) %>%
transmute(
case_when(
.>0.2 ~ NA,
)
) %>%
cbind.data.frame(df)
But I am quite stuck, since I can't combine transmute with case_when: all examples that I found use explicitly the column names in case_when, but I can't, since I won't know the names of the column in advance. I will only know the initial of the columns that I need to transmute.
Thanks,
Alex
We can use transmute_at if the intention is to return only those columns specified in the vars
library(dplyr)
df %>%
transmute_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
bind_cols(df %>% select(date), .)
If we need all the columns to return, but only change the columns of interest in vars, then we need mutate_at instead of transmute_at
df %>%
mutate_at(vars(starts_with('a')), ~ case_when(. > 0.2~ NA_real_, TRUE~ .)) %>%
select(date, starts_with('a')) # only need if we are selecting a subset of columns

How to use mutate_at or mutate_if at the same time to do multiple action on data

I would like to apply 3 functions using one code on the same variables in my data.
I have a data set and there are certain columns in my data and i want to apply these function to all of them.
1- make them all factor data
2- replace spaces in the columns with missing(convert space values to missing)
3- give missing value an explicit factor level using fct_explicit_na
i have done this in separate code lines but i want to merge all of them using dplyr mutate function. I tried the following but didnt work
cols <- c("id12", "id13", "id14", "id15")
data_new <- data_old %>%
mutate_if(cols=="", NA) %>% # replace space with NA for cols
mutate_at(cols, factor) %>% # then turn them into factors
mutate_at(cols, fct_explicit_na) # give NAs explicit factor level
)
I get the error:
Error in tbl_if_vars(.tbl, .p, .env, ..., .include_group_vars = .include_group_vars) :
length(.p) == length(tibble_vars) is not TRUE
The mutate_if step is not doing what the OP intend to do. Instead, we can do this in a single step with
library(dplyr)
data_old %>%
mutate_at(vars(cols), ~ na_if(., "") %>%
factor %>%
fct_explicit_na)
Why the OP's code didn't work?
Using a reproducible example, below code converts columns that are factor to character class
iris1 <- iris %>%
mutate_if(is.factor, as.character) %>%
mutate(Species = replace(Species, c(1, 3, 5), ""))
Now, if we do
iris1 %>%
mutate_if("Species" == "", NA)
it is comparing two strings instead of checking the column values. Also, mutate_if should return a logical vector of length 1 for selecting that column.
Instead, if we use
iris1 %>%
mutate_if(~ any(. == ""), ~ na_if(., "")) %>%
head

dplyr mutate stringr str_detect with multiple conditional arguments and corresponding output

I would like to mutate a string differently, depending on the format. This example has 2 formats based on inclusion of certain punctuation. Each element of the vector contains specific words uniquely associated with the format.
I have tried multiple approaches with ifelse and casewhen but not getting the desired results, which is to "keep" the last part of the string.
I am trying to use easy verbs and am not proficient in grex. Open to any suggestions for an efficient general method.
library(dplyr)
library(stringr)
df <- data.frame(KPI = c("xxxxx.x...Alpha...Keep.1",
"xxxxx.x...Alpha..Keep.2",
"Bravo...Keep3",
"Bravo...Keep4",
"xxxxx...Charlie...Keep.5",
"xxxxx...Charlie...Keep.6"))
dot3dot3split <- function(x) strsplit(x, "..." , fixed = TRUE)[[1]][3]
dot3dot3split("xxxxx.x...Alpha...Keep.1") # returns as expected
"Keep.1"
dot3split <- function(x) strsplit(x, "..." , fixed = TRUE)[[1]][2]
dot3split("Bravo...Keep3") # returns as expected
"Keep3"
df1 <- df %>% mutate_if(is.factor, as.character) %>%
mutate(KPI.v2 = ifelse(str_detect(KPI, paste(c("Alpha", "Charlie"), collapse = '|')), dot3dot3split(KPI),
ifelse(str_detect(KPI, "Bravo"), dot3split(KPI), KPI))) # not working as expected
df1$KPI.v2
"Keep.1" "Keep.1" "Alpha" "Alpha" "Keep.1" "Keep.1"
The functions you designed (dot3dot3split and dot3split) are not able to vectorize the operation. For example, if there are more than one elements, only the first one is returned. That may cause some problems.
dot3dot3split(c("xxxxx.x...Alpha...Keep.1", "xxxxx.x...Alpha..Keep.2"))
# [1] "Keep.1"
Since you are using stringr, I suggest that you can use str_extract to extract the string you want, without using ifelse or functions that can do vectorized operation.
df <- data.frame(KPI = c("xxxxx.x...Alpha...apples",
"xxxxx.x...Alpha..bananas",
"Bravo...oranges",
"Bravo...grapes",
"xxxxx...Charlie...cherries",
"xxxxx...Charlie...guavas"))
library(dplyr)
library(stringr)
df1 <- df %>%
mutate_if(is.factor, as.character) %>%
mutate(KPI.v2 = str_extract(KPI, "[A-Za-z]*$"))
df1
# KPI KPI.v2
# 1 xxxxx.x...Alpha...apples apples
# 2 xxxxx.x...Alpha..bananas bananas
# 3 Bravo...oranges oranges
# 4 Bravo...grapes grapes
# 5 xxxxx...Charlie...cherries cherries
# 6 xxxxx...Charlie...guavas guavas

Resources