Change multiple vector classes in R at once - r

I am trying to use change the classes of multiple vector at once, using %>% mutate_if
in an empty dataset of logical vectors. I can change them one by one with as.factor().
My dataset looks the following:
ID code
pc01 cat
pc02 dog
pc03 cat
pc04 horse
pc01 dog
pc02 horse
Now, I post to you my whole code if it helps:
library(dplyr)
G <- as.factor(levels(as.factor(id)))
dat <- as.data.frame(G)
datprep <- data.frame(matrix(vector(), length(G),
length(levels(as.factor(code)))
)
)
colnames(datprep) = levels(as.factor(code))
datD <- cbind(datprep, datD)
# columns are logical, shall be factors.
datD %>% mutate_if(is.logical, as.factor)
Any suggestions?

In the new version of dplyr, we can also use across
library(dplyr)
df <- df %>%
mutate(across(where(is.character), factor))

In base R you could do:
df <- type.convert(df)
or even
df <- rapply(df,factor,"character", how="replace")

If I understood your problem correctly (not sure since there is no logical colum in you example just character), you have to include the ~ and . with mutate_if (and many others):
library(dplyr)
df <- dplyr::tibble(ID = c("pc01", "pc02", "pc03", "pc04", "pc01", "pc02"),
code = c("cat", "dog", "cat", "horse", "dog", "horse"))
df %>%
dplyr::mutate_if(is.character, ~ as.factor(.))
ID code
<fct> <fct>
1 pc01 cat
2 pc02 dog
3 pc03 cat
4 pc04 horse
5 pc01 dog
6 pc02 horse
The tilde "~" sinalizes a function (result is on the left) and the "." is stands for any column... I used is.character() as this is what your example columns are but you can change it to any other type of verification

Related

General way of uniting column with subsequent column inside case_when inside mutate(across()) in R

I have a df in R, this is a simplified version:
ID <- c("AA1", "AA2","AA3","AA4","AA5")
Pop <- c("AA","AA","AA","AA","AA")
abc08 <- c(2,1,2,0,2)
...4 <- c(3,4,4,0,3)
abc11 <- c(2,2,2,2,1)
...5 <- c(3,4,4,4,3)
df <- data.frame(ID, Pop, abc08, ...4, abc11, ...5)
df
I want to unite columns that start with "abc" with their subsequent column, making the df look like this:
ID <- c("AA1", "AA2","AA3","AA4","AA5")
Pop <- c("AA","AA","AA","AA","AA")
abc08 <- c("2-3","1-4","2-4",NA,"2-3")
abc11 <- c("2-3","2-4","2-4","2-4","1-3")
df <- data.frame(ID, Pop, abc08, abc11)
df
My original df have more columns, so I'm, searching for a general way of doing it.
The code I have so far looks like this:
df %>% mutate(across(starts_with("abc"),
~ case_when(. > 0 ~ paste(., "content of subsequent column", sep = "-"),
. == 0 ~ NA_character_
)))%>%
select(!starts_with("..."))
where "content of subsequent column" obviously needs to be something that identifies the column following ´.´. I can't be the first who has had this problem, but I've searched for hours now without getting anywhere...
Try the following:
library(tidyverse)
df %>%
mutate(across(starts_with("abc"),
~ paste0(., "-", get(colnames(df)[which(colnames(df) == cur_column()) + 1]) ))) %>%
select(!starts_with("..."))
Output:
ID Pop abc08 abc11
1 AA1 AA 2-3 2-3
2 AA2 AA 1-4 2-4
3 AA3 AA 2-4 2-4
4 AA4 AA 0-0 2-4
5 AA5 AA 2-3 1-3
This works without case_when because paste0 is vectorised. The long phrase starting with get dynamically retrieves the value of whatever column is after the target "abc" column.

Extract words from text using dplyr and stringr

I'm trying to find an effective way to extract words from an text column in a dataset. The approach I'm using is
library(dplyr)
library(stringr)
Text = c("A little bird told me about the dog", "A pig in a poke", "As busy as a bee")
data = as.data.frame(Text)
keywords <- paste0(c("bird", "dog", "pig","wolf","cat", "bee", "turtle"), collapse = "|")
data %>% mutate(Word = str_extract(Text, keywords))
It's just an example but I have more than 2000 possible words to extract from each row. I don't know yet another approach to use, but the fact I will have a big regex will make things slow or doesn't matter the size of the regex? I think it will not appear more than one of these words in each row, but there is a way to make multiple columns automatically if more than one word appear in each row?
We can use str_extract_all to return a list, convert the list elements to a named list or tibble and use unnest_wider
library(purrr)
library(stringr)
library(tidyr)
library(dplyr)
data %>%
mutate(Words = str_extract_all(Text, keywords),
Words = map(Words, ~ as.list(unique(.x)) %>%
set_names(str_c('col', seq_along(.))))) %>%
unnest_wider(Words)
# A tibble: 3 x 3
# Text col1 col2
# <fct> <chr> <chr>
#1 A little bird told me about the dog bird dog
#2 A pig in a poke pig <NA>
#3 As busy as a bee bee <NA>
Try intersect with keywords as an array
data <- data.frame(Text = Text, Word = sapply(Text, function(v) intersect(unlist(strsplit(v,split = " ")),keywords),USE.NAMES = F))

Usings multiple conditions in select (dplyr)

I want to choose certain columns of a dataframe with dplyr::select() using contains() more than ones. I know there are other ways to solve it, but I wonder whether this is possible inside select(). An example:
df <- data.frame(column1= 1:10, col2= 1:10, c3= 1:10)
library(dplyr)
names(select(df, contains("col") & contains("1")))
This gives an error but I would like the function to give "column1".
I expected that select() would allow a similiar appraoch as filter() where we can set multiple conditions with operators, i.e. something like filter(df, column1 %in% 1:5 & col2 != 2).
EDIT
I notice that my question is more general and I wonder whether it is possible to pass any combinations in select(), like select(df, contains("1") | !starts_with("c")), and so on. But can't figure out how to make such a function.
You can use select_if and grepl
library(dplyr)
df %>%
select_if(grepl("col", names(.)) & grepl(1, names(.)))
# column1
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
#10 10
If you want to use select with contains you could do something like this:
df %>%
select(intersect(contains("col"), contains("1")))
This can be combined in other ways, as mentioned in the comments:
df %>%
select(intersect(contains("1"), starts_with("c")))
You can also chain two select calls:
library(dplyr)
df <- data.frame(column1 = 1:10, col2 = 1:10, c3 = 1:10)
df %>%
select(contains("col")) %>%
select(contains("1"))
Not too elegant for one-line lovers
You could use the dplyr::intersect function
select(df, intersect(contains("col"), contains("1")))

Pass column names as strings to group_by and summarize

With dplyr starting version 0.7 the methods ending with underscore such as summarize_ group_by_ are deprecated since we are supposed to use quosures.
See:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I am trying to implement the following example using quo and !!
Working example:
df <- data.frame(x = c("a","a","a","b","b","b"), y=c(1,1,2,2,3,3), z = 1:6)
lFG <- df %>%
group_by( x,y)
lFG %>% summarize( min(z))
However, in the case, I need to implement the columns to group by and summarize are specified as strings.
cols2group <- c("x","y")
col2summarize <- "z"
How can I get the same example as above working?
For this you can now use _at versions of the verbs
df %>%
group_by_at(cols2group) %>%
summarize_at(.vars = col2summarize, .funs = min)
Edit (2021-06-09):
Please see Ronak Shah's answer, using
mutate(across(all_of(cols2summarize), min))
Now the preferred option
From dplyr 1.0.0 you can use across :
library(dplyr)
cols2group <- c("x","y")
col2summarize <- "z"
df %>%
group_by(across(all_of(cols2group))) %>%
summarise(across(all_of(col2summarize), min)) %>%
ungroup
# x y z
# <chr> <dbl> <int>
#1 a 1 1
#2 a 2 3
#3 b 2 4
#4 b 3 5
Another option is to use non-standard evaluation (NSE), and have R interpret the string as quoted names of objects:
cols2group <- c("x","y")
col2summarize <- "z"
df %>%
group_by(!!rlang::sym(cols2group)) %>%
summarize(min(!!rlang::sym(col2summarize)))
The rlang::sym() function takes the strings and turns them into quotes, which are in turn unquoted by !! and used as names in the context of df where they refer to the relevant columns. There's different ways of doing the same thing, as always, and this is the shorthand I tend to use!
See ?dplyr::across for the updated way to do this since group_by_at and summarize_at are now Superseded

Pop out observation/row from a data frame

My data looks like this:
library(tidyverse)
set.seed(1)
df <- tibble(
id = c("cat", "cat", "mouse", "dog", "fish", "fish", "fish"),
value = rnorm(7, 100, sd = 50)
)
How might I "pop out" the top value of fish, as in move fish to a new data frame and simultaneously remove it from the current data frame?
This works (but it doesn't seem all that elegant):
df_store <- df %>%
filter(id == "fish") %>%
top_n(1)
df <- anti_join(df, df_store)
Is there a better way?
You can do both actions in one single line by using the package pipeR.
library(pipeR); library(dplyr)
df <- df %>>% filter(id == "fish") %>>% top_n(1) %>>% (~ df2) %>% anti_join(df, .)
print(df2)
#### 1 fish 124.3715
print(df)
#### 1 mouse 58.21857
#### 2 dog 179.76404
#### 3 fish 58.97658
#### 4 cat 68.67731
#### 5 cat 109.18217
#### 6 fish 116.47539
I'm no expert of pipeR so you can check it out here, how this kind of assignment within a pipe actually works.
Just one remark: when using top_n i recommend to specify the value column, by default it's the last column but you can easily forget it: top_n(1, value)

Resources