dplyr unnest() not working for large comma separated data - r

Trying to use dplyr's unnest function to split apart a large character data set separated by commas. The data set has the form:
id keywords
835a24fe-c276-9824-0f4d-35fc81319cca Analytics,Artificial Intelligence,Big Data,Health Care
I want to create a table that has the "id" in column one and each of the "keywords" in a separate column with the same "id"
I'm using the code:
CB_keyword <- tibble(id=organizations$uuid[organizations$uuid %in% org_uuid ] ,
keyword=organizations$category_list[organizations$uuid %in% org_uuid]) %>% unnest(keyword, names_sep = ",")
The %in% code is selecting "id" and "keyword" info from another table ... and it is doing this correctly. The piping to unnest seems to do nothing. The tibble remains unchanged except that the column name is now "keyword,keyword" instead of "keyword", but the data is the same as if the unnest command is not used.

If the keywords is a string column, use separate_rows instead of unnest
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(keywords, sep=",\\s*")

Related

how to call variables while variable names was stored as a string

If I have a df and I would like to arrange it by 3 variables: "ID, AGE, SEX". How can I call them when I store those three variables' name in a variable "order_var"
order_var <- "ID, AGE, SEX"
df %>% arrange (paste0 (order_var))
How can I call those three variables?
Here is one option - split the 'order_var' at the , followed by any space (\\s*), extract the list element ([[1]]), and pass it inside across with all_of
library(dplyr)
df %>%
arrange(across(all_of(strsplit(order_var, ",\\s*"))[[1]]))
Or another option is eval by creating the full expression
eval(rlang::parse_expr(sprintf('df %%>%% arrange(%s)', order_var)))

Filter one dataframe based on the last two digits of another dataframe's value under one column in R

The table, Data_frame, has an ID column that contains over 1000 participants' information, such as "Sample_LI.01"
My_ColData also has an ID column that contains only 40 participants' different information, such as "Sample_LI-01".
I want to use the ID column in My_ColData to filter the Data_frame table. However, you may have noticed the formats of ID are slightly different. I wonder if the best way to possibly filter based on the last two digits?
I have a code so far, look like
data_frame %>% filter (ID %in% my_ColData$ID, if______)
Having no idea what to write about in this if condition. Or is there a better to realize my goal? Any suggestions would be appreciated.
We could use str_replace to replace the - with . to match the 'ID' from 'data_frame' with the 'ID' from 'my_ColData'
library(dplyr)
library(stringr)
data_frame %>%
filter(ID %in% str_replace(my_ColData$ID, '-', '.') )
We could use str_sub to check for the last two digits
library(dplyr)
library(stringr)
data_frame %>%
filter(str_sub(ID, -2) %in% str_sub(my_colData$ID, -2))

Selecting columns from dataframe programmatically when column names have spaces

I have a dataframe which I would like to query. Note that the columns of that dataframe could change and the column names have spcaes. I have a function that I want to apply on the dataframe columns. I figured I could programmatically find out what columns exists and then use that list of columns to apply function to the columns that exist.
I was able to figure out how to do that when the column names don't have spaces: See the code below
library(tidyverse)
library(rlang)
col_names <- c("cyl","mpg","New_Var")
cc <- rlang::quos(col_names)
mtcars%>%mutate(New_Var=1)%>%select(!!!cc)
But when the column names have spaces, this method does not works, below is the code I used:
col_names <- c("cyl","mpg","`New Var`")
cc <- rlang::quos(col_names)
mtcars%>%mutate(`New Var`=1)%>%select(!!!cc)
Is there a way to select columns that have spaces in their name without changing their names ?
You have to do nothing differently for values with spaces. For example,
library(dplyr)
library(rlang)
col_names <- c("cyl","mpg","New Var")
cc <- quos(col_names)
mtcars %>% mutate(`New Var`=1) %>% select(!!!cc)
Also note, that select also accepts string names so this works too :
mtcars%>% mutate(`New Var`=1) %>% select(col_names)

R dplyr: select all columns after transforming selected columns

I have a tibble and want to only do some mutation on selected columns. In the case below, all columns which have the word 'date' will be transformed by a function (as.Date()).
After I have performed some transformations on the selected columns, I want to get back all the columns from my tibble df.
Is there a way to do so?
df %>% select(contains('date')) %>% mutate_all(as.Date) %>% select(all)
Thanks
We can use mutate_at instead of select and then mutate_all. This would select only the columns of interest and modify those while keeping the others columns as such
library(dplyr)
df %>%
mutate_at(vars(contains('date')), as.Date)

Filtering dataframe on regex string across all rows?

I've read in a table df which has numbers and strings.
I have a keywords stored in a vector arr_words. For every row in the table; if row contains any word from the vector ignoring case, I want to keep that row.
For instance, if one of the cells has "i like magIcalot", and one of my keywords is "magic", I want to keep all the attributes from that row.
I've been trying this, but I'm pretty sure it's wrong since it's getting me back zero rows-
df %>%
rowwise() %>%
filter(any(names(df) %in% arr_words))
If you want to search in any specific field say field1, you can use as below:
library(dplyr)
df %>%
filter(grepl(arr_words,field1))
If you want to search across all fields, then:
library(stringr)
library(dplyr)
df %>%
filter_all(any_vars(str_detect(., arr_words)))

Resources