How to set list names ,here is the code as below.
Currently,split_data include two sub list [[1]] and [[2]], how set names separately for them?
I want set name 'A' for [[1]],'B' for [[2]], so can retrieve data use split_data['A']...
Anyone can help on this, thanks ?
for instance ma <- list(a=c('a1','a2'),b=c('b1','b2')) can use ma["a"] for sub list
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
Others have shown you in the comments how to get what you want using split() instead of group_split(). That seems like the easiest solution.
However, if you're stuck with the existing code, here's an alternative that keeps your current code, and adds the names.
library(tidyverse)
test_data <- data.frame(category=c('A','B','A','B','A','B','A','B'),
sales=c(1,2,4,5,8,1,4,6))
split_data <- test_data %>% group_split(category)
names(split_data) <- test_data %>% group_by(category) %>% group_keys() %>% apply(1, paste, collapse = ".")
The idea is to use group_by to split in the same way group_split does, then extract the keys as a tibble. This will have one row per group, but will have the different variables in separate columns, so I put them together by pasting the columns with a dot as separator. The last expression in the pipe is equivalent to apply(keys, 1, f)
where f is function(row) paste(row, collapse = "."). It applies f to each row of the tibble, producing a single name.
This should work even if the split happens on multiple variables, and produces names similar to those produced by split().
Related
I have a data frame with bacteria families from with all their OTUs (phylum, order, family...).
The data frame is large and I would like the name of each column to be only the last part of each string. The one that starts with "f___"
For example
I tried some methods in R (like dplyr::filter or filter(str_detect))and also separating columns in Excel and could not get what I wanted. I don't do it manually because it's too many columns.
df being your dataframe, you could use rename_with from package dplyr:
df %>%
rename_with(
## your renaming function (see ?gsub for help on
## replacing with search patterns (regular expressions):
~ gsub('.*;f___(.*)$', '\\1', .x),
## column selection (see ?dplyr::select for handy shortcuts)
cols = everything()
)
the .x in the replacement formula ~ etc. represents the variable argument to the replacement function, in this case the 'old' column name. You'll encounter this 'dot-something' pattern frequently in tidyverse packages.
microbiota <- read_csv("Tablas/nivel5-familia_clean.csv")
colnames(microbiota) <- gsub(colnames(microbiota),pattern = '.*f__', replacement = "")
I solve it like this.
As a follow up to this question, I'm using dplyr's group_split() to make dataframes / tibbles based on a levels of a column. Continuing off of this question, I want to split off of two columns instead of 1. When I try to split and name the columns, it attributes the wrong names to some of the datasets.
Here's a simple example:
library(dplyr)
#Sample dataset to intuitively illustrate issue
example <- tibble(number = c(1:6),
even_or_odd = c("odd", "even", "odd", "even", "odd", "even"),
prime_or_not = c("prime", "prime", "prime", "not", "prime", "not")) %>%
mutate(type = paste0(even_or_odd, "_", prime_or_not)) %>%
mutate(type_factor = factor(type, levels = unique(type)))
#Does group split to make 3 datasets
the_test <- example %>%
group_split(even_or_odd, prime_or_not) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #wrong label :`-(
odd_prime <- the_test["odd_prime"]$odd_prime #wrong label :`-(
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
My question: how do I ensure that my group names will be attributed to the right dataset and avoid the issues here with even_not and odd_prime being mixed up?
In my actual dataset, I have 50+ combinations, so typing them all out manually is not an option. In addition, my actual dataset will have some combinations that don't consistently exist (like the (like the odd not prime combination here), so relying on index isn't an option.
Instead of splitting by the two columns, use the factor column that was created, which ensures that it splits by the order of the levels created in the type_factor. In addition, using the unique on type_factor can have some issues if the order of the values in 'type_factor' is different i.e. unique gets the first non-duplicated value based on its occurrence. Instead, levels is better. In fact, it may be more appropriate to droplevels as well in case of unused levels.
the_test <- example %>%
group_split(type_factor) %>%
setNames(levels(example$type_factor))
group_split returns unnamed list. If we want to avoid the pain of renaming incorrectly, use split from base R which does return a named list. Thus, it can return in any order as long as the key/value pairs are correct
# 1 - return in a different order based on alphabetic order
split(example, example[c("even_or_odd", "prime_or_not")], drop = TRUE)
# 2 - return order based on the levels of the factor column
split(example, example$type_factor)
# 3 - With dplyr pipe
example %>%
split(.$type_factor)
# 4 - or using magrittr exposition operator
library(magrittr)
example %$%
split(x = ., f = type_factor)
Oh, of course the moment I post it, I realize that an easy solution existed:
Just change the group split to the new variable and it works!
library(dplyr)
#Does group split to make 3 datasets
the_test <- example %>%
group_split(type_factor) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #works now!
odd_prime <- the_test["odd_prime"]$odd_prime #works now!
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
I have a tibble with 20 variables. So far I've been using this pipe to find out which values appear more than once in a single column
as_tibble(iris) %>% group_by(Petal.Length) %>% summarise(n=sum(n())) %>% filter(n>1)
I was wonering if I could write a line that could loop this through all the columns and return 20 different tibbles (or as many as I need in the future) in the same way the pipe above would return one tibble. I have tried writing my own loops but I've had no success, I am quite new.
The iris example dataset has 5 columns so feel free to give an answer with 5 columns.
Thank you!
library(dplyr)
col_names <- colnames(iris)
lapply(
col_names,
function(col) {
iris %>%
group_by_at(col) %>%
summarise(n = n()) %>%
filter(n > 1)
}
)
In base R 4.1+ we have this one-liner. For each column it applies table and then filters out those elements whose value exceeds 1. Finally it converts what remains of the table to a data frame. Omit stack if it is ok to return a list of table objects instead of a list of data frames.
lapply(iris, \(x) stack(Filter(function(x) x > 1, table(x))))
A variation of that is to keep only duplicated items and then add 1 giving slightly fewer keystrokes. Again we can omit stack if returning a list of table objects is ok.
lapply(iris, \(x) stack(table(x[duplicated(x)]) + 1))
I have a dataframe which I would like to query. Note that the columns of that dataframe could change and the column names have spcaes. I have a function that I want to apply on the dataframe columns. I figured I could programmatically find out what columns exists and then use that list of columns to apply function to the columns that exist.
I was able to figure out how to do that when the column names don't have spaces: See the code below
library(tidyverse)
library(rlang)
col_names <- c("cyl","mpg","New_Var")
cc <- rlang::quos(col_names)
mtcars%>%mutate(New_Var=1)%>%select(!!!cc)
But when the column names have spaces, this method does not works, below is the code I used:
col_names <- c("cyl","mpg","`New Var`")
cc <- rlang::quos(col_names)
mtcars%>%mutate(`New Var`=1)%>%select(!!!cc)
Is there a way to select columns that have spaces in their name without changing their names ?
You have to do nothing differently for values with spaces. For example,
library(dplyr)
library(rlang)
col_names <- c("cyl","mpg","New Var")
cc <- quos(col_names)
mtcars %>% mutate(`New Var`=1) %>% select(!!!cc)
Also note, that select also accepts string names so this works too :
mtcars%>% mutate(`New Var`=1) %>% select(col_names)
I need a dataframe containing the names of some files matching a pattern mapped to each line in those files. My problem is, that I am unable to generate multiple rows for each row, the dataframe should grow in columns and rows, expanded per row. What I need is basically a left outer join, but I am struggling with the syntax.
library(dplyr)
app.lsts <- data.frame(
file=list.files(path='.', pattern='app.lst', recursive=TRUE)
) %>%
mutate(command=paste0('cat ', file)) %>%
mutate(packages=system(command, intern=TRUE))
The last mutate does not work because packages is a list of lines. How do I "unwrap" these?
First, some working (but not very good code):
require(tidyverse)
out_df <-
list.files(path='.', pattern='*.foo', recursive=TRUE) %>%
map(~readLines(file(.x))) %>%
setNames(fnames) %>%
t %>%
as.data.frame %>%
gather(file, lines) %>%
unnest()
out_df
This is a tidyverse-style command to generate the data that I think you want. Since I don't have your input files, I made up these sample files:
contents of f1.foo
line_1_f1
line_2_f1
contents of f2.foo
line_1_f2
line_2_f2
line_3_f2
Changes relative to your approach:
Avoid the use of the built-in function file() as a column name. I used fname instead.
Don't use system to read the files, there is built-in R functions to do that. Use of system() needlessly makes porting your code to other operating systems far more unlikely to succeed.
Build the data frame after all the data is read into R, not before. Because of the way non-standard evaluation with dplyr works, it's hard to use readLines(...) inside of a mutate() where the file connection to be read varies.
Use purrr::map() to generate a list of lists of file content lines from a list of filenames. This is a tidyverse way of writing a for-loop.
Set the names of the list elements with setNames().
Munge this list into a data.frame using t() and as.data.frame()
Tidy the data with gather() to collapse the data frame that has one column per file into a data frame with one file per row.
Expand the list using unnest().
I don't think this approach is very pretty, but it works. Another approach that avoids the ugly steps 5 and 6 is a for loop.
fnames <- list.files(path='.', pattern='*.foo', recursive=TRUE)
out_df <- data.frame(fname = c(), lines=c())
for(fname in fnames){
fcontents <- readLines(file(fname)) %>% as.character
this_df <- data.frame(fname = fname, lines = fcontents)
out_df <- bind_rows(out_df, this_df)
}
The output in either case is
fname lines
1 f1.foo line_1_f1
2 f1.foo line_2_f1
3 f2.foo line_1_f2
4 f2.foo line_2_f2
5 f2.foo line_3_f2