Merging many columns in R - r

I have an issue with merging many columns by the same ID. I know that this is possible for two lists but I need to combine all species columns into one so I have first column as species (combined) and then w,w.1,w.2,w.3, w.4... The species columns all have the same species in them but are not in order so I can't just drop every other column as this would mean the w values aren't associated with the right species. This is an extremely large dataset of 10000 rows and 2000 columns so would need to automated. I need the w values to be associated to the corresponding species. Dataset attached.
Thank you for any help
dataset

If your data is in a frame called dt, you can use lapply() along with bind_rows() like this:
library(dplyr)
library(tidyr)
bind_rows(
lapply(seq(1,ncol(dt),2), function(x) {
dt[,c(x,x+1)] %>%
rename_with(~c("Species", "value")) %>%
mutate(w = colnames(dt)[x+1])
})
) %>%
pivot_wider(id_cols = Species, names_from = w)

Related

Find combinations of rows grouped by a column, picking n rows from each group in data.frame in R

Essentially this problem, but picking 2+ rows from each group.
In the scenario here, each group has 2+ rows, and I need to generate all combinations where I select two of them. Columns contain group, unique IDs, and numerical values.
Example data:
dat <- data.frame(Group = c("A","A","A","B","B","C","C","C","D","D","D","D")) %>%
group_by(Group) %>%
mutate(ID = paste(Group, seq_along(Group), sep ="_")) %>%
ungroup() %>%
mutate(value = sample(1:length(ID), replace=TRUE))
How do I find all possible data.frame combinations where n = 2 rows of each group are chosen?
Desired output could be a single list/data.frame of those combinations with unique IDs as in the answer to the linked question above, or preferably, a list of the unique data.frames themselves - each containing the 2 rows per group and their Group, ID, and value columns.
Thank you for your help!

How to divide a data frame into groups of a predefined size while keeping each category of a variable represented in each group

I am having trouble doing cross-validation for a hierarchical dataset. There is a level 2 factor ("ID") that needs to be equally represented in each subset. For this dataset, there are 157 rows and 28 IDs. I want to divide my data up into five subsets, each containing 31 rows, where each of the 28 IDs is represented (a stand can be repeated within a subset).
I have gotten as far as:
library(dplyr)
df %>%
group_by(ID) %>%
and have no clue where to take it from there. Any help is appreciated!
Here's what I'd do: assign one row from each ID randomly to each of the 5 subsets, and then distribute the leftovers fully randomly. Without sample data this is untested, but it should at least get you on the right track.
df %>%
group_by(ID) %>%
mutate(
random_rank = rank(runif(n())),
strata = ifelse(random_rank <= 5, random_rank, sample(1:5, size = n(), replace = TRUE))
) %>%
select(-random_rank) %>%
ungroup()
This should create a strata column as described above. If you want to split the data into a list of data frames for each strata, ... %>% group_by(strata) %>% group_split().

manipulating all the tables in an R group_split list

Edited
I have a large table, which starts like this
Essentially, it's a table with multiple samples ("samp_id") showing the number ("least") of "taxon" present in each.
I want to transpose/pivot the table to look like this;
i.e. with "taxon" as the top row, with each of the 90 samples in "data" following as a row based on the "least" column, re-named with its "samp_id". So you see what each sample is, as well the value in "least" for each sample in the different "taxon" (which may not be identical across the 90 samples).
Previously, I have separated the data into multiple tibbles based on "samp_id", selected "taxon" and "least", re-named "least" with the "samp_id" then combined the individual tibbles based on "taxon" with full_join using something like the code below, then transposing the combined table
ACLOD_11 = data %>%
filter(samp_id == "ACLOD_11") %>%
select(taxon, least) %>%
rename("ACLOD_11" = least)
ACLOD_12 = data ... #as above, but different samp_id
data_final = list(ACLOD_11, ACLOD_12, ...) %>%
reduce(full_join, by = "taxon")
As I have more data tables to follow after this one with 90 samples, so I want to be able to do this without having to individually separate the data into 100s of tibbles and manually inputting the "samp_id" before joining.
I have currently split the data into 90 separate tibbles based on "samp_id" (there are 90 samples in "data")
data_split = data %>%
group_split(samp_id)
but am unsure if this is the best way to do this, or what I should to next?
We can use
library(dplyr)
library(purrr)
data %>%
split(.$samp_id) %>%
imap(~ .x %>%
select(taxon, least) %>%
rename(!!.y := least)) %>%
reduce(full_join, by = 'taxon')

R check for outliers in multiple variables

I need to check my data fro outliers and I have 67 different variables. So I don't want to do it by hand. This is my code for checking it by hand (I have three factors to be checked - voiceID, gender and VP). But I don't know how I should change it to a loop that iterates over columns.
features %>%
group_by(voiceID, gender, VP) %>%
identify_outliers(meanF0)
The values are all numbers. The output should tell me which rows for what factors are outliers.
Thanks for help
The output of identify_outliers is a tibble with multiple columns and it can take a single variable at a time. The variable name can be either quoted or unquoted. In that case, we can group_split the data by the grouping variables, then loop over the columns of interest, and apply the identify_outliers
library(dplyr)
library(purrr)
library(rstatix)
nm1 <- c("score", "score2")
demo.data %>%
group_split(gender) %>%
map(~ map(nm1, function(x) .x %>%
identify_outliers(x)))
If we want to count the outliers,
features %>%
group_by(voiceID, gender, VP) %>%
summarise(across(everything(), ~ length(boxplot(., plot = FALSE)$out)))

extracting columns from dataframe with atleast three values above cutoff

I am new to R programming. Need help to filter my data.For example my data set is mtcars. I want to extract columns which have at least three values above 18. How do i do that.thanks
I have used sort function but that is good only for one column each. not as a whole data frame.
You can get the names of the columns with the following code do this:
library(dplyr)
library(tidyr)
columns = mtcars %>% gather() %>% filter(value > 18) %>% count(key) %>% filter(n > 3) %>%
select(key)
And then filter the dataframe with:
mtcars[, c(t(columns))]
gather transforms the dataframe to one that has two columns:
key is the name of the column
value is the value taken by the observation for the column
The value above 18 are filtered and we count the number of observations by key (the name of the column).

Resources