R; How to select() columns that contains() strings where the string is any element of a list - r

I want to subset a dataframe whereby I select columns based on the fact that the colname contains a certain string or not. These strings that it must contain are stored in a separate list.
This is what I have now:
colstrings <- c('A', 'B', 'C')
for (i in colstrings){
df <- df %>% select(-contains(i))
}
However, it feels like this shouldn't be done with a for loop. Any suggestions on how to make this code shorter?

Here's an answer adapted from a previous SO post:
library(dplyr)
df <-
tibble(
ash = c(1, 2),
bet = c(2, 3),
can = c(3, 4)
)
df
substr_list <- c("sh", "an")
df %>%
select(matches(paste(substr_list, collapse="|")))
See more here: select columns based on multiple strings with dplyr contains()

Related

R change column names, retain part of colnames

I have a df with colnames thus:
Resampled.Band.1..raster.bsq...404.502014.Nanometers.
...
Resampled.Band.74..raster.bsq...950.851990.Nanometers.
I want them like this:
950.851990_nm
With:
orig_names <- names(df)
new_name <- gsub("Resampled.Band.", "", orig_names)
and
new_name <- gsub(".Nanometers.", "_nm", new_name)
names(all_roi_rfl) <- new_name
I achieve part of what I want: to change the first and last parts of the colnames:
1..raster.bsq...404.502014_nm
I could repeat this to clean the colnames up most of the way.
But how do I deal with the part of the colnames that varies itself, the band number?
Extract the values that you want using regex and replace the column names.
x <- c('Resampled.Band.1..raster.bsq...404.502014.Nanometers.',
'Resampled.Band.74..raster.bsq...950.851990.Nanometers.')
sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', x)
#[1] "404.502014_nm" "950.851990_nm"
This extracts number that occur between "raster.bsq" and "Nanometers" and appends "_nm" to extracted value.
In your case to replace column names it would be :
names(all_roi_rfl) <- sub('.*raster.bsq\\.+(\\d+\\.\\d+)\\.Nanometers\\.', '\\1_nm', names(all_roi_rfl))
A similar answer to Ronak's but using gsub instead.
First generate a dataframe...
df <-
data.frame(
Resampled.Band.1..raster.bsq...404.502014.Nanometers. = c(1, 2, 1, 2),
Resampled.Band.74..raster.bsq...950.851990.Nanometers. = c('a', 'b', 'c', NA))
using gsub identify the string before and after the piece you want to extract
colnames(df) <- gsub(".*raster.bsq...(.+).Nanometers.", "\\1_nm", colnames(df))

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

How to use the R pipe operator (%>%) in the following cases

1) I have a data frame named df, how can I include an if statement within the mutate function used within the pipe operator? The following does not work:
df %>%
mutate_if(myvar == "A", newColumn = oldColumn*3, newColumn = oldColumn)
The variable myvar is not included in the data frame and is a "flag" variable with values either "A" or "B". When "A", would like to create a new column named "newColumn" in the data frame that is three times the old column (named "oldColumn"), otherwise it is identical to the old column.
2) Would like to divide the column named "numbers" with the entry of numbers which has the minimum value in another column named "seconds", as follows:
df$newCol <- df$numbers / df[df$seconds== min(df$seconds),]$numbers
How can I do that with mutate command and "%>%", so that it looks more handy? Nothing that I tried works unfortunately.
Thanks for any answers,
J.
If myvar is just a variable floating around in the environmnet, you can use an if else statement within mutate (similar question here)
library(dplyr)
# Generate dataset
df <- tibble(oldColumn = rnorm(100))
# Mutate with if-else conditions
df <- df %>% mutate(newColumn = if(myvar == "A") oldColumn else if(myvar=="B") oldColumn * 3)
If myvar is included as a column in the dataframe then you could can use case_when.
# Generate dataset
df <- tibble(myvar = sample(c("A", "B"), 100, replace = TRUE),
oldColumn = rnorm(100))
# Create a new column which depends on the value of myvar
df <- df %>%
mutate(newColumn = case_when(myvar == "A" ~ oldColumn*3,
myvar == "B" ~ oldColumn))
As for question 2, you can use mutate with "." operater which calls the left hand side (i.e. "df") in the right hand side of the function. Then you can filter down to the row with the minimum value of seconds (top_n statement using -1 as argument), and pull out the value for the numbers variable
# Generate data
df <- tibble(numbers = sample(1:60),
seconds = sample(1:60))
# Do computation
df <- df %>% mutate(newCol = numbers / top_n(.,-1,seconds) %>% pull(numbers))

Assert that a combination of columns is unique (using `assertr`)

I am looking for a "tidy" and concise way to make sure that a combination of columns is unique in a tibble using assertr.
So far, this is the best I could come up with:
PasteRows <- function(df) {
apply(df, 1, paste, collapse = '')
}
tib <- tibble(a = c(1, 1, 3), b = c('a', 'b', 'b'))
tib %>%
assert_rows(PasteRows, is_uniq, a, b)
... but I first have to define PasteRows. Also, I am not sure if apply has a performance penalty, because it converts the tibble into a matrix.
How can I improve and shorten this?

R dplyr: mutate a column for specific row range

I am trying to modify the values of a column for rows in a specific range. This is my data:
df = data.frame(names = c("george","michael","lena","tony"))
and I want to do the following using dplyr:
df[2:3,] = "elsa"
My attempt at it is the following, but it doesn't seem to work:
df = cbind(df, rows = as.integer(rownames(df)))
dplyr::mutate(df, ifelse(rows %in% c(2,3), names = "elsa" , names = names))
which gives the result:
Error: unused arguments (names = "elsa", names = c(1, 3, 2, 4))
Thanks for any advice.
This question is a little vague, but I think OP is trying to just replace certain values in a data frame using indexing. As the comment above noted the example dataframe's column is comprised of a factor variable, which makes replacing the value behave differently than you might expect. There are two ways to get around this.
The first (more verbose) way is to force df$names to be a character variable instead of a factor. Then using indexing to select the value you'd like to change and replace it:
df$names = as.character(df$names)
df$names[c(2,3)] = "elsa"
Alternatively, you can set stringsAsFactors = TRUE and proceed as above.
df = data.frame(names = c("george","michael","lena","tony"), stringsAsFactors = FALSE)
df$names[c(2:3)] = "elsa"
names
1 george
2 elsa
3 elsa
4 tony
Definitely check out ?data.frame to get a fuller explanation.
The factor answers are faster, but you can do it with dplyr like this (notice that the column must be of type character and not factor):
df <- data.frame(names = c("george","michael","lena","tony"), stringsAsFactors=F)
oldnames <- c("michael", "lena")
df <- mutate(df, names=ifelse(names %in% oldnames, "elsa", names))
Another way is to do something like
oldnames <- c("michael", "lena")
df$names[df$names %in% oldnames] <- "elsa"
Convert names to a character vector explicitly and use replace:
df %>% mutate(names = replace(as.character(names), 2:3, "elsa"))
Note: If names were already a character vector we could have done just:
df %>% mutate(names = replace(names, 2:3, "elsa"))
We can do this using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the row index as i and assign (:=) 'elisa' to the 'names'. As the OP mentioned about large dataset, using the := from data.table will be extremely fast.
library(data.table)
setDT(df)[2:3, names := 'elisa']
df
# names
#1: george
#2: elisa
#3: elisa
#4: tony

Resources