I'm trying to create a new column based on another, using case_when to give different outputs based on the value of each row.
I start with df <- data.frame(a=c("abc", "123", "abc", "123"))
And want to generate a new column b like so
#> a b
#> 1 abc letter
#> 2 123 number
#> 3 abc letter
#> 4 123 number
I've tried df %>% mutate(b = case_when(startsWith(a, "a") ~ "letter", startsWith(a, "1") ~ "number")) but it only gives an error. Can someone show me how to get different values for column b based on the first letter of the row in column a?
According to ?startsWith
x -vector of character string whose “starts” are considered.
So, startsWith expects the class to be character and here it is factor class. Converting it to character class would solve the issue
library(dplyr)
df %>%
mutate(b = case_when(startsWith(as.character(a), "a") ~ "letter",
TRUE ~ "number"))
# a b
#1 abc letter
#2 123 number
#3 abc letter
#4 123 number
The default behavior of data.frame would be stringsAsFactors = TRUE. If we specify stringsAsFactors = FALSE, the 'a' column will be character class
Another option is str_detect to create a logical expression by checking if the character from the start (^) of the string is a digit ([0-9])
library(stringr)
library(dplyr)
df %>%
mutate(b = c("letter", "number")[1+str_detect(a, "^[0-9]")])
# a b
#1 abc letter
#2 123 number
#3 abc letter
# 123 number
You can just use if_else() since there's only two cases here. A regex seems more appropriate, given the test you're trying to run; the key is that ^ specifies the start of the string, and [:alpha:] matches alphabetical letters, case-insensitive.
library(tidyverse)
df <- data.frame(a=c("abc", "123", "abc", "123"))
df %>% mutate(
b = a %>% str_detect("^[:alpha:]") %>% if_else("letter", "number")
)
#> a b
#> 1 abc letter
#> 2 123 number
#> 3 abc letter
#> 4 123 number
Created on 2019-09-29 by the reprex package (v0.3.0)
As #akrun pointed out, there is an issue here with factors vs. characters - are you sure this is an appropriate example for your use case, i.e. your real data is in factors? Luckily, though, str_detect() works just as well either way.
Related
I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?
tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.
This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2
An example dataframe:
example_df = data.frame(Gene.names = c("A", "B"),
Score = c("3.69,2.97,2.57,3.09,2.94",
"3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83"),
ResidueAA = c("S", "Y"),
ResidueNo = c(3, 3),
Sequence = c("MSSYT", "MSSYTRAP") )
I want to check if the character at ResidueAA column at the position at ResidueNo column matches with the corresponding position in the ‘Sequence’ column. The output should be another column, say, ‘Check’ with a Yes or No.
This is working code:
example_df$Check=sapply(1:nrow(example_df),FUN=function(i){d=example_df[i,]; substr(d$Sequence,d$ResidueNo,d$ResidueNo)==d$ResidueAA})
Is there an easier/elegant way to do this? Ideally, I want something that works within a dplyr pipe.
Also, related to this, how can I extract the corresponding value from the 'Score' column into a new column, say, 'Score_1'?
Thanks
We can use substr directly
library(dplyr)
example_df %>%
mutate(Check = substr(Sequence, ResidueNo, ResidueNo) == ResidueAA)
-output
# Gene.names Score ResidueAA ResidueNo Sequence Check
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT TRUE
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP FALSE
To create a new column with matching 'Score', use match to get the corresponding index instead of == (which does an elementwise comparison) and use the index for extracting the 'Score' element
example_df %>%
mutate(Score2 = Score[match(ResidueAA,
substr(Sequence, ResidueNo, ResidueNo), ResidueAA)])
-output
#Gene.names Score ResidueAA ResidueNo Sequence
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP
# Score2
#1 3.69,2.97,2.57,3.09,2.94
#2 <NA>
Update
Based on the comments, we need to extract the corresponding element of 'Score' based on the 'ResidueNo' if the substring values of 'Sequence' is the same as the 'ResidueAA'. This can be done by splitting the 'Score' with strsplit into a list, extract the first element ([[1]] - after a rowwise operation) and then use the 'ResidueNo' to get the splitted word on that location
example_df %>%
rowwise %>%
mutate(Score2 = if(substr(Sequence, ResidueNo, ResidueNo) ==
ResidueAA) strsplit(Score, ",")[[1]][ResidueNo] else NA_character_) %>%
ungroup
-output
# A tibble: 2 x 6
# Gene.names Score ResidueAA ResidueNo Sequence Score2
# <chr> <chr> <chr> <dbl> <chr> <chr>
#1 A 3.69,2.97,2.57,3.09,2.94 S 3 MSSYT 2.57
#2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3 MSSYTRAP <NA>
Or another option is separate_rows to split the rows to expand the data, then do a group by 'Gene.names', `summarise to get the corresponding 'Score2' element (similar to previous solution) and do a join with the original dataset
library(tidyr)
example_df %>%
separate_rows(Score, sep= ",") %>%
group_by(Gene.names) %>%
summarise(Score2 = if(substr(first(Sequence), first(ResidueNo), first(ResidueNo)) ==
first(ResidueAA)) Score[first(ResidueNo)] else
NA_character_, .groups = 'drop') %>%
right_join(example_df)
To get an individual score, you would need to split the string and return the index corresponding to the position. You could vectorize this, e.g.:
getScore <- Vectorize(function(x, pos) unlist(strsplit(x, ",", TRUE), use.names = FALSE)[pos])
example_df %>% mutate(check=substr(Sequence, ResidueNo, ResidueNo) == ResidueAA,
MyScore=ifelse(check, as.numeric(getScore(Score, ResidueNo)), NA))
#> Gene.names Score ResidueAA ResidueNo
#> 1 A 3.69,2.97,2.57,3.09,2.94 S 3
#> 2 B 3.99,2.27,2.89,2.89,2.00,2.52,2.09,2.83 Y 3
#> Sequence check MyScore
#> 1 MSSYT TRUE 2.57
#> 2 MSSYTRAP FALSE NA
readr::type_convert guesses the class of each column in a data frame. I would like to apply type_convert to only some columns in a data frame (to preserve other columns as character). MWE:
# A data frame with multiple character columns containing numbers.
df <- data.frame(A = letters[1:10],
B = as.character(1:10),
C = as.character(1:10))
# This works
df %>% type_convert()
Parsed with column specification:
cols(
A = col_character(),
B = col_double(),
C = col_double()
)
A B C
1 a 1 1
2 b 2 2
...
However, I would like to only apply the function to column B (this is a stylised example; there may be multiple columns to try and convert). I tried using purrr::map_at as well as sapply, as follows:
# This does not work
map_at(df, "B", type_convert)
Error in .f(.x[[i]], ...) : is.data.frame(df) is not TRUE
# This does not work
sapply(df["B"], type_convert)
Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE
Is there a way to apply type_convert selectively to only some columns of a data frame?
Edit: #ekoam provides an answer for type_convert. However, applying this answer to many columns would be tedious. It might be better to use the base::type.convert function, which can be mapped:
purrr::map_at(df, "B", type.convert) %>%
bind_cols()
# A tibble: 10 x 3
A B C
<chr> <int> <chr>
1 a 1 1
2 b 2 2
Try this:
df %>% type_convert(cols(B = "?", C = "?", .default = "c"))
Guess the type of B; any other character column stays as is. The tricky part is that if any column is not of a character type, then type_convert will also leave it as is. So if you really have to type_convert, maybe you have to first convert all columns to characters.
type_convert does not seem to support it. One trick which I have used a few times is using combination of select & bind_cols as shown below.
df %>%
select(B) %>%
type_convert() %>%
bind_cols(df %>% select(-B))
Regular expression to find words and digits that are repeated back to back
Suppose I have a data frame
df<-data.frame(name=c("mike","mike","mike","bob","mike"),age=c(23,23,23,25,23)
how can I write a regular expression to check "name" column whether "mike" or any other word is repeated back to bake for e.g here mike is repeated 3 times and in "age" column a digit is repeated e.g here 23 is repeated 3 times back to back
You can try this :
library(dplyr)
df %>%
mutate(across(.fns = data.table::rleid, .names = '{col}_grp')) %>%
group_by(across(ends_with('grp'))) %>%
filter(n() >= 3) %>%
ungroup %>%
select(names(df))
# name age
# <chr> <dbl>
#1 mike 23
#2 mike 23
#3 mike 23
For every column in df we use rleid to give a unique number to consecutive values and select those groups that have >= 3 rows in them.
Base R one liner (no regex):
df[which(c(0, cumsum(abs(diff(as.integer(as.factor(df$name)))))) == 0),]
I am running into some problems doing text processing using dplyr and stringr functions (specifically str_split()). I think I am misunderstanding something very fundamental about how to use dplyr correctly when dealing with elements that are vectors/lists.
Here's a tibble, df...
library(tidyverse)
df <- tribble(
~item, ~phrase,
"one", "romeo and juliet",
"two", "laurel and hardy",
"three", "apples and oranges and pears and peaches"
)
Now I create a new column, splitPhrase, by doing str_split() on one of the columns using "and" as the delimiter.
df <- df %>%
mutate(splitPhrase = str_split(phrase,"and"))
That seems to work, sort-of, in RStudio I see this...
In the console I see that my new column, splitPhrase, is actually composed of list... but it looks correct in the Rstudio display, right?
df
#> # A tibble: 3 x 3
#> item phrase splitPhrase
#> <chr> <chr> <list>
#> 1 one romeo and juliet <chr [2]>
#> 2 two laurel and hardy <chr [2]>
#> 3 three apples and oranges and pears and peaches <chr [4]>
What I ultimately want to do is to extract the last item of each splitPhrase. In other words, I'd like to get to this...
The problem is I can't see how to just grab the last element in each splitPhrase. If it were just a vector, I could do something like this...
#> last( c("a","b","c") )
#[1] "c"
#>
But that doesn't work within the tibble, neither does other things that come to mind:
df <- df %>%
mutate(lastThing = last(splitPhrase))
# Error in mutate_impl(.data, dots) :
# Column `lastThing` must be length 3 (the number of rows) or one, not 4
df <- df %>% group_by(splitPhrase) %>%
mutate(lastThing = last(splitPhrase))
# Error in grouped_df_impl(data, unname(vars), drop) :
# Column `splitPhrase` can't be used as a grouping variable because it's a list
So, I think I am "not getting" how to work with vectors that are inside an element in table/tibble column. It seems to have something to do with the fact that in my example it's actually a list of vectors.
Is there a particular function that will help me out here, or a better way of getting to this?
Created on 2018-09-27 by the reprex package (v0.2.1)
The 'splitPhrase' column is a list, so we loop through the list to get the elements
library(tidyverse)
df %>%
mutate(splitPhrase = str_split(phrase,"\\s*and\\s*"),
Last = map_chr(splitPhrase, last)) %>%
select(item, Last)
But, it can be done in many ways. Using separate_rows, expand the column, then get last element grouped by 'item'
df %>%
separate_rows(phrase,sep = " and ") %>%
group_by(item) %>%
summarise(Last = last(phrase))
Haven't tested for efficiency, but we can also use regex to extract the string segment after the last "and":
With sub:
library(dplyr)
df %>%
mutate(lastThing = sub("^.*and\\s", "", phrase)) %>%
select(-phrase)
With str_extract:
library(stringr)
df %>%
mutate(lastThing = str_extract(phrase, "(?<=and\\s)\\w+$")) %>%
select(-phrase)
With extract:
library(tidyr)
df %>%
extract(phrase, "lastThing", "^.*and\\s(\\w+)")
Output:
# A tibble: 3 x 2
item lastThing
<chr> <chr>
1 one juliet
2 two hardy
3 three peaches