I asked the question(How to mutate a new column by modifying another column?)
Now I have another problem. I have to use more 'untidy'IDs like,
df1 <- data.frame(id=c("A-1","A-10","A-100","b-1","b-10","b-100"),n=c(1,2,3,4,5,6))
from this IDs, I want to assign new 'tidy' IDs like,
df2 <- data.frame(id=c("A0001","A0010","A0100","B0001","B0010","B0100"),n=c(1,2,3,4,5,6))
(now I need capital 'B' instead of 'b')
I tried to use str_pad functiuon, but I couldn't manage.
We can separate the data into different columns based on "-", convert the letters to uppercase, using sprintf pad with 0's and combine the two columns with unite.
library(dplyr)
library(tidyr)
df1 %>%
separate(id, c("id1", "id2"), sep = "-") %>%
mutate(id1 = toupper(id1),
id2 = sprintf('%04s', id2)) %>%
unite(id, id1, id2, sep = "")
# id n
#1 A0001 1
#2 A0010 2
#3 A0100 3
#4 B0001 4
#5 B0010 5
#6 B0100 6
Based on the comment if there are cases where we don't have separator and we want to change certain id1 values we can use the following.
df1 %>%
extract(id, c("id1", "id2"), regex = "([:alpha:])-?(\\d+)") %>%
mutate(id1 = case_when(id1 == 'c' ~ 'B',
TRUE ~ id1),
id1 = toupper(id1),id2 = sprintf('%04s', id2)) %>%
unite(id, id1, id2, sep = "")
The str_pad function is handy for this purpose, as you said. But you have to extract out the digits first and then paste it all back together.
library(stringr)
paste0(toupper(str_extract(df1$id, "[aA-zZ]-")),
str_pad(str_extract(df1$id, "\\d+"), width=4, pad="0"))
[1] "A-0001" "A-0010" "A-0100" "B-0001" "B-0010" "B-0100"
Base R solution
df1$id <- sub("^(.)0+?(.{4})$","\\1\\2", sub("-", "0000", toupper(df1$id)))
tidyverse solution
library(tidyverse)
df1$id <- str_to_upper(df1$id) %>%
str_replace("-","0000") %>%
str_replace("^(.)0+?(.{4})$","\\1\\2")
Output
df1
# id n
# 1 A0001 1
# 2 A0010 2
# 3 A0100 3
# 4 B0001 4
# 5 B0010 5
# 6 B0100 6
Data
df1 <- data.frame(id=c("A-1","A-10","A-100","b-1","b-10","b-100"),n=c(1,2,3,4,5,6))
Related
Let's say I have a dataframe of scores
library(dplyr)
id <- c(1 , 2)
name <- c('John', 'Ninaa')
score1 <- c(8, 6)
score2 <- c(NA, 7)
df <- data.frame(id, name, score1, score2)
Some mistakes have been made so I want to correct them. My corrections are in a different dataframe.
id <- c(2,1)
column <- c('name', 'score2')
new_value <- c('Nina', 9)
corrections <- data.frame(id, column, new_value)
I want to search the dataframe for the correct id and column and change the value.
I have tried something with match but I don't know how mutate the correct column.
df %>% mutate(corrections$column = replace(corrections$column, match(corrections$id, id), corrections$new_value))
We could join by 'id', then mutate across the columns specified in the column and replace the elements based on the matching the corresponding column name (cur_column()) with the column
library(dplyr)
df %>%
left_join(corrections) %>%
mutate(across(all_of(column), ~ replace(.x, match(cur_column(),
column), new_value[match(cur_column(), column)]))) %>%
select(names(df))
-output
id name score1 score2
1 1 John 8 9
2 2 Nina 6 7
It's an implementation of a feasible idea with dplyr::rows_update, though it involves functions of multiple packages. In practice I prefer a moderately parsimonious approach.
library(tidyverse)
corrections %>%
group_by(id) %>%
group_map(
~ pivot_wider(.x, names_from = column, values_from = new_value) %>% type_convert,
.keep = TRUE) %>%
reduce(rows_update, by = 'id', .init = df)
# id name score1 score2
# 1 1 John 8 9
# 2 2 Nina 6 7
ID score
a 1
a 2
b 2
b 4
c 4
c 5
I want to change id to "a,b,c" order this to
ID score
a 1
b 2
c 4
a 2
b 4
c 5
What I tried
> data <- read_csv(data)
> data <- factor(data$id, levels = c('a', 'b', 'c'))
This works for tables so I tried it but didn't work for this. Anybody know if there is a way?
Instead of assigning the 'id' column to data <- (which would replace the data with the 'id' values) it would be used for ordering. In base R, this can be done with
data1 <- data[order(duplicated(data$ID)),]
row.name(data1) <- NULL
Or with dplyr
library(dplyr)
library(data.table)
data %>%
arrange(rowid(ID))
library(dplyr)
d %>%
group_by(ID) %>%
mutate(r = row_number()) %>%
ungroup() %>%
arrange(r, ID, score) %>%
select(-r)
OR in base R
with(d, d[order(ave(seq(NROW(d)), d$ID, FUN = seq_along), ID, score),])
I have a dataframe with in 1 column gene IDs (data1). In another dataframe I have the corresponding gene names (data2). Data1 also contains cells with multiple genenames, separated with ':', and also a lot of NAs. Preferably I want to add a column to data1 with the corresponding gene names, also separated by ':' if there are multiple. An alternative would be to replace all the genenames in data1 with the corresponding gene names. Any idea how to go about this? Thanks!
a <- c("ENSG00000150401:ENSG00000150403", "ENSG00000185294", "NA")
data1 <- data.frame(a)
b <- c("ENSG00000150401", "ENSG00000150403", "ENSG00000185294")
c <- c("GeneA", "GeneB", "GeneC")
data2 <- data.frame(b,c)
One option involving stringr could be:
data1$res <- str_replace_all(data1$a, setNames(data2$c, data2$b))
a res
1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
2 ENSG00000185294 GeneC
3 NA NA
We can get data1 in long format, left_join data2 and paste values together.
library(dplyr)
data1 %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(a, sep = ":") %>%
left_join(data2, by = c('a' = 'b')) %>%
group_by(row) %>%
summarise(a = paste0(a, collapse = ":"),
c = paste0(c, collapse = ":")) %>%
select(-row)
# a c
# <chr> <chr>
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2 ENSG00000185294 GeneC
#3 NA NA
Here is another option with gsubfn
library(gsubfn)
data1$res <- gsubfn("\\w+", setNames(as.list(as.character(data2$c)),
data2$b), as.character(data1$a))
data1
# a res
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2 ENSG00000185294 GeneC
#3 NA NA
In base R, this can be also done by splitting the 'a' column with strsplit and then do match with a named vector created from 'b', 'c' columns of second dataset
is.na(data1$a) <- data1$a == "NA" # converting to real NA instead of character
i1 <- !is.na(data1$a)
# create named vector
v1 <- setNames(as.character(data2$c), data2$b)
data1$res[i1] <- sapply(strsplit(as.character(data1$a[i1]), ":"),
function(x) paste(v1[x], collapse=":"))
I have a data frame with two lists of variables. Each observation in the list contains different length of elements. For example the 4th of the variable “accession” contains one element but 7th contains two elements.
current dataframe
I want to make a new data frame combine two lists together which looks like:
final dataframe I want
Thanks for helping me!
This is data frame I am currently having.
library(rentrez)
search <- entrez_search(db="gds", term=paste0("disease", " AND gse[ETYP]") , retMax = 15)
id <- unlist(search$ids)
UID <- c(sapply(id, paste0, collapse=""))
pub.summary <- entrez_summary(db = "gds", id = UID ,
always_return_list = TRUE)
summary <- extract_from_esummary(esummaries = pub.summary ,
elements = c("samples"),
simplify = T)
df <- data.frame(summary)
df <-data.frame(t(df))
df <- df %>% mutate()
df
This is the data frame result I wish to have
# accession title
#1 GSM3955152 Cancer3
GSM3955155 Adjacent3
GSM3955757 SW480 cells, HES1-binding RNAs/LncRNAs
GSM3955153 Adjacent1
GSM3955150 Cancer1
GSM3955151 Cancer2
#2 GSM33026213 his4wk_sensitized_uti_1
GSM3302681 3his4wk_resolved_pbs_2
GSM3302624 c57bl6j_pbs_9
.
.
.
.
#4 GSM3955757 SW480 cells, HES1-binding RNAs/LncRNAs
.
.
.
.
#15 GSM3934992 control rep4 [N_0039]
GSM3935006 control rep15 [W_010]
GSM3935012 control rep17 [W_023]
GSM3934989 control rep1 [N_0026]
END
Update
Based on the OP's updates, an option is to specify simplify = FALSE in the extract_from_esummary to return as list, then extract the first list element fom each list and rbind to create a single dataframe
summary <- extract_from_esummary(esummaries = pub.summary ,
elements = "samples",
simplify = FALSE)
out <- do.call(rbind, lapply(summary, `[[`, 1))
row.names(out) <- NULL
head(out)
# accession title
#1 GSM3955152 Cancer3
#2 GSM3955155 Adjacent3
#3 GSM3955757 SW480 cells, HES1-binding RNAs/LncRNAs
#4 GSM3955153 Adjacent1
#5 GSM3955150 Cancer1
#6 GSM3955151 Cancer2
An option would be pad the list elements with NA to keep the length same in both columns (if one is of different length) and then unnest
library(dplyr)
library(purrr)
df1 %>%
mutate(n = pmax(lengths(accession), lengths(title))) %>%
mutate_at(vars(accession, title), ~
map2(., n, ~ `length<-`(.x, .y))) %>%
select(-n) %>%
unnest(cols = c(accession, title))
# A tibble: 12 x 2
# accession title
# <chr> <chr>
# 1 A a
# 2 B b
# 3 C c
# 4 <NA> d
# 5 <NA> e
# 6 A a
# 7 B b
# 8 C c
# 9 D <NA>
#10 E <NA>
#11 A d
#12 B <NA>
Or an option is to gather into 'long' format, then unnest the 'val' column and spread it back to 'wide' format
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
gather(key, val, -rn) %>%
unnest(val) %>%
group_by(rn, key) %>%
mutate(i1 = row_number()) %>%
spread(key, val) %>%
ungroup %>%
select(-rn, -i1)
data
df1 <- tibble(accession = list(LETTERS[1:3], LETTERS[1:5], LETTERS[1:2]),
title = list(letters[1:5], letters[1:3], letters[4]))
I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)