Create new column if first letter is a specific letter in R - r

Consider the following data in R:
d <- data.frame(a = c("E5","E5","E5","E5"),
b = c("011","012","013","111"))
I want to add a new column that is equal to "A5" if the first letter in column b is 0 excerpt "013". That is, I want the following table:
a b c
1 E5 011 A5
2 E5 012 A5
3 E5 013
4 E5 111
How do I do that in R?

Does this work:
> d
a b
1 E5 011
2 E5 012
3 E5 013
4 E5 111
> transform(d, c = ifelse(str_detect(d$b, '^01[^3]'), 'A5',''))
a b c
1 E5 011 A5
2 E5 012 A5
3 E5 013
4 E5 111
>

An option with data.table
library(data.table)
setDT(d)[substr(b, 1, 1) == 0 & b != '013', c := 'A5']
-output
d
# a b c
#1: E5 011 A5
#2: E5 012 A5
#3: E5 013 <NA>
#4: E5 111 <NA>

Related

Calculate medians of multiple columns in a data frame

I would like to calculate the median of my df database below.
In this case, I would like to get the median results for columns A1 through A10 and return the results for the columns separately.
Thanks!
#database
df <- structure(
list(D1 = c("a","a","b","b","b"),
D2 = c("c","d","c","d","c"), D3 = c("X","X","Y","Z","Z"), A1=c(1,2,3,4,5),A2=c(4,2,3,4,4), A3=c(1,2,3,4,6),
A4=c(1,9,4,4,6),A5=c(1,4,3,9,6),A6=c(1,2,4,4,8),A7=c(1,1,3,4,7),A8=c(1,6,4,4,2),A9=c(1,2,3,4,6),A10=c(1,5,3,2,7)),
class = "data.frame", row.names = c(NA, -5L))
If you would want to keep it simple:
apply(df[, 4:13], 2, median)
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
3 4 3 4 4 4 3 4 3 3
We can loop over the numeric columns and get the median
library(dplyr)
df %>%
summarise(across(where(is.numeric), median))
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
1 3 4 3 4 4 4 3 4 3 3
Or use colMedians from matrixStats
library(matrixStats)
colMedians(as.matrix(df[startsWith(names(df), "A")]))
[1] 3 4 3 4 4 4 3 4 3 3
Or in base R
sapply(df[startsWith(names(df), "A")], median)
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
3 4 3 4 4 4 3 4 3 3

How to find common rows (considering vice versa format) of 2 dataframe in R

I want to find the common rows between 2 dataframe. To find the common rows, I can use inner_join(), semi_join(), and merge(). I have gone through different posts including this. But, these operations are not fulfilling my purposes. Because my data in the dataframe is a little different!
Sometimes, the data in the dataframe can be vise versa. Like the 3rd and 5th rows of dataframe-1 and dataframe-2. Dataframe-1 contains A3 A1 0.75 but Dataframe-2 contains A1 A3 0.75 . I would like to take these 2 rows as the same.
My first dataframe looks like
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
3 A3 A1 0.75
4 A4 A5 0.88
5 A5 A3 0.99
6 (+)-1(10),4-Cadinadiene Falcarinone-10 0.09
7 Leucodelphinidin-100 (+)-1(10),4-Cadinadiene 0.876
8 Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
9 (2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
10 Falcarinone Lignin 1
11 A1 (+)-1(10),4-Cadinadiene 1
12 A2 Lignin-10 1
13 A3 (2E,7R,11R)-2-Phyten-1-ol 1
14 Falcarinone A6 1
15 A4 Leucodelphinidin 1
16 A4 Leucodelphinidin 1
17 Falcarinone A100 1
18 A4 Falcarinone 1
the second dataframe looks like
query target
1 A1 A2
2 A2 A5
3 A1 A3 // Missing in the output
4 A4 A5
5 A3 A5 // Missing in the output
6 A3 (2E,7R,11R)-2-Phyten-1-ol
7 (+)-1(10),4-Cadinadiene Falcarinone
8 Leucodelphinidin (+)-1(10),4-Cadinadiene-100
9 Lignin-2 (2E,7R,11R)-2-Phyten-1-ol
10 A11 (+)-1(10),4-Cadinadiene
11 A2 Lignin
12 A3 (2E,7R,11R)-2-Phyten-1-0l
13 Falcarinone A60
14 A4 Leucodelphinidin // Missing in the output
The code I am using
output <- semi_join(Dataframe-1, Dataframe-2) OR
output <- inner_join(df_only_dd, sample_data_dd_interaction)
The output I am getting
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
But, my expected output is like this
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
3 A3 A1 0.75
4 A4 A5 0.88
5 A5 A3 0.99
6 A4 Leucodelphinidin 1
Reproducible code is given below
df_1 <- read.table(text="query target weight
A1 A2 0.6
A2 A5 0.5
A3 A1 0.75
A4 A5 0.88
A5 A3 0.99
(+)-1(10),4-Cadinadiene Falcarinone 0.09
Leucodelphinidin (+)-1(10),4-Cadinadiene 0.876
Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
(2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
Falcarinone Lignin 1
A1 (+)-1(10),4-Cadinadiene 1
A2 Lignin 1
A3 (2E,7R,11R)-2-Phyten-1-ol 1
Falcarinone A6 1
A4 Leucodelphinidin 1
A4 Leucodelphinidin 1
Falcarinone A100 1
A5 Falcarinone 1", header=TRUE)
df_2 <- read.table(text="query target
A1 A2
A2 A5
A1 A3
A4 A5
A3 A5
(+)-1(10),4-Cadinadiene Falcarinone
Leucodelphinidin (+)-1(10),4-Cadinadiene-100
Lignin-2 (2E,7R,11R)-2-Phyten-1-ol
A11 (+)-1(10),4-Cadinadiene
A2 Lignin
A3 (2E,7R,11R)-2-Phyten-1-0l
Falcarinone A6
A4 Leucodelphinidin ", header=TRUE)
Any kind of suggestion is appreciated.
You could write a small function that sorts rows of first two columns of both data frames, then merge them.
sc <- function(x, i) setNames(cbind(data.frame(t(apply(x[i], 1, sort))), x[-i]), names(x))
res <- merge(sc(df_1, 1:2), sc(df_2, 1:2))
res[!duplicated(res), ] ## remove duplicates
# query target weight
# 1 (+)-1(10),4-Cadinadiene Falcarinone 0.09
# 2 A1 A2 0.60
# 3 A1 A3 0.75
# 4 A2 A5 0.50
# 5 A2 Lignin 1.00
# 6 A3 A5 0.99
# 7 A4 A5 0.88
# 8 A4 Leucodelphinidin 1.00
# 10 A6 Falcarinone 1.00
Edit
Solution with data.table which should be more memory efficient.
library(data.table)
setDT(df_1)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
setDT(df_2)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
res <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)
res
# query target weight
# 1: (+)-1(10),4-Cadinadiene Falcarinone 0.09
# 2: A1 A2 0.60
# 3: A1 A3 0.75
# 4: A2 A5 0.50
# 5: A2 Lignin 1.00
# 6: A3 A5 0.99
# 7: A4 A5 0.88
# 8: A4 Leucodelphinidin 1.00
# 9: A6 Falcarinone 1.00
To get back "data.frame"s, just do e.g. setDF(res).
maybe can try:
output <- merge(df_1, df_2, all=T)
and then check for duplicated rows regardless of ordering, smthing like:
same.rows <- duplicated(t(apply(output, 1, sort)))
which returnds a vector of flags
FALSE FALSE FALSE TRUE FALSE FALSE TRUE
you can then keep the rows which are FALSE
output[which(same.rows==F),]
query target weight
1 A1 A2 0.60
2 A1 A3 0.75
3 A2 A5 0.50
5 A3 A5 0.99
6 A4 A5 0.88
does it make sense?

How to create an incrementing variable with 2 variables in R?

I would like to create an incrementing variable (Id1 or Id2) from 2 others variables (Var1 and Var2).
Thank you.
Elodie
EDIT (reproductible example for Aaron Montgomery)
I want to create an incrementing variable : "Id". The value of "Id" changes if VarA is a new value and if VarB is a new value. See in particular when Id = 4 in the expected table.
data_example <- data.table::fread("
VarA VarB
A1 B1
A1 B2
A1 B3
A1 B4
A2 B5
A3 B6
A4 B7
A5 B7
A5 B8
A6 B9
A7 B10
A8 B10
A9 B10")
Expected table
VarA VarB Id
A1 B1 1
A1 B2 1
A1 B3 1
A1 B4 1
A2 B5 2
A3 B6 3
A4 B7 4
A5 B7 4
A5 B8 4
A6 B9 5
A7 B10 6
A8 B10 6
A9 B10 6
Here is one solution using the tidyverse
library(tidyverse)
data_example <- data.table::fread("
Var1 Var2 Id1 Id2
604211 1001 3 1
604211 1093 3 1
604211 1146 3 1
604211 1319 3 1
635348 1002 5 2
634849 1005 5 2
620861 1004 4 3
622281 1004 4 3
622281 1041 4 3
600044 1100 1 4
600049 1033 2 5
607692 1033 2 5
612595 1033 2 5")
data_example %>%
arrange(Var1,Var2) %>%
group_by(Var1) %>%
mutate(id1 = group_indices()) %>%
group_by(Var2) %>%
mutate(id2 = group_indices())

Ranking data that have the same values [duplicate]

This question already has answers here:
Rank vector with some equal values [duplicate]
(3 answers)
Closed 4 years ago.
I have a large data set including a column of counts for different genetic markers. I want to generate an overall ranking that takes into account the count number regardless of the genetic marker. For instance if 2 or more genetic markers all have a count of 5 they should all have the same rank number and I want the rank numbers to be displayed in a separate column. I have this dataframe;
SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8
I want the output to be:
SNP count rank
a1 26 1
a2 18 2
a3 16 3
a4 15 4
a8 15 4
a5 14 5
a6 14 5
a7 14 5
a9 13 7
a10 12 8
a11 12 8
a12 11 9
a13 10 10
a14 9 11
a15 8 12
Note that SNPs a4 and a8 are the same, a5, a6 a7 have equal count values and also a10 and a11. I've tried
transform(df, x= ave(count,FUN=function(x) order(x,decreasing=T)))
but it's not want I want
What you are looking for is the rleid function from the data.table package.
data.table::rleid(df$count)
[1] 1 2 3 4 5 5 5 6 7 8 8 9 10 11 12
df is obtained like so:
df <- read.table(text ="SNP count
a1 26
a2 18
a3 16
a4 15
a5 14
a6 14
a7 14
a8 15
a9 13
a10 12
a11 12
a12 11
a13 10
a14 9
a15 8",
stringsAsFactors =FALSE,
header = TRUE)
And for thoroughness:
df$rank <- data.table::rleid(df$count)
df
SNP count rank
1 a1 26 1
2 a2 18 2
3 a3 16 3
4 a4 15 4
5 a5 14 5
6 a6 14 5
7 a7 14 5
8 a8 15 6
9 a9 13 7
10 a10 12 8
11 a11 12 8
12 a12 11 9
13 a13 10 10
14 a14 9 11
15 a15 8 12
Edit:
Thanks to #Frank, a better solution would be to sort the data frame by count before applying rleid:
setDT(df)[order(-count), rank := rleid(count)]
Which gives:
df
SNP count rank
1: a1 26 1
2: a2 18 2
3: a3 16 3
4: a4 15 4
5: a5 14 5
6: a6 14 5
7: a7 14 5
8: a8 15 4
9: a9 13 6
10: a10 12 7
11: a11 12 7
12: a12 11 8
13: a13 10 9
14: a14 9 10
15: a15 8 11

Match in R while disregarding order [duplicate]

This question already has an answer here:
Match Dataframes Excluding Last Non-NA Value and disregarding order
(1 answer)
Closed 5 years ago.
I am trying to do a match in R regardless of the order of the columns.
Basically the problem I am trying to solve is that if all of the values in the columns of df2, from column 2-to the end, are found in df1 (after Partner), then match df1.
Here's the catch: disregard the last non-NA value in each row when doing this match but include it in the final output. So don't take the last non-NA value into account when matching but include it.
After the match, determine if that last non-na value exists in any of the columns with it's respective row.
df1
Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA
df2
lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1
How do I match df1 with df2 so that the following happens:
1) Disregards the order of the columns found in both dataframes.
2) Then determine if the last non-na value exists in the row currently.
Final output:
df3
Partner Col1 Col2 Col3 Col4 lift rule1 rule2 rule3 EXIST?
A A1 A2 NA NA 11 A2 A1 A9 YES
A A1 A2 NA NA 10 A1 A3 NA NOPE
B A2 B9 NA NA 11 B9 A2 D7 YES
B A2 B9 NA NA 11 A2 B9 B1 YES
D Q1 Q3 Q4 NA 10 Q4 Q1 NA YES
I get one more B match than you, but this solution is very close to what you want. You first have to add an id column as we use it to reconstruct the data. Then to perform the match, you first need to melt it with gather from tidyr and use inner_join from dplyr. We then cbind using the ids and the original data.frames.
library(tidyr);library(dplyr)
df1 <- read.table(text="Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA",header=TRUE, stringsAsFactors=FALSE)
df2 <- read.table(text="lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1",header=TRUE, stringsAsFactors=FALSE)
df1 <- cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
#melt with gather
d11 <- df1 %>% gather(Col, Value,starts_with("C")) #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22 <- df2 %>% gather(Rule, Value,starts_with("r")) #Long
res <- inner_join(d11,d22)
cbind(df1[res$df1_id,],df2[res$df2_id,])
df1_id Partner Col1 Col2 Col3 Col4 df2_id lift rule1 rule2 rule3
1 1 A A1 A2 <NA> <NA> 2 10 A1 A3 <NA>
1.1 1 A A1 A2 <NA> <NA> 1 11 A2 A1 A9
2 2 B A2 B9 <NA> <NA> 1 11 A2 A1 A9
2.1 2 B A2 B9 <NA> <NA> 5 11 A2 B9 B1
2.2 2 B A2 B9 <NA> <NA> 3 11 B9 A2 D7
4 4 D Q1 Q3 Q4 <NA> 4 10 Q4 Q1 <NA>

Resources