How to create an incrementing variable with 2 variables in R? - r

I would like to create an incrementing variable (Id1 or Id2) from 2 others variables (Var1 and Var2).
Thank you.
Elodie
EDIT (reproductible example for Aaron Montgomery)
I want to create an incrementing variable : "Id". The value of "Id" changes if VarA is a new value and if VarB is a new value. See in particular when Id = 4 in the expected table.
data_example <- data.table::fread("
VarA VarB
A1 B1
A1 B2
A1 B3
A1 B4
A2 B5
A3 B6
A4 B7
A5 B7
A5 B8
A6 B9
A7 B10
A8 B10
A9 B10")
Expected table
VarA VarB Id
A1 B1 1
A1 B2 1
A1 B3 1
A1 B4 1
A2 B5 2
A3 B6 3
A4 B7 4
A5 B7 4
A5 B8 4
A6 B9 5
A7 B10 6
A8 B10 6
A9 B10 6

Here is one solution using the tidyverse
library(tidyverse)
data_example <- data.table::fread("
Var1 Var2 Id1 Id2
604211 1001 3 1
604211 1093 3 1
604211 1146 3 1
604211 1319 3 1
635348 1002 5 2
634849 1005 5 2
620861 1004 4 3
622281 1004 4 3
622281 1041 4 3
600044 1100 1 4
600049 1033 2 5
607692 1033 2 5
612595 1033 2 5")
data_example %>%
arrange(Var1,Var2) %>%
group_by(Var1) %>%
mutate(id1 = group_indices()) %>%
group_by(Var2) %>%
mutate(id2 = group_indices())

Related

Calculate medians of multiple columns in a data frame

I would like to calculate the median of my df database below.
In this case, I would like to get the median results for columns A1 through A10 and return the results for the columns separately.
Thanks!
#database
df <- structure(
list(D1 = c("a","a","b","b","b"),
D2 = c("c","d","c","d","c"), D3 = c("X","X","Y","Z","Z"), A1=c(1,2,3,4,5),A2=c(4,2,3,4,4), A3=c(1,2,3,4,6),
A4=c(1,9,4,4,6),A5=c(1,4,3,9,6),A6=c(1,2,4,4,8),A7=c(1,1,3,4,7),A8=c(1,6,4,4,2),A9=c(1,2,3,4,6),A10=c(1,5,3,2,7)),
class = "data.frame", row.names = c(NA, -5L))
If you would want to keep it simple:
apply(df[, 4:13], 2, median)
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
3 4 3 4 4 4 3 4 3 3
We can loop over the numeric columns and get the median
library(dplyr)
df %>%
summarise(across(where(is.numeric), median))
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
1 3 4 3 4 4 4 3 4 3 3
Or use colMedians from matrixStats
library(matrixStats)
colMedians(as.matrix(df[startsWith(names(df), "A")]))
[1] 3 4 3 4 4 4 3 4 3 3
Or in base R
sapply(df[startsWith(names(df), "A")], median)
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
3 4 3 4 4 4 3 4 3 3

How to find common rows (considering vice versa format) of 2 dataframe in R

I want to find the common rows between 2 dataframe. To find the common rows, I can use inner_join(), semi_join(), and merge(). I have gone through different posts including this. But, these operations are not fulfilling my purposes. Because my data in the dataframe is a little different!
Sometimes, the data in the dataframe can be vise versa. Like the 3rd and 5th rows of dataframe-1 and dataframe-2. Dataframe-1 contains A3 A1 0.75 but Dataframe-2 contains A1 A3 0.75 . I would like to take these 2 rows as the same.
My first dataframe looks like
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
3 A3 A1 0.75
4 A4 A5 0.88
5 A5 A3 0.99
6 (+)-1(10),4-Cadinadiene Falcarinone-10 0.09
7 Leucodelphinidin-100 (+)-1(10),4-Cadinadiene 0.876
8 Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
9 (2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
10 Falcarinone Lignin 1
11 A1 (+)-1(10),4-Cadinadiene 1
12 A2 Lignin-10 1
13 A3 (2E,7R,11R)-2-Phyten-1-ol 1
14 Falcarinone A6 1
15 A4 Leucodelphinidin 1
16 A4 Leucodelphinidin 1
17 Falcarinone A100 1
18 A4 Falcarinone 1
the second dataframe looks like
query target
1 A1 A2
2 A2 A5
3 A1 A3 // Missing in the output
4 A4 A5
5 A3 A5 // Missing in the output
6 A3 (2E,7R,11R)-2-Phyten-1-ol
7 (+)-1(10),4-Cadinadiene Falcarinone
8 Leucodelphinidin (+)-1(10),4-Cadinadiene-100
9 Lignin-2 (2E,7R,11R)-2-Phyten-1-ol
10 A11 (+)-1(10),4-Cadinadiene
11 A2 Lignin
12 A3 (2E,7R,11R)-2-Phyten-1-0l
13 Falcarinone A60
14 A4 Leucodelphinidin // Missing in the output
The code I am using
output <- semi_join(Dataframe-1, Dataframe-2) OR
output <- inner_join(df_only_dd, sample_data_dd_interaction)
The output I am getting
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
But, my expected output is like this
query target weight
1 A1 A2 0.60
2 A2 A5 0.50
3 A3 A1 0.75
4 A4 A5 0.88
5 A5 A3 0.99
6 A4 Leucodelphinidin 1
Reproducible code is given below
df_1 <- read.table(text="query target weight
A1 A2 0.6
A2 A5 0.5
A3 A1 0.75
A4 A5 0.88
A5 A3 0.99
(+)-1(10),4-Cadinadiene Falcarinone 0.09
Leucodelphinidin (+)-1(10),4-Cadinadiene 0.876
Lignin (2E,7R,11R)-2-Phyten-1-ol 0.778
(2E,7R,11R)-2-Phyten-1-ol Leucodelphinidin 0.55
Falcarinone Lignin 1
A1 (+)-1(10),4-Cadinadiene 1
A2 Lignin 1
A3 (2E,7R,11R)-2-Phyten-1-ol 1
Falcarinone A6 1
A4 Leucodelphinidin 1
A4 Leucodelphinidin 1
Falcarinone A100 1
A5 Falcarinone 1", header=TRUE)
df_2 <- read.table(text="query target
A1 A2
A2 A5
A1 A3
A4 A5
A3 A5
(+)-1(10),4-Cadinadiene Falcarinone
Leucodelphinidin (+)-1(10),4-Cadinadiene-100
Lignin-2 (2E,7R,11R)-2-Phyten-1-ol
A11 (+)-1(10),4-Cadinadiene
A2 Lignin
A3 (2E,7R,11R)-2-Phyten-1-0l
Falcarinone A6
A4 Leucodelphinidin ", header=TRUE)
Any kind of suggestion is appreciated.
You could write a small function that sorts rows of first two columns of both data frames, then merge them.
sc <- function(x, i) setNames(cbind(data.frame(t(apply(x[i], 1, sort))), x[-i]), names(x))
res <- merge(sc(df_1, 1:2), sc(df_2, 1:2))
res[!duplicated(res), ] ## remove duplicates
# query target weight
# 1 (+)-1(10),4-Cadinadiene Falcarinone 0.09
# 2 A1 A2 0.60
# 3 A1 A3 0.75
# 4 A2 A5 0.50
# 5 A2 Lignin 1.00
# 6 A3 A5 0.99
# 7 A4 A5 0.88
# 8 A4 Leucodelphinidin 1.00
# 10 A6 Falcarinone 1.00
Edit
Solution with data.table which should be more memory efficient.
library(data.table)
setDT(df_1)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
setDT(df_2)[,c("query", "target") := list(pmin(query,target), pmax(query,target))]
res <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)
res
# query target weight
# 1: (+)-1(10),4-Cadinadiene Falcarinone 0.09
# 2: A1 A2 0.60
# 3: A1 A3 0.75
# 4: A2 A5 0.50
# 5: A2 Lignin 1.00
# 6: A3 A5 0.99
# 7: A4 A5 0.88
# 8: A4 Leucodelphinidin 1.00
# 9: A6 Falcarinone 1.00
To get back "data.frame"s, just do e.g. setDF(res).
maybe can try:
output <- merge(df_1, df_2, all=T)
and then check for duplicated rows regardless of ordering, smthing like:
same.rows <- duplicated(t(apply(output, 1, sort)))
which returnds a vector of flags
FALSE FALSE FALSE TRUE FALSE FALSE TRUE
you can then keep the rows which are FALSE
output[which(same.rows==F),]
query target weight
1 A1 A2 0.60
2 A1 A3 0.75
3 A2 A5 0.50
5 A3 A5 0.99
6 A4 A5 0.88
does it make sense?

Create new column if first letter is a specific letter in R

Consider the following data in R:
d <- data.frame(a = c("E5","E5","E5","E5"),
b = c("011","012","013","111"))
I want to add a new column that is equal to "A5" if the first letter in column b is 0 excerpt "013". That is, I want the following table:
a b c
1 E5 011 A5
2 E5 012 A5
3 E5 013
4 E5 111
How do I do that in R?
Does this work:
> d
a b
1 E5 011
2 E5 012
3 E5 013
4 E5 111
> transform(d, c = ifelse(str_detect(d$b, '^01[^3]'), 'A5',''))
a b c
1 E5 011 A5
2 E5 012 A5
3 E5 013
4 E5 111
>
An option with data.table
library(data.table)
setDT(d)[substr(b, 1, 1) == 0 & b != '013', c := 'A5']
-output
d
# a b c
#1: E5 011 A5
#2: E5 012 A5
#3: E5 013 <NA>
#4: E5 111 <NA>

For each row the sort top 5 values in desc order and get there column name

I have a data frame and for each row , I want to extract the top 5 columns with max value in each row
DF <- data.frame(a1=c(10,45,100,5000,23,45,2,23,56),
a2=c(60,20,5,2,1,2,3,4,5),
a3=c(90,2,0,0,0,4,-5,-3,-2),
a4=c(900,122,30,40,50,64,-75,-83,-92),
a5=c(190,32,30,50,80,49,-50,-7,-2),
a6=c(30,27,80,54,84,49,-50,-37,-23),
a7=c(0,32,39,50,80,9,-5,-7,-23))
I tried using the below approach
k <- 5
mx <- t(apply(DF,1,function(x)names(DF)[sort(head(order(x,decreasing=TRUE),k))]))
mx<-as.data.frame(mx)
I am able to get results but the order is not correct for all rows
for example
Input is
**Expected O/P for Row 1 should be **
a4 a5 a3 a2 a6
or
a4 a5 a3 a6 a2
My O/P is
I would appreciate dplyr based solution if possible
Try this approach, the issue was you had an additional sort() that was reordering the values again:
#Code
mx <- t(apply(DF,1,function(x)names(DF)[head(order(x,decreasing=TRUE),k)]))
mx<-as.data.frame(mx)
Output:
V1 V2 V3 V4 V5
1 a4 a5 a3 a2 a6
2 a4 a1 a5 a7 a6
3 a1 a6 a7 a4 a5
4 a1 a6 a5 a7 a4
5 a6 a5 a7 a4 a1
6 a4 a5 a6 a1 a7
7 a2 a1 a3 a7 a5
8 a1 a2 a3 a5 a7
9 a1 a2 a3 a5 a6
A tidyverse approach would imply reshaping data like this:
library(tidyverse)
#Code
DF %>%
#Create an id by row
mutate(id=1:n()) %>%
#Reshape
pivot_longer(cols = -id) %>%
#Arrange
arrange(id,-value) %>%
#Filter top 5
group_by(id) %>%
mutate(Var=1:n()) %>%
filter(Var<=5) %>%
select(-c(value,Var)) %>%
#Format
mutate(Var=paste0('V',1:n())) %>%
pivot_wider(names_from = Var,values_from=name) %>%
ungroup() %>%
select(-id)
Output:
# A tibble: 9 x 5
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 a4 a5 a3 a2 a6
2 a4 a1 a5 a7 a6
3 a1 a6 a7 a4 a5
4 a1 a6 a5 a7 a4
5 a6 a5 a7 a4 a1
6 a4 a5 a6 a1 a7
7 a2 a1 a3 a7 a5
8 a1 a2 a3 a5 a7
9 a1 a2 a3 a5 a6

Match in R while disregarding order [duplicate]

This question already has an answer here:
Match Dataframes Excluding Last Non-NA Value and disregarding order
(1 answer)
Closed 5 years ago.
I am trying to do a match in R regardless of the order of the columns.
Basically the problem I am trying to solve is that if all of the values in the columns of df2, from column 2-to the end, are found in df1 (after Partner), then match df1.
Here's the catch: disregard the last non-NA value in each row when doing this match but include it in the final output. So don't take the last non-NA value into account when matching but include it.
After the match, determine if that last non-na value exists in any of the columns with it's respective row.
df1
Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA
df2
lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1
How do I match df1 with df2 so that the following happens:
1) Disregards the order of the columns found in both dataframes.
2) Then determine if the last non-na value exists in the row currently.
Final output:
df3
Partner Col1 Col2 Col3 Col4 lift rule1 rule2 rule3 EXIST?
A A1 A2 NA NA 11 A2 A1 A9 YES
A A1 A2 NA NA 10 A1 A3 NA NOPE
B A2 B9 NA NA 11 B9 A2 D7 YES
B A2 B9 NA NA 11 A2 B9 B1 YES
D Q1 Q3 Q4 NA 10 Q4 Q1 NA YES
I get one more B match than you, but this solution is very close to what you want. You first have to add an id column as we use it to reconstruct the data. Then to perform the match, you first need to melt it with gather from tidyr and use inner_join from dplyr. We then cbind using the ids and the original data.frames.
library(tidyr);library(dplyr)
df1 <- read.table(text="Partner Col1 Col2 Col3 Col4
A A1 A2 NA NA
B A2 B9 NA NA
C B7 V9 C1 N9
D Q1 Q3 Q4 NA",header=TRUE, stringsAsFactors=FALSE)
df2 <- read.table(text="lift rule1 rule2 rule3
11 A2 A1 A9
10 A1 A3 NA
11 B9 A2 D7
10 Q4 Q1 NA
11 A2 B9 B1",header=TRUE, stringsAsFactors=FALSE)
df1 <- cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
#melt with gather
d11 <- df1 %>% gather(Col, Value,starts_with("C")) #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22 <- df2 %>% gather(Rule, Value,starts_with("r")) #Long
res <- inner_join(d11,d22)
cbind(df1[res$df1_id,],df2[res$df2_id,])
df1_id Partner Col1 Col2 Col3 Col4 df2_id lift rule1 rule2 rule3
1 1 A A1 A2 <NA> <NA> 2 10 A1 A3 <NA>
1.1 1 A A1 A2 <NA> <NA> 1 11 A2 A1 A9
2 2 B A2 B9 <NA> <NA> 1 11 A2 A1 A9
2.1 2 B A2 B9 <NA> <NA> 5 11 A2 B9 B1
2.2 2 B A2 B9 <NA> <NA> 3 11 B9 A2 D7
4 4 D Q1 Q3 Q4 <NA> 4 10 Q4 Q1 <NA>

Resources