R function to replace tricky merge in Excel (vlookup + hlookup) - r

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!

You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)

You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T

Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

Related

R delete fathers row based on sons in hierarchycal data

I'm working with some data like these:
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.frame(id,name)
data
> data
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
7 3 e
8 3 f
9 3 k
10 4 f
11 4 u
My goal is this: if there is only a son that I do not want, remove all the row with the same father of the disliked son. For example, I don't like the son e, the result should be:
> data_e
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Because the rows with id 2 and 3 have in their name e.
This could be also a task like " I do not like e and f together":
> data_eandf
id name
1 1 a
2 1 b
3 1 k
4 2 b
5 2 e
6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
10 4 f
11 4 u
Or, "I don't want you if you have e or f":
> data_eorf
id name
1 1 a
2 1 b
3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 3 e
# 8 3 f
# 9 3 k
# 10 4 f
# 11 4 u
As you've noticed, to be more clear, I've "commented" the must-be-deleted rows.
I've searched, but I've found a lot of question based on only one column like data[which(data$name=='e'),], but this is going to remove only at sons' levels, not all the row of the relative father. Also I've thought to put the data in the wide format, paste all the name of a id in an unique cell, and fetch if there is e for example with function like grepl(), but I think this could be a problem with large dataset (these data are an example).
Do you have any idea about how to manage this?
Thanks in advance
Here's a function to handle the different cases
dislike1 <- c('e')
dislike2 <- c('e', 'f')
myfun <- function(df, dislike, ops = NULL) {
require(dplyr)
if (is.null(ops) || ops == 'OR') {
df %>%
group_by(id) %>%
filter(!any(name %in% dislike)) %>%
ungroup
} else if (ops == 'AND') {
df %>%
group_by(id) %>%
filter(!all(dislike %in% name)) %>%
ungroup
}
}
myfun(data, dislike1)
# A tibble: 5 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 4 f
# 5 4 u
myfun(data, dislike2, 'AND')
# A tibble: 8 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
# 4 2 b
# 5 2 e
# 6 2 g
# 7 4 f
# 8 4 u
myfun(data, dislike2, 'OR')
# A tibble: 3 x 2
# id name
# <dbl> <fct>
# 1 1 a
# 2 1 b
# 3 1 k
data[!(data$id %in% unique(data[data$name == 'e', 'id'])),]
unique(data[data$name == 'e', 'id']) will get the unique id's that have 'e' in the name field. Then you can use the %in% operator to find all the rows with those id's. The ! is a negation operator.
I have a data.table solution
require(data.table)
id <- c(1,1,1,2,2,2,3,3,3,4,4) # fathers
name <- c('a','b','k','b','e','g','e','f','k','f','u') # sons
data <- data.table(id,name)
# names to be deleted
to_del <- c("e","f")
# returns only id's without any of the names to be deleted
data[ , .SD[ !any(name %in% to_del) ,name ] , by = "id"]
id V1
1: 1 a
2: 1 b
3: 1 k

R: Collapse duplicated values in a column while keeping the order

I'm sure this is super simple but just can't find the answer. I have a data frame like so
Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A
And I'd like to group by Id and collapse the distinct event values while keeping the event order like so
Id event
1 1 A
2 1 B
3 1 A
4 2 C
5 2 A
Most of my searches end up with using the distinct() or unique() functions but that leads losing the A event in row 3 for Id 1.
Thanks in advance!
We can use lead to compare each row and filter those rows that are different than the previous ones. is.na(lead(Id)) is to also include the last rows.
library(dplyr)
dat2 <- dat %>%
filter(!(Id == lead(Id) & event == lead(event)) | is.na(lead(Id)))
dat2
# Id event
# 1 1 A
# 2 1 B
# 3 1 A
# 4 2 C
# 5 2 A
DATA
dat <- read.table(text = " Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header = TRUE, stringsAsFactors = FALSE)
You can just compare every row with the one after it.
df = read.table(text=" Id event
1 1 A
2 1 B
3 1 A
4 1 A
5 2 C
6 2 C
7 2 A",
header=TRUE)
df[rowSums(df[-1,] == head(df, -1)) !=2, ]
Id event
1 1 A
2 1 B
4 1 A
6 2 C
7 2 A
Here is a solution with data.table:
library("data.table")
dt <- fread(
" Id event
1 A
1 B
1 A
1 A
2 C
2 C
2 A")
unique(dt[, r:=rleidv(event), Id])[, -3]
# Id event
# 1: 1 A
# 2: 1 B
# 3: 1 A
# 4: 2 C
# 5: 2 A
or
dt[, .SD[unique(rleidv(event))], by = Id]
(thx to #mt1022 for the comment)
A base R solution using tapply and rle:
x <- tapply(dat$event,dat$Id,function(x) rle(x)$values)
do.call(rbind,Map(data.frame,Id=names(x),event=x))
# Id event
# 1.1 1 A
# 1.2 1 B
# 1.3 1 A
# 2.1 2 C
# 2.2 2 A
I think the distinct function will be able to solve the problem.
dat %>%
distinct(Id, event)

How to calculate the frequency of each value in a column corresponding to each value in another column in R?

I have a dataset as follows:
col1 col2
A 1
A 2
A 2
B 1
B 1
C 1
C 1
C 2
I want the output as:
col1 col2 Frequency
A 1 1
A 2 2
B 1 2
C 1 2
C 2 1
I tried using the aggregate function and also the table function but I am unable to get desired result.
You can add a dummy column or use the rownames to aggregate on:
aggregate(rownames(mydf) ~ ., mydf, length)
# col1 col2 rownames(mydf)
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 C 2 1
table also works fine but will report combinations that may not be in your data as "0":
data.frame(table(mydf))
# col1 col2 Freq
# 1 A 1 1
# 2 B 1 2
# 3 C 1 2
# 4 A 2 2
# 5 B 2 0
# 6 C 2 1
Another nice approach is to use "data.table":
library(data.table)
as.data.table(mydf)[, .N, by = names(mydf)]
if your data is
col1 <- c("A","A","A","B","B","C","C","C")
col2 <- c(1,2,2,1,1,1,1,2)
df <- data.frame(col1,col2)
you can use dplyr
1) group_by both both variables, since your output is supposed to include every combination of them
2) count the number of observations for each group using n()
library(dplyr)
df %>% group_by(col1,col2) %>% summarize(frequency=n())
# output
col1 col2 frequency
1 A 1 1
2 A 2 2
3 B 1 2
4 C 1 2
5 C 2 1

R counting strings variables in each row of a dataframe

I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.
You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))
You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1

melt data frame and split values

I have the following data frame with measurements concatenated into a single column, separated by some delimiter:
df <- data.frame(v1=c(1,2), v2=c("a;b;c", "d;e;f"))
df
v1 v2
1 1 a;b;c
2 2 d;e;f;g
I would like to melt/transforming it into the following format:
v1 v2
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 2 f
7 2 g
Is there an elegant solution?
Thx!
You can split the strings with strsplit.
Split the strings in the second column:
splitted <- strsplit(as.character(df$v2), ";")
Create a new data frame:
data.frame(v1 = rep.int(df$v1, sapply(splitted, length)), v2 = unlist(splitted))
The result:
v1 v2
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 2 f

Resources