Merging two dataframes in R and arranging rows based on certain conditions

Merging two dataframes in R and arranging rows based on certain conditions - r

I have two dataframes df1 and df2 which I have merged together into another dataframe df3
df1 <- data.frame(
Name = c("A", "B", "C"),
Value = c(1, 2, 3),
Method = c("Indirect"))
df2 <- data.frame(
Name = c("A", "B"),
Value = c(4, 5),
Method = c("Direct"))
df3 <- rbind(df1, df2)
So df3 looks something like this
Now I need to identify all the unique entries in the Name column (which is C in this case) and for each of the unique entries, a row is to be added which would have the same "Name" but "Value" would be 0 and the "Method" would be the opposite one. The output should look like this.
Finally the rows with similar "Name" are to be arranged one below the other.
I have a huge dataframe and I need to achieve the above mentioned outcome in the most efficient way in R. How do I proceed?

One way
tmp=df3[!(df3$Name %in% df3$Name[duplicated(df3$Name)]),]
tmp$Value=0
tmp$Method=ifelse(tmp$Method=="Direct","Indirect","Direct")
Name Value Method
3 C 0 Direct
you can now rbind this to your original data (and sort it).

Please find another solution using data.table
Reprex
Code
library(data.table)
library(magrittr) # for the pipe!
setDT(df3)
df3 <- rbindlist(list(df3,
df3[!(df3$Name %in% df3[duplicated(Name)]$Name)
][, `:=` (Value = 0, Method = fifelse(Method == "Indirect", "Direct", "Indirect"))])) %>%
setorder(., Name)
Output
df3
#> Name Value Method
#> 1: A 1 Indirect
#> 2: A 4 Direct
#> 3: B 2 Indirect
#> 4: B 5 Direct
#> 5: C 3 Indirect
#> 6: C 0 Direct
Created on 2021-12-15 by the reprex package (v2.0.1)

I think that with 10,000 rows you will barely notice it:
library(dplyr)
df3 |>
add_count(Name) |>
filter(n == 1) |>
mutate(
Value = 0,
Method = c(Indirect = 'Direct', Direct = 'Indirect')[Method],
n = NULL
) |>
bind_rows(df3) |>
arrange(Name, Value, Method)
# Name Value Method
# 1 A 1 Indirect
# 2 A 4 Direct
# 3 B 2 Indirect
# 4 B 5 Direct
# 5 C 0 Direct
# 6 C 3 Indirect

Related

Conditional creation (mutate) of new columns

I have a vector containing "potential" column names:
col_vector <- c("A", "B", "C")
I also have a data frame, e.g.
library(tidyverse)
df <- tibble(A = 1:2,
B = 1:2)
My goal now is to create all columns mentioned in col_vector that don't yet exist in df.
For the above exmaple, my code below works:
df %>%
mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.
Example data where code above fails:
df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
B = 1:2,
C = 1:2)

This should work.
df[,setdiff(col_vector, colnames(df))] <- NA

Solution
This base operation might be simpler than a full-fledged dplyr workflow:
library(tidyverse) # For the setdiff() function.
# ...
# Code to generate 'df'.
# ...
# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA
# View results
df
Results
Given your sample col_vector and df here
col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)
this solution should yield the following results:
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Advantages
An advantage of my solution, over the alternative linked above by #geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.
df %>% mutate(
#####################################
A = ifelse("A" %in% names(.), A, NA),
B = ifelse("B" %in% names(.), B, NA),
C = ifelse("C" %in% names(.), B, NA)
# ...
# etc.
#####################################
)
My solution is by contrast more dynamic
##############################
df[, setdiff(col_vector, names(df))] <- NA
##############################
if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.
Note
Incredibly, #AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.

R creating a function using plyr revalue with multiple inputs

I am new to R and just learning the ropes so thanks in advance for any assistance you can provide.
I have a dataset that I am cleaning as a class project.
I have several sets of categorical data that I want to turn into specific numeric values.
I am repeating the same code format for different columns that I think would make a good function.
I would like to turn this:
# plyr using revalue
df$Area <- revalue(x = df$Area,
replace = c("rural" = 1,
"suburban" = 2,
"urban" = 3))
df$Area <- as.numeric(df$Area)
into this:
reval_3 <- function(data, columnX,
value1, num_val1,
value2, num_val2,
value3, num_val3) {
# plyr using revalue
data$columnX <- revalue(x = data$columnX,
replace = c(value1 = num_val1,
value2 = num_val2,
value3 = num_val3))
# set as numeric
data$columnX <- as.numeric(data$columnX)
# return dataset
return(data)
}
I get the following error:
The following `from` values were not present in `x`: value1, value2, value3
Error: Assigned data `as.numeric(data$columnX)` must be compatible with existing data.
x Existing data has 10000 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `columnX`.
I've tried it with a single value1 where value1 <- c("rural" = 1, "suburban" = 2, "urban" = 3)
I know I can just:
df$Area <- as.numeric(as.factor(df$Area))
the data but I want specific values for each choice rather than R choosing.
Any assistance appreciated.

As already mentioned by #MartinGal in his comment, plyr is retired and the package authors themselves recommend using dplyr instead. See https://github.com/hadley/plyr.
Hence, one option to achieve your desired result would be to make use of dplyr::recode. Additionally if you want to write your function I would suggest to pass the values to recode and the replacements as vectors instead of passing each value and replacement as separate arguments:
library(dplyr)
set.seed(42)
df <- data.frame(
Area = sample(c("rural", "suburban", "urban"), 10, replace = TRUE)
)
recode_table <- c("rural" = 1, "suburban" = 2, "urban" = 3)
recode(df$Area, !!!recode_table)
#> [1] 1 1 1 1 2 2 2 1 3 3
reval_3 <- function(data, x, values, replacements) {
recode_table <- setNames(replacements, values)
data[[x]] <- recode(data[[x]], !!!recode_table)
data
}
df <- reval_3(df, "Area", c("rural", "suburban", "urban"), 1:3)
df
#> Area
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 2
#> 8 1
#> 9 3
#> 10 3

You can use case_when with across.
If the columns that you want to change are called col1, col2 you can do -
library(dplyr)
df <- df %>%
mutate(across(c(col1, col2), ~case_when(. == 'rural' ~ 1,
. == 'suburban' ~ 2,
. == 'urban' ~ 3)))
Based on your actual column names you can also pass starts_with, ends_with, range of columns A:Z in across.

Replace sets of rows in a tibble with a single row

There is a similar question for data.table at Replace sets of rows in a data.table with a single row but I am looking for an equivalent solution in tidyverse. So, I have a tibble like:
DT <- tibble (
id=c(1,1,1,1,2,2,2,2),
location = c("a","b","c","d","a","b","d","e"),
seq = c(1,2,3,4,1,2,3,4))
For every id, I want to look for the sequence b,c,d and if there is such a thing, I want to replace the rows with b and c with a single row, let's say z. The values for the other variables should retain the values of the previous b (in this case id and seq)
So in this case, the new tibble should be
DT.Tobe <- tibble (
id=c(1,1,1,2,2,2,2),
place = c("a","z","d","a","b","d","e"),
seq = c(1,2,4,1,2,3,4))
I was not able to find even a starting point for this...

library(dplyr)
# library(zoo) # rollapply
DT %>%
group_by(id) %>%
mutate(
isseq = zoo::rollapply(location, 3, FUN = function(z) identical(z, c("b", "c", "d")), align = "left", partial = TRUE),
isseq = isseq | lag(isseq, default = FALSE)
) %>%
group_by(id, isseq) %>%
summarize(
across(everything(), ~ {
if (cur_group()$isseq) {
if (cur_column() == "location") "z" else first(.)
} else .
})
) %>%
ungroup() %>%
select(-isseq)
# # A tibble: 7 x 3
# id location seq
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 d 4
# 3 1 z 2
# 4 2 a 1
# 5 2 b 2
# 6 2 d 3
# 7 2 e 4
The order is changed because the group_by(isseq) tends to keep "like" together. This should be easy to either re-order (assuming "seq" is meaningful) or pre-add an order variable and using it later.
If it is possible for a single id to have multiple of such sequences (if so, say something), then run-length encoding will be needed here as well (to differentiate between different b-c-d sequences in the same id).

One possible option would be to use a for-loop. Here is my pseudocode.
for (i in nrows(DT)){ # Repeat this if statement for each row in your DT
if (place[i] == "b & place[i+1] == "c"){ # if the first item is B and the second item is C
DT <- DT %>%
dplyr::replace(place[i] == "z") # Replaces item B with the z character
DT[-(i+1)] # Deletes item C's row
}
}
The Dplyr cheat sheet has some useful functions that may help with finding the right tools for the if-statement part of this pseudocode.
What are your thoughts?

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.

Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20

The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.

Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20

You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Deduplicating a data frame when the order of values may differ in R

Let's say I have a data.frame that looks like this:
df = data.frame(from=c(1, 1, 2, 1),
to=c(2, 3, 1, 4),
title=c("A", "B", "A", "A"),
stringsAsFactors=F)
df is an object that holds all of the various connections for a network graph. I also have a second data.frame, which is the simplified graph data:
df2 = data.frame(from=c(1, 1, 3),
to=c(2, 4, 1),
stringsAsFactors=F)
What I need is to pull the title values from df into df2. I can't simply dedup df because a) from and to can be in different orders, and b) title is not unique between connections. The current condition I have is:
df2$title = df$title[df2$from == df$from & df2$to == df$to]
However, this results in too few rows due to the order of from and to being reversed in row 2 of df2. If I introduce an OR condtion, then I get too many results because the connection between 1 and 2 will be matched twice.
My question, then, is how do I effectively "dedup" the title variable to append it to df2?
The expected outcome is this:
from to title
1 1 2 A
2 1 4 A
3 3 1 B

library(dplyr);
merge(mutate(df2, from1 = pmin(from, to), to1 = pmax(from, to)),
mutate(df, from1 = pmin(from, to), to1 = pmax(from, to)),
by = c("from1", "to1"), all.x = T) %>%
select(from1, to1, title) %>% unique()
# from1 to1 title
#1 1 2 A
#3 1 3 B
#4 1 4 A
Another way we can try, where edgeSort function produce unique edges if the two vertices are the same and use match function to match all equal edges.
edgeSort <- function(df) apply(df, 1, function(row) paste0(sort(row[1:2]), collapse = ", "))
df2$title <- df$title[match(edgeSort(df2), edgeSort(df))]
df2
from to title
1 1 2 A
2 1 4 A
3 3 1 B

I guess you can do it in base R by 2 merge statements:
step1 <- merge(df2, df, all.x = TRUE)
step2 <- merge(df2[is.na(step1$title),], df, all.x = TRUE, by.x = c("to", "from"), by.y = c("from", "to"))
rbind(step1[!is.na(step1$title),], step2)
from to title
1 1 2 A
2 1 4 A
3 3 1 B

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging two dataframes in R and arranging rows based on certain conditions - r

One way tmp=df3[!(df3$Name %in% df3$Name[duplicated(df3$Name)]),] tmp$Value=0 tmp$Method=ifelse(tmp$Method=="Direct","Indirect","Direct") Name Value Method 3 C 0 Direct you can now rbind this to your original data (and sort it).

Related

Conditional creation (mutate) of new columns

R creating a function using plyr revalue with multiple inputs

Replace sets of rows in a tibble with a single row

Gather duplicate column sets into single columns

Deduplicating a data frame when the order of values may differ in R

Categories

Resources