Creating an additional dataframe using two existing dataframe - r

Input Data Frame
Data Frame 1 (example - nrow = 100)
Col A | Col B | Col C
a 1 2
a 3 4
b 5 6
c 9 10
Data frame 2 (example - nrow = 200)
Col A | Col B | Col E
a 1 22
a 31 41
a 3 63
b 5 6
b 11 13
c 9 20
I want to create a third data set which contains each of the additional row found in the Data Frame 2 for the Col A entry.
Output File (nrow = 200-100 = 100)
Col A | Col B | Col E
a 31 41
b 11 13

You could add row numbers to each data frame, and then do an anti_join:
library(tidyverse)
df2 %>%
group_by(colA) %>%
mutate(rn = row_number()) %>%
anti_join(df1 %>% group_by(colA) %>% mutate(rn = row_number())) %>%
select(-rn)
Output
# A tibble: 2 x 3
# Groups: colA [2]
colA colD colE
<chr> <dbl> <dbl>
1 a 51 63
2 b 11 13

If we need it in a loop, create an empty dataset with the column names of the second dataset. Loop over the unique values of 'ColA' from second dataset, subset the 'df2', get the difference in number of rows between the subset and the corresponding row of 'df1' ('cnt'), rbind the 'out' with the tail of the subset of dataset
#// Create an empty dataset structure
out <- data.frame(ColA = character(), ColD = numeric(), ColE = numeric())
# // Get the unique values of the column
un1 <- unique(df2$ColA)
# // Loop over the unique values
for(un in un1) {
# // subset the dataset df2
tmp <- subset(df2, ColA == un)
# // get a difference in row count
cnt <- nrow(tmp) - sum(df1$ColA == un)
# // use the count to subset the subset of df2
# // rbind and assign back to the original out
out <- rbind(out, tail(tmp, cnt))
}
row.names(out) <- NULL
out
# ColA ColD ColE
#1 a 51 63
#2 b 11 13
For multiple columns, we could paste to create a single column
df1 <- data.frame(ColA = c('a', 'a', 'b', 'c'), ColB = c(1, 3, 5, 9),
ColC = c(2, 4, 6, 10))
df2 <- data.frame(ColA = c('a', 'a', 'a', 'b', 'b', 'c'),
ColB = c(1, 31, 3, 5, 11, 9), ColE = c(22, 41, 63, 6, 13, 20))
create the function
f1 <- function(data1, data2, by_cols) {
# // Create an empty dataset structure
# // Get the unique value by pasteing the by_cols
data2$new <- do.call(paste, data2[by_cols])
data1$new <- do.call(paste, data1[by_cols])
out <- data2[0,]
un1 <- unique(data2$new)
# // Loop over the unique values
for(un in un1) {
# // subset the second dataset
tmp <- subset(data2, new == un)
# // get the difference in row count
cnt <- nrow(tmp) - sum(data1$new == un)
# // use the count to subet the subset of data2
# // rbind and assign back to the original out
out <- rbind(out, tail(tmp, cnt))
}
out$new <- NULL
row.names(out) <- NULL
out
}
f1(df1, df2, c("ColA", "ColB"))
# ColA ColB ColE
#1 a 31 41
#2 b 11 13
data
df1 <- structure(list(ColA = c("a", "a", "b", "c"), ColB = c(1, 3, 5,
9), ColC = c(2, 4, 6, 10)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ColA = c("a", "a", "a", "b", "b", "c"), ColD = c(12,
31, 51, 71, 11, 93), ColE = c(22, 41, 63, 86, 13, 20)), class = "data.frame",
row.names = c(NA,
-6L))

Related

How to filter out rows that follow a certain established threshold being reached in R?

I have the following data:
group <- rep(letters[seq(from = 1, to = 3)], each = 4)
date <- c("1999-01-01", "1999-01-02", "1999-10-01", "1999-10-05",
"1988-02-01", "1997-12-25", "1997-12-26", "1998-01-01",
"2000-05-01", "2000-07-01", "2000-12-01", "2000-12-02")
day <- c(1,2,274,278,
1,3616,3617,3623,
1, 62,215,216)
diff <- c(0, 1, 272, 4,
0, 3615, 1, 6,
0, 61, 153, 1)
matrix <- matrix(c(group, date, day, diff), ncol = 4, byrow = F)
df <- as.data.frame(matrix)
colnames(df) <- c("group", "date", "day", "diff")
df
In this case, "diff" is the difference between consecutive dates, in days, by group. I am trying to filter out all rows after an arbitrary threshold of "diff" has been reached. For example, let's say this threshold is a difference of 100 days. I would want to eliminate all rows on and after the first value of "diff" that is greater than 100, by group. In other words, my output would look as follows:
group2 <- c("a", "a",
"b",
"c", "c")
date2 <- c("1999-01-01", "1999-01-02",
"1988-02-01 ",
"2000-05-01", "2000-07-01")
day2 <- c(1, 2,
1,
1, 62)
diff2 <- c(0,1,
0,
0,61)
matrix2 <- matrix(c(group2, date2, day2, diff2), ncol = 4, byrow = F)
df2 <- as.data.frame(matrix2)
colnames(df2) <- c("group", "date", "day", "diff")
df2
Is there some way to get this output? There are similar questions on Stack Overflow, but they do not accommodate groups, or do not work for my data. Filtering any value of "diff" less than 100 is not the solution, as it leaves me with dates that occurred AFTER the 100 day gap, which I do not want.
df %>%
filter(diff < 100)
Again, I just want to find the first instance where diff > 100 and remove this row and all subsequent rows for that group. Any help here would be appreciated.
df = data.frame(group, date, day, diff)
library(dplyr)
df %>%
group_by(group) %>%
filter(cumsum(diff > 100) == 0)
# # A tibble: 5 × 4
# # Groups: group [3]
# group date day diff
# <chr> <chr> <dbl> <dbl>
# 1 a 1999-01-01 1 0
# 2 a 1999-01-02 2 1
# 3 b 1988-02-01 1 0
# 4 c 2000-05-01 1 0
# 5 c 2000-07-01 62 61

Divide one table by another, with matching index

I have two table with a shared index, I want to divide one by another. This could be done with division on two data frames. But It seems arbitrary (how would I know I am dividing the right number?) and does not preserve index, so I want to do this division by matching rows with the same index. What's the best way to do this? Is there a best practice in terms of table division in this case?
tb1 <- data.frame(index = c(1, 2, 3), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))
tb2 <- data.frame(index = c(1, 2, 3), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1[,-1]/tb2[,-1]
total_1 total_2
1 25 10
2 225 13
3 100 10
Another case, two col of index must match.
tb2 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))
If both data have the same index and the number of rows are same. One way is to order by 'index' in both data to enforce that they are in the same order. Then do the division
tb1new <- tb1[order(tb1$index),]
tbl2new <- tb2[order(tb2$index),]
tb1new[-1] <- tbl1new[-1]/tbl2new[-1]
Or we can make a check on both 'index' first and use that condition to do the division
i1 <- all.equal(tbl1$index, tbl2$index)
if(i1) tb1[-1]/tbl2[-1]
Or another option in a join
library(data.table)
nm1 <- c('total_1', 'total_2')
nm2 <- c('unit_1', 'unit_2')
setDT(tb1)[tb2, (nm1) := .SD/mget(nm2), on = .(index), .SDcols = nm1]
You can perform a join and divide the columns. In base R :
result <- merge(tb1, tb2, by = c('index_1', 'index_2'))
result
# index_1 index_2 total_1 total_2 unit_1 unit_2
#1 a c 100 20 4 2
#2 b b 300 60 3 6
#3 b d 450 39 2 3
total_cols <- grep('total', names(result), value = TRUE)
unit_cols <- grep('unit', names(result), value = TRUE)
result[total_cols]/result[unit_cols]
# total_1 total_2
#1 25 10
#2 100 10
#3 225 13
Maybe this is not the most efficient solution but here is another way:
library(dplyr)
library(tidyr)
# For one index matching
tb1 %>%
left_join(tb2, by = "index") %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_")))
index result_1 result_2
1 1 25 10
2 2 225 13
3 3 100 10
# For two indices matching
tb1 %>%
left_join(tb2, by = c("index_1", "index_2")) %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_"))) %>%
select(!starts_with(c("total", "unit")))
index_1 index_2 result_1 result_2
1 a c 25 10
2 b d 225 13
3 b b 100 10

Concatenate tibbles with different columns

I want to concatenate tibbles,
there are few common columns
few columns with same name but different values
one different column
I have create an example below, can someone please create the desired tibble
Thanks
library(tidyverse)
# common columns in both tibble
x <- c(1, 2, 3)
y <- c(2, 3, 4)
# common column name and different value for each tibble
v <- c(15, 10, 20)
# specific column to tibble
t_a <- c(4, 5, 6)
tbl_a <- tibble(x, y, v, t_a)
# common column name and different value for each tibble
v <- c(7, 11, 13)
# specific column to tibble
t_b<- c(9, 14, 46)
tbl_b <- tibble(x, y, v, t_b)
# concatenate tbl such output looks like this
x <- c(1, 2, 3, 1, 2, 3)
y <- c(2, 3, 4, 2, 3, 4)
v <- c(15, 10, 20, 7, 11, 13)
t <- c(4, 5, 6, 9, 14, 46)
name <- c("a", "a", "a", "b", "b", "b")
# desired output
tbl <- tibble(x, y, v, t, name)
Here, we can bind the datasets together and use pivot_longer
library(dplyr)
library(tidyr)
bind_rows(tbl_a, tbl_b) %>%
pivot_longer(cols = c(t_a, t_b), names_to = c('.value', 'name'),
names_sep="_", values_to = 't', values_drop_na = TRUE)
-output
# A tibble: 6 x 5
# x y v name t
# <dbl> <dbl> <dbl> <chr> <dbl>
#1 1 2 15 a 4
#2 2 3 10 a 5
#3 3 4 20 a 6
#4 1 2 7 b 9
#5 2 3 11 b 14
#6 3 4 13 b 46

Match value from one dataframe to values from a second dataframe of different length

I have two dataframes like so
df_1 <- data.frame(Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100))
df_2 <- data.frame(Value = c(5, 2, 33),
Symbol = c("B", "A", "D"))
I want to attach df_2$Symbol to df_1 based on whether or not df_2$Value falls between df_1$Min and df_1$Max. If there's no df_2$Value in the appropriate range I'd like NA instead:
df_target <- data.frame(
Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100),
Symbol = c("A", "B", NA, "D")
)
If df_1 and df_2 were of equal lengths this would be simple with findInterval or something with cut but alas...
A solution in either base or tidyverse would be appreciated.
We could use a non-equi join
library(data.table)
setDT(df_1)[df_2, Symbol := Symbol, on = .(Min < Value, Max > Value)]
df_1
# Min Max Symbol
#1: 1 3 A
#2: 4 7 B
#3: 9 14 <NA>
#4: 25 100 D
Or can use fuzzy_left_join
library(fuzzyjoin)
fuzzy_left_join(df_1, df_2, by = c('Min' = 'Value',
'Max' = 'Value'), list(`<`, `>`) ) %>%
dplyr::select(-Value)
# Min Max Symbol
#1 1 3 A
#2 4 7 B
#3 9 14 <NA>
#4 25 100 D

Matching data replacement in R

I have a two datasets with a similar dimensions and a similar column names. The goal is to check if NA values exist in one of the datasets and replace with the corresponding values in the other dataset as shown in the example below.
I have tried running a for loop for to do solve the problem but that didn't work and failed miserably.
df is new data frame created with NA's
loop = for (a in 1:nrow(data1)) {
for (b in 1:ncol(data1)) {
for (c in 1:nrow(data2)) {
for (d in 1:ncol(data2)) {
for (x in 1:nrow(df)) {
for (y in 1:ncol(df)) {
df[x,y]<- ifelse(data1[a,b] != "NA", data1[a,b], data2[c,d])
return(df)`enter code here`
}
}
}
}
}
}
Example
# The first data frame
structure(list(age = c(23, 22, 21, 20), gender = c("M", "F",
NA, "F")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# age gender
# 1 23 M
# 2 22 F
# 3 21 NA
# 4 20 F
# The second data frame
structure(list(age = c(23, 22, 21, 20), gender = c("M", "F",
"M", "F")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
# age gender
# 1 23 M
# 2 22 F
# 3 21 M
# 4 20 F
Desired output
Age Gender
23 M
22 F
21 M
20 F
You might try this:
df1 <- tibble(age = c(23,22,21,20),
gender = c("M", "F", NA, "F"))
# -------------------------------------------------------------------------
#> df1
# # A tibble: 4 x 2
# age gender
# <dbl> <chr>
# 1 23 M
# 2 22 F
# 3 21 NA
# 4 20 F
# -------------------------------------------------------------------------
df2 <- tibble(age = c(23,22,21,20),
gender = c("M", "F", "M", "F"))
# -------------------------------------------------------------------------
#> df2
# # A tibble: 4 x 2
# age gender
# <dbl> <chr>
# 1 23 M
# 2 22 F
# 3 21 M
# 4 20 F
# -------------------------------------------------------------------------
# get the na in df1 of gender var
df1.na <- is.na(df1$gender)
#> df1.na
# [1] FALSE FALSE TRUE FALSE
# -------------------------------------------------------------------------
# use the values in df2 to replace na in df1 (Note that this is index based)
df1$gender[df1.na] <- df2$gender[df1.na]
df1
# -------------------------------------------------------------------------
#> df1
# A tibble: 4 x 2
# age gender
# <dbl> <chr>
# 1 23 M
# 2 22 F
# 3 21 M
# 4 20 F
# -------------------------------------------------------------------------
This can be done using the natural_join function from the rqdatatable library. The function does require an index to merge on, so we will need to create one.
Creating a reproducible example will help other people help you. Here I've created two simple data frames that should cover most cases for your problem.
# Create example data
tbl1 <-
data.frame(
w = c(1, 2, 3, 4),
x = c(1, 2, 3, NA),
y = c(1, 2, 3, 4),
z = c(1, NA, NA, NA)
)
tbl2 <-
data.frame(
w = c(9, 9, 9, 9), # check value doesnt overwrite value,
x = c(1, 2, 3, 4), # check na gets filled in
y = c(1, 2, 3, NA), # check NA doesnt overwrite value
z = c(9, NA, NA, NA) # check NA in both stays NA
)
# Create join index
tbl1$indx <- 1:nrow(tbl1)
tbl2$indx <- 1:nrow(tbl2)
# Use natural_join
library("rqdatatable")
natural_join(tbl1, tbl2, by = "indx")

Resources