Divide one table by another, with matching index - r

I have two table with a shared index, I want to divide one by another. This could be done with division on two data frames. But It seems arbitrary (how would I know I am dividing the right number?) and does not preserve index, so I want to do this division by matching rows with the same index. What's the best way to do this? Is there a best practice in terms of table division in this case?
tb1 <- data.frame(index = c(1, 2, 3), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))
tb2 <- data.frame(index = c(1, 2, 3), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1[,-1]/tb2[,-1]
total_1 total_2
1 25 10
2 225 13
3 100 10
Another case, two col of index must match.
tb2 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))

If both data have the same index and the number of rows are same. One way is to order by 'index' in both data to enforce that they are in the same order. Then do the division
tb1new <- tb1[order(tb1$index),]
tbl2new <- tb2[order(tb2$index),]
tb1new[-1] <- tbl1new[-1]/tbl2new[-1]
Or we can make a check on both 'index' first and use that condition to do the division
i1 <- all.equal(tbl1$index, tbl2$index)
if(i1) tb1[-1]/tbl2[-1]
Or another option in a join
library(data.table)
nm1 <- c('total_1', 'total_2')
nm2 <- c('unit_1', 'unit_2')
setDT(tb1)[tb2, (nm1) := .SD/mget(nm2), on = .(index), .SDcols = nm1]

You can perform a join and divide the columns. In base R :
result <- merge(tb1, tb2, by = c('index_1', 'index_2'))
result
# index_1 index_2 total_1 total_2 unit_1 unit_2
#1 a c 100 20 4 2
#2 b b 300 60 3 6
#3 b d 450 39 2 3
total_cols <- grep('total', names(result), value = TRUE)
unit_cols <- grep('unit', names(result), value = TRUE)
result[total_cols]/result[unit_cols]
# total_1 total_2
#1 25 10
#2 100 10
#3 225 13

Maybe this is not the most efficient solution but here is another way:
library(dplyr)
library(tidyr)
# For one index matching
tb1 %>%
left_join(tb2, by = "index") %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_")))
index result_1 result_2
1 1 25 10
2 2 225 13
3 3 100 10
# For two indices matching
tb1 %>%
left_join(tb2, by = c("index_1", "index_2")) %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_"))) %>%
select(!starts_with(c("total", "unit")))
index_1 index_2 result_1 result_2
1 a c 25 10
2 b d 225 13
3 b b 100 10

Related

How to filter out rows that follow a certain established threshold being reached in R?

I have the following data:
group <- rep(letters[seq(from = 1, to = 3)], each = 4)
date <- c("1999-01-01", "1999-01-02", "1999-10-01", "1999-10-05",
"1988-02-01", "1997-12-25", "1997-12-26", "1998-01-01",
"2000-05-01", "2000-07-01", "2000-12-01", "2000-12-02")
day <- c(1,2,274,278,
1,3616,3617,3623,
1, 62,215,216)
diff <- c(0, 1, 272, 4,
0, 3615, 1, 6,
0, 61, 153, 1)
matrix <- matrix(c(group, date, day, diff), ncol = 4, byrow = F)
df <- as.data.frame(matrix)
colnames(df) <- c("group", "date", "day", "diff")
df
In this case, "diff" is the difference between consecutive dates, in days, by group. I am trying to filter out all rows after an arbitrary threshold of "diff" has been reached. For example, let's say this threshold is a difference of 100 days. I would want to eliminate all rows on and after the first value of "diff" that is greater than 100, by group. In other words, my output would look as follows:
group2 <- c("a", "a",
"b",
"c", "c")
date2 <- c("1999-01-01", "1999-01-02",
"1988-02-01 ",
"2000-05-01", "2000-07-01")
day2 <- c(1, 2,
1,
1, 62)
diff2 <- c(0,1,
0,
0,61)
matrix2 <- matrix(c(group2, date2, day2, diff2), ncol = 4, byrow = F)
df2 <- as.data.frame(matrix2)
colnames(df2) <- c("group", "date", "day", "diff")
df2
Is there some way to get this output? There are similar questions on Stack Overflow, but they do not accommodate groups, or do not work for my data. Filtering any value of "diff" less than 100 is not the solution, as it leaves me with dates that occurred AFTER the 100 day gap, which I do not want.
df %>%
filter(diff < 100)
Again, I just want to find the first instance where diff > 100 and remove this row and all subsequent rows for that group. Any help here would be appreciated.
df = data.frame(group, date, day, diff)
library(dplyr)
df %>%
group_by(group) %>%
filter(cumsum(diff > 100) == 0)
# # A tibble: 5 × 4
# # Groups: group [3]
# group date day diff
# <chr> <chr> <dbl> <dbl>
# 1 a 1999-01-01 1 0
# 2 a 1999-01-02 2 1
# 3 b 1988-02-01 1 0
# 4 c 2000-05-01 1 0
# 5 c 2000-07-01 62 61

How to automatically fill in a blank column

I am trying to get the list of sums of two columns from my original data set, from left to right
I have made a loop:
for (i in 1:ncol(df)) {
m = i
n = i + 1
if (i %% 2 != 0) {
df_cum$V1 <- sum(df[,m] + df[,n])
}
}
But, the way to add value to the new list is wrong:
df_cum$V1 <- sum(df[,m] + df[,n])
would be really appreciated if anyone knows how to do that in R
You can try split.default(), i.e.
sapply(split.default(df, gsub('\\d+', '', names(df))), sum)
A B
17 12
A base R option using tapply -
tapply(unlist(df),
rep(1:ncol(df), each = nrow(df) * 2, length.out = nrow(df) * ncol(df)),
sum)
# 1 2 3
#17 12 13
The logic here is to create group of every 2 columns and sum them.
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(A1 = c(0, 3, 2), A2 = c(2, 6, 4),
B1 = c(3, 0, 1), B2 = c(2, 3, 3),
C1 = c(7, 3, 2), C2 = c(1, 0, 0))
We can do this in tidyverse
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(everything(), names_to = c(".value", "grp"),
names_sep ="(?<=[A-Z])(?=[0-9])") %>%
select(-grp) %>%
summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 1 x 3
A B C
<dbl> <dbl> <dbl>
1 17 12 13
Or using base R
aggregate(values ~ ., transform(stack(df1),
ind = sub("\\d+", "", ind)), FUN = sum)
ind values
1 A 17
2 B 12
3 C 13
Or another option with rowsum from base R
with(stack(df1), rowsum(values, group = trimws(ind, whitespace = "\\d+")))
[,1]
A 17
B 12
C 13
Or another option is with colSums and rowsum
{tmp <- colSums(df1); rowsum(tmp, group = substr(names(tmp), 1, 1))}
[,1]
A 17
B 12
C 13
data
df1 <- structure(list(A1 = c(0, 3, 2), A2 = c(2, 6, 4), B1 = c(3, 0,
1), B2 = c(2, 3, 3), C1 = c(7, 3, 2), C2 = c(1, 0, 0)),
class = "data.frame", row.names = c(NA,
-3L))

R grouped frequency table

I'm an R noob and I feel this should be simple but I cannot work it out. I have a survey dataset, with columns for ID, employer, practice_area and then a number of columns where the survey takers had to indicate which tools they use with a 'check all that apply' instruction. The data set now has a column for each tool option with either 1 or 0.
Sample df:
np1 <- data.frame(ID = c(1:10),
practice_area = c("A", "B", "C", "A", "A", "C", "B", "D", "C", "A"),
tool_1 = sample(0:1,10, replace = TRUE),
tool_2 = sample(0:1,10, replace = TRUE),
tool_3 = sample(0:1,10, replace = TRUE),
tool_4 = sample(0:1,10, replace = TRUE),
tool_5 = sample(0:1,10, replace = TRUE))
I'd like a frequency table that is grouped by practice_area. So basically I can see the results that it would say practice_area A, x people use tool_1, x people use tool_2, etc.
# data
df <- data.frame(ID = c(1, 2, 3, 4 ,5),
employer = c("A", "B", "C", "D", "E"),
practice_area = c("X", "Y", "X", "X", "X"),
tool_1 = c(1, 0, 0, 1, 1),
tool_2 = c(1, 0, 0, 1, 0),
tool_3 = c(1, 1, 1, 1, 1),
tool_4 = c(0, 1, 1, 0, 1))
Output:
# code
df %>%
group_by(practice_area) %>%
summarise(tool_1 = sum(tool_1), tool_2 = sum(tool_2),
tool_3 = sum(tool_3), tool_4 = sum(tool_4))
Ok, so let's start off with creating a dataset, to reproduce this problem.
library(tidyverse)
df <- data.frame(
ID = 1:50,
employer = rep(
c("employer.1","employer.2"),
25
),
practice_area = rep(
1:5,
10
),
tool.1 = sample(0:1, 50, replace=T),
tool.2 = sample(0:1, 50, replace=T)
)
So, If I want a table like this:
# A tibble: 10 x 3
# Groups: practice_area [5]
practice_area tool n
<int> <chr> <int>
1 1 tool.1 7
2 1 tool.2 2
3 2 tool.1 4
4 2 tool.2 2
5 3 tool.1 2
6 3 tool.2 4
7 4 tool.1 4
8 4 tool.2 6
9 5 tool.1 6
10 5 tool.2 5
I would do
df %>%
pivot_longer(
starts_with("tool"),
names_to = "tool",
values_to = "uses_tool"
) %>%
filter(uses_tool != 0) %>%
group_by(practice_area) %>%
count(tool)
In this piece of code, I make a long table (instead of wide) in which I have a column for the tools (selected with start_with, see https://dplyr.tidyverse.org/reference/select.html). After that, I remove the ones that don't use the tool (uses_tool != 0) and I group them by the practice area. The only thing to do then is to count the occurrences by group.

Match value from one dataframe to values from a second dataframe of different length

I have two dataframes like so
df_1 <- data.frame(Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100))
df_2 <- data.frame(Value = c(5, 2, 33),
Symbol = c("B", "A", "D"))
I want to attach df_2$Symbol to df_1 based on whether or not df_2$Value falls between df_1$Min and df_1$Max. If there's no df_2$Value in the appropriate range I'd like NA instead:
df_target <- data.frame(
Min = c(1, 4, 9, 25),
Max = c(3, 7, 14, 100),
Symbol = c("A", "B", NA, "D")
)
If df_1 and df_2 were of equal lengths this would be simple with findInterval or something with cut but alas...
A solution in either base or tidyverse would be appreciated.
We could use a non-equi join
library(data.table)
setDT(df_1)[df_2, Symbol := Symbol, on = .(Min < Value, Max > Value)]
df_1
# Min Max Symbol
#1: 1 3 A
#2: 4 7 B
#3: 9 14 <NA>
#4: 25 100 D
Or can use fuzzy_left_join
library(fuzzyjoin)
fuzzy_left_join(df_1, df_2, by = c('Min' = 'Value',
'Max' = 'Value'), list(`<`, `>`) ) %>%
dplyr::select(-Value)
# Min Max Symbol
#1 1 3 A
#2 4 7 B
#3 9 14 <NA>
#4 25 100 D

Insert specified values in R grouped df and fill up missing values using another df (R)

I have 2 dfs : df & xdf.
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
In df, within "id" column, for the groups (a & b), I would like to insert only that row of xdf which matches the same id name as in df. How can I make it ? I have tried following commands but all of the values of xdf$x are inserted for each group.
ndf <- df %>%
group_by(id) %>%
do(add_row(., id = .$id[1], x = xdf$x))
> ndf
# A tibble: 12 x 3
# Groups: id [2]
id x y
<chr> <dbl> <dbl>
1 a 1 0.2
2 a 2 0
3 a 3 0.9
4 a 4 7
5 a 2 NA
6 a 3.5 NA
7 b 1 1
8 b 2 0.3
9 b 3 5
10 b 4 5.1
11 b 2 NA
12 b 3.5 NA
# expected result should be : ndf <- ndf[c(-6,-11),]
My end goal is to fill these newborns NA of ndf with the approx() function. But my issue remains because I'm using xout = xdf$x that calls supernumerary values. How can I overcome this? Can you help to write a function that makes xout varies?
f <- function(z)
{
fdf <- approx(z$x, z$y, xout = xdf$x, method = "linear")
return(data.frame(nx= fdf$x, y.out = fdf$y, id = unique(z$id)))
}
jdf <- as.data.frame(ddply(ndf, .(id), f))
zdf <- subset(jdf, select = c(id, nx, y.out))
> zdf
id nx y.out
1 a 2.0 0.00
2 a 3.5 3.95
3 b 2.0 0.30
4 b 3.5 5.05
# expected results
id nx y.out
1 a 2.0 0.00
2 b 3.5 5.05
Any helpful tips to this is welcome. Many thanks!
library(dplyr)
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
ndf <- df %>%
bind_rows(xdf) %>%
arrange(id)
zdf <- ndf %>%
group_by(id) %>%
group_modify(~mutate(., y_approx = approx(.$x, .$y, .$x, method = "linear")[["y"]])) %>%
ungroup() %>%
filter(is.na(y)) %>%
select(id, y_approx)

Resources