Compute the difference between two columns by pair in R - r

I have the following data:
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
I want to "extend" this data frame to make name pairs for every possible combination of names without repetition like so:
names_1 <- c("a", "a", "a", "b", "b", "c")
names_2 <- c("b", "c", "d", "c", "d", "d")
scores_1 <- c(95, 95, 95, 55, 55, 100)
scores_2 <- c(55, 100, 60, 100, 60, 60)
df_extended <- cbind.data.frame(names_1, names_2, scores_1, scores_2)
In the extended data, scores_1 are the scores for the corresponding name in names_1, and scores_2 are for names_2.
The following bit of code makes the appropriate name pairs. But I do not know how to get the scores in the right place after that.
t(combn(df$names,2))
The final goal is to get the row-wise difference between scores_1 and scores_2.
df_extended$score_diff <- abs(df_extended$scores_1 - df_extended$scores_2)

df_ext <- data.frame(t(combn(df$names, 2,\(x)c(x, df$scores[df$names %in%x]))))
df_ext <- setNames(type.convert(df_ext, as.is =TRUE), c('name_1','name_2', 'type_1', 'type_2'))
df_ext
name_1 name_2 type_1 type_2
1 a b 95 55
2 a c 95 100
3 a d 95 60
4 b c 55 100
5 b d 55 60
6 c d 100 60

names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
library(tidyverse)
map(df, ~combn(x = .x, m = 2)%>% t %>% as_tibble) %>%
imap_dfc(~set_names(x = .x, nm = paste(.y, seq(ncol(.x)), sep = "_"))) %>%
mutate(score_diff = scores_1 - scores_2)
#> # A tibble: 6 × 5
#> names_1 names_2 scores_1 scores_2 score_diff
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 a b 95 55 40
#> 2 a c 95 100 -5
#> 3 a d 95 60 35
#> 4 b c 55 100 -45
#> 5 b d 55 60 -5
#> 6 c d 100 60 40
Created on 2022-06-06 by the reprex package (v2.0.1)

First, we can create a new data frame with the unique combinations of names. Then, we can merge on the scores to match the names for both names_1 and names_2 to get the final data.frame.
names <- c("a", "b", "c", "d")
scores <- c(95, 55, 100, 60)
df <- cbind.data.frame(names, scores)
new_df <- data.frame(t(combn(df$names,2)))
names(new_df)[1] <- "names_1"; names(new_df)[2] <- "names_2"
new_df <- merge(new_df, df, by.x = 'names_1', by.y = 'names')
new_df <- merge(new_df, df, by.x = 'names_2', by.y = 'names')
names(new_df)[3] <- "scores_1"; names(new_df)[4] <- "scores_2"
> new_df
names_2 names_1 scores_1 scores_2
1 b a 95 55
2 c a 95 100
3 c b 55 100
4 d a 95 60
5 d b 55 60
6 d c 100 60

Related

How to round numbers and replace small values in table by "<0.1" in R

I have a large data table containing values ("c" and "d" in "exampledata") which i would like to alter in the following way:
if group >1000 and value >1: round to 0 digits
if group >1000 and value <1: paste "<1"
if group <1000 and value >0.1: round to 1 digit
if group <1000 and value <0.1: paste "<0.1"
exampledata <- data.table(
a = c("aa", "bb", "cc", "aa", "bb", "cc"),
b = c("a", "b", "c", "a", "a", "b"),
c = c(0, 0.05, 0.5, 50, 10, 6.898),
d = c(10000, 153.789, 123.22, 55.11, 0.0000025, 0.06),
group = c(11000, 50220, 10, 23, 62, 5)
)
This would be the desired solution i am looking for:
desired_solution <- data.table(
a = c("aa", "bb", "cc", "aa", "bb", "cc"),
b = c("a", "b", "c", "a", "a", "b"),
c = c(0, "<1", 0.5, 50.0, 10.0, 6.9),
d = c(10000, 154, 123.2, 55.1, "<0.1", "<0.1"),
group = c(11000, 50220, 10, 23, 62, 5)
)
> desired_solution
a b c d group
1: aa a 0 10000 11000
2: bb b <1 154 50220
3: cc c 0.5 123.2 10
4: aa a 50 55.1 23
5: bb a 10 <0.1 62
6: cc b 6.9 <0.1 5
I have tried:
desired_solution <- exampledata %>%
mutate_at(vars(c(3:4)), case_when(.>0.1 & group < 1000 ~ round(., digits = 1),
.<0.1 & group < 1000 ~ "<0.1",
.>1 & group > 1000 ~ round(., digits = 0),
.<1 & group > 1000 ~ "<1"))
This of course did not work. I dont know how to solve this problem. If i replace the smaller numbers (e.g. <0.1) by a character ("<0.1) i can no longer round the remaining values and if i round the values first I will no longer have the smaller numbers to display!
I will be gratefull for any ideas.
The issue is that with case_when all RHS values (see ?case_when)
must evaluate to the same type of vector.
Hence, to fix you issue you have to convert the numerics to characters using as.charatcer
library(dplyr)
exampledata %>%
mutate(across(3:4, ~ case_when(
. > 0.1 & group < 1000 ~ as.character(round(., digits = 1)),
. < 0.1 & group < 1000 ~ "<0.1",
. > 1 & group > 1000 ~ as.character(round(., digits = 0)),
. < 1 & group > 1000 ~ "<1"
)))
#> a b c d group
#> 1 aa a <1 10000 11000
#> 2 bb b <1 154 50220
#> 3 cc c 0.5 123.2 10
#> 4 aa a 50 55.1 23
#> 5 bb a 10 <0.1 62
#> 6 cc b 6.9 <0.1 5

How to add a row, which is a sum of some values in a column based on specific values in other column?

I am trying to add two rows to the data frame.
Regarding the first row, its value in MODEL column should be X, total_value should be the sum of total value of rows, with the MODEL being A and C and total_frequency should be the sum of total_frequency of rows, with the MODEL being A and C.
In the second row, the value in MODEL column should be Z, total_value should be the sum of total_value of rows, with the MODEL being D, Fand E, and total_frequency should be the sum of total_frequency of rows, with the MODEL being D,Fand E.
I am stuck, as I do not know how to select specific values of MODEL and then sum these two other columns.
Here is my data
data.frame(MODEL=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), total_value= c(62, 54, 78, 38, 16, 75, 39, 13, 58, 37),
total_frequency = c(78, 83, 24, 13, 22, 52, 16, 16, 20, 72))
You can try with dplyr, calculating the "new rows", then put together with the data df:
library(dplyr)
first <- df %>%
# select the models you need
filter(MODEL %in% c("A","C")) %>%
# call them x
mutate(MODEL = 'X') %>%
# grouping
group_by(MODEL) %>%
# calculate the sums
summarise_all(sum)
# same with the second
second <- df %>%
filter(MODEL %in% c("D","F","E")) %>%
mutate(MODEL = 'Z') %>%
group_by(MODEL) %>% summarise_all(sum)
# put together
rbind(df, first, second)
# A tibble: 12 x 3
MODEL total_value total_frequency
1 A 62 78
2 B 54 83
3 C 78 24
4 D 38 13
5 E 16 22
6 F 75 52
7 G 39 16
8 H 13 16
9 I 58 20
10 J 37 72
11 X 140 102
12 Z 129 87
The following code is a straightforward solution to the problem.
i1 <- df1$MODEL %in% c("A", "C")
total_value <- sum(df1$total_value[i1])
total_frequency <- sum(df1$total_frequency[i1])
df1 <- rbind(df1, data.frame(MODEL = "X", total_value, total_frequency))
i2 <- df1$MODEL %in% c("D", "E", "F")
total_value <- sum(df1$total_value[i2])
total_frequency <- sum(df1$total_frequency[i2])
df1 <- rbind(df1, data.frame(MODEL = "Z", total_value, total_frequency))
df1
# MODEL total_value total_frequency
#1 A 62 78
#2 B 54 83
#3 C 78 24
#4 D 38 13
#5 E 16 22
#6 F 75 52
#7 G 39 16
#8 H 13 16
#9 I 58 20
#10 J 37 72
#11 X 140 102
#12 Z 129 87
It is also possible to write a function to avoid repeating the same code.
fun <- function(X, M, vals){
i1 <- X$MODEL %in% vals
total_value <- sum(X$total_value[i1])
total_frequency <- sum(X$total_frequency[i1])
rbind(X, data.frame(MODEL = M, total_value, total_frequency))
}
df1 <- fun(df1, M = "X", vals = c("A", "C"))
df1 <- fun(df1, M = "Z", vals = c("D", "E", "F"))

Convert information from rows to new columns

Is there a way in R to place every three values in the column "V" (below) to new columns? In others words, I need to reshape the data from long to wide, but only to three columns and where the values are what appears in column V. Below is a demonstration.
Thank you in advance!
data = structure(list(Key = c(200, 200, 200, 200, 200, 200, 300, 300,
300, 300, 300, 300, 400, 400, 400, 400, 400, 400),
V = c("a", "b", "c", "b", "d", "c", "d", "b", "c", "a", "f", "c", "d", "b",
"c", "a", "b", "c")),
row.names = c(NA, 18L),
class = "data.frame")
Here is one option
data %>%
group_by(Key) %>%
mutate(
grp = gl(n() / 3, 3),
col = c("x", "y", "z")[(row_number() + 2) %% 3 + 1]) %>%
group_by(Key, grp) %>%
spread(col, V) %>%
ungroup() %>%
select(-grp)
## A tibble: 6 x 4
# Key x y z
# <dbl> <chr> <chr> <chr>
#1 200 a b c
#2 200 b d c
#3 300 d b c
#4 300 a f c
#5 400 d b c
#6 400 a b c
Note: This assumes that the number of entries per Key is divisible by 3.
Instead of grp = gl(n() / 3, 3) you can also use grp = rep(1:(n() / 3), each = 3).
Update
In response to your comments, let's create sample data by removing some rows from data such that for Key = 200 and Key = 300 we don't have a multiple of 3 V entries.
data2 <- data %>% slice(-c(1, 8))
Then we can do
data2 %>%
group_by(Key) %>%
mutate(grp = gl(ceiling(n() / 3), 3)[1:n()]) %>%
group_by(Key, grp) %>%
mutate(col = c("x", "y", "z")[1:n()]) %>%
spread(col, V) %>%
ungroup() %>%
select(-grp)
## A tibble: 6 x 4
# Key x y z
# <dbl> <chr> <chr> <chr>
#1 200 b c b
#2 200 d c NA
#3 300 d c a
#4 300 f c NA
#5 400 d b c
#6 400 a b c
Note how "missing" values are filled with NA.

R dplyr - Sum values for different factors

I have multiple factors ("a","b","c") in my dataset, each with corresponding values for Price and Cost.
dat <- data.frame(
ProductCode = c("a", "a", "b", "b", "c", "c"),
Price = c(24, 37, 78, 45, 20, 34),
Cost = c(10,15,45,25,10,17)
)
I am looking for the sum of Price and Cost for each ProductCode.
by.code <- group_by(dat, code)
by.code <- summarise(by.code,
SumPrice = sum(Price),
SumCost = sum(Cost))
This code does not work as it sums all values in the column, without breaking them into categories.
SumPrice SumCost
1 238 122
Thanks in advance for your help.
This is not dplyr - This answer is for you if you dont mind the sqldf or data.table package:
sqldf("select ProductCode, sum(Price) as PriceSum, sum(Cost) as CostSum from dat group by ProductCode")
ProductCode PriceSum CostSum
a 61 25
b 123 70
c 54 27
OR using the data.table package:
library(data.table)
MM<-data.table(dat)
MM[, list(sum(Price),sum(Cost)), by = ProductCode]
ProductCode V1 V2
1: a 61 25
2: b 123 70
3: c 54 27
Your code works fine. There was just a typo. You should name your column ProductionCode into code and your code works fine. I just did that and R is giving proper output. Below is the code:
library(dplyr)
dat <- data.frame(
code = c("a", "a", "b", "b", "c", "c"),
Price = c(24, 37, 78, 45, 20, 34),
Cost = c(10,15,45,25,10,17)
)
dat
by.code <- group_by(dat, code)
by.code <- summarise(by.code,
SumPrice = sum(Price),
SumCost = sum(Cost))
by.code
We can use aggregate from base R
aggregate(.~ProductCode, dat, sum)
# ProductCode Price Cost
#1 a 61 25
#2 b 123 70
#3 c 54 27

Summation of variables by Groups in R

I have a data frame, and I'd like to create a new column that gives the sum of a numeric variable grouped by factors. So something like this:
BEFORE:
data1 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60))
AFTER:
data2 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60),
sum = c(30, 30, 70, 70, 110, 110))
In Stata you can do this with the egen command quite easily. I've tried the aggregate function, and the ddply function but they create entirely new data frames, and I just want to add a column to the existing one.
You are looking for ave
> data2 <- transform(data1, sum=ave(value, month, FUN=sum))
month sex value sum
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110
data1$sum <- ave(data1$value, data1$month, FUN=sum) is useful if you don't want to use transform
Also data.table is helpful
library(data.table)
DT <- data.table(data1)
DT[, sum:=sum(value), by=month]
UPDATE
We can also use a tidyverse approach which is simple, yet elegant:
> library(tidyverse)
> data1 %>%
group_by(month) %>%
mutate(sum=sum(value))
# A tibble: 6 x 4
# Groups: month [3]
month sex value sum
<dbl> <fct> <dbl> <dbl>
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110

Resources