R using combn with apply - r

I have a data frame that has percentage values for a number of variables and observations, as follows:
obs <- data.frame(Site = c("A", "B", "C"), X = c(11, 22, 33), Y = c(44, 55, 66), Z = c(77, 88, 99))
I need to prepare this data as an edge list for network analysis, with "Site" as the nodes and the remaining variables as the edges. The result should look like this:
Node1 Node2 Weight Type
A B 33 X
A C 44 X
...
B C 187 Z
So that for "Weight" we are calculating the sum of all possible pairs, and this separately for each column (which ends up in "Type").
I suppose the answer to this has to be using apply on a combn expression, like here Applying combn() function to data frame, but I haven't quite been able to work it out.
I can do this all by hand taking the combinations for "Site"
sites <- combn(obs$Site, 2)
Then the individual columns like so
combA <- combn(obs$A, 2, function(x) sum(x)
and binding those datasets together, but this obviously become annoying very soon.
I have tried to do all the variable columns in one go like this
b <- apply(newdf[, -1], 1, function(x){
sum(utils::combn(x, 2))
}
)
but there is something wrong with that.
Can anyone help, please?

One option would be to create a function and then map that function to all the columns that you have.
func1 <- function(var){
obs %>%
transmute(Node1 = combn(Site, 2)[1, ],
Node2 = combn(Site, 2)[2, ],
Weight = combn(!!sym(var), 2, function(x) sum(x)),
Type = var)
}
map(colnames(obs)[-1], func1) %>% bind_rows()

Here is an example using combn
do.call(
rbind,
combn(1:nrow(obs),
2,
FUN = function(k) cbind(data.frame(t(obs[k, 1])), stack(data.frame(as.list(colSums(obs[k, -1]))))),
simplify = FALSE
)
)
which gives
X1 X2 values ind
1 A B 33 X
2 A B 99 Y
3 A B 165 Z
4 A C 44 X
5 A C 110 Y
6 A C 176 Z
7 B C 55 X
8 B C 121 Y
9 B C 187 Z

try it this way
library(tidyverse)
obs_long <- obs %>% pivot_longer(-Site, names_to = "type")
sites <- combn(obs$Site, 2) %>% t() %>% as_tibble()
Type <- tibble(type = c("X", "Y", "Z"))
merge(sites, Type) %>%
left_join(obs_long, by = c("V1" = "Site", "type" = "type")) %>%
left_join(obs_long, by = c("V2" = "Site", "type" = "type")) %>%
mutate(res = value.x + value.y) %>%
select(-c(value.x, value.y))
V1 V2 type res
1 A B X 33
2 A C X 44
3 B C X 55
4 A B Y 99
5 A C Y 110
6 B C Y 121
7 A B Z 165
8 A C Z 176
9 B C Z 187

Related

Find unique entries in otherwise identical rows

I am currently trying to find a way to find unique column values in otherwise duplicate rows in a dataset.
My dataset has the following properties:
The dataset's columns comprise an identifier variable (ID) and a large number of response variables (x1 - xn).
Each row should represent one individual, meaning the values in the ID column should all be unique (and not repeated).
Some rows are duplicated, with repeated entries in the ID column and seemingly identical response item values (x1 - xn). However, the dataset is too large to get a full overview over all variables.
As demonstrated in the code below, if rows are truly identical for all variables, then the duplicate row can be removed with the dplyr::distinct() function. In my case, not all "duplicate" rows are removed by distinct(), which can only mean that not all entries are identical.
I want to find a way to identify which entries are unique in these otherwise duplicate rows.
Example:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
# The dataframe with all entries
df
A data.frame: 6 × 7
ID x1 x2 x3 x4 x5 x6
1 4 a 7 d x g
1 4 a 10 d p g
2 5 b 8 e y h
2 5 b 8 e y h
3 6 c 9 f z i
3 6 d 11 f q i
# The dataframe
df %>%
# with duplicates removed
distinct() %>%
# filtered for columns only containing duplicates in the ID column
janitor::get_dupes(ID)
ID dupe_count x1 x2 x3 x4 x5 x6
1 2 4 a 7 d x g
1 2 4 a 10 d p g
3 2 6 c 9 f z i
3 2 6 d 11 f q i
In the example above I demonstrate how dplyr::distinct() will remove fully duplicate rows (ID = 2), but not rows that are different in some columns (rows where ID = 1 and 3, and columns x2, x3 and x5).
What I want is an overview over which columns that are not duplicates for each value:
df %>%
distinct() %>%
janitor::get_dupes(ID) %>%
# Here I want a way to find columns with unidentical entries:
find_nomatch()
ID x2 x3 x5
1 7 x
1 10 p
3 c 9 z
3 d 11 q
A data.table alternative. Coerce data frame to a data.table (setDT). Melt data to long format (melt(df, id.vars = "ID")).
Within each group defined by 'ID' and 'variable' (corresponding to the columns in the wide format) (by = .(ID, variable)), count number of unique values (uniqueN(value)) and check if it's equal to the number of rows in the subgroup (== .N). If so (if), select the entire subgroup (.SD).
Finally, reshape the data back to wide format (dcast).
library(data.table)
setDT(df)
d = melt(df, id.vars = "ID")
dcast(d[ , if(uniqueN(value) == .N) .SD, by = .(ID, variable)], ID + rowid(ID, variable) ~ variable)
# ID ID_1 x2 x3 x5
# 1: 1 1 <NA> 7 x
# 2: 1 2 <NA> 10 p
# 3: 3 1 c 9 z
# 4: 3 2 d 11 q
A bit more simple than yours I think:
library(dplyr)
library(janitor)
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = c("a", "a", "b", "b", "c", "d"),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2),
"x5" = c("x", "p", "y", "y", "z", "q"),
"x6" = rep(letters[7:9], each = 2)
)
d <- df %>%
distinct() %>%
janitor::get_dupes(ID)
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) .x[1, ] != .y))%>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-18 by the reprex package (v2.0.1)
Edit
d %>%
group_by(ID) %>%
# Check for each id which row elements are different from the of the first
group_map(\(.x, .id) apply(.x, 1, \(.y) !Vectorize(identical)(unlist(.x[1, ]), .y))) %>%
do.call(what = cbind) %>% # Bind results for all ids
apply(1, any) %>% # return true if there are differences anywhere
c(T, .) %>% # Keep id column
`[`(d, .)
#> ID x2 x3 x5
#> 1 1 a 7 x
#> 2 1 a 10 p
#> 3 3 c 9 z
#> 4 3 d 11 q
Created on 2022-01-19 by the reprex package (v2.0.1)
I have been working on this issue for some time and I found a solution, though it tooks more step than I would've though necessary. I can only presume there's a more elegant solution out there. Anyway, this should work:
df <- df %>%
distinct() %>%
janitor::get_dupes(ID)
# Make vector of unique values from the duplicated ID values
l <- distinct(df, ID) %>% unlist()
# Lapply on each ID
df <- lapply(
l,
function(x) {
# Filter rows for the duplicated ID
dplyr::filter(df, ID == x) %>%
# Transpose dataframe (converts it into a matrix)
t() %>%
# Convert back to data frame
as.data.frame() %>%
# Filter columns that are not identical
dplyr::filter(!if_all(everything(), ~ . == V1)) %>%
# Transpose back
t() %>%
# Convert back to data frame
as.data.frame()
}
) %>%
# Bind the dataframes in the list together
bind_rows() %>%
# Finally the columns are moved back in ascending order
relocate(x2, .before = x3)
#Remove row names (not necessary)
row.names(df) <- NULL
df
A data.frame: 4 × 3
x2 x3 x5
NA 7 x
NA 10 p
c 9 z
d 11 q
Feel free to comment
If you just want to keep the first instance of each identifier:
df <- data.frame(
"ID" = rep(1:3, each = 2),
"x1" = rep(4:6, each = 2),
"x2" = rep(letters[1:3], each = 2),
"x3" = c(7, 10, 8, 8, 9, 11),
"x4" = rep(letters[4:6], each = 2)
)
df %>%
distinct(ID, .keep_all = TRUE)
Output:
ID x1 x2 x3 x4
1 1 4 a 7 d
2 2 5 b 8 e
3 3 6 c 9 f

How can I extract a subset of data based on another data frame and grab observations before and after that subset

I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar
We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56
You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56

Dividing Counts only of certain groups

An example of data
Var1 <- rep(c("X", "Y", "Z"),2)
Var2 <- rep(c("A","B"),3)
Count<-sample(c(10:100), 6)
data<-data.frame(Var1,Var2,Count)
Produces
Var1 Var2 Count
1 X A 89
2 Y B 97
3 Z A 29
4 X B 38
5 Y A 50
6 Z B 88
I would like to divide the counts only of Var2 B by two, to get
Var1 Var2 Count Count2
1 X A 89 89
2 Y B 97 48.5
3 Z A 29 29
4 X B 38 19
5 Y A 50 50
6 Z B 88 44
But I'm not sure how to only divide based on a variable.
I'm new to coding, so any help is appreciated!
Base R solution:
data$Count2 <- data$Count ## copy to new variable
## Then change the subset to desired value. LHS subsets, RHS provides change
data$Count2[data$Var2 == "B"] <- data$Count[data$Var2 == "B"]/2
And Tidyverse/dplyr solution
library(dplyr)
data = data %>%
mutate(Count2 = ifelse(Var2 == "B", Count/2, Count ))
# alternatively, this is identical to above
data = mutate(data, Count2 = ifelse(Var2 == "B", Count/2, Count ))
Slight variation to Brian's dplyr solution, use replace to update a portion of the column inside the mutate function.
require(tidyverse)
Var1 <- rep(c("X", "Y", "Z"),2)
Var2 <- rep(c("A","B"),3)
Count<-sample(c(10:100), 6)
data<-data.frame(Var1,Var2,Count)
data %<>%
mutate(Count=replace(Count, Var2=='B', Count[Var2=='B']/2))
With data.table
library(data.table)
setDT(data)[, Count := as.numeric(Count)][Var2 == 'B', Count := Count/2]

Is it possible to define a `function` and insert it in `formula` to construct a `model.frame` in R?

I was trying to extract a model.frame in R by defining a function to use it in formula as:
library(Formula)
df <- data.frame(c = LETTERS[1:2], a = c(74, 80), b = c(8, 10))
soln <- function(x, y){
A <- matrix(c(1, 1, 1, -2), nrow=2)
B <- matrix(c(x, y), nrow=2)
return((as.matrix(solve(A)%*%B))[1,])
}
F1 <- Formula::Formula(c ~ . | (soln(a, b)))
mf <- stats::model.frame(F1, data = df)
mf
c a b soln(a, b)
1 A 74 8 76.000000
2 B 80 10 8.666667
Here mf provides a data.frame that does NOT match the actual value from soln() function. In fact soln(74, 8) = 52 and soln(80, 10) = 56.66667, but in mf it is showing 76 and 8.666667. How is this function working in the model formula? Is it possible to define a function to get the correct values in the model.frame this way?
We can loop through the sequence of rows to get the expected output
do.call(rbind, lapply(seq_len(nrow(df)), function(i) model.frame(F1, data = df[i,])))
# c a b soln(a, b)
#1 A 74 8 52.00000
#2 B 80 10 56.66667
Or using tidyverse
library(tidyverse)
df %>%
group_split(rn = row_number()) %>%
map_df(~ model.frame(F1, data = .x))
# c a b rn soln(a, b)
#1 A 74 8 1 52.00000
#2 B 80 10 2 56.66667

Count occurrence of a categorical variable, when grouping and summarising by a different variable in R

I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333

Resources