R Find Distance Between Two values By Group - r

HAVE = data.frame(INSTRUCTOR = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
STUDENT = c(1, 2, 2, 2, 1, 3, 1, 1, 1, 1, 2, 1),
SCORE = c(10, 1, 0, 0, 7, 3, 5, 2, 2, 4, 10, 2),
TIME = c(1,1,2,3,2,1,1,2,3,1,1,2))
WANT = data.frame(INSTRUCTOR = c(1, 2, 3),
SCORE.DIF = c(-9, NA, 6))
For each INSTRUCTOR, I wish to find the SCORE of the first and second STUDENT, and subtract their scores. The STUDENT code varies so I wish not to use '==1' vs '==2'
I try:
HAVE[, .SD[1:2], by = 'INSTRUCTOR']
but do not know how to subtract vertically and obtain 'WANT' data frame from 'HAVE'

library(data.table)
setDT(HAVE)
unique(HAVE, by = c("INSTRUCTOR", "STUDENT")
)[, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
To use your new TIME variable, we can do
HAVE[, .SD[which.min(TIME),], by = .(INSTRUCTOR, STUDENT)
][, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
One might be tempted to replace SCORE[1:2] with head(SCORE,2), but that won't work: head(SCORE,2) will return length-1 if the input is length-2, as it is with instructor 2 (who only has one student albeit multiple times). When you run diff on length-1 (e.g., diff(1)), it returns a 0-length vector, which in the above data.table code reduces to zero rows for instructor 2. However, when there is only one student, SCORE[1:2] resolves to c(SCORE[1], NA), for which the diff is length-1 (as needed) and NA (as needed).

Related

Dplyr: Create two columns based on specific conditions

In this dataset DF, we have 4 names and 4 professions.
DF<-tribble(
~names, ~princess, ~singer, ~astronaut, ~painter,
"diana", 4, 1, 2, 3,
"shakira", 2, 1, 3, 4,
"armstrong", 3, 4, 1, 2,
"picasso", 1, 3, 1, 4
)
Assume that the cell values are some measure of their their profession. So, for instance, Diana has highest cell value for princess (correctly) but Shakira has highest cell value for painter (incorrectly).
I want to create two columns called "Compatible" and "Incompatible" where the program will pick value of 4 for Diana as it is under the correct profession Princess and assign it to column "Compatible" and in the "Incompatible" put an average of the other 3 values. For Shakira, it will pick the value 1 from the correct profession of singer, and assign it to Compatible; for Incompatible it average the other values. Similarly for other names
So the output will be like this:
DF1<-tribble(
~names, ~princess, ~singer, ~astronaut, ~painter,~Compatible,~Incompatible,
"diana", 4, 1, 2, 3, 4, 2,
"shakira", 2, 1, 3, 4, 1, 3,
"armstrong", 3, 4, 1, 2, 1, 3,
"picasso", 1, 3, 1, 4, 4, 1.66
)
Here is the dataset which shows the correct names and professions:
DF3<- tribble(
~names, ~professions,
"diana", "princess",
"shakira", "singer",
"armstrong", "astronaut",
"picasso", "painter"
)
DF1[1:5] %>%
pivot_longer(-names) %>%
left_join(DF3, 'names') %>%
group_by(names, name = if_else(name == professions, 'compatible', 'incompatible')) %>%
summarise(profession = first(professions), value = mean(value), .groups = 'drop') %>%
pivot_wider()
# A tibble: 4 x 4
names profession compatible incompatible
<chr> <chr> <dbl> <dbl>
1 armstrong astronaut 1 3
2 diana princess 4 2
3 picasso painter 4 1.67
4 shakira singer 1 3

Count co-occurrences between elements in one column based on second column, and count only if unequal in third column

I want to count how often each pairwise combination of unique elements in column c in data frame df co-occurs on the elements of column a, but with the addition that co-occurrences are only counted if the respective values in column b are unequal, i.e., conditional on a non-match in column b
a <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
b <- c(1,1,2,2,2,1,1,2,2,3,3,3,3,1,1,1,2,2,2,4)
c <- c(1,2,1,2,3,2,3,1,2,1,1,2,3,1,2,1,1,2,4,1)
df <- as.data.frame(cbind(a,b,c))
Without considering column b I could do the following to retain for each pair of elements of column c, on how many elements of a they co-occur
df <- unique(df[,c(1,3)])
df <- merge(df, df, by = "a")
df$count <- 1
df <- aggregate(count ~ ., df[, c(2:4)], sum)
df <- df[df$c.x != df$c.y,]
With the additional condition of a non-match in b, there is only one difference: elements 2 and 4 of column c both co-occur on element 4 of column a, but have the same value in b and should therefore not be counted to end up with:
c.x <- c(2,3,4,1,3,1,2,1)
c.y <- c(1,1,1,2,2,3,3,4)
count <- c(4,3,1,4,3,3,3,1)
result <- as.data.frame(cbind(c.x,c.y,count))
As the original data set is large (> 1,000,000 observations), I welcome fast solutions, i.e., without using loops or merges. Usually, I create co-occurrence matrices from three-column data frames using sparseMatrix()
I'm not sure from your description if this is what you had in mind, nor how fast this would turn out to be, but here is an approach with purrr:
library(purrr)
split(df, c) %>%
combn(2, simplify = F) %>%
set_names(map(., ~ paste(names(.x), collapse = "_"))) %>%
map_int(~ merge(.x[[1]], .x[[2]], by = NULL) %>%
dplyr::filter(a.x == a.y && b.x != b.y) %>%
nrow())
Returns:
1_2 1_3 1_4 2_3 2_4 3_4
0 27 0 21 0 0
# Data used:
df <- structure(list(a = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4), b = c(1, 1, 2, 2, 2, 1, 1, 2, 2, 3, 3, 3, 3, 1, 1, 1, 2, 2, 2, 4), c = c(1, 2, 1, 2, 3, 2, 3, 1, 2, 1, 1, 2, 3, 1, 2, 1, 1, 2, 4, 1)), class = "data.frame", row.names = c(NA, -20L))

How to calculate the number of edges between nodes of a certain group?

I have the following dataset and the following script:
library(GGally)
library(ggnet)
library(network)
library(sna)
library(ggplot2)
# edgelist
e <- data.frame(sender = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5),
receiver = c(2, 3, 4, 5, 1, 3, 1, 1, 2, 2, 4, 3, 2, 4))
# information about the nodes (vertices)
v <- data.frame(actors = c(1, 2, 3, 4, 5),
groups = c("A", "A", "B", "C", "D"))
net <- network(e, directed = TRUE)
x = data.frame(actors = network.vertex.names(net))
x = merge(x, v, by = "actors", sort = FALSE)$groups
net %v% "group" = as.character(x)
y = RColorBrewer::brewer.pal(9, "Set1")[ c(3, 1, 9, 6, 8) ]
names(y) = levels(x)
ggnet2(net, color = "group", palette = y, alpha = 0.75, size = 4, edge.alpha = 0.5, arrow.size = 8, arrow.gap = 0.01)
Is there an easy and fast way to calculate the number of edges from group of nodes (A, B, C, D) to the same group of nodes or to another group of nodes (i.e., from A-A, A-B, A-C, A-D, B-A, etc.)?
There is network.edgecountin the network-Package. But how can I apply it to my question?
The mixingmatrix function from network does the trick. It displays the number of ties within and between groups, getting at homophily and assortive mixing.
> mixingmatrix(net, "group")
To
From A B C D Total
A 3 2 1 1 7
B 3 0 1 0 4
C 1 1 0 0 2
D 0 0 1 0 1
Total 7 3 3 1 14

Changing rank across time in R as opposed to by group

I am having an issue in R where I want to add a rank (or Index) column, though as opposed to the rankings changing every time the combination changes. I want it to change every time the previous combination changes. I will illustrate what I mean in the code below.
df <- data.frame(id = c(1, 1, 1, 1, 1),
time = c(1, 2, 3, 4, 5),
group = c(1, 2, 2, 1, 3),
rank1 = c(1, 2, 2, 1, 3),
rank2 = c(1, 2, 2, 3, 4))
In the example I am ranking by group. rank1 is consistent with what I have been able do so far, which is at time 4 the rank is 1 because there was a previous instance of that group. I want something similar to rank2, because it accounts for there being a gap between the group == 1 instances, and assigns a different rank accordingly (i.e. at time 4 in rank2 is 3 as opposed to 1).

count non-NA values and group by variable

I am trying to show how many complete observations there are per variabie ID without using the complete.cases package or any other package.
If I use na.omit to filter out the NA values, I will lose all of the IDs which might have ZERO complete cases.
In the end, I'd like a frequency table with two columns: ID and Number of Complete Observations
> length(unique(data$ID))
[1] 332
> head(data)
ID value
1 1 NA
2 1 NA
3 1 NA
4 1 NA
5 1 NA
6 1 NA
> dim(data)
[1] 772087 2
When I try to create my own function z - which counts non-NA values and apply that in the aggregate() function, the IDs with zero complete observations are left out. I should be left with 332 rows, not 323. How does one resolve this using base functions?
z <- function(x){
sum(!is.na(x))
}
aggregate(value ~ ID, data = data , FUN = "z")
> nrow(aggregate(isna ~ ID, data = data , FUN = "z"))
[1] 323
One of the ways to do this is using table:
df2 <- table(df$Id, !is.na(df$value))[,2]
data.frame(ID = names(df2), value = df2)
Data
structure(list(Id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4), value = c(NA,
1, 1, 2, 2, NA, 3, NA, 3, 3, 4, 4)), .Names = c("Id", "value"
), row.names = c(NA, -12L), class = "data.frame")
Base R you can use your utility function like this:
stack(by(data$value, data$ID, FUN=function(x) sum(!is.na(x))))
you can directly use table for this purpose. Below is the sample code:
df1 <- structure(list(Id = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4), value = c(2,
1, 1, NA, NA, NA, 3, NA, 3, 3, 4, 4)), .Names = c("Id", "value"
), row.names = c(NA, -12L), class = "data.frame")
df2 <- as.data.frame.matrix(with(df1, table(Id, value)))
resultDf <- data.frame(Id=row.names(df2), count=apply(df2, 1, sum))
resultDf
The code makes a table of id and value. Then it just sums the non-na values from the table. Hope this is easy to understand and helps.

Resources