R Selecting highest count cells conditional on two columns - r

Apologies, if this is a duplicate please let me know, I'll gladly delete.
I am attempting to select the four highest values for different values of another column.
Dataset:
A B COUNT
1 1 2 2
2 1 3 6
3 1 4 3
4 1 5 9
5 1 6 2
6 1 7 7
7 1 8 0
8 1 9 5
9 1 10 2
10 1 11 7
11 2 1 5
12 2 3 1
13 2 4 8
14 2 5 9
15 2 6 5
16 2 7 2
17 2 8 2
18 2 9 4
19 3 1 7
20 3 2 5
21 3 4 2
22 3 5 8
23 3 6 6
24 3 7 1
25 3 8 9
26 3 9 5
27 4 1 8
28 4 2 1
29 4 3 1
30 4 5 3
31 4 6 9
For example, I would like to select four highest counts when A=1 (9,7,7,6) then when A=2 (9,8,5,5) and so on...
I would also like the corresponding B column value to be beside each count, so for when A=1 my desired output would be something like:
B A Count
5 1 9
7 1 7
11 1 7
3 1 6
I have looked a various answers on 'selecting highest values' but was struggling to find an example conditioning on other columns.
Many thanks

We can do
df1 %>%
group_by(A) %>%
arrange(desc(COUNT)) %>%
filter(row_number() <5)

library(dplyr)
data %>% group_by(A) %>%
arrange(A, desc(COUNT)) %>%
slice(1:4)

Related

Assign ID based on a sequence of consecutive days in R

I have a dataset with repeated measures which I want to use to assign IDs. The repeated measures are from a sequence of consecutive days. However, the sequence itself may be unbalanced (e.g., some have more days while others have less, some start with day 1 but a few others may start with 2 or 3). My question is how to create and assign the same ID withinid the same block of sequence. Here is a toy dataset:
days <- data.frame(
day = c(1L,2L,3L,4L,5L,6L,8L,9L,10L,
2L,3L,4L,5L,6L,7L,9L,10L,
1L,2L,4L,5L,6L,8L,9L,10L,
1L,2L,3L,4L,5L,6L,7L,8L,9L,10L)
)
Here is the end result I expect:
id day
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 8
8 1 9
9 1 10
10 2 2
11 2 3
12 2 4
13 2 5
14 2 6
15 2 7
16 2 9
17 2 10
18 3 1
19 3 2
20 3 4
21 3 5
22 3 6
23 3 8
24 3 9
25 3 10
26 4 1
27 4 2
28 4 3
29 4 4
30 4 5
31 4 6
32 4 7
33 4 8
34 4 9
35 4 10
Get the difference between adjacent elements and check if it is less than 0, take the cumulative sum
days$id <- cumsum(c(TRUE, diff(days$day) < 0))

Use apply to create a list of adjacency matrices from dataframe in R

I have an edgelist of friendships with 5 different schools over 3 waves. I'd like to create a list for each school that contains 3 adjacency matrices (one for each wave). I can do this one by one, but I would like to use a loop or an apply function to automate it.
This is the code I have used for one school and wave:
school1_w1 <- filter(edges, school == 1 & wave == 1) %>%
graph_from_data_frame(., directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
school1_w2 <- filter(edges, school == 1 & wave == 2) %>%
graph_from_data_frame(., directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
school1_w3 <- filter(edges, school == 1 & wave == 3) %>%
graph_from_data_frame(., directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
school1 <- list(school1_w1, school1_w2, school1_w3)
How can I do this for all 5 schools with an apply or loop? Sample data below:
ego alter wave school
1 4 1 1
1 4 2 1
1 3 3 1
2 3 1 1
2 4 2 1
2 4 3 1
3 1 1 1
3 2 2 1
3 3 3 1
4 1 1 1
4 1 2 1
4 1 3 1
5 8 1 2
5 6 2 2
5 7 3 2
6 7 1 2
6 7 2 2
6 7 3 2
7 8 1 2
7 6 2 2
7 6 3 2
8 7 1 2
8 7 2 2
8 7 3 2
9 10 1 3
9 11 2 3
9 12 3 3
10 11 1 3
10 11 2 3
10 9 3 3
11 12 1 3
11 10 2 3
11 12 3 3
12 9 1 3
12 10 2 3
12 10 3 3
13 14 1 4
13 15 2 4
13 16 3 4
14 16 1 4
14 16 2 4
14 13 3 4
15 16 1 4
15 16 2 4
15 16 3 4
16 15 1 4
16 15 2 4
16 15 3 4
17 20 1 5
17 18 2 5
17 18 3 5
18 19 1 5
18 20 2 5
18 19 3 5
19 17 1 5
19 17 2 5
19 17 3 5
20 18 1 5
20 17 2 5
20 17 3 5
We can use split + lapply :
library(igraph)
result <- lapply(split(edges, list(edges$school, edges$wave)), function(x) {
graph_from_data_frame(x, directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
})
Or with by :
result <- by(edges, list(edges$school, edges$wave), function(x) {
graph_from_data_frame(x, directed = TRUE) %>%
as_adjacency_matrix() %>% as.matrix()
})

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

Grouping cases with at least three variables in common in R

I have want to group my dataset by multiple variables and than id those groups. I can id groups when I only group by one variable using dplyr with group_indices.
But I want to group cases by having the same value on at least one of a certain set of variables and then identify the group cases belong to. How to do this in R?
I have the following dataset
NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9
I want cases to be grouped when they have at least one variable of the three I listed (name, adress, phonenumber) in common.
Cases with most in common to each other should be grouped over cases that have the least in common.
So I want to create a grouping variable which gives cases the same value if they're in the same group.
You can assume the hierarchy of name>address>phone
NPI name adress phone org
1 1 1 1 1
2 1 1 1 1
3 2 2 2 2
4 2 3 3 2
5 3 4 4 3
6 3 4 5 3
7 4 5 6 4
8 5 6 6 4
9 6 7 7 5
10 7 8 8 6
11 1 9 9 1
In the my real dataset I don't have numbers but names, actual addresses and phone numbers. So all the variables I'm working with are string variables.
Try this with dplyr:
library(dplyr)
df %>%
arrange(name, adress, phone) %>%
mutate(group = c(1, ifelse((name != lag(name)) & (adress != lag(adress)) & (phone != lag(phone)), 1, 0)[-1]),
group = cumsum(group)) %>%
arrange(NPI)
Result:
NPI name adress phone group
1 1 1 1 1 1
2 2 1 1 1 1
3 3 2 2 2 2
4 4 2 3 3 2
5 5 3 4 4 3
6 6 3 4 5 3
7 7 4 5 6 4
8 8 5 6 6 4
9 9 6 7 7 5
10 10 7 8 8 6
11 11 1 9 9 1
Note:
This works even if name, adress, and phone are all characters. As long as and id column (NPI) is numeric, the final data.frame would be in the correct order.
Data:
df = read.table(text = " NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9 ", header = TRUE)
library(dplyr)
df = df %>% mutate_at(vars(-NPI), as.character)

How to generate an uneven sequence of numbers in R

Here's an example data frame:
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
I want to generate a sequence of numbers according to the number of observations of y per x group (e.g. there are 2 observations of y for x=1). I want the sequence to be continuously increasing and jumps by 2 after each x group.
The desired output for this example would be:
1,2,5,6,7,10,11,14,17,20,21,22,25,26
How can I do this simply in R?
To expand on my comment, the groupings can be arbitrary, you simply need to recast it to the correct ordering. There are a few ways to do this, #akrun has shown that this can be accomplished using match function, or you can make use the the as.numeric function if this is easier to understand for yourself.
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
# these are equivalent
df$newx <- as.numeric(factor(df$x, levels=unique(df$x)))
df$newx <- match(df$x, unique(df$x))
Since you now have a "new" releveling which is sequential, we can use the logic that was discussed in the comments.
df$newNumber <- 1:nrow(df) + (df$newx-1)*2
For this example, this will result in the following dataframe:
x y newx newNumber
1 1 1 1
1 2 1 2
2 3 2 5
2 4 2 6
2 6 2 7
3 3 3 10
3 7 3 11
4 8 4 14
5 6 5 17
6 4 6 20
6 3 6 21
6 7 6 22
9 3 7 25
9 2 7 26
where df$newNumber is the output you wanted.
To create the sequence 0,0,4,4,4,9,..., basically what you're doing is taking the minimum of each group and subtracting 1. The easiest way to do this is using the library(dplyr).
library(dplyr)
df %>%
group_by(x) %>%
mutate(newNumber2 = min(newNumber) -1)
Which will have the output:
Source: local data frame [14 x 5]
Groups: x
x y newx newNumber newNumber2
1 1 1 1 1 0
2 1 2 1 2 0
3 2 3 2 5 4
4 2 4 2 6 4
5 2 6 2 7 4
6 3 3 3 10 9
7 3 7 3 11 9
8 4 8 4 14 13
9 5 6 5 17 16
10 6 4 6 20 19
11 6 3 6 21 19
12 6 7 6 22 19
13 9 3 7 25 24
14 9 2 7 26 24

Resources