Subset multiple dataframes by a range of dates, combining into single dataframe - r

I have 24 dataframes (named b1 - b24) in a list that follow the general format
b1 <- structure(list(Timestamp = c("13/7/1995", "14/7/1995", "15/7/1995",
"16/7/1995", "17/7/1995", "18/7/1995", "19/7/1995", "20/7/1995",
"21/7/1995", "22/7/1995", "23/7/1995", "24/7/1995", "25/7/1995",
"26/7/1995", "27/7/1995", "28/7/1995", "29/7/1995", "30/7/1995",
"31/7/1995", "1/8/1995"), Rainfall = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
CodeofStandard = c(83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L, 83L)),
row.names = c(NA, 20L), class = "data.frame")
b1
#> Timestamp Rainfall CodeofStandard
#> 1 13/7/1995 0 83
#> 2 14/7/1995 0 83
#> 3 15/7/1995 0 83
#> 4 16/7/1995 0 83
#> 5 17/7/1995 0 83
#> 6 18/7/1995 0 83
#> 7 19/7/1995 0 83
#> 8 20/7/1995 0 83
#> 9 21/7/1995 0 83
#> 10 22/7/1995 0 83
#> 11 23/7/1995 0 83
#> 12 24/7/1995 0 83
#> 13 25/7/1995 0 83
#> 14 26/7/1995 0 83
#> 15 27/7/1995 0 83
#> 16 28/7/1995 0 83
#> 17 29/7/1995 0 83
#> 18 30/7/1995 0 83
#> 19 31/7/1995 0 83
#> 20 1/8/1995 0 83
I have sorted each individual dataframe for the top 30 Rainfall Events. I want to take the timestamp and subset the other 24 for the corresponding Rainfall value for that Timestamp. There may be missing dates from the other 24 dataframes, so if there's a missing value, I would like NA to be extracted in place of the Rainfall value.
I have currently been doing this in Excel.. Please see attached to have an idea of my pain
So for the column b1/136001B (Column G), I have copy and pasted the dates for the top 30 rainfall events for that data frame, and that retrieves the corresponding rainfall value for that data frame, in addition to the other data frames along the columns (I:AB). On the left side of the image (B:C:D), contains all the rows from the data frames b1 to b24, which is where the matrix is indexing and matching from.
For example, the values from b1 (column H) (136001B) would consider all the values in column D and would index the corresponding value from C if the date matches. Please see the image below for the equation. Dragging this along the length of b1:b24 and down the length of the top 30 days would give the matrix from above.
[
As you can see there are some NA values, which means that the row containing that particular date does not exist.
Then I would like to covert that matrix into a dataframe with three columns
Where Station is the station (b1:b24) for which the rainfall value was retrieved from. Rainfall is that said rainfall value and Origin is the top 30 rainfall events for the original dataframe in this instance, b1
Please see below for the output that I would like to end up with
all_b1 <- structure(list(Station = c("b1", "b1", "b1", "b1", "b1", "b1",
"b1", "b1", "b1", "b1", "b1", "b2", "b2", "b2", "b2", "b2", "b2",
"b2", "b2", "b2", "b2"), Rainfall = c(66L, 64L, 64L, 62L, 62L,
61L, 61L, 60L, 59L, 59L, 59L, 211L, 176L, 68L, 134L, 135L, 220L,
100L, 57L, 27L, 98L), Origin = c(66L, 64L, 64L, 62L, 62L, 61L,
61L, 60L, 59L, 59L, 59L, 210L, 162L, 146L, 143L, 141L, 125L,
102L, 101L, 95L, 92L)), row.names = 20:40, class = "data.frame")
all_b1
#> Station Rainfall Origin
#> 20 b1 66 66
#> 21 b1 64 64
#> 22 b1 64 64
#> 23 b1 62 62
#> 24 b1 62 62
#> 25 b1 61 61
#> 26 b1 61 61
#> 27 b1 60 60
#> 28 b1 59 59
#> 29 b1 59 59
#> 30 b1 59 59
#> 31 b2 211 210
#> 32 b2 176 162
#> 33 b2 68 146
#> 34 b2 134 143
#> 35 b2 135 141
#> 36 b2 220 125
#> 37 b2 100 102
#> 38 b2 57 101
#> 39 b2 27 95
#> 40 b2 98 92
Please give me an idea of where to start. I would love to basically start again in R because I have to do this multiple time with other sets of dataframes.
Also please don't be afraid to ask for clarification, I know my rambling mess needs it. Also, my title isn't as clear as it could be, please suggest alternative if you can?

Related

Drop observations if there are inconsistent variables within same ID [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 7 months ago.
df <- structure(list(id = c(123L, 123L, 123L, 45L, 45L, 9L, 103L, 103L,
22L, 22L, 22L), age = c(69L, 23L, 70L, 29L, 29L, 37L, 25L, 54L,
40L, 40L, 41L)), class = "data.frame", row.names = c(NA, -11L
))
id age
1 123 69
2 123 23
3 123 70
4 45 29
5 45 29
6 9 37
7 103 25
8 103 54
9 22 40
10 22 40
11 22 41
I would like to drop all observations for an id if it is associated with different values for age. How can I do that?
I would be left with:
id age
45 29
45 29
9 37
A dplyr approach:
library(dplyr)
dat |>
group_by(id) |>
filter(n_distinct(age)==1)
Without external packages, you could use ave():
df |>
subset(ave(age, id, FUN = \(x) length(unique(x))) == 1)
# id age
# 4 45 29
# 5 45 29
# 6 9 37

How to use stringr or grepl to Make new character variable?

I have data on subject codes and grades for students. Each has to take 4 subjects out of which English is mandatory. It is represented by code 301.
Columns: SUB1, SUB2, and so on represent subject codes for other modules and the next column represents the marks.
Based on these codes I am trying to do two things:
First thing:
I am trying to create a course column consisting of PCB if the student has subject codes 42, 43, and 44. PCM if the student has subject codes 41, 42, and 43. Commerce if the codes are 55, 54, and 30.
The issue that I am facing is that the codes are spread across columns and I facing difficulty to standardize them.
Second thing:
Based on the identified course, I am trying to sum the grades obtained by each student in these subjects. However, I am want to add the English grade to it as well.
Example of the data:
structure(list(Roll.No. = c(12653771L, 12653813L, 12653787L,
12653850L, 12653660L, 12653553L, 12653902L, 12653888L, 12653440L,
12653487L, 12653469L, 12653832L, 12653382L, 12653814L, 12653587L,
12653508L, 12653449L, 12653445L, 12653776L, 12653806L), SUB = c(301L,
301L, 301L, 301L, 301L, 301L, 301L, 301L, 301L, 301L, 301L, 301L,
301L, 301L, 301L, 301L, 301L, 301L, 301L, 301L), MRK = c(93L,
82L, 85L, 74L, 85L, 80L, 67L, 77L, 78L, 94L, 89L, 65L, 88L, 89L,
82L, 85L, 91L, 77L, 97L, 76L), SUB.1 = c(30L, 30L, 30L, 30L,
42L, 41L, 30L, 30L, 41L, 41L, 41L, 30L, 41L, 30L, 42L, 41L, 41L,
41L, 30L, 30L), MRK.1 = c(74L, 97L, 75L, 73L, 72L, 81L, 62L,
71L, 63L, 75L, 93L, 74L, 89L, 91L, 67L, 87L, 81L, 94L, 86L, 69L
), SUB.2 = c(48L, 41L, 48L, 54L, 43L, 42L, 48L, 48L, 42L, 42L,
42L, 41L, 42L, 41L, 43L, 42L, 42L, 42L, 48L, 48L), MRK.2 = c(76L,
95L, 79L, 74L, 72L, 75L, 67L, 74L, 60L, 72L, 93L, 56L, 79L, 68L,
68L, 91L, 62L, 75L, 95L, 67L), SUB.3 = c(54L, 54L, 54L, 55L,
44L, 43L, 54L, 54L, 43L, 43L, 43L, 54L, 43L, 54L, 44L, 43L, 43L,
43L, 54L, 54L), MRK.3 = c(80L, 96L, 77L, 44L, 94L, 69L, 63L,
74L, 57L, 67L, 80L, 67L, 72L, 89L, 95L, 87L, 68L, 82L, 94L, 69L
), SUB.4 = c(55L, 55L, 55L, 64L, 265L, 48L, 55L, 55L, 48L, 48L,
283L, 55L, 48L, 55L, 64L, 283L, 283L, 48L, 55L, 55L), MRK.4 = c(45L,
95L, 46L, 76L, 91L, 74L, 44L, 52L, 82L, 92L, 92L, 60L, 81L, 49L,
83L, 89L, 90L, 83L, 93L, 61L), SUB.5 = c(NA, 64L, NA, 41L, 48L,
NA, NA, NA, NA, NA, NA, 64L, NA, 49L, NA, 48L, 49L, NA, NA, NA
), MRK.5 = c(NA, "97", NA, "AB", "87", NA, NA, NA, NA, NA, NA,
"71", NA, "97", NA, "83", "98", NA, NA, NA)), row.names = c(NA,
20L), class = "data.frame")
This should also answer: your first question:
df <- df %>%
# convert to numeric and induce NAs in MRK.5 (losing AB)
mutate(MRK.5 = as.numeric(MRK.5)) %>%
# convert NAs to 0
mutate(across(where(anyNA), ~ replace_na(., 0))) %>%
# create a new string of subjects
mutate(subjects = str_c(SUB.1, SUB.2, SUB.3, SUB.4, SUB.5, sep = "_")) %>%
# use case_when to specify course
mutate(course = case_when(str_detect(subjects, "42") & str_detect(subjects, "43") & str_detect(subjects, "44") ~ "PCB",
str_detect(subjects, "41") & str_detect(subjects, "42") & str_detect(subjects, "43") ~ "PCM",
str_detect(subjects, "30") & str_detect(subjects, "54") & str_detect(subjects, "55") ~ "Commerce",
TRUE ~ "Other"))
I am not sure whether you want the summary scores for just the three subjects plus English in each course, or for all subjects, but with a bit of wrangling you can the following dataframe, from where you should be able to get what you need.
df %>% select(-subjects) %>%
# create consistent naming for the pivot separation
rename(SUB.E = SUB, MRK.E = MRK) %>%
# pivot into tidy form
pivot_longer(cols = -c(Roll.No., subjects, course),
names_pattern = "(\\w{3})\\.(.{1})",
names_to = c(".value", "course2"))
Giving:
A tibble: 120 × 5
Roll.No. course course2 SUB MRK
<int> <chr> <chr> <dbl> <dbl>
1 12653771 Commerce E 301 93
2 12653771 Commerce 1 30 74
3 12653771 Commerce 2 48 76
4 12653771 Commerce 3 54 80
5 12653771 Commerce 4 55 45
6 12653771 Commerce 5 0 0
7 12653813 Commerce E 301 82
8 12653813 Commerce 1 30 97
9 12653813 Commerce 2 41 95
10 12653813 Commerce 3 54 96
To calculate the summed scores per student, you then need to filter by condition and use group_by and summarise:
# filter for each condition
df %>% filter(course == "Commerce" & SUB %in% c(301,30,50,54) |
course == "PCB" & SUB %in% c(301,42,43,44) |
course == "PCM" & SUB %in% c(301,41,42,43)) %>%
# summarise by group
group_by(Roll.No.,course) %>%
summarise(score = sum(MRK))
Combining the parts above into one block of code:
df %>% mutate(across(, ~as.numeric(.))) %>%
mutate(across(where(anyNA), ~ replace_na(., 0))) %>%
rename(SUB.E = SUB, MRK.E = MRK) %>%
mutate(subjects = str_c(SUB.1, SUB.2, SUB.3, SUB.4, SUB.5, sep = "_")) %>%
mutate(course = case_when(str_detect(subjects, "42") & str_detect(subjects, "43") & str_detect(subjects, "44") ~ "PCB",
str_detect(subjects, "41") & str_detect(subjects, "42") & str_detect(subjects, "43") ~ "PCM",
str_detect(subjects, "30") & str_detect(subjects, "54") & str_detect(subjects, "55") ~ "Commerce",
TRUE ~ "Other")) %>%
pivot_longer(cols = -c(Roll.No., subjects, course),
names_pattern = "(\\w{3})\\.(.{1})",
names_to = c(".value", "course2")) %>%
filter(course == "Commerce" & SUB %in% c(301,30,50,54) |
course == "PCB" & SUB %in% c(301,42,43,44) |
course == "PCM" & SUB %in% c(301,41,42,43)) %>%
group_by(Roll.No., course) %>%
summarise(combined_mark = sum(MRK))
Giving:
Roll.No. course combined_mark
<dbl> <chr> <dbl>
1 12653382 PCM 328
2 12653440 PCM 258
3 12653445 PCM 328
4 12653449 PCM 302
5 12653469 PCM 355
6 12653487 PCM 308
7 12653508 PCM 350
8 12653553 PCM 305
9 12653587 PCB 312
10 12653660 PCB 323
11 12653771 Commerce 247
12 12653776 Commerce 277
13 12653787 Commerce 237
14 12653806 Commerce 214
15 12653813 Commerce 275
16 12653814 Commerce 269
17 12653832 Commerce 206
18 12653850 Commerce 221
19 12653888 Commerce 222
20 12653902 Commerce 192
Answering your first question assuming that you are interested in students participating in ALL of these courses:
I am trying to create a course column consisting of PCB if the student has subject codes 42, 43, and 44. PCM if the student has subject codes 41, 42, and 43. Commerce if the codes are 55, 54, and 30.
library(tidyverse)
df %>%
mutate(across(-Roll.No., ~ as.numeric(.))) %>%
rowwise() %>%
mutate(subject_summary = case_when(sum(c_across(starts_with("SUB")) %in% c(42, 43, 44)) == 3 ~"PCB",
sum(c_across(starts_with("SUB")) %in% c(41, 42, 43)) == 3 ~ "PCM",
sum(c_across(starts_with("SUB")) %in% c(55, 54, 30)) == 3 ~ "Commerce",
TRUE ~ NA_character_)) %>%
ungroup()
gives:
# A tibble: 20 x 14
Roll.No. SUB MRK SUB.1 MRK.1 SUB.2 MRK.2 SUB.3 MRK.3 SUB.4 MRK.4 SUB.5 MRK.5 subject_summary
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 12653771 301 93 30 74 48 76 54 80 55 45 NA NA Commerce
2 12653813 301 82 30 97 41 95 54 96 55 95 64 97 Commerce
3 12653787 301 85 30 75 48 79 54 77 55 46 NA NA Commerce
4 12653850 301 74 30 73 54 74 55 44 64 76 41 NA Commerce
5 12653660 301 85 42 72 43 72 44 94 265 91 48 87 PCB
6 12653553 301 80 41 81 42 75 43 69 48 74 NA NA PCM
7 12653902 301 67 30 62 48 67 54 63 55 44 NA NA Commerce
8 12653888 301 77 30 71 48 74 54 74 55 52 NA NA Commerce
9 12653440 301 78 41 63 42 60 43 57 48 82 NA NA PCM
10 12653487 301 94 41 75 42 72 43 67 48 92 NA NA PCM
11 12653469 301 89 41 93 42 93 43 80 283 92 NA NA PCM
12 12653832 301 65 30 74 41 56 54 67 55 60 64 71 Commerce
13 12653382 301 88 41 89 42 79 43 72 48 81 NA NA PCM
14 12653814 301 89 30 91 41 68 54 89 55 49 49 97 Commerce
15 12653587 301 82 42 67 43 68 44 95 64 83 NA NA PCB
16 12653508 301 85 41 87 42 91 43 87 283 89 48 83 PCM
17 12653449 301 91 41 81 42 62 43 68 283 90 49 98 PCM
18 12653445 301 77 41 94 42 75 43 82 48 83 NA NA PCM
19 12653776 301 97 30 86 48 95 54 94 55 93 NA NA Commerce
20 12653806 301 76 30 69 48 67 54 69 55 61 NA NA Commerce
You second question can be answered by continuing above code (but showing the full code here for convenience):
df %>%
mutate(across(-Roll.No., ~ as.numeric(.))) %>%
rowwise() %>%
mutate(subject_summary = case_when(sum(c_across(starts_with("SUB")) %in% c(42, 43, 44)) == 3 ~"PCB",
sum(c_across(starts_with("SUB")) %in% c(41, 42, 43)) == 3 ~ "PCM",
sum(c_across(starts_with("SUB")) %in% c(55, 54, 30)) == 3 ~ "Commerce",
TRUE ~ NA_character_)) %>%
ungroup() %>%
pivot_longer(cols = -c(Roll.No., subject_summary, SUB, MRK),
names_pattern = "(.*)(\\d{1})",
names_to = c(".value", "course")) %>%
group_by(Roll.No.) %>%
summarize(subject_summary = first(subject_summary),
grade = case_when(all(subject_summary == "PCB") ~ sum(MRK.[SUB. %in% c(42, 43, 44)]) + first(MRK),
all(subject_summary == "PCM") ~ sum(MRK.[SUB. %in% c(41, 42, 43)]) + first(MRK),
all(subject_summary == "Commerce") ~ sum(MRK.[SUB. %in% c(55, 54, 30)]) + first(MRK)))
gives:
# A tibble: 20 x 3
Roll.No. subject_summary grade
<int> <chr> <dbl>
1 12653382 PCM 328
2 12653440 PCM 258
3 12653445 PCM 328
4 12653449 PCM 302
5 12653469 PCM 355
6 12653487 PCM 308
7 12653508 PCM 350
8 12653553 PCM 305
9 12653587 PCB 312
10 12653660 PCB 323
11 12653771 Commerce 292
12 12653776 Commerce 370
13 12653787 Commerce 283
14 12653806 Commerce 275
15 12653813 Commerce 370
16 12653814 Commerce 318
17 12653832 Commerce 266
18 12653850 Commerce 265
19 12653888 Commerce 274
20 12653902 Commerce 236

Create a new table from an existing one on a criteria in R

I've done a self-paced reading experiment in which 151 participants read 112 sentences divided into three lists and I'm having some problems cleaning the data in R. I'm not a programmer so I'm kind of struggling with all this!
I've got the results file which looks something like this:
results
part item word n.word rt
51 106 * 1 382
51 106 El 2 286
51 106 asistente 3 327
51 106 del 4 344
51 106 carnicero 5 394
51 106 que 6 274
51 106 abapl’a 7 2327
51 106 el 8 1104
51 106 sabor 9 409
51 106 del 10 360
51 106 pollo 11 1605
51 106 envipi— 12 256
51 106 un 13 4573
51 106 libro 14 660
51 106 *. 15 519
Part=participant; item=sentences; n.word=number of word; rt=reading times.
In the results file, I have the reading times of every word of every sentence read by every participant. Every participant read more or less 40 sentences. My problem is that I am interested in the reading times of specific words, such as the main verb or the last word of each sentence. But as every sentence is a bit different, the main verb is not always in the same position for each sentence. So I've done another table with the position of the words I'm interested in every sentence.
rules
item v1 v2 n1 n2
106 12 7 3 5
107 11 8 3 6
108 11 8 3 6
item=sentence; v1=main verb; v2=secondary verb; n1=first noun; n2=second noun.
So this should be read: For sentence 106, the main verb is the word number 12, the secondary verb is the word number 7 and so on.
I want to have a final table that looks like this:
results2
part item v1 v2 n1 n2
51 106 256 2327 327 394
51 107 ...
52 106 ...
Does anyone know how to do this? It's kind of a from long to wide problem but with a more complex scenario.
If anyone could help me, I would really appreciate it! Thanks!!
You can try the following code, which joins your results data to a reshaped rules data, and then reshapes the result into a wider form.
library(tidyr)
library(dplyr)
inner_join(select(results, -word),
pivot_longer(rules, -item), c("item", "n.word"="value")) %>%
select(-n.word) %>%
pivot_wider(names_from=name, values_from=rt) %>%
select(part, item, v1, v2, n1, n2)
# A tibble: 1 x 6
# part item v1 v2 n1 n2
# <int> <int> <int> <int> <int> <int>
#1 51 106 256 2327 327 394
Data:
results <- structure(list(part = c(51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L), item = c(106L, 106L, 106L,
106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L,
106L), word = c("*", "El", "asistente", "del", "carnicero", "que",
"abapl’a", "el", "sabor", "del", "pollo", "envipi—", "un", "libro",
"*."), n.word = 1:15, rt = c(382L, 286L, 327L, 344L, 394L, 274L,
2327L, 1104L, 409L, 360L, 1605L, 256L, 4573L, 660L, 519L)), class = "data.frame", row.names = c(NA,
-15L))
rules <- structure(list(item = 106:108, v1 = c(12L, 11L, 11L), v2 = c(7L,
8L, 8L), n1 = c(3L, 3L, 3L), n2 = c(5L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-3L))

Finding common links in a matrix and classification by common intersection

Suppose I have a matrix of distance cost, in which the cost of destiny and the cost of origin both need to be below a certain threshold amount - say, US 100 -- to share a link. My difficulty lies in achieving a common set after classifying these localities: A1 links (cost of destiny and origin below threshold) with A2 and (same thing) A3 and A4; A2 links with A1 and A4; A4 links with A1 and A2. So A1, A2 and A4 would be classified in the same group, as being the group with highest frequency of links between themselves. Below I set a matrix as an example:
A1 A2 A3 A4 A5 A6 A7
A1 0 90 90 90 100 100 100
A2 80 0 90 90 90 110 100
A3 80 110 0 90 120 110 90
A4 90 90 110 0 90 100 90
A5 110 110 110 110 0 90 80
A6 120 130 135 100 90 0 90
A7 105 110 120 90 90 90 0
I am programming this with Stata and I haven't placed the matrix above in matrix form, as in mata. The column listing the letters A plus the number is a variable with the rownames of the matrix and the rest of the columns are named with each locality name (e.g. A1 and so on).
I have returned the list of links between each locality with the following code, which maybe I did it very "bruteforcelly" since I was in a hurry:
clear all
set more off
//inputting matrix
input A1 A2 A3 A4 A5 A6 A7
0 90 90 90 100 100 100
80 0 90 90 90 100 100
80 110 0 90 120 110 90
90 90 110 0 90 100 90
110 110 110 110 0 90 90
120 130 135 100 90 0 90
105 110 120 90 90 90 0
end
//generate row variable
gen locality=""
forv i=1/7{
replace locality="A`i'" in `i'
}
*
order locality, first
//generating who gets below the threshold of 100
forv i=1/7{
gen r_`i'=0
replace r_`i'=1 if A`i'<100 & A`i'!=0
}
*
//checking if both ways (origin and destiny below threshold)
forv i=1/7{
gen check_`i'=.
forv j=1/7{
local v=r_`i'[`j']
local vv=r_`j'[`i']
replace check_`i'=`v'+`vv' in `j'
}
*
}
*
//creating list of links
gen locality_x=""
forv i=1/7{
preserve
local name = locality[`i']
keep if check_`i'==2
replace locality_x="`name'"
keep locality locality_x
save "C:\Users\user\Desktop\temp_`i'", replace
restore
}
*
use "C:\Users\user\Desktop\temp_1", clear
forv i=2/7{
append using "C:\Users\user\Desktop\temp_`i'"
}
*
//now locality_x lists if A.1 has links with A.2, A.3 etc. and so on.
//the dificulty lies in finding a common intersection between the groups.
Which returns the following listing:
locality_x locality
A1 A2
A1 A3
A1 A4
A2 A1
A2 A4
A3 A1
A4 A1
A4 A2
A4 A7
A5 A6
A5 A7
A6 A5
A6 A7
A7 A4
A7 A5
A7 A6
I am trying to get familiar with set-intersection, but I haven't a clue of how to do this in Stata. I want to do something in which I could reprogram the threshold and find the common set. I would be thankful if you could produce a solution in R, given that I can program a bit in it.
A similar way of obtaining the list in R (as #user2957945 put in his answer below):
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
# get values less than threshold
id = m < 100
# make sure both values are less than threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
I'm also adding the "graph theory" tag since I believe this is not exactly a intersection problem, in which I could transform the list in vectors and use the intersect function in R. The code needs to produce a new id in which some localities must be in the same new id (group). As in the example above, if the set of A.1 has A.2 and A.4, A.2 has A.1 and A.4 and A.4 has A.1 and A.2, these three localities must be in the same id (group). In other words, I need the biggest intersection grouping of each locality. I understand that there might problems with a different matrix, such as A.1 has A.2 and A.6, A.2 has A.1 and A.6 and A.6 has A.1 and A.2 (but A.6 does not have A.4, considering the first example above still). In that situation, I welcome a solution of adding A.6 to the grouping or some other arbitrary one, in which the code just groups the first set together, removes A.1, A.2 and A.4 from the listing, and leaves A.6 with no new grouping.
In R you can do
# get values less then threshold
id = m < 100
# make sure both values are less then threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
.
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
Assuming what you want are the largest complete subgraphs, you can use the igraph package:
# Load necessary libraries
library(igraph)
# Define global parameters
threshold <- 100
# Compute the adjacency matrix
# (distances in both directions need to be smaller than the threshold)
am <- m < threshold & t(m) < threshold
# Make an undirected graph given the adjacency matrix
# (we set diag to FALSE so as not to draw links from a vertex to itself)
gr <- graph_from_adjacency_matrix(am, mode = "undirected", diag = FALSE)
# Find all the largest complete subgraphs
lc <- largest_cliques(gr)
# Output the list of complete subgraphs as a list of vertex names
lapply(lc, (function (e) e$name))
As far as I know, there is no similar functionality in Stata. However, if you were looking for the largest connected subgraph (which is the whole graph in your case), then you could have used clustering commands in Stata (i.e. clustermat).

Data format conversion to be combined with string split in R

I have following data frame oridf:
test_name gp1_0month gp2_0month gp1_1month gp2_1month gp1_3month gp2_3month
Test_1 136 137 152 143 156 150
Test_2 130 129 81 78 86 80
Test_3 129 128 68 68 74 71
Test_4 40 40 45 43 47 46
Test_5 203 201 141 134 149 142
Test_6 170 166 134 116 139 125
oridf <- structure(list(test_name = structure(1:6, .Label = c("Test_1",
"Test_2", "Test_3", "Test_4", "Test_5", "Test_6"), class = "factor"),
gp1_0month = c(136L, 130L, 129L, 40L, 203L, 170L), gp2_0month = c(137L,
129L, 128L, 40L, 201L, 166L), gp1_1month = c(152L, 81L, 68L,
45L, 141L, 134L), gp2_1month = c(143L, 78L, 68L, 43L, 134L,
116L), gp1_3month = c(156L, 86L, 74L, 47L, 149L, 139L), gp2_3month = c(150L,
80L, 71L, 46L, 142L, 125L)), .Names = c("test_name", "gp1_0month",
"gp2_0month", "gp1_1month", "gp2_1month", "gp1_3month", "gp2_3month"
), class = "data.frame", row.names = c(NA, -6L))
I need to convert it to following format:
test_name month group value
Test_1 0 gp1 136
Test_1 0 gp2 137
Test_1 1 gp1 152
Test_1 1 gp2 143
.....
Hence, conversion would involve splitting of gp1 and 0month, etc. from columns 2:7 of the original data frame oridf so that I can plot it with following command:
qplot(data=newdf, x=month, y=value, geom=c("point","line"), color=test_name, linetype=group)
How can I convert these data? I tried the melt command, but I cannot combine it with the strsplit command.
First I would use melt like you had done.
library(reshape2)
mm <- melt(oridf)
then there is also a colsplit function you can use in the reshape2 library as well. Here we use it on the variable column to split at the underscore and the "m" in month (ignoring the rest)
info <- colsplit(mm$variable, "(_|m)", c("group","month", "xx"))[,-3]
Then we can recombine the data
newdf <- cbind(mm[,1, drop=F], info, mm[,3, drop=F])
# head(newdf)
# test_name group month value
# 1 Test_1 gp1 0 136
# 2 Test_2 gp1 0 130
# 3 Test_3 gp1 0 129
# 4 Test_4 gp1 0 40
# 5 Test_5 gp1 0 203
# 6 Test_6 gp1 0 170
And we can plot it using the qplot command you supplied above
Use gather from the tidyr package to convert from wide to long and then useseparate from the same package to separate the group_month column into group and month columns. Finally using mutate from dplyr smf extract_numeric from tidyr extract the numeric part of month.
library(dplyr)
# devtools::install_github("hadley/tidyr")
library(tidyr)
newdf <- oridf %>%
gather(group_month, value, -test_name) %>%
separate(group_month, into = c("group", "month")) %>%
mutate(month = extract_numeric(month))

Resources