Why my dataset changed after running the setDT() - r

I have 2 CSV files. Now I want to find the common rows of these 2 files. So, after reading them as dataframe I converted them as datatable and then merge them. But, somehow, my code is not working. After using setDT() my dataset is changed and I am not getting any common rows between them!
Before running my dataset
nodeA nodeB scr
1 ID08918 ID04896 1
2 ID00402 ID01198 1
3 ID00182 ID01576 1
4 ID06413 ID00745 1
5 ID00215 ID01175 1
6 ID00448 ID05351 1
7 ID00860 ID00959 0.996197718631179
8 ID01110 ID01127 0.99604743083004
9 ID00497 ID01192 0.995436766623207
10 ID00877 ID01590 0.993939393939394
11 ID01192 ID01183 0.992202729044834
12 ID00361 ID00570 0.988354430379747
13 ID01045 ID01201 0.98766954377312
14 ID11641 ID00541 0.986875315497224
15 ID11641 ID00570 0.98685540950455
16 ID00458 ID01151 0.986813186813187
17 ID00199 ID01211 0.981416957026713
18 ID00570 ID00309 0.981151299032094
19 ID00541 ID00309 0.978161503301168
20 ID00603 ID06789 0.977272727272727
library(dplyr)
df_1 <- read.csv("~/df_1.csv", stringsAsFactors = FALSE)
df_2 <- read.csv("~/df_2.csv", stringsAsFactors = FALSE)
library(data.table)
setDT(df_1)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
setDT(df_2)[,c("nodeA", "nodeB") := list(pmin(nodeA,nodeB), pmax(nodeA,nodeB))]
result <- merge(df_1[!duplicated(df_1),], df_2, allow.cartesian=TRUE)
After running the code my dataset is changed.
nodeA nodeB scr
1: ID08918 ID08918 1
2: ID00402 ID00402 1
3: ID00182 ID00182 1
4: ID06413 ID06413 1
5: ID00215 ID00215 1
6: ID00448 ID00448 1
7: ID00860 ID00860 0.996197718631179
8: ID01110 ID01110 0.99604743083004
9: ID00497 ID00497 0.995436766623207
10: ID00877 ID00877 0.993939393939394
11: ID01192 ID01192 0.992202729044834
12: ID00361 ID00361 0.988354430379747
13: ID01045 ID01045 0.98766954377312
14: ID11641 ID11641 0.986875315497224
15: ID11641 ID11641 0.98685540950455
16: ID00458 ID00458 0.986813186813187
17: ID00199 ID00199 0.981416957026713
18: ID00570 ID00570 0.981151299032094
19: ID00541 ID00541 0.978161503301168
20: ID00603 ID00603 0.977272727272727
Reproducible Dataset
df_1
structure(list(query = structure(c(18L, 5L, 1L, 17L, 3L, 6L,
12L, 15L, 8L, 13L, 16L, 4L, 14L, 19L, 19L, 7L, 2L, 10L, 9L, 11L
), .Label = c("ID00182", "ID00199", "ID00215", "ID00361", "ID00402",
"ID00448", "ID00458", "ID00497", "ID00541", "ID00570", "ID00603",
"ID00860", "ID00877", "ID01045", "ID01110", "ID01192", "ID06413",
"ID08918", "ID11641"), class = "factor"), target = structure(c(16L,
11L, 14L, 4L, 8L, 17L, 5L, 6L, 10L, 15L, 9L, 3L, 12L, 2L, 3L,
7L, 13L, 1L, 1L, 18L), .Label = c("ID00309", "ID00541", "ID00570",
"ID00745", "ID00959", "ID01127", "ID01151", "ID01175", "ID01183",
"ID01192", "ID01198", "ID01201", "ID01211", "ID01576", "ID01590",
"ID04896", "ID05351", "ID06789"), class = "factor"), new_ssp = structure(c(15L,
15L, 15L, 15L, 15L, 15L, 14L, 13L, 12L, 11L, 10L, 9L, 8L, 7L,
6L, 5L, 4L, 3L, 2L, 1L), .Label = c("0.977272727272727", "0.978161503301168",
"0.981151299032094", "0.981416957026713", "0.986813186813187",
"0.98685540950455", "0.986875315497224", "0.98766954377312",
"0.988354430379747", "0.992202729044834", "0.993939393939394",
"0.995436766623207", "0.99604743083004", "0.996197718631179",
"1"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
df_2
structure(list(nodeA = structure(c(4L, 2L, 1L, 1L, 1L, 4L, 1L,
9L, 3L, 4L, 2L, 8L, 2L, 1L, 5L, 7L, 3L, 6L, 2L, 1L), .Label = c("ID00309",
"ID00361", "ID00541", "ID00570", "ID00615", "ID00696", "ID00762",
"ID01200", "ID05109"), class = "factor"), nodeB = structure(c(8L,
3L, 3L, 1L, 2L, 7L, 9L, 8L, 8L, 6L, 9L, 7L, 4L, 4L, 6L, 9L, 6L,
7L, 5L, 5L), .Label = c("ID00361", "ID00541", "ID00570", "ID00615",
"ID00696", "ID01200", "ID05109", "ID11641", "ID11691"), class = "factor"),
scr = structure(20:1, .Label = c("1.85284606048794", "1.90444166064472",
"1.90762235378507", "1.94364188077133", "1.95883206119256",
"2.08440437841349", "2.26408172709962", "2.3223132020942",
"2.46120775935034", "2.49647215035727", "2.50432367561777",
"2.57541320006514", "2.65099330092281", "2.75209155741549",
"2.93717640337986", "2.99596628688011", "3.21209741517806",
"3.21997803385465", "3.48788394772132", "3.81389707587156"
), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Note: I am also using dplyr for some purposes like %>% etc. Does it mean, dplyr and data.table is conflicting somehow?

one possible solution with dplyr, inner_join and union from dplyr:
# inner join
df_2 %>%
dplyr::inner_join(df_1, by = c("nodeA" = "query", "nodeB" = "target")) %>%
dplyr::mutate(GROUP = 1) %>%
dplyr::union(df_2 %>%
dplyr::inner_join(df_1, by = c("nodeB" = "query", "nodeA" = "target")) %>%
dplyr::mutate(GROUP = 2))
nodeA nodeB scr new_ssp GROUP
1 ID00361 ID00570 3.48788394772132 0.988354430379747 1
2 ID00570 ID11641 3.81389707587156 0.98685540950455 2
3 ID00309 ID00570 3.21997803385465 0.981151299032094 2
4 ID00309 ID00541 2.99596628688011 0.978161503301168 2
5 ID00541 ID11641 2.57541320006514 0.986875315497224 2

Related

How to get the rest of the rows after taking some rows randomly from a dataframe in R

I have 2 dataframe df_1 and df_2. Now I have to select some rows randomly from df_1 and then I will merge the rest of the rows (which not selected randomly) from df_1 with df_2.
I am using this code
set.seed(9999)
df_1 <- # the whole dataset
test_dataset1 <- sample_n(df_1, 10)
train_part_1 <- df_1[which(!df_1 %in% test_dataset1)] # Not working
train_1 <- rbind(df_2, train_part_1)
But, when I am trying to extract the rows not selected randomly. My code is not working. I am getting the same data as the df_1 means 20 rows (same dataset)
Edited: Actually, I have to make 3 test and 3 train datasets. So, how can I use the seed function to get the same dataset for reproduce purposes?
Reproducible data (only df_1):
structure(list(nodeA = structure(c(4L, 2L, 1L, 1L, 1L, 4L, 1L,
9L, 3L, 4L, 2L, 8L, 2L, 1L, 5L, 7L, 3L, 6L, 2L, 1L), .Label = c("ID00309",
"ID00361", "ID00541", "ID00570", "ID00615", "ID00696", "ID00762",
"ID01200", "ID05109"), class = "factor"), nodeB = structure(c(8L,
3L, 3L, 1L, 2L, 7L, 9L, 8L, 8L, 6L, 9L, 7L, 4L, 4L, 6L, 9L, 6L,
7L, 5L, 5L), .Label = c("ID00361", "ID00541", "ID00570", "ID00615",
"ID00696", "ID01200", "ID05109", "ID11641", "ID11691"), class = "factor"),
scr = structure(20:1, .Label = c("1.85284606048794", "1.90444166064472",
"1.90762235378507", "1.94364188077133", "1.95883206119256",
"2.08440437841349", "2.26408172709962", "2.3223132020942",
"2.46120775935034", "2.49647215035727", "2.50432367561777",
"2.57541320006514", "2.65099330092281", "2.75209155741549",
"2.93717640337986", "2.99596628688011", "3.21209741517806",
"3.21997803385465", "3.48788394772132", "3.81389707587156"
), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Get your sample using random row numbers and the use - to get the inverse:
df_1 <- structure(list(nodeA = structure(c(4L, 2L, 1L, 1L, 1L, 4L, 1L, 9L, 3L, 4L,
2L, 8L, 2L, 1L, 5L, 7L, 3L, 6L, 2L, 1L),
.Label = c("ID00309", "ID00361", "ID00541",
"ID00570", "ID00615", "ID00696",
"ID00762", "ID01200", "ID05109"),
class = "factor"),
nodeB = structure(c(8L, 3L, 3L, 1L, 2L, 7L, 9L, 8L, 8L, 6L,
9L, 7L, 4L, 4L, 6L, 9L, 6L, 7L, 5L, 5L),
.Label = c("ID00361", "ID00541", "ID00570",
"ID00615", "ID00696", "ID01200",
"ID05109", "ID11641", "ID11691"),
class = "factor"),
scr = structure(20:1, .Label = c("1.85284606048794", "1.90444166064472",
"1.90762235378507", "1.94364188077133",
"1.95883206119256", "2.08440437841349",
"2.26408172709962", "2.3223132020942",
"2.46120775935034", "2.49647215035727",
"2.50432367561777", "2.57541320006514",
"2.65099330092281", "2.75209155741549",
"2.93717640337986", "2.99596628688011",
"3.21209741517806", "3.21997803385465",
"3.48788394772132", "3.81389707587156"
), class = "factor")),
class = "data.frame", row.names = c(NA, -20L))
set.seed(9999)
Selected <- sample.int(nrow(df_1), 10)
# index selected the row; use [col,row] pattern to select rows
test_dataset1 <- df_1[ Selected, ]
# use -index to remove rows
train_part_1 <- df_1[-Selected, ]
test_dataset1
#> nodeA nodeB scr
#> 6 ID00570 ID05109 2.93717640337986
#> 9 ID00541 ID11641 2.57541320006514
#> 19 ID00361 ID00696 1.90444166064472
#> 3 ID00309 ID00570 3.21997803385465
#> 10 ID00570 ID01200 2.50432367561777
#> 2 ID00361 ID00570 3.48788394772132
#> 20 ID00309 ID00696 1.85284606048794
#> 8 ID05109 ID11641 2.65099330092281
#> 12 ID01200 ID05109 2.46120775935034
#> 18 ID00696 ID05109 1.90762235378507
train_part_1
#> nodeA nodeB scr
#> 1 ID00570 ID11641 3.81389707587156
#> 4 ID00309 ID00361 3.21209741517806
#> 5 ID00309 ID00541 2.99596628688011
#> 7 ID00309 ID11691 2.75209155741549
#> 11 ID00361 ID11691 2.49647215035727
#> 13 ID00361 ID00615 2.3223132020942
#> 14 ID00309 ID00615 2.26408172709962
#> 15 ID00615 ID01200 2.08440437841349
#> 16 ID00762 ID11691 1.95883206119256
#> 17 ID00541 ID01200 1.94364188077133
Created on 2021-03-14 by the reprex package (v1.0.0)

Read data set into well formated table with pre-specified number of columns

I have a txt.file like this:
0003 MPARTNER SALZ S 150112 22:30:45 160304 08:38:13 2 BUY 2 BUY 12380 165426 150109 08:00:00
0003 SPROTTSE HUGHES S 140407 02:30:50 141120 13:55:06 2 BUY 2 BUY 3764 57379 140401 10:05:00
0003 SPROTTSE HUGHES S 141223 09:06:13 160715 08:42:56 3 MARKETPERFORM 3 HOLD 3764 57379 141223 08:02:00
001V MPARTNER PEARLSTEIN D 140821 02:44:05 150312 09:17:13 2 BUY 2 BUY 12380 163717 140820 08:16:00
001V MPARTNER PEARLSTEIN D 151016 15:07:40 160411 08:40:35 2 BUY 2 BUY 12380 163717 151009 08:12:00
001W CANACCOR K 140321 04:06:40 140609 23:06:44 SPECULATIVE BUY 1 STRONG BUY 406 150412 140319 23:19:00
001W CANACCOR WRIGHT K 140714 12:47:31 160228 22:57:45 BUY 1 STRONG BUY 406 150412 140714 12:38:00
001W CLARUS OFIR E 140515 11:40:00 150515 09:27:09 SPECULATIVE BUY 1 STRONG BUY 202 115944 140515 11:40:00
001W CLARUS MACKAY D 150813 09:40:45 160812 09:40:02 BUY 1 STRONG BUY 202 73763 150813 09:23:00
001W DEACON OFIR E 150119 22:03:46 170328 06:45:14 1 BUY 1 STRONG BUY 704 115944 150112 07:24:00
001W DEACON OFIR E 171115 06:48:47 171115 06:48:47 1 BUY 1 STRONG BUY 704 115944 171115 06:42:00
#70L MORGAN MARTINEZ J 100226 07:12:51 100708 04:51:16 8 EQUALWT/NO RATING 3 HOLD 1595 56947 100226 07:12:00
#70L MORGAN MARTINEZ DE O J 100708 05:09:02 100910 00:48:28 6 EQUALWT/IN-LINE 3 HOLD 1595 56947 100708 03:14:00
#70L MORGAN MARTINEZ DE O J 100910 21:16:07 101110 21:55:52 2 OVERWT/IN-LINE 2 BUY 1595 56947 100910 19:18:00
#70L MORGAN OLCOZ CERDAN J 101112 01:32:41 120618 21:04:56 2 OVERWT/IN-LINE 2 BUY 1595 56947 101111 20:03:00
#70L MORGAN OLCOZ CERDAN J 120712 03:19:26 131216 19:49:59 6 EQUALWT/IN-LINE 3 HOLD 1595 56947 120711 19:20:00
#70L MORGAN OLCOZ CERDAN J 140226 22:20:19 150417 13:07:31 2 OVERWT/IN-LINE 2 BUY 1595 56947 140226 22:20:00
#70L MORGAN J 150608 01:25:35 171106 00:16:05 1 OVERWT/ATTRACTIVE 2 BUY 1595 56947 150608 01:25:00
And I would like to produce a table in R with the same structure as in the txt file with the apparent 16 columns.
I tried to use the codes:
max(count.fields("BSP.txt", sep="")) # 18 columns
df= read.delim("BSP.txt", sep = "" ,header = FALSE,col.names = c("V1", "VS","V3", "V4", "V5","V6",
"V7", "V8", "V9", "V10",
"V11", "V12", "V13", "V14",
"V15","V16","V17","V18"))
But I received a weirdly structured table:
structure(list(V1 = structure(c(2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("#70L", "0003",
"001V", "001W"), class = "factor"), VS = structure(c(5L, 6L,
6L, 5L, 5L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L
), .Label = c("CANACCOR", "CLARUS", "DEACON", "MORGAN", "MPARTNER",
"SPROTTSE"), class = "factor"), V3 = structure(c(9L, 1L, 1L,
8L, 8L, 3L, 10L, 6L, 4L, 6L, 6L, 5L, 5L, 5L, 7L, 7L, 7L, 2L), .Label = c("HUGHES",
"J", "K", "MACKAY", "MARTINEZ", "OFIR", "OLCOZ", "PEARLSTEIN",
"SALZ", "WRIGHT"), class = "factor"), V4 = structure(c(9L, 9L,
9L, 4L, 4L, 1L, 8L, 6L, 4L, 6L, 6L, 7L, 5L, 5L, 3L, 3L, 3L, 2L
), .Label = c("140321", "150608", "CERDAN", "D", "DE", "E", "J",
"K", "S"), class = "factor"), V5 = structure(c(9L, 4L, 8L, 7L,
12L, 2L, 6L, 5L, 11L, 10L, 13L, 3L, 15L, 15L, 14L, 14L, 14L,
1L), .Label = c("01:25:35", "04:06:40", "100226", "140407", "140515",
"140714", "140821", "141223", "150112", "150119", "150813", "151016",
"171115", "J", "O"), class = "factor"), V6 = structure(c(16L,
1L, 5L, 2L, 13L, 12L, 9L, 8L, 6L, 15L, 3L, 4L, 17L, 17L, 7L,
10L, 11L, 14L), .Label = c("02:30:50", "02:44:05", "06:48:47",
"07:12:51", "09:06:13", "09:40:45", "101112", "11:40:00", "12:47:31",
"120712", "140226", "140609", "15:07:40", "171106", "22:03:46",
"22:30:45", "J"), class = "factor"), V7 = structure(c(10L, 6L,
12L, 7L, 11L, 17L, 9L, 8L, 13L, 14L, 15L, 4L, 4L, 5L, 2L, 3L,
16L, 1L), .Label = c("00:16:05", "01:32:41", "03:19:26", "100708",
"100910", "141120", "150312", "150515", "160228", "160304", "160411",
"160715", "160812", "170328", "171115", "22:20:19", "23:06:44"
), class = "factor"), V8 = structure(c(5L, 13L, 7L, 8L, 6L, 18L,
17L, 9L, 10L, 3L, 4L, 1L, 2L, 16L, 12L, 14L, 15L, 11L), .Label = c("04:51:16",
"05:09:02", "06:45:14", "06:48:47", "08:38:13", "08:40:35", "08:42:56",
"09:17:13", "09:27:09", "09:40:02", "1", "120618", "13:55:06",
"131216", "150417", "21:16:07", "22:57:45", "SPECULATIVE"), class = "factor"),
V9 = structure(c(6L, 6L, 8L, 6L, 6L, 10L, 10L, 12L, 10L,
1L, 1L, 9L, 2L, 3L, 7L, 5L, 4L, 11L), .Label = c("1", "100910",
"101110", "13:07:31", "19:49:59", "2", "21:04:56", "3", "8",
"BUY", "OVERWT/ATTRACTIVE", "SPECULATIVE"), class = "factor"),
V10 = structure(c(6L, 6L, 8L, 6L, 6L, 2L, 2L, 6L, 2L, 6L,
6L, 7L, 1L, 4L, 3L, 5L, 3L, 3L), .Label = c("00:48:28", "1",
"2", "21:55:52", "6", "BUY", "EQUALWT/NO", "MARKETPERFORM"
), class = "factor"), V11 = structure(c(2L, 2L, 3L, 2L, 2L,
9L, 9L, 1L, 9L, 1L, 1L, 8L, 4L, 2L, 7L, 6L, 7L, 5L), .Label = c("1",
"2", "3", "6", "BUY", "EQUALWT/IN-LINE", "OVERWT/IN-LINE",
"RATING", "STRONG"), class = "factor"), V12 = structure(c(4L,
4L, 6L, 4L, 4L, 4L, 4L, 8L, 4L, 8L, 8L, 3L, 5L, 7L, 2L, 3L,
2L, 1L), .Label = c("1595", "2", "3", "BUY", "EQUALWT/IN-LINE",
"HOLD", "OVERWT/IN-LINE", "STRONG"), class = "factor"), V13 = structure(c(1L,
5L, 5L, 1L, 1L, 6L, 6L, 8L, 3L, 8L, 8L, 9L, 4L, 2L, 8L, 9L,
8L, 7L), .Label = c("12380", "2", "202", "3", "3764", "406",
"56947", "BUY", "HOLD"), class = "factor"), V14 = structure(c(5L,
7L, 7L, 4L, 4L, 1L, 1L, 6L, 9L, 8L, 8L, 3L, 11L, 10L, 3L,
3L, 3L, 2L), .Label = c("150412", "150608", "1595", "163717",
"165426", "202", "57379", "704", "73763", "BUY", "HOLD"), class = "factor"),
V15 = structure(c(8L, 4L, 7L, 6L, 10L, 3L, 5L, 2L, 9L, 2L,
2L, 12L, 11L, 11L, 12L, 12L, 12L, 1L), .Label = c("01:25:00",
"115944", "140319", "140401", "140714", "140820", "141223",
"150109", "150813", "151009", "1595", "56947"), class = "factor"),
V16 = structure(c(2L, 7L, 3L, 5L, 4L, 16L, 10L, 13L, 6L,
14L, 15L, 8L, 17L, 17L, 9L, 11L, 12L, 1L), .Label = c("",
"08:00:00", "08:02:00", "08:12:00", "08:16:00", "09:23:00",
"10:05:00", "100226", "101111", "12:38:00", "120711", "140226",
"140515", "150112", "171115", "23:19:00", "56947"), class = "factor"),
V17 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 1L, 4L,
2L, 3L, 5L, 6L, 9L, 8L, 10L, 1L), .Label = c("", "06:42:00",
"07:12:00", "07:24:00", "100708", "100910", "11:40:00", "19:20:00",
"20:03:00", "22:20:00"), class = "factor"), V18 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 1L,
1L, 1L), .Label = c("", "03:14:00", "19:18:00"), class = "factor")), .Names = c("V1",
"VS", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11",
"V12", "V13", "V14", "V15", "V16", "V17", "V18"), class = "data.frame", row.names = c(NA,
-18L))
As stated above, I would like to receive a table with 16 columns with the structure in the txt.file. Even the empty fields (e.g. in Row 6) should remain.
E.g for Row 6:
Can you help me on this?
many thanks.
One option is to use read.fwf
df <- read.fwf("tst.txt", widths = c(8, 10, 14, 28, 7, 10, 7, 10, 7, 29, 3,
21, 9, 8, 7, 8), header = FALSE)
#Now next part will be to remove the leading/training whitespaces from character fields.
library(dplyr)
df <- df %>% mutate_if(is.factor, function(x)trimws(as.character(x)))
The data frame looks as:
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
# 1 0003 MPARTNER SALZ S 150112 22:30:45 160304 08:38:13 2 BUY 2 BUY 12380 165426 150109 08:00:00
# 2 0003 SPROTTSE HUGHES S 140407 02:30:50 141120 13:55:06 2 BUY 2 BUY 3764 57379 140401 10:05:00
# 3 0003 SPROTTSE HUGHES S 141223 09:06:13 160715 08:42:56 3 MARKETPERFORM 3 HOLD 3764 57379 141223 08:02:00
# 4 001V MPARTNER PEARLSTEIN D 140821 02:44:05 150312 09:17:13 2 BUY 2 BUY 12380 163717 140820 08:16:00
# 5 001V MPARTNER PEARLSTEIN D 151016 15:07:40 160411 08:40:35 2 BUY 2 BUY 12380 163717 151009 08:12:00
# 6 001W CANACCOR K 140321 04:06:40 140609 23:06:44 NA SPECULATIVE BUY 1 STRONG BUY 406 150412 140319 23:19:00
# 7 001W CANACCOR WRIGHT K 140714 12:47:31 160228 22:57:45 NA BUY 1 STRONG BUY 406 150412 140714 12:38:00
# 8 001W CLARUS OFIR E 140515 11:40:00 150515 09:27:09 NA SPECULATIVE BUY 1 STRONG BUY 202 115944 140515 11:40:00
# 9 001W CLARUS MACKAY D 150813 09:40:45 160812 09:40:02 NA BUY 1 STRONG BUY 202 73763 150813 09:23:00
# 10 001W DEACON OFIR E 150119 22:03:46 170328 06:45:14 1 BUY 1 STRONG BUY 704 115944 150112 07:24:00
# 11 001W DEACON OFIR E 171115 06:48:47 171115 06:48:47 1 BUY 1 STRONG BUY 704 115944 171115 06:42:00
# 12 #70L MORGAN MARTINEZ J 100226 07:12:51 100708 04:51:16 8 EQUALWT/NO RATING 3 HOLD 1595 56947 100226 07:12:00
# 13 #70L MORGAN MARTINEZ DE O J 100708 05:09:02 100910 00:48:28 6 EQUALWT/IN-LINE 3 HOLD 1595 56947 100708 03:14:00
# 14 #70L MORGAN MARTINEZ DE O J 100910 21:16:07 101110 21:55:52 2 OVERWT/IN-LINE 2 BUY 1595 56947 100910 19:18:00
# 15 #70L MORGAN OLCOZ CERDAN J 101112 01:32:41 120618 21:04:56 2 OVERWT/IN-LINE 2 BUY 1595 56947 101111 20:03:00
# 16 #70L MORGAN OLCOZ CERDAN J 120712 03:19:26 131216 19:49:59 6 EQUALWT/IN-LINE 3 HOLD 1595 56947 120711 19:20:00
# 17 #70L MORGAN OLCOZ CERDAN J 140226 22:20:19 150417 13:07:31 2 OVERWT/IN-LINE 2 BUY 1595 56947 140226 22:20:00
# 18 #70L MORGAN J 150608 01:25:35 171106 00:16:05 1 OVERWT/ATTRACTIVE 2 BUY 1595 56947 150608 01:25:00
The above data.frame got 16 columns and 18 rows.

Sampling distribution and sum of tables

I've made a few experiments and each experiment led to the apparition of color.
As I can't do more experiments, I want to sample by size=30 and see what frequency table (of colors) I could obtain for 1000 sampling. The resulting frequency table should be the sum of the 1000 frequency table.
I think about concatenating table as follows and try to agregate, but it did not work:
mydata=structure(list(Date = structure(c(11L, 1L, 9L, 9L, 10L, 1L, 2L,
3L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 7L, 4L, 4L, 4L, 6L, 6L, 11L,
5L, 4L, 7L, 10L, 6L, 6L, 2L, 5L, 7L, 11L, 1L, 9L, 11L, 11L, 11L,
1L, 1L), .Label = c("01/02/2016", "02/02/2016", "03/02/2016",
"08/02/2016", "10/02/2016", "11/02/2016", "16/02/2016", "22/02/2016",
"26/01/2016", "27/01/2016", "28/01/2016"), class = "factor"),
Color = structure(c(30L, 33L, 11L, 1L, 18L, 18L, 11L,
16L, 19L, 19L, 22L, 1L, 18L, 18L, 13L, 14L, 13L, 18L, 24L,
24L, 11L, 24L, 2L, 33L, 25L, 1L, 30L, 5L, 24L, 18L, 13L,
35L, 19L, 19L, 18L, 23L, 19L, 8L, 19L, 14L), .Label = c("ARD",
"ARP", "BBB", "BIE", "CFX", "CHR", "DDD", "DOO", "EAU", "ELY",
"EPI", "ETR", "GEN", "GER", "GGG", "GIS", "ISE", "JUV", "LER",
"LES", "LON", "LYR", "MON", "NER", "NGY", "NOJ", "NYO", "ORI",
"PEO", "RAY", "RRR", "RSI", "SEI", "SEP", "VIL", "XQU", "YYY",
"ZYZ"), class = "factor"), Categorie = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "1,2", "1,2,3",
"1,3", "2", "2,3", "3", "4", "5"), class = "factor"), Portion_Longueur = c(3L,
4L, 1L, 1L, 2L, 4L, 5L, 6L, 7L, 7L, 8L, 8L, 9L, 8L, 8L, 9L,
11L, 7L, 7L, 7L, 9L, 8L, 3L, 8L, 7L, 11L, 2L, 9L, 8L, 5L,
8L, 12L, 3L, 4L, 1L, 3L, 3L, 3L, 4L, 5L)), .Names = c("Date",
"Color", "Categorie", "Portion_Longueur"), row.names = c(NA,
40L), class = "data.frame")
for (i in 1:1000) {
mysamp= sample(mydata$Color,size=30)
x=data.frame(table(mysamp))
if (i==1) w=x
else w <- c(w, x)
}
aggregate(w$Freq, by=list(Color=w$mysamp), FUN=sum)
Example, for 3 sampling, for (i in 1:3) I expect have sum as follow :
But I do not have Sum, instead I have:
Color x
1 ARD 2
2 ARP 1
3 BBB 0
4 BIE 0
5 CFX 0
6 CHR 0
7 DDD 0
8 DOO 1
9 EAU 0
10 ELY 0
11 EPI 3
12 ETR 0
13 GEN 2
14 GER 2
15 GGG 0
16 GIS 1
17 ISE 0
18 JUV 4
19 LER 5
20 LES 0
21 LON 0
22 LYR 1
23 MON 1
24 NER 2
25 NGY 1
26 NOJ 0
27 NYO 0
28 ORI 0
29 PEO 0
30 RAY 1
31 RRR 0
32 RSI 0
33 SEI 2
34 SEP 0
35 VIL 1
36 XQU 0
37 YYY 0
38 ZYZ 0
How to do this ?
Thanks a lot
Your for loop is what's causing your issues. You end up creating a big list that is somewhat difficult to perform calculations on (check out names(w) to see what I mean). A better data structure would allow for easier calculations:
x = NULL #initialize
for (i in 1:1000) {
mysamp = sample(mydata$Color,size=30) #sample
mysamp = data.frame(table(mysamp)) #frequency
x = rbind(x, mysamp) #bind to x
}
aggregate(Freq~mysamp, data = x, FUN = sum) #perform calculation
Note that this loop runs a bit slower than your loop. This is because of the rbind() function. See this post. Maybe someone will come along with a more efficient solution.

Conditionally remove a row based on another id code

In a dataset which contains many ids, I am only trying to manipulate rows which have id 7 or 9, and leave everything else untouched.
I am trying to conditionally remove a row from 7 or 9 in all instances where there isn't a variable that corresponds to it. So, if in the case of the dput example below, I want to remove the ninth row from id=9 because id=7 does not have an itemcode=2. Vice versa for id=7, I am trying to remove its itemcode=9 because id=9 does not have it.
id client item itemcode unit X2001 X2002 X2003 X2004 X2005 X2006 X2007
...
7 7 Bob eighth 8 100 13 18 15 NA NA NA NA
8 7 Bob ninth 9 100 11 21 10 NA NA NA NA
9 9 Bob_new first 1 100 NA NA NA 23 18 25 18
Code:
structure(list(id = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 10L), client = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), .Label = c("Bob",
"Bob_new", "Mark"), class = "factor"), item = structure(c(3L,
9L, 4L, 2L, 8L, 7L, 1L, 5L, 3L, 6L, 9L, 4L, 2L, 8L, 7L, 1L, 3L
), .Label = c("eighth", "fifth", "first", "fourth", "ninth",
"second", "seventh", "sixth", "third"), class = "factor"), itemcode = c(1L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L
), unit = c(100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L), X2001 = structure(c(5L,
6L, 1L, 4L, 2L, 5L, 3L, 1L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L
), .Label = c("11", "12", "13", "22", "24", "25", "NA"), class = "factor"),
X2002 = structure(c(4L, 8L, 1L, 3L, 7L, 2L, 5L, 6L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L), .Label = c("13", "14", "15",
"17", "18", "21", "22", "24", "NA"), class = "factor"), X2003 = structure(c(5L,
1L, 4L, 2L, 6L, 1L, 3L, 1L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L), .Label = c("10", "11", "15", "19", "23", "24", "NA"), class = "factor"),
X2004 = structure(c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 5L, 4L,
2L, 6L, 1L, 3L, 4L, 3L, 4L), .Label = c("11", "14", "15",
"20", "23", "25", "NA"), class = "factor"), X2005 = structure(c(6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 3L, 2L, 4L, 3L, 5L, 3L, 1L, 4L,
3L), .Label = c("11", "13", "18", "19", "25", "NA"), class = "factor"),
X2006 = structure(c(9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 8L, 6L,
1L, 2L, 5L, 3L, 7L, 8L, 4L), .Label = c("10", "15", "18",
"19", "20", "22", "23", "25", "NA"), class = "factor"), X2007 = structure(c(8L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 4L, 7L, 6L, 2L, 4L, 1L, 5L, 5L,
3L), .Label = c("12", "13", "16", "18", "19", "21", "24",
"NA"), class = "factor")), .Names = c("id", "client", "item",
"itemcode", "unit", "X2001", "X2002", "X2003", "X2004", "X2005",
"X2006", "X2007"), class = "data.frame", row.names = c(NA, -17L
))
————————————————————————————————————————
ANOTHER SCENARIO:
before:
structure(list(id = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L), client = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 3L), .Label = c("Bob", "Bob_new", "Mark"), class = "factor"),
item = structure(c(3L, 9L, 10L, 9L, 4L, 2L, 8L, 7L, 7L, 1L,
5L, 3L, 6L, 9L, 4L, 2L, 8L, 7L, 1L, 3L), .Label = c("eighth",
"fifth", "first", "fourth", "ninth", "second", "seventh",
"sixth", "third", "third "), class = "factor"), itemcode = c(1L,
3L, 3L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 1L), type = structure(c(1L, 1L, 2L, 3L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("A",
"B", "C"), class = "factor"), unit = c(100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L), X2001 = c(24L,
25L, 30L, 26L, 11L, 22L, 12L, 25L, 24L, 13L, 11L, NA, NA,
NA, NA, NA, NA, NA, NA, NA), X2002 = c(17L, 24L, 12L, 96L,
13L, 15L, 22L, 21L, 14L, 18L, 21L, NA, NA, NA, NA, NA, NA,
NA, NA, NA), X2003 = c(23L, 10L, 46L, 94L, 19L, 11L, 24L,
19L, 10L, 15L, 10L, NA, NA, NA, NA, NA, NA, NA, NA, NA),
X2004 = c(NA, NA, 43L, 83L, NA, NA, NA, 6L, NA, NA, NA, 23L,
20L, 14L, 25L, 11L, 15L, 20L, 15L, 20L), X2005 = c(NA, NA,
97L, 86L, NA, NA, NA, 17L, NA, NA, NA, 18L, 13L, 19L, 18L,
25L, 18L, 11L, 19L, 18L), X2006 = c(NA, NA, 11L, 91L, NA,
NA, NA, 11L, NA, NA, NA, 25L, 22L, 10L, 15L, 20L, 18L, 23L,
25L, 19L), X2007 = c(NA, NA, 19L, 27L, NA, NA, NA, 15L, NA,
NA, NA, 18L, 24L, 21L, 13L, 18L, 12L, 19L, 19L, 16L)), .Names = c("id",
"client", "item", "itemcode", "type", "unit", "X2001", "X2002",
"X2003", "X2004", "X2005", "X2006", "X2007"), class = "data.frame", row.names = c(NA,
-20L))
after:
structure(list(id = c(7L, 7L, 7L, 7L, 7L, 7L, 9L, 9L, 9L, 9L,
9L, 9L, 10L), client = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L), .Label = c("Bob", "Bob_new", "Mark"), class = "factor"),
item = structure(c(2L, 7L, 3L, 1L, 5L, 4L, 2L, 6L, 3L, 1L,
5L, 4L, 2L), .Label = c("fifth", "first", "fourth", "seventh",
"sixth", "third", "third "), class = "factor"), itemcode = c(1L,
3L, 4L, 5L, 6L, 7L, 1L, 3L, 4L, 5L, 6L, 7L, 1L), type = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A",
"B"), class = "factor"), unit = c(100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L), X2001 = c(24L,
10L, 11L, 22L, 12L, 17L, NA, NA, NA, NA, NA, NA, NA), X2002 = c(17L,
87L, 13L, 15L, 22L, 19L, NA, NA, NA, NA, NA, NA, NA), X2003 = c(23L,
47L, 19L, 11L, 24L, 17L, NA, NA, NA, NA, NA, NA, NA), X2004 = c(NA,
28L, NA, NA, NA, 28L, 23L, 14L, 25L, 11L, 15L, 20L, 20L),
X2005 = c(NA, 43L, NA, NA, NA, 16L, 18L, 19L, 18L, 25L, 18L,
11L, 18L), X2006 = c(NA, 69L, NA, NA, NA, 5L, 25L, 10L, 15L,
20L, 18L, 23L, 19L), X2007 = c(NA, 72L, NA, NA, NA, 20L,
18L, 21L, 13L, 18L, 12L, 19L, 16L)), .Names = c("id", "client",
"item", "itemcode", "type", "unit", "X2001", "X2002", "X2003",
"X2004", "X2005", "X2006", "X2007"), class = "data.frame", row.names = c(NA,
-13L))
I could implement the said filter code to remove items which do not exist in its corresponding place (id 7 and 9).
But if there are sub levels for items, like type of item. I am also trying to remove items if they don't have a type similar in the corresponding field.
You could use filter from dplyr
library(dplyr)
filter(df_all, itemcode %in% intersect(itemcode[id==7],
itemcode[id==9])|!id %in% c(7,9) )
# id client item itemcode unit X2001 X2002 X2003 X2004 X2005 X2006 X2007
#1 7 Bob first 1 100 24 17 23 NA NA NA NA
#2 7 Bob third 3 100 25 24 10 NA NA NA NA
#3 7 Bob fourth 4 100 11 13 19 NA NA NA NA
#4 7 Bob fifth 5 100 22 15 11 NA NA NA NA
#5 7 Bob sixth 6 100 12 22 24 NA NA NA NA
#6 7 Bob seventh 7 100 24 14 10 NA NA NA NA
#7 7 Bob eighth 8 100 13 18 15 NA NA NA NA
#8 9 Bob_new first 1 100 NA NA NA 23 18 25 18
#9 9 Bob_new third 3 100 NA NA NA 14 19 10 21
#10 9 Bob_new fourth 4 100 NA NA NA 25 18 15 13
#11 9 Bob_new fifth 5 100 NA NA NA 11 25 20 18
#12 9 Bob_new sixth 6 100 NA NA NA 15 18 18 12
#13 9 Bob_new seventh 7 100 NA NA NA 20 11 23 19
#14 9 Bob_new eighth 8 100 NA NA NA 15 19 25 19
#15 10 Mark first 1 100 NA NA NA 20 18 19 16
Update
Based on the new dataset, perhaps this helps
library(dplyr)
library(tidyr)
dfnew %>%
unite(itemtype, itemcode,type) %>%
filter(itemtype %in% intersect(itemtype[id==7],
itemtype[id==9])|!id %in% c(7,9)) %>%
separate(itemtype, c('itemcode', 'type'))
# id client item itemcode type unit X2001 X2002 X2003 X2004 X2005 X2006
# 1 7 Bob first 1 A 100 24 17 23 NA NA NA
# 2 7 Bob third 3 B 100 30 12 46 43 97 11
# 3 7 Bob fourth 4 A 100 11 13 19 NA NA NA
# 4 7 Bob fifth 5 A 100 22 15 11 NA NA NA
# 5 7 Bob sixth 6 A 100 12 22 24 NA NA NA
# 6 7 Bob seventh 7 A 100 25 21 19 6 17 11
# 7 9 Bob_new first 1 A 100 NA NA NA 23 18 25
# 8 9 Bob_new third 3 B 100 NA NA NA 14 19 10
# 9 9 Bob_new fourth 4 A 100 NA NA NA 25 18 15
# 10 9 Bob_new fifth 5 A 100 NA NA NA 11 25 20
# 11 9 Bob_new sixth 6 A 100 NA NA NA 15 18 18
# 12 9 Bob_new seventh 7 A 100 NA NA NA 20 11 23
# 13 10 Mark first 1 A 100 NA NA NA 20 18 19
# X2007
#1 NA
#2 19
#3 NA
#4 NA
#5 NA
#6 15
#7 18
#8 21
#9 13
#10 18
#11 12
#12 19
#13 16
If I understand the problem: every itemcode in id=9 subset must have identical itemcode in id=7 subset (and reverse). If it is not the case then we filter the row with the non-pair itemcode out, but leave everything with id not in 7 or 9. Here is one way of doing it:
First get common item codes:
items_9 <- df_all$itemcode[ df_all$id==9 ]
items_7 <- df_all$itemcode[ df_all$id==7 ]
items_common <- items_9[ items_9 %in% items_7 ]
select everything with common itemcodes for 7 and 9 and the rest:
df_new <- df_all[
which(
( df_all$id %in% c(7, 9) &
df_all$itemcode %in% items_common
) |
!df_all$id %in% c(7,9)
)
,]
library(dplyr)
df$remove <- paste(df$itemcode, df$type)
df<-invisible(filter(df,
remove %in% intersect(remove[type==7],
remove[type==9])|!type %in% c(7,9) ))
#Remove the additional column after filter
df$remove <- NULL
You could do something like this, which runs setdiff in both directions. The cl() function wasn't really necessary, but I really don't like writing the same expression over and over again.
f <- function(x, y) setdiff(union(x, y), x)
cl <- function(var) substitute(df$itemcode[df$id == x], list(x = var))
So now you can call f() on c(id7, id9) and then reverse it and get the c(id9, id7) result.
do.call(f, x <- list(cl(7), cl(9)))
# [1] 2
do.call(f, rev(x))
# [1] 9

R multiples strings aggregate

I'm trying to aggregate multiple rows by column in a data frame. I succed to use aggregate for one column \o/ but I don't understand how to use it for several columns. I let you an exemple of my data:
Gene_Title ID_Affymetrix GB_Acc.x Gene_Symbol.x Entrez ID_Agl GB_Acc.y Gene_Symbol.y Unigene Ensembl Chr_location
trafficking protein particle complex 4 1429632_at AK005276 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
aldo-keto reductase family 1, member B3 (aldose reductase) 1437133_x_at AV127085 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
sodium channel, voltage-gated, type I, alpha 1450120_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
sodium channel, voltage-gated, type I, alpha 1450121_at AV336781 Scn1a 20265 58 NM_018733 Scn1a Mm.439704 ENSMUST00000094951 chr2:66173557-66173498
aldo-keto reductase family 1, member B3 (aldose reductase) 1456590_x_at BB469763 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
dolichol-phosphate (beta-D) mannosyltransferase 2 1415675_at BC008256 Dpm2 13481 33459 NM_010073 Dpm2 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825
proline rich 13 1423686_a_at BC016234 Prr13 66151 4 NM_025385 Prr13 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149
transmembrane protein 2 1424711_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
transmembrane protein 2 1451458_at BC019745 Tmem2 83921 23 NM_031997 Tmem2 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251
lipase, endothelial 1450188_s_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421261_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
lipase, endothelial 1421262_at BC020991 Lipg 16891 52 NM_010720 Lipg Mm.299647 ENSMUST00000066532 chr18:75099688-75099629
coatomer protein complex, subunit gamma 1415670_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
coatomer protein complex, subunit gamma 1416017_at BC024686 Copg 54161 25829 NM_017477 Copg Mm.258785 ENSMUST00000113607 chr6:87862890-87862949
leucine rich repeat containing 1 1452411_at BG966295 Lrrc1 214345 29 NM_172528 Lrrc1 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939
aldo-keto reductase family 1, member B3 (aldose reductase) 1448319_at NM_009658 Akr1b3 11677 22 NM_009658 Akr1b3 Mm.451 ENSMUST00000166583 chr6:34253982-34253930
ATPase, H+ transporting, lysosomal V0 subunit D1 1415671_at NM_013477 Atp6v0d1 11972 11826 NM_013477 Atp6v0d1 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778
golgi autoantigen, golgin subfamily a, 7 1415672_at NM_020585 Golga7 57437 54944 NM_020585 Golga7 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919
trafficking protein particle complex 4 1415674_a_at NM_021789 Trappc4 60409 10239 NM_021789 Trappc4 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859
phosphoserine phosphatase 1415673_at NM_133900 Psph 100678 57142 NM_133900 Psph Mm.271784 ENSMUST00000031399 chr5:130271500-130271441
Some gene_title (and gene_symbol) are represented several times but with different ID(Affymetrix or Agilent), or with different GB_Acc. In general I want to have only one line per gene and in Ids or GB_Acc or other columns the different values:
Here my data with Id affymetrix:
>f=function(x){return(paste(x,collapse=","))}
>tab4=aggregate(ID_Affymetrix ~ GB_Acc.x+ Gene_Title+GB_Acc.y+Gene_Symbol.x+Entrez+Unigene+Ensembl+Chr_location+ID_Agl,data=tab3,f)
GB_Acc.x Gene_Title GB_Acc.y Gene_Symbol.x Entrez Unigene Ensembl Chr_location ID_Agl ID_Affymetrix
BC016234 proline rich 13 NM_025385 Prr13 66151 Mm.393955 ENSMUST00000164688 chr15:102291090-102291149 4 1423686_a_at
AV127085 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1437133_x_at
BB469763 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1456590_x_at
NM_009658 aldo-keto reductase family 1, member B3 (aldose reductase) NM_009658 Akr1b3 11677 Mm.451 ENSMUST00000166583 chr6:34253982-34253930 22 1448319_at
BC019745 transmembrane protein 2 NM_031997 Tmem2 83921 Mm.329776 ENSMUST00000096194 chr19:21930192-21930251 23 1424711_at,1451458_at
BG966295 leucine rich repeat containing 1 NM_172528 Lrrc1 214345 Mm.28534 ENSMUST00000049755 chr9:77278998-77278939 29 1452411_at
BC020991 lipase, endothelial NM_010720 Lipg 16891 Mm.299647 ENSMUST00000066532 chr18:75099688-75099629 52 1450188_s_at,1421261_at,1421262_at
AV336781 sodium channel, voltage-gated, type I, alpha NM_018733 Scn1a 20265 Mm.439704 ENSMUST00000094951 chr2:66173557-66173498 58 1450120_at,1450121_at
AK005276 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1429632_at
NM_021789 trafficking protein particle complex 4 NM_021789 Trappc4 60409 Mm.29814 ENSMUST00000170082 chr9:44211918-44211859 10239 1415674_a_at
NM_013477 ATPase, H+ transporting, lysosomal V0 subunit D1 NM_013477 Atp6v0d1 11972 Mm.17708 ENSMUST00000013304 chr8:108048837-108048778 11826 1415671_at
BC024686 coatomer protein complex, subunit gamma NM_017477 Copg 54161 Mm.258785 ENSMUST00000113607 chr6:87862890-87862949 25829 1415670_at,1416017_at
BC008256 dolichol-phosphate (beta-D) mannosyltransferase 2 NM_010073 Dpm2 13481 Mm.22001 ENSMUST00000150419 chr2:32428766-32428825 33459 1415675_at
NM_020585 golgi autoantigen, golgin subfamily a, 7 NM_020585 Golga7 57437 Mm.196269 ENSMUST00000121783 chr8:24351978-24351919 54944 1415672_at
NM_133900 phosphoserine phosphatase NM_133900 Psph 100678 Mm.271784 ENSMUST00000031399 chr5:130271500-130271441 57142 1415673_at
As you can see, for Tmem2, Copg,Lipg and Scn1a I now have several ID_Affymetrix in the same row. For this genes the only difference was on this column. But for Akr1b3 and Trappc4 there was also some difference in th GB_acc.x column.
So in a general way I would like to make an aggregate for each columns (except Gene_Title and Gene_Symbol which normally are always the same for a given gene) and finally have for exemple:
Gene_Tile Gene_Symbol GB_Acc ID_Affy ...
Traffickp Prot complex 4 Trapcc4 AK005276,NM_021789 1429632_at,1415674_a_at ...
If anyone as any idea
Thanks!
EDIT:
this is the dput(head(mydata,20)). There's some errors at the end but I didn't know this function and his goal
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
structure(list(Gene_Title = structure(c(1L, 1L, 1L, 2L, 3L, 3L,
4L, 5L, 6L, 7L, 7L, 7L, 8L, 9L, 10L, 10L, 11L, 11L, 12L, 12L), .Label = c("aldo-keto reductase family 1, member B3 (aldose reductase)",
"ATPase, H+ transporting, lysosomal V0 subunit D1", "coatomer protein complex, subunit gamma",
"dolichol-phosphate (beta-D) mannosyltransferase 2", "golgi autoantigen, golgin subfamily a, 7",
"leucine rich repeat containing 1", "lipase, endothelial", "phosphoserine phosphatase",
"proline rich 13", "sodium channel, voltage-gated, type I, alpha",
"trafficking protein particle complex 4", "transmembrane protein 2"
), class = "factor"), ID_Affymetrix = structure(c(13L, 14L, 20L,
2L, 1L, 7L, 6L, 3L, 19L, 17L, 8L, 9L, 4L, 10L, 15L, 16L, 5L,
12L, 11L, 18L), .Label = c("1415670_at", "1415671_at", "1415672_at",
"1415673_at", "1415674_a_at", "1415675_at", "1416017_at", "1421261_at",
"1421262_at", "1423686_a_at", "1424711_at", "1429632_at", "1437133_x_at",
"1448319_at", "1450120_at", "1450121_at", "1450188_s_at", "1451458_at",
"1452411_at", "1456590_x_at"), class = "factor"), GB_Acc.x = structure(c(2L,
11L, 4L, 12L, 9L, 9L, 5L, 13L, 10L, 8L, 8L, 8L, 15L, 6L, 3L,
3L, 14L, 1L, 7L, 7L), .Label = c("AK005276", "AV127085", "AV336781",
"BB469763", "BC008256", "BC016234", "BC019745", "BC020991", "BC024686",
"BG966295", "NM_009658", "NM_013477", "NM_020585", "NM_021789",
"NM_133900"), class = "factor"), Gene_Symbol.x = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Entrez = c(11677L, 11677L,
11677L, 11972L, 54161L, 54161L, 13481L, 57437L, 214345L, 16891L,
16891L, 16891L, 100678L, 66151L, 20265L, 20265L, 60409L, 60409L,
83921L, 83921L), ID_Agl = c(22L, 22L, 22L, 11826L, 25829L, 25829L,
33459L, 54944L, 29L, 52L, 52L, 52L, 57142L, 4L, 58L, 58L, 10239L,
10239L, 23L, 23L), GB_Acc.y = structure(c(1L, 1L, 1L, 4L, 5L,
5L, 2L, 7L, 12L, 3L, 3L, 3L, 11L, 9L, 6L, 6L, 8L, 8L, 10L, 10L
), .Label = c("NM_009658", "NM_010073", "NM_010720", "NM_013477",
"NM_017477", "NM_018733", "NM_020585", "NM_021789", "NM_025385",
"NM_031997", "NM_133900", "NM_172528"), class = "factor"), Gene_Symbol.y = structure(c(1L,
1L, 1L, 2L, 3L, 3L, 4L, 5L, 7L, 6L, 6L, 6L, 9L, 8L, 10L, 10L,
12L, 12L, 11L, 11L), .Label = c("Akr1b3", "Atp6v0d1", "Copg",
"Dpm2", "Golga7", "Lipg", "Lrrc1", "Prr13", "Psph", "Scn1a",
"Tmem2", "Trappc4"), class = "factor"), Unigene = structure(c(12L,
12L, 12L, 1L, 4L, 4L, 3L, 2L, 6L, 8L, 8L, 8L, 5L, 10L, 11L, 11L,
7L, 7L, 9L, 9L), .Label = c("Mm.17708", "Mm.196269", "Mm.22001",
"Mm.258785", "Mm.271784", "Mm.28534", "Mm.29814", "Mm.299647",
"Mm.329776", "Mm.393955", "Mm.439704", "Mm.451"), class = "factor"),
Ensembl = structure(c(11L, 11L, 11L, 1L, 7L, 7L, 9L, 8L,
3L, 4L, 4L, 4L, 2L, 10L, 5L, 5L, 12L, 12L, 6L, 6L), .Label = c("ENSMUST00000013304",
"ENSMUST00000031399", "ENSMUST00000049755", "ENSMUST00000066532",
"ENSMUST00000094951", "ENSMUST00000096194", "ENSMUST00000113607",
"ENSMUST00000121783", "ENSMUST00000150419", "ENSMUST00000164688",
"ENSMUST00000166583", "ENSMUST00000170082"), class = "factor"),
Chr_location = structure(c(7L, 7L, 7L, 9L, 8L, 8L, 4L, 10L,
12L, 2L, 2L, 2L, 6L, 1L, 5L, 5L, 11L, 11L, 3L, 3L), .Label = c("chr15:102291090-102291149",
"chr18:75099688-75099629", "chr19:21930192-21930251", "chr2:32428766-32428825",
"chr2:66173557-66173498", "chr5:130271500-130271441", "chr6:34253982-34253930",
"chr6:87862890-87862949", "chr8:108048837-108048778", "chr8:24351978-24351919",
"chr9:44211918-44211859", "chr9:77278998-77278939"), class = "factor")), .Names = c("Gene_Title",
"ID_Affymetrix", "GB_Acc.x", "Gene_Symbol.x", "Entrez", "ID_Agl",
"GB_Acc.y", "Gene_Symbol.y", "Unigene", "Ensembl", "Chr_location"
), row.names = c(NA, 20L), class = "data.frame")
Erreur dans `?`(dput(head(tab3, 20)), dput(head(tab3, 20))) :
c("pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 6, 7, 7, 7, 8, 9, 10, 10, 11, 11, 12, 12)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(13, 14, 20, 2, 1, 7, 6, 3, 19, 17, 8, 9, 4, 10, 15, 16, 5, 12, 11, 18)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(2, 11, 4, 12, 9, 9, 5, 13, 10, 8, 8, 8, 15, 6, 3, 3, 14, 1, 7, 7)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)",
"pas de documentation de type ‘c(1, 1, 1, 2, 3, 3, 4, 5, 7, 6, 6, 6, 9, 8, 10, 10, 12, 12, 11, 11)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'aide)", "pas de documentation de type ‘c(11677, 11677, 11677, 11972, 54161, 54161, 13481, 57437, 214345, 16891, 16891, 16891, 100678, 66151, 20265, 20265, 60409, 60409, 83921, 83921)’ et de thème ‘dput(head(tab3, 20))’ (ou erreur de traitement de l'a
Maybe this is what you're looking for?
library(dplyr)
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol) %>%
summarise_each(funs(paste(., collapse = ",")))
I didn't test it with your data though, because I couldn't copy and paste it into my session.
Update:
In your data you have two columns Gene_Symbol.x and Gene_Symbol.y which were probably created during a merge. I assume they have the same information, and hence you could adjust the code to:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(., collapse = ",")), -Gene_Symbol.y)
Or, if you only want unique entries in each column (as in #juba's answer) you can write:
dfcollapsed <- df %>% # replace df with the name of your data frame
group_by(Gene_Title, Gene_Symbol.x) %>%
summarise_each(funs(paste(unique(.), collapse = ",")), -Gene_Symbol.y)
Hope that helps.
Maybe the following with aggregate :
f <- function(v) {paste(unique(v), collapse=", ")}
aggregate(tab3, list(tab3$Gene_Title, tab3$Gene_Symbol.x), f)

Resources