Improve speed of 3 for loops in R - r

I am working with a matrix set_onco of 206 rows x 196 cols and I have a vector, genes_100 (it's a matrix but I take only the first col), with 101 names.
here's a snippet of how they look
> set_onco[1:10,1:10]
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
GLI1_UP.V1_DN COPZ1 C10orf46 C20orf118 TMEM181 CCNL2 YIPF1 GTDC1 OPN3 RSAD2 SLC22A1
GLI1_UP.V1_UP IGFBP6 HLA-DQB1 CCND2 PTH1R TXNDC12 M6PR PPT2 STAU1 IGJ TMOD3
E2F1_UP.V1_DN TGFB1I1 CXCL5 POU5F1 SAMD10 KLF2 STAT6 ENTPD6 VCAN HMGCS1 ANXA8
E2F1_UP.V1_UP RRP1B HES1 ADCY6 CHAF1B VPS37B GRSF1 TLX2 SSX2IP DNA2 CMA1
EGFR_UP.V1_DN NPY1R PDZK1 GFRA1 GREB1 MSMB DLC1 MYB SLC6A14 IFI44 IFI44L
EGFR_UP.V1_UP FGG GBP1 TNFRSF11B FGB GJA1 DUSP6 S100A9 ADM ITGB6 DUSP4
ERB2_UP.V1_DN NPY1R PDZK1 ANXA3 GREB1 HSPB8 DLC1 NRIP1 FHL2 EGR3 IFI44
FAM18B1
ERB2_UP.V1_UP CYP1A1 CEACAM5 FAM129A TNFRSF11B DUSP4 CYP1B1 UPK2 DAB2 CEACAM6 KIAA1199
GCNP_SHH_UP_EARLY.V1_DN SRRM2 KIAA1217 DEFA1 DLK1 PITX2 CCL2 UPK3B SEZ6 TAF15 EMP1
genes_100[1:10,1]
[1] AL591845.1 B3GALT6 RAP1GAP HSPG2 BX293535.1 RP1-159A19.1 IFI6 FAM76A FAM176B CSF3R
101 Levels: 5_8S_rRNA AC018470.1 AC091179.2 AC103702.3 AC138972.1 ACVR1B AL049829.5 AL137797.2 AL139260.2 AL450326.2 AL591845.1 AL607122.2 B3GALT6 BX293535.1 ... ZNF678
what I want to do is to parse through the matrix and count the frequency at which each row contains the names in genes_100
to do that I created 3 for loops: the first one moves down one row at the time, the second one moves into the row and the third one loops over the list genes_100 checking for matches.
at the end I save in a matrix how many times genes_100 matched with the terms in each row, saving also the row names from the matrix (so that I know which one is which)
the code works and gives me the correct output...but it's just really slow!!
a snippet of the output is:
head(result_matrix_100)
freq_100
[1,] "GLI1_UP.V1_DN" "0"
[2,] "GLI1_UP.V1_UP" "0"
[3,] "E2F1_UP.V1_DN" "0"
[4,] "E2F1_UP.V1_UP" "0"
[5,] "EGFR_UP.V1_DN" "0"
[6,] "EGFR_UP.V1_UP" "0"
I used system.time() and I get:
user system elapsed
525.38 0.06 530.34
which is way too slow since I have even bigger matrices to parse, and in some cases I have to repeat this 10k times!!!
the code is:
result_matrix_100 <- matrix(nrow=0, ncol=2)
for (q in seq(1,nrow(set_onco),1)) {
for (j in seq(1, length(set_onco[q,]),1)) {
for (x in seq(1,101,1)) {
if (as.character(genes_100[x,1]) == as.character(set_onco[q,j])) {
freq_100 <- freq_100+1
}
}
}
result_matrix_100 <- rbind(result_matrix_100, cbind(row.names(set_onco)[q], freq_100))
}
what would you suggest?
thanks in advance :)

#joran's will possibly be faster although it may not be "factor-safe". Your set_onco values are probably encoded as factor variables (because your genes_100 object clearly is.) This will be safer:
set_onco[] <- lapply(set_onco, as.character)
# that converts a data.frame with factor columns to character valued
# at that point #joran's solution could be used safely
freq100 <- apply(set_onco, 1, function(x) sum(x %in% genes_100) )
# that does a row-by-row count of the number of matches to genes_100
freq100
GLI1_UP.V1_DN GLI1_UP.V1_UP E2F1_UP.V1_DN
0 0 0
E2F1_UP.V1_UP EGFR_UP.V1_DN EGFR_UP.V1_UP
0 0 0
ERB2_UP.V1_DN ERB2_UP.V1_UP GCNP_SHH_UP_EARLY.V1_DN
0 0 0
The size of your dataset (206 rows x 196 cols) is quite small so this will be virtually immediate. These dput statements and output can be used to construct what I think your objects look like internally:
dput(set_onco)
structure(list(V2 = structure(c(1L, 4L, 8L, 6L, 5L, 3L, 5L, 2L,
7L), .Label = c("COPZ1", "CYP1A1", "FGG", "IGFBP6", "NPY1R",
"RRP1B", "SRRM2", "TGFB1I1"), class = "factor"), V3 = structure(c(1L,
6L, 3L, 5L, 8L, 4L, 8L, 2L, 7L), .Label = c("C10orf46", "CEACAM5",
"CXCL5", "GBP1", "HES1", "HLA-DQB1", "KIAA1217", "PDZK1"), class = "factor"),
V4 = structure(c(3L, 4L, 8L, 1L, 7L, 9L, 2L, 6L, 5L), .Label = c("ADCY6",
"ANXA3", "C20orf118", "CCND2", "DEFA1", "FAM129A", "GFRA1",
"POU5F1", "TNFRSF11B"), class = "factor"), V5 = structure(c(7L,
5L, 6L, 1L, 4L, 3L, 4L, 8L, 2L), .Label = c("CHAF1B", "DLK1",
"FGB", "GREB1", "PTH1R", "SAMD10", "TMEM181", "TNFRSF11B"
), class = "factor"), V6 = structure(c(1L, 8L, 5L, 9L, 6L,
3L, 4L, 2L, 7L), .Label = c("CCNL2", "DUSP4", "GJA1", "HSPB8",
"KLF2", "MSMB", "PITX2", "TXNDC12", "VPS37B"), class = "factor"),
V7 = structure(c(8L, 6L, 7L, 5L, 3L, 4L, 3L, 2L, 1L), .Label = c("CCL2",
"CYP1B1", "DLC1", "DUSP6", "GRSF1", "M6PR", "STAT6", "YIPF1"
), class = "factor"), V8 = structure(c(2L, 5L, 1L, 7L, 3L,
6L, 4L, 8L, 9L), .Label = c("ENTPD6", "GTDC1", "MYB", "NRIP1",
"PPT2", "S100A9", "TLX2", "UPK2", "UPK3B"), class = "factor"),
V9 = structure(c(4L, 8L, 9L, 7L, 6L, 1L, 3L, 2L, 5L), .Label = c("ADM",
"DAB2", "FHL2", "OPN3", "SEZ6", "SLC6A14", "SSX2IP", "STAU1",
"VCAN"), class = "factor"), V10 = structure(c(8L, 6L, 4L,
2L, 5L, 7L, 3L, 1L, 9L), .Label = c("CEACAM6", "DNA2", "EGR3",
"HMGCS1", "IFI44", "IGJ", "ITGB6", "RSAD2", "TAF15"), class = "factor"),
V11 = structure(c(8L, 9L, 1L, 2L, 6L, 3L, 5L, 7L, 4L), .Label = c("ANXA8",
"CMA1", "DUSP4", "EMP1", "IFI44", "IFI44L", "KIAA1199", "SLC22A1",
"TMOD3"), class = "factor")), .Names = c("V2", "V3", "V4",
"V5", "V6", "V7", "V8", "V9", "V10", "V11"), class = "data.frame", row.names = c("GLI1_UP.V1_DN",
"GLI1_UP.V1_UP", "E2F1_UP.V1_DN", "E2F1_UP.V1_UP", "EGFR_UP.V1_DN",
"EGFR_UP.V1_UP", "ERB2_UP.V1_DN", "ERB2_UP.V1_UP", "GCNP_SHH_UP_EARLY.V1_DN"
))
dput(factor(genes_100) )
structure(c(1L, 2L, 9L, 7L, 3L, 10L, 8L, 6L, 5L, 4L), .Label = c("AL591845.1",
"B3GALT6", "BX293535.1", "CSF3R", "FAM176B", "FAM76A", "HSPG2",
"IFI6", "RAP1GAP", "RP1-159A19.1"), class = "factor")

Something like this will probably be quite fast:
#Sample data
m <- matrix(sample(letters,206*196,replace = TRUE),206,196)
genes_100 <- letters[1:5]
m1 <- matrix(m %in% genes_100,206,196)
rowSums(m1)

Related

Issues with pivot_wider and unique identifiers because of duplicate values

I'm trying to use pivot_wider move my dataset from long to wide so I can use it in a different programme.
I have seen the other posts on this topic but the solutions don't address my problem.
I have measurement variable called "rating" which has a value for each "rock" and each test ("gentest", first and second). I have an id variable called "turkcode".
For each individual in the dataset, there are 18 ratings. The problem is that there are 4 ratings for rock #8 and I think this is why the data won't pivot wider the way I want them to.
Here's a subset of the data
structure(list(turkcode = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("100879",
"104655", "108505", "110324", "110600", "112445", "114083", "115814",
"116573", "117411", "117817", "118651", "119324", "121548", "121883",
"121918", "123275", "123718", "125491", "127450", "127825", "128062",
"129061", "131404", "135358", "135594", "135671", "135945", "137951",
"138675", "139469", "140924", "145730", "147222", "148533", "150851",
"153455", "158882", "164468", "166907", "169260", "171463", "172398",
"175565", "177108", "179000", "180270", "183953", "185574", "185880",
"185948", "186371", "187787", "189220", "190014", "192550", "193904",
"195308", "196755", "197493", "198368", "200155", "200297", "201915",
"214519", "215994", "217903", "218771", "219302", "220434", "222740",
"223223", "224721", "225118", "225223", "229856", "229874", "231301",
"232576", "233842", "234215", "237581", "239567", "240609", "241098",
"241423", "242108", "244633", "246055", "251597", "252929", "255252",
"256652", "259936", "274962", "277053", "279422", "280317", "282602",
"283750", "285737", "286259", "287544", "288507", "290503", "291401",
"291835", "292160", "294117", "297863", "298061", "299347", "299499",
"301399", "304875", "305231", "306312", "307410", "308979", "311157",
"311524", "311630", "318956", "318988", "319995", "321405", "324288",
"327086", "327559", "328345", "328401", "330318", "330909", "332723",
"334115", "334517", "335811", "335831", "337145", "338323", "338542",
"338575", "340083", "341182", "343612", "343947", "344554", "346476",
"349874", "350117", "350433", "350972", "351187", "355311", "356717",
"359366", "360048", "360058", "361191", "361971", "362827", "363543",
"367244", "374254", "374965", "376278", "377622", "382139", "382916",
"384586", "385229", "386782", "388951", "389029", "390299", "390662",
"396335", "396732", "398076", "398573", "399276", "399587", "403388",
"406073", "406160", "411977", "412935", "417350", "420060", "421393",
"422944", "424462", "427143", "429291", "430758", "431629", "431638",
"431935", "432218", "433788", "434291", "436681", "437087", "439385",
"439499", "440477", "440834", "441253", "441876", "443826", "444080",
"447597", "452643", "454649", "457055", "457946", "463512", "464079",
"464123", "467897", "468650", "470211", "471115", "471512", "475493",
"476937", "479198", "482871", "484066", "484070", "485462", "486402",
"491701", "491835", "499644", "501833", "502335", "502373", "504800",
"507439", "507946", "507987", "509066", "513078", "515519", "517017",
"517988", "519144", "519210", "519858", "522847", "523683", "525315",
"528577", "532463", "532630", "533028", "539033", "539852", "540690",
"546773", "546916", "549652", "551599", "554198", "556066", "559920",
"560804", "560857", "562080", "562420", "563841", "565668", "565776",
"566509", "569039", "572553", "575364", "576421", "576694", "576877",
"577120", "577155", "577534", "577605", "578463", "578820", "578995",
"580213", "581893", "582433", "582905", "583887", "584569", "585314",
"585566", "587393", "589144", "592284", "594463", "596863", "601837",
"602632", "604254", "605885", "609296", "609963", "610062", "612437",
"612949", "613161", "614372", "614777", "615372", "615384", "616927",
"618118", "620041", "620336", "621634", "622289", "624098", "626163",
"626612", "627019", "627856", "630003", "630255", "634018", "634478",
"635801", "638606", "640012", "641078", "641366", "641436", "641821",
"642076", "642446", "643329", "643942", "644015", "646792", "647254",
"647700", "649516", "650792", "650810", "651229", "652387", "652671",
"654778", "657964", "658894", "660500", "660607", "664469", "666754",
"666796", "668996", "669712", "671682", "673516", "675712", "677835",
"678008", "679262", "680295", "686455", "690471", "691175", "692489",
"694023", "696001", "698716", "700133", "700641", "707812", "707953",
"708010", "708881", "713657", "715255", "715386", "716764", "718936",
"719956", "725348", "727753", "728436", "729588", "730513", "731928",
"732013", "732438", "733366", "733559", "734672", "735174", "735675",
"737044", "737127", "741264", "745262", "748173", "748414", "748943",
"749221", "749963", "750363", "753518", "754512", "754970", "758639",
"760838", "761642", "766250", "770646", "772574", "773054", "775271",
"776762", "778208", "779453", "781378", "781861", "782257", "785763",
"785860", "787011", "790280", "791735", "791903", "792178", "796650",
"796822", "796970", "798621", "802731", "804701", "805606", "807848",
"809142", "810539", "812182", "812321", "814029", "814545", "814774",
"815079", "816572", "824215", "825063", "827763", "829973", "829983",
"830126", "832112", "832666", "833066", "834756", "835270", "835340",
"837413", "837746", "839882", "846097", "847975", "848746", "851745",
"851975", "856622", "858918", "859174", "859182", "859726", "859850",
"862222", "864356", "865028", "869700", "871576", "872256", "873350",
"873597", "875873", "883140", "886308", "886592", "886706", "892144",
"893930", "894959", "896820", "900374", "901373", "902879", "904147",
"905194", "906305", "908049", "908798", "911505", "913314", "915390",
"915833", "919057", "922432", "924120", "925640", "927671", "932006",
"936810", "936916", "938349", "940727", "941945", "942271", "943188",
"944548", "945783", "947164", "948322", "949181", "951414", "952632",
"955090", "956428", "956985", "959916", "960349", "962224", "962980",
"964665", "967160", "967588", "969929", "972543", "972893", "977734",
"978083", "978981", "980427", "980782", "981541", "981850", "982220",
"983781", "985193", "986366", "988934", "989056", "991218", "991914",
"995411", "995630", "995873", "995936", "996309"), class = "factor"),
aid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("mem",
"noMem"), class = "factor"), gentest = structure(c(1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
2L, 1L, 2L), .Label = c("first", "second"), class = "factor"),
rocks = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L,
6L, 6L, 7L, 7L, 8L, 8L, 8L, 8L, 1L, 1L), .Label = c("R1",
"R2", "R3", "R4", "R5", "R6", "R7", "R8"), class = "factor"),
rating = c(7L, 5L, 2L, 7L, 4L, 2L, 6L, 3L, 3L, 2L, 3L, 3L,
2L, 1L, 3L, 6L, 3L, 2L, 2L, 4L), condition = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("baseline", "category", "property"
), class = "factor"), order = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("after", "before", "none"), class = "factor")), row.names = c(NA,
-20L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
turkcode = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("100879",
"104655", "108505", "110324", "110600", "112445", "114083",
"115814", "116573", "117411", "117817", "118651", "119324",
"121548", "121883", "121918", "123275", "123718", "125491",
"127450", "127825", "128062", "129061", "131404", "135358",
"135594", "135671", "135945", "137951", "138675", "139469",
"140924", "145730", "147222", "148533", "150851", "153455",
"158882", "164468", "166907", "169260", "171463", "172398",
"175565", "177108", "179000", "180270", "183953", "185574",
"185880", "185948", "186371", "187787", "189220", "190014",
"192550", "193904", "195308", "196755", "197493", "198368",
"200155", "200297", "201915", "214519", "215994", "217903",
"218771", "219302", "220434", "222740", "223223", "224721",
"225118", "225223", "229856", "229874", "231301", "232576",
"233842", "234215", "237581", "239567", "240609", "241098",
"241423", "242108", "244633", "246055", "251597", "252929",
"255252", "256652", "259936", "274962", "277053", "279422",
"280317", "282602", "283750", "285737", "286259", "287544",
"288507", "290503", "291401", "291835", "292160", "294117",
"297863", "298061", "299347", "299499", "301399", "304875",
"305231", "306312", "307410", "308979", "311157", "311524",
"311630", "318956", "318988", "319995", "321405", "324288",
"327086", "327559", "328345", "328401", "330318", "330909",
"332723", "334115", "334517", "335811", "335831", "337145",
"338323", "338542", "338575", "340083", "341182", "343612",
"343947", "344554", "346476", "349874", "350117", "350433",
"350972", "351187", "355311", "356717", "359366", "360048",
"360058", "361191", "361971", "362827", "363543", "367244",
"374254", "374965", "376278", "377622", "382139", "382916",
"384586", "385229", "386782", "388951", "389029", "390299",
"390662", "396335", "396732", "398076", "398573", "399276",
"399587", "403388", "406073", "406160", "411977", "412935",
"417350", "420060", "421393", "422944", "424462", "427143",
"429291", "430758", "431629", "431638", "431935", "432218",
"433788", "434291", "436681", "437087", "439385", "439499",
"440477", "440834", "441253", "441876", "443826", "444080",
"447597", "452643", "454649", "457055", "457946", "463512",
"464079", "464123", "467897", "468650", "470211", "471115",
"471512", "475493", "476937", "479198", "482871", "484066",
"484070", "485462", "486402", "491701", "491835", "499644",
"501833", "502335", "502373", "504800", "507439", "507946",
"507987", "509066", "513078", "515519", "517017", "517988",
"519144", "519210", "519858", "522847", "523683", "525315",
"528577", "532463", "532630", "533028", "539033", "539852",
"540690", "546773", "546916", "549652", "551599", "554198",
"556066", "559920", "560804", "560857", "562080", "562420",
"563841", "565668", "565776", "566509", "569039", "572553",
"575364", "576421", "576694", "576877", "577120", "577155",
"577534", "577605", "578463", "578820", "578995", "580213",
"581893", "582433", "582905", "583887", "584569", "585314",
"585566", "587393", "589144", "592284", "594463", "596863",
"601837", "602632", "604254", "605885", "609296", "609963",
"610062", "612437", "612949", "613161", "614372", "614777",
"615372", "615384", "616927", "618118", "620041", "620336",
"621634", "622289", "624098", "626163", "626612", "627019",
"627856", "630003", "630255", "634018", "634478", "635801",
"638606", "640012", "641078", "641366", "641436", "641821",
"642076", "642446", "643329", "643942", "644015", "646792",
"647254", "647700", "649516", "650792", "650810", "651229",
"652387", "652671", "654778", "657964", "658894", "660500",
"660607", "664469", "666754", "666796", "668996", "669712",
"671682", "673516", "675712", "677835", "678008", "679262",
"680295", "686455", "690471", "691175", "692489", "694023",
"696001", "698716", "700133", "700641", "707812", "707953",
"708010", "708881", "713657", "715255", "715386", "716764",
"718936", "719956", "725348", "727753", "728436", "729588",
"730513", "731928", "732013", "732438", "733366", "733559",
"734672", "735174", "735675", "737044", "737127", "741264",
"745262", "748173", "748414", "748943", "749221", "749963",
"750363", "753518", "754512", "754970", "758639", "760838",
"761642", "766250", "770646", "772574", "773054", "775271",
"776762", "778208", "779453", "781378", "781861", "782257",
"785763", "785860", "787011", "790280", "791735", "791903",
"792178", "796650", "796822", "796970", "798621", "802731",
"804701", "805606", "807848", "809142", "810539", "812182",
"812321", "814029", "814545", "814774", "815079", "816572",
"824215", "825063", "827763", "829973", "829983", "830126",
"832112", "832666", "833066", "834756", "835270", "835340",
"837413", "837746", "839882", "846097", "847975", "848746",
"851745", "851975", "856622", "858918", "859174", "859182",
"859726", "859850", "862222", "864356", "865028", "869700",
"871576", "872256", "873350", "873597", "875873", "883140",
"886308", "886592", "886706", "892144", "893930", "894959",
"896820", "900374", "901373", "902879", "904147", "905194",
"906305", "908049", "908798", "911505", "913314", "915390",
"915833", "919057", "922432", "924120", "925640", "927671",
"932006", "936810", "936916", "938349", "940727", "941945",
"942271", "943188", "944548", "945783", "947164", "948322",
"949181", "951414", "952632", "955090", "956428", "956985",
"959916", "960349", "962224", "962980", "964665", "967160",
"967588", "969929", "972543", "972893", "977734", "978083",
"978981", "980427", "980782", "981541", "981850", "982220",
"983781", "985193", "986366", "988934", "989056", "991218",
"991914", "995411", "995630", "995873", "995936", "996309"
), class = "factor"), rocks = structure(c(1L, 1L, 2L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 1L, 1L), .Label = c("R1",
"R2", "R3", "R4", "R5", "R6", "R7", "R8"), class = "factor"),
gentest = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("first",
"second"), class = "factor"), .rows = list(1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15:16, 17:18,
19L, 20L)), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE))
Does anyone know how I can modify the second set of ratings for rock #8 so that I can pivot the data wider or even exclude this data from the dataset altogether?
EDIT:
Here is an example of how I'd like the output to look
id <- rep("100879", times = 6)
aid <- rep("mem", times = 6)
test <- rep(c("first", "second"), times = 3)
order <- rep("after", times = 6)
condition <- rep ("cat", times = 6)
R1 <- sample(0:9, 6, replace=T)
R2 <- sample(0:9, 6, replace=T)
R3 <- sample(0:9, 6, replace=T)
R4 <- sample(0:9, 6, replace=T)
R5 <- sample(0:9, 6, replace=T)
R6 <- sample(0:9, 6, replace=T)
R7 <- sample(0:9, 6, replace=T)
R8 <- sample(0:9, 6, replace=T)
df <- cbind(id, aid, test, order, condition, R1, R2, R3, R4, R5, R6, R7, R8)
a data.table suggestion
library( data.table )
#set data as data.table
setDT( mydata )
#create rowid by group
mydata[, row_id := rowidv( mydata, cols = c("turkcode", "aid", "gentest", "condition", "order", "rocks") ) ]
#create new rocks-column to group on
mydata[, rocks2 := paste0( rocks, ifelse( row_id == 1, "", paste0("_",row_id ) ) ) ]
#now cast to wide
dcast( mydata, turkcode + aid + gentest + condition + order ~ rocks2, value.var = "rating" )
# turkcode aid gentest condition order R1 R2 R3 R4 R5 R6 R7 R8 R8_2
# 1: 100879 mem first category after 7 2 4 6 3 3 2 3 6
# 2: 100879 mem second category after 5 7 2 3 2 3 1 3 2
# 3: 104655 mem first category after 2 NA NA NA NA NA NA NA NA
# 4: 104655 mem second category after 4 NA NA NA NA NA NA NA NA
Another option using pivot_wider and separate
library(dplyr)
library(tidyr)
#short version, but you will end up with R1-R8 in list foramt
df %>%
pivot_wider(id_cols = c("turkcode", "aid", "gentest", "condition", "order"),
names_from = "rocks", values_from = "rating", values_fn = list(rating = list))
#clean version
df %>%
#id_cols: A set of columns that uniquely identifies each observation.
#Defaults to all columns in data except for the columns specified in names_from and values_from.
pivot_wider(id_cols = c("turkcode", "aid", "gentest", "condition", "order"),
names_from = "rocks",
values_from = "rating",
values_fn = list(rating = ~paste(., collapse = ","))
#values_fn = list(rating = mean)
#,values_fill = list(rating=0)
) %>%
separate(R8, into = c('R8','R8_1'))
# A tibble: 4 x 14
# Groups: turkcode, gentest [1,118]
turkcode aid gentest condition order R1 R2 R3 R4 R5 R6 R7 R8 R8_1
<fct> <fct> <fct> <fct> <fct> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 100879 mem first category after 7 2 4 6 3 3 2 3 6
2 100879 mem second category after 5 7 2 3 2 3 1 3 2
3 104655 mem first category after 2 NA NA NA NA NA NA NA NA
4 104655 mem second category after 4 NA NA NA NA NA NA NA NA

Return one out of multiple rows with partially matching entries

I have a dataset with the name of proteome. It has 14 columns and thousands of rows.
dput(Proteome)
structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L,
3L), .Label = c("HCTF", "IFT", "ROSF"), class = "factor"), X..Proteins = c(5L,
5L, 5L, 5L, 3L, 7L), X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L), Previous.5.amino.acids = structure(c(4L,
5L, 4L, 2L, 3L, 1L), .Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY",
"TMYFC"), class = "factor"), Sequence = structure(c(5L, 1L, 4L,
2L, 3L, 6L), .Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"), class = "factor")), .Names = c("Protein.name",
"X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"
), class = "data.frame", row.names = c(NA, -6L))
The column of interest in this dataset is "Sequence". In row 2 of this column, first two letters of row 1 are missing; in row 3, last three letters of row 1 are missing; in row 4, first seven and last three letters of row 1 are missing.
Rows 2, 3, and 4 reflect the artifacts of the scientific method I have been using to generate the data, and therefore I want to remove these entries.
I want R to return only one of the four rows, ideally row 1, and remove the rest. The way R can do it is by first finding all rows with a matching string of letters and then eliminating such rows while keeping only one. For example, in the above data set, GCNFHAESTR match in all four rows, so I want R to return me only one row, ideally the top one. But I don't know how to do this.
To further clarify, "Sequence" has hundreds of rows with partially matching entries but the matching entries in those rows are different from the one shown in the example above. For example, it is possible that row no. 35 and 39 have the following entries (Row 35: GNYTCAGCWPFK, and Row 36: YTCAGCWPFK). As matching entries in these rows are totally different than the ones in the example above, I can not declare the string beforehand. So, I want to come up with a mechanism that allows me to detect all those rows which have a partially matching entries and then keep only one of them, while delete others.
I look forward to hearing from the experts.
Thanks!
If I understood correctly, you just need to subset your data according to the presence of the string you want. Use grepl for that.
aa <- structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L, 3L),
.Label = c("HCTF", "IFT", "ROSF"),
class = "factor"),
X..Proteins = c(5L, 5L, 5L, 5L, 3L, 7L),
X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L),
Previous.5.amino.acids = structure(c(4L, 5L, 4L, 2L, 3L, 1L),
.Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY", "TMYFC"),
class = "factor"),
Sequence = structure(c(5L, 1L, 4L, 2L, 3L, 6L),
.Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"),
class = "factor")),
.Names = c("Protein.name", "X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"),
class = "data.frame", row.names = c(NA, -6L))
It is good for you to declare the string beforehand
myStrToDetect <-'GCNFHAESTR'
#the following line filters the data set into those where "Sequence" has the pattern you provided (4 rows)
matching_df <- aa[grepl(myStrToDetect , aa$Sequence),]
Protein.name X..Proteins X..PSMs Previous.5.amino.acids Sequence
1 HCTF 5 3 NCTMY GHFCLKPGCNFHAESTRGYR
2 HCTF 5 1 TMYFC FCLKPGCNFHAESTRGYR
3 HCTF 5 6 NCTMY GHFCLKPGCNFHAESTR
4 HCTF 5 2 FCLKP GCNFHAESTR
# This next command chooses only the first line, if there are multiple occurrences
head(matching_df, 1)
Protein.name X..Proteins X..PSMs Previous.5.amino.acids Sequence
1 HCTF 5 3 NCTMY GHFCLKPGCNFHAESTRGYR

R spread vs gather in tidyr

I have a dataframe in the following form:
person currentTest beforeValue afterValue
1 1 A 1.284297055 2.671763513
2 2 A -0.618359548 -2.354926905
3 3 A 0.039457430 -0.091709968
4 4 A -0.448608324 -0.362851832
5 5 A -0.961777124 -1.416284339
6 6 A 0.702471895 2.052181444
7 7 A -0.455222045 -2.125684279
8 8 A -1.231549132 -2.777425148
9 9 A -0.797234990 -0.558306183
10 10 A -0.709734963 -1.244159550
11 1 B -0.472799377 -0.869472343
12 2 B 0.059720737 1.444855389
13 3 B 0.924201532 2.731049485
14 4 B 0.658884183 1.017542475
15 5 B -1.989807256 -4.712671740
16 6 B 0.660241305 1.971232718
17 7 B 0.089636952 -0.564457911
18 8 B -0.828399941 0.507659171
19 9 B -0.838074237 -0.316996942
20 10 B -1.659197101 -3.317623686
...
What I'd like is to get a data frame of:
person A_Before A_After B_Before, B_After, ...
1 1.284297055 2.671763513 -0.472799377 -0.869472343
2 -0.618359548 -2.354926905 0.059720737 1.444855389
...
I've tried gather and spread but that's not quite what I need as there's the creation of new columns. Any suggestions?
The dput version for easy access is below:
resultsData <- dput(resultsData)
structure(list(person = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L), currentTest = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("A", "B", "C",
"D", "E", "F"), class = "factor"), beforeValue = c(1.28429705541541,
-0.618359548370402, 0.039457429902531, -0.448608324038257, -0.961777123997687,
0.702471895259405, -0.455222044740939, -1.23154913153736, -0.797234989892673,
-0.709734963076803, -0.47279937661921, 0.0597207367403981, 0.924201531911827,
0.658884182599422, -1.98980725637449, 0.660241304554785, 0.0896369516528346,
-0.828399941497236, -0.838074236572976, -1.65919710134782, 0.577469369909437,
1.92748171699512, -0.245593641496638, 0.126104785456265, -0.559338325961641,
1.29802115505785, 0.719406692531958, 0.969414499181256, -0.814697072724845,
0.86465983690719, -0.709539159817187, 1.02775240926492, -0.50490096148732,
0.40769259465753, -0.868531009656408, 0.949518511358715, 2.32458579520932,
-0.257578702370506, -0.789761851618986, 0.0979274657020477, -0.00803566278013502,
1.42984177159549, 1.45485678109231, -0.956556613290905, 0.443323691839299,
-0.261951072972966, -1.30990441429799, 0.0921741874883992, -1.02612779569131,
0.81550719514697, -0.403037731404182, -0.384422139459082, 0.417074857491798,
-1.37128032791855, -0.0796160137501127, 1.35302483988882, -0.752751140138746,
0.812453275384099, -1.32443072805549, -1.66986584340583), afterValue = c(2.67176351335094,
-2.35492690509713, -0.0917099675669388, -0.362851831626841, -1.4162843393352,
2.05218144382074, -2.12568427901904, -2.77742514848958, -0.558306182843248,
-1.24415954975022, -0.869472343362331, 1.44485538931333, 2.73104948477609,
1.01754247530805, -4.71267174035743, 1.9712327179732, -0.564457911016569,
0.507659170771878, -0.31699694238194, -3.31762368638082, 1.09068172988414,
4.37537723545199, -0.116850493406969, 1.9533832597394, -1.69003563933244,
2.62250581307257, -0.00837379068728961, 1.84192937988371, -0.675899868505659,
2.08506660046288, -0.583526785879512, 0.699298693972492, -1.26172199141024,
1.23589313451783, -1.56008919968504, 0.436686458587792, 0.11699090169902,
-1.07206510594109, 1.21204947218164, -0.812406581646911, 0.50373332256566,
-0.084945367568491, -0.236015748624917, -0.479606239480476, -0.596799139055039,
-0.562575023441403, -0.339935276865152, -0.213813544612318, -0.265296303857373,
-1.12545083569158, 0.0105156062602101, 0.635695183644557, 0.767433440961415,
0.16648012185356, 0.544633089427927, -0.904001384160196, -0.429299134808951,
0.764224744168297, -0.166062348771635, -0.101892580202475)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -60L), .Names = c("person",
"currentTest", "beforeValue", "afterValue"))
We can use dcast from reshape2
library(reshape2)
meltdf <- melt(resultsData, id.vars=1:2)
dcast(meltdf, person ~ currentTest + variable)
> dcast(meltdf, person ~ currentTest + variable)
person A_beforeValue A_afterValue B_beforeValue B_afterValue C_beforeValue C_afterValue D_beforeValue D_afterValue E_beforeValue
1 1 1.28429706 2.67176351 -0.47279938 -0.8694723 0.5774694 1.090681730 -0.70953916 -0.5835268 -0.008035663
2 2 -0.61835955 -2.35492691 0.05972074 1.4448554 1.9274817 4.375377235 1.02775241 0.6992987 1.429841772
3 3 0.03945743 -0.09170997 0.92420153 2.7310495 -0.2455936 -0.116850493 -0.50490096 -1.2617220 1.454856781
4 4 -0.44860832 -0.36285183 0.65888418 1.0175425 0.1261048 1.953383260 0.40769259 1.2358931 -0.956556613
5 5 -0.96177712 -1.41628434 -1.98980726 -4.7126717 -0.5593383 -1.690035639 -0.86853101 -1.5600892 0.443323692
6 6 0.70247190 2.05218144 0.66024130 1.9712327 1.2980212 2.622505813 0.94951851 0.4366865 -0.261951073
7 7 -0.45522204 -2.12568428 0.08963695 -0.5644579 0.7194067 -0.008373791 2.32458580 0.1169909 -1.309904414
8 8 -1.23154913 -2.77742515 -0.82839994 0.5076592 0.9694145 1.841929380 -0.25757870 -1.0720651 0.092174187
9 9 -0.79723499 -0.55830618 -0.83807424 -0.3169969 -0.8146971 -0.675899869 -0.78976185 1.2120495 -1.026127796
10 10 -0.70973496 -1.24415955 -1.65919710 -3.3176237 0.8646598 2.085066600 0.09792747 -0.8124066 0.815507195
E_afterValue F_beforeValue F_afterValue
1 0.50373332 -0.40303773 0.01051561
2 -0.08494537 -0.38442214 0.63569518
3 -0.23601575 0.41707486 0.76743344
4 -0.47960624 -1.37128033 0.16648012
5 -0.59679914 -0.07961601 0.54463309
6 -0.56257502 1.35302484 -0.90400138
7 -0.33993528 -0.75275114 -0.42929913
8 -0.21381354 0.81245328 0.76422474
9 -0.26529630 -1.32443073 -0.16606235
10 -1.12545084 -1.66986584 -0.10189258
You can use a combined gather + spread approach; Gather the *Values columns and combine with currentTest to form the new header, then spread to wide format:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value)
# A tibble: 10 x 13
# person A_afterValue A_beforeValue B_afterValue B_beforeValue C_afterValue C_beforeValue
# * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.67176351 1.28429706 -0.8694723 -0.47279938 1.090681730 0.5774694
# 2 2 -2.35492691 -0.61835955 1.4448554 0.05972074 4.375377235 1.9274817
# 3 3 -0.09170997 0.03945743 2.7310495 0.92420153 -0.116850493 -0.2455936
# 4 4 -0.36285183 -0.44860832 1.0175425 0.65888418 1.953383260 0.1261048
# 5 5 -1.41628434 -0.96177712 -4.7126717 -1.98980726 -1.690035639 -0.5593383
# 6 6 2.05218144 0.70247190 1.9712327 0.66024130 2.622505813 1.2980212
# 7 7 -2.12568428 -0.45522204 -0.5644579 0.08963695 -0.008373791 0.7194067
# 8 8 -2.77742515 -1.23154913 0.5076592 -0.82839994 1.841929380 0.9694145
# 9 9 -0.55830618 -0.79723499 -0.3169969 -0.83807424 -0.675899869 -0.8146971
#10 10 -1.24415955 -0.70973496 -3.3176237 -1.65919710 2.085066600 0.8646598
# ... with 6 more variables: D_afterValue <dbl>, D_beforeValue <dbl>, E_afterValue <dbl>,
# E_beforeValue <dbl>, F_afterValue <dbl>, F_beforeValue <dbl>
If you need to rename the columns:
resultsData %>%
gather(key, value, -person, -currentTest) %>%
unite(header, c('currentTest', 'key'), sep = "_") %>%
spread(header, value) %>%
rename_at(vars(matches("Value$")), funs(gsub("Value$", "", .)))
We could do this in a single line using recast
reshape2::recast(resultsData, person ~currentTest + variable, id.var = 1:2)
#person A_beforeValue A_afterValue B_beforeValue B_afterValue C_beforeValue C_afterValue D_beforeValue D_afterValue
#1 1 1.28429706 2.67176351 -0.47279938 -0.8694723 0.5774694 1.090681730 -0.70953916 -0.5835268
#2 2 -0.61835955 -2.35492691 0.05972074 1.4448554 1.9274817 4.375377235 1.02775241 0.6992987
#3 3 0.03945743 -0.09170997 0.92420153 2.7310495 -0.2455936 -0.116850493 -0.50490096 -1.2617220
#4 4 -0.44860832 -0.36285183 0.65888418 1.0175425 0.1261048 1.953383260 0.40769259 1.2358931
#5 5 -0.96177712 -1.41628434 -1.98980726 -4.7126717 -0.5593383 -1.690035639 -0.86853101 -1.5600892
#6 6 0.70247190 2.05218144 0.66024130 1.9712327 1.2980212 2.622505813 0.94951851 0.4366865
#7 7 -0.45522204 -2.12568428 0.08963695 -0.5644579 0.7194067 -0.008373791 2.32458580 0.1169909
#8 8 -1.23154913 -2.77742515 -0.82839994 0.5076592 0.9694145 1.841929380 -0.25757870 -1.0720651
#9 9 -0.79723499 -0.55830618 -0.83807424 -0.3169969 -0.8146971 -0.675899869 -0.78976185 1.2120495
#10 10 -0.70973496 -1.24415955 -1.65919710 -3.3176237 0.8646598 2.085066600 0.09792747 -0.8124066
# E_beforeValue E_afterValue F_beforeValue F_afterValue
#1 -0.008035663 0.50373332 -0.40303773 0.01051561
#2 1.429841772 -0.08494537 -0.38442214 0.63569518
#3 1.454856781 -0.23601575 0.41707486 0.76743344
#4 -0.956556613 -0.47960624 -1.37128033 0.16648012
#5 0.443323692 -0.59679914 -0.07961601 0.54463309
#6 -0.261951073 -0.56257502 1.35302484 -0.90400138
#7 -1.309904414 -0.33993528 -0.75275114 -0.42929913
#8 0.092174187 -0.21381354 0.81245328 0.76422474
#9 -1.026127796 -0.26529630 -1.32443073 -0.16606235
#10 0.815507195 -1.12545084 -1.66986584 -0.10189258

Checking row format of csv

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.
Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.
This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

working with months in zoo

I'd like to augment a zoo object with a variable which I could use to test for month changes. I'm sure there are more general ways to do this. Suggestions there would be great, but I'd like to understand why this simple approach fails. I'd feel better if I understood what I'm missing here ;-)
e.g. for a zoo object
library(zoo)
tz <- structure(c(7L, 7L, 1L, 6L, 0L, 9L, 0L, 1L, 6L, 0L, 3L, 3L, 5L,
0L, 8L, 2L, 0L, 3L, 2L, 5L, 2L, 3L, 4L, 7L, 8L, 9L, 0L, 1L, 4L,
5L, 6L, 7L, 8L, 2L, 3L, 4L, 5L, 8L, 9L, 0L), .Dim = c(20L, 2L
), .Dimnames = list(NULL, c("x", "y")), index = structure(c(13880,
13881, 13913, 13916, 13946, 13947, 13948, 13980, 13983, 13984,
13985, 14016, 14048, 14082, 14083, 14115, 14147, 14180, 14212,
14243), class = "Date"), class = "zoo")
Add a year/month variable using as.yearmon() seems easy enough. If I were in a data frame this would yield a fine character variable, but in zoo tragedy ensues if you forget to wrap in as.numeric()
tz$yrmo <- as.numeric(as.yearmon(index(tstz)))
> head(tz)
x y yrmo
2008-01-02 7 2 2008.000
2008-01-03 7 3 2008.000
2008-02-04 1 4 2008.083
2008-02-07 6 7 2008.083
2008-03-08 0 8 2008.167
2008-03-09 9 9 2008.167
This looks great and I can compare data elements successfully
(tz$x[6] != tz$y[6])
2008-03-09
FALSE
but why do I get this result when I compare the year/month variable?
> (tz$yrmo[2] != tz$yrmo[1])
Data:
logical(0)
Index:
character(0)
and why does testing the yearmon or data items with identical() fail in this way? (both should be true)
> identical(tz$yrmo[2] , tz$yrmo[1])
[1] FALSE
> identical(tz$x[2] , tz$x[1])
[1] FALSE
Am I just playing with fire in using yearmon() which creates an index class in zoo?
Should I switch to something like Dirk Eddelbuettel's 'turning a date into a monthnumber'? Number of months between two dates
Q1: The clue in the output having a Data and a Index section is that these are zoo objects. So they have Index attributes that are compared as well and they are not equal. If you wanted to compare the values then you could access the coredata():
> (coredata(tz$yrmo[2]) != coredata(tz$yrmo[1]))
[1] FALSE
> coredata(tz$yrmo[2])
[1] 2008
> coredata(tz$yrmo[1])
[1] 2008
Q2: identical checks more than just the numeric values. It also determines equality of all attributes.
> attributes(tz$yrmo[2])
$index
[1] "2008-01-03"
$class
[1] "zoo"

Resources