I am using the following ID data, I am trying to put the data back into the correct form.
The first 20 observations of the "incorrect ID" look like:
[1] 11820096867 11820053047 13410057602 13410015341 14257205715 28382012393 13410001306 11820000771 11820000784 11820000884 11820011030
[12] 15230002545 13410015602 17336011108 11820000769 11820096867 11820053030 13410050602 11820053030 14257205715
This data can be split up into 4 sections S, G, V and I
I want to add back these leading zeros and separate the data into 4 columns.
S = 2 digits long
G = 1 digit long
V = 5 digits
I = 5 digits
I have been working backwards with these "incorrect ID´s" so for example the observation 11820000771 would be split into the last 5 digits (minus the leading zero) and would be = I the next 5 digits (minus the leading zero) would be = V etc. So;
Example 1:
11820000771 would be:
I = 0771
V = 82000
G = 1
S = 1
Example 2:
14257205715 would be:
I = 5715
V = 25720
G = 4
S = 1
Example 3:
13410015602 would be:
I = 15602
V = 4100
G = 3
S = 1
Example 4:
10943900008 would be:
I = 0008
V = 94390
G = 0
S = 1
In the documentation it states that the leading zeros are not shown for the "incorrect ID" data and have been removed.
In a second "correct" data frame this is what the S, G, V and I look like:
S G V I
[91,] 0 1 18200 97341
[92,] 0 1 71990 15340
[93,] 0 1 18200 87418
[94,] 6 1 18200 38602
[95,] 27 1 34100 1640
[96,] 0 1 19699 30069
[97,] 0 2 84694 59574
[98,] 0 1 71990 1640
[99,] 0 1 18200 771
[100,] 0 1 18200 1640
So
The first objective is to split the "incorrect ID´s" into the correct S, G, V and I similar to the above.
The second objective is to create a new ID key which looks like the following:
[1] "00-01-73360-50661" "00-01-87692-30040" "00-01-34100-57509" "00-01-18200-53047" "00-03-70310-30703" "00-01-82000-72385"
[7] "00-01-68213-09410" "00-01-18200-00771" "00-01-34100-50340" "00-03-73360-97341"
Where the S, G, V and I are combined and split by a - and leading zeros are added back to the data.
Overview:
I am trying to add back leading zeros to segments of an ID variable which is split into 4 maximum length sections. If a segment begins on a 0 then it is removed. If it begins on a number greater than 0 then no leading zero is added to the ID.
Hopefull I am clear, if I am not in any part let me know and I will clarify
DATA:
ID <- c(11820096867, 11820053047, 13410057602, 13410015341, 14257205715,
28382012393, 13410001306, 11820000771, 11820000784, 11820000884,
11820011030, 15230002545, 13410015602, 17336011108, 11820000769,
11820096867, 11820053030, 13410050602, 11820053030, 14257205715,
11820011168, 27336097343, 13410015509, 12556924173, 13410001222,
18769227102, 18769210012, 13410048574, 13410057602, 28066095605,
17199030030, 11820011047, 13410057509, 13410017256, 13410050306,
18200072518, 13410001306, 11820053168, 11820053168, 11820096867,
11820043047, 18200072385, 11820043218, 13410029602, 13410030341,
17199030030, 17199000048, 18066095615, 15230002540, 13410015341,
17199030030, 13410057306, 11820011168, 13410059505, 17336011214,
11820096867, 11820000884, 13410003602, 31820000042, 13410015341,
11820000891, 13410000355, 11820096867, 13410031306, 17289010016,
11820053218, 11820053030, 11820000016, 11820011030, 17336011214,
13410015340, 2710000106005, 11820061030, 17089701331, 23410017306,
11820000016, 27199077005, 13410003256, 13410057341, 17199030030,
15230000435, 11820053218, 13410015341, 18769241103, 15230000434,
11820043218, 11820000842, 13410057340, 11820011047, 13410001340,
33410000354, 12210000170, 11820041218, 27336097343, 13410046874,
13410015340, 31820000697, 13410015306, 13410000007, 613598510062,
15230000022, 618516510505, 11820053218, 13410001602, 15146051460,
15230000022, 17031000024, 11820000884, 14182700012, 11820000784,
2710000106005, 18769233103, 17199010074, 17199030030, 18200072385,
11820011168, 11820000769, 16821309117, 11820053168, 13410050505,
11820043218, 11820053030, 13410017509, 17231163001, 15230002540,
33410000354, 18769210014, 15230002545, 27031030701, 15230000002,
18769240020, 12210000170, 23410017306, 13410050340, 17199000048,
15230000434, 11820096867, 15230002903, 13410057340, 28066095605,
11820079047, 17199000048, 11820011030, 17199000048, 27336097343,
13410057341, 13410000555, 13410050574, 18769230050, 11820096867,
11820000884, 18769210014, 21820086167, 11820053168, 11820041218,
13410015306, 715643501208, 11820002990, 613598512001, 16821309117,
13410000355, 33410000354, 13410057602, 11820000126, 17089701331,
11820027168, 17336035201, 27336097343, 13410057340, 11820000769,
11820053218, 11820011168, 16206705142, 11820000884, 11820053168,
11820011168, 18066095615, 15230000017, 11820003982, 11820043218,
17199030030, 11820000466, 27336097343, 11820096867, 11820011030,
15230002966, 611969902000, 11820011030, 17289010011, 711820053025,
23410017306, 11820096867, 12210000170, 13410057341, 18382072553,
15230000434, 13410057306, 13410048574, 12556971416, 618516510505,
13410014574, 13410017340, 27336082341, 13410001306, 18200072385,
13410015341, 11820079047, 15230000435, 17336035201, 13410015341,
13410051574, 17289010011, 11820096867, 13410050574, 13410001306,
15230000434, 21820000801, 13410001602, 17089701331, 23410017306,
13410050306, 11820053030, 11820000771, 11820000016, 11820000884,
18200072385, 15230002903, 17143945712, 11820004989, 16206705155,
11820011030, 13410050602, 16821309117, 18769233103, 11820011030,
13410003602, 17199030069, 23410017306, 17336013661, 15230002540,
13410050340, 15230002903, 18769283102, 13410057602, 17336011108,
27336097343, 17199070002, 13410057306, 15230000966, 13410072805,
11820000693, 17336035301, 21820000115, 15230000536, 31820000042,
13410057340, 17143932012, 11820053047, 13410017256, 13410001222,
18769241103, 17199030030, 13410015340, 10948700007, 11820086031,
11820043218, 13410031306, 13410057602, 17199030030, 11820003982,
11820011168, 17336011214, 16206705155, 11820053030, 13410057340,
15230002545, 613598510062, 13410057340, 2710000106005, 13410057306,
11820004990, 18200072518, 17336013343, 18066095615, 11820053218,
13410048574, 13410015306, 11820096867, 13410015340, 18469400001,
13410048574, 11820053218, 13410001340, 11820053168, 18769233103,
13410050306, 13410010602, 15230002545, 18066095615, 11820000106,
11820002992, 11820000693, 17199000048, 13410057306, 11820000771,
13410015341, 17031000009, 13410078574, 27336097343, 21820000647,
13410015341, 13410057256, 31820000697, 15230000017, 13410030341,
13410000175, 16821309117, 11820000771, 21820086167, 613598510062,
13410048505, 13410001306, 13410007306, 13410001505, 11820079047,
18806705542, 37336097341, 12210007500, 13410072805, 18066095615,
11820011047, 13410078574, 31820000697, 18417341130, 16206705155,
11820053168, 13410015341, 13410057306, 13410017256, 18382023473,
15230000435, 613598512001, 14182700712, 13410057340, 13410057509,
11820053168, 11820011218, 15230000434, 15230002966, 13410001602,
17199000027, 13410057306, 13410050340, 13410057341, 15230000434,
13410057602, 11820053047, 15146051460, 27199077008, 13410057340,
13410001306, 23410000005, 11820053218, 11820003982, 23410068505,
11820000833, 17031037037, 11820000466, 16206705155, 11820043218,
11820011030, 27336082341, 11820003982, 23410017306, 11820043218,
17336013302, 13410057341, 17336035201, 17199030005, 11820000884,
18200072385, 13410017505, 11820096418, 15230000540, 11820015168,
715643501201, 16821302112, 613598512001, 11820053168, 11820053047,
13410010505, 13410000554, 21820086167, 15230000416, 13410001340,
11820053030, 13410001340, 11820096867, 23410003505, 11820053218,
23410000005, 18200072385, 15230002545, 23410000005, 11820096867,
11820001991, 21820086167, 13410001602, 13410015341, 13410057602,
13410000355, 13410007306, 13410057602, 18066095615, 18382012368,
12210001640, 15230000434, 13410057340, 13410015256, 28382012393,
13410050306, 11820053047, 11820000891, 13410000559, 11820000466,
18015761194, 11820096418, 11820000891, 11820096418, 17199030030,
13410057509, 18769241103, 11820096867, 16821309117, 16821309117,
11820079047, 27336097343, 2710000106744, 11820000784, 11820000884,
18066095675, 11820096418, 13410015341, 11820053168, 11820053168,
11820096867, 11820004990, 613598510062, 15230000434, 2710000106005,
15230000434, 11820053047, 613598512001, 31820000042, 11820096379,
15230000435, 11820011030, 11820053030, 12210001640, 13410003306,
18200072385, 18417340130, 11820053168, 13410072805, 11820053218,
11820015168, 13410001509, 13410031306, 17089701325, 17199048004,
11820096867, 13410001509, 18549811113, 18066095937, 17336011341,
11820011025, 11820011030, 11820096418, 18066095935, 11820015168,
18200072385, 13410007341, 17336011348, 13410007306, 13410057602,
13410001341, 18769241102, 13410057340, 13410001602, 17199036400,
17289000016, 11820096867, 16821302117, 13410057306, 13410057306,
11820000833, 14182700712, 11820011030, 11820011030, 15230000440
)
EDIT 2:
As was pointed out in the comments to remove the leading zeroes of the below data.
This data is the "correct" data in the correct format. What I am now trying to do is to just remove the leading zeros from each section in the below data. So taking 00-01-18200-00987 would be split into 4 columns as before and the leading zeros removed.
S = 0
G = 1
V = 18200
I = 0987
Data:
IDs <- c("00-01-41827-00712", "00-01-52300-01540", "00-01-18200-00987",
"00-01-83820-07131", "00-01-34100-01222", "00-01-34100-50602",
"00-01-52300-00536", "00-01-42572-05715", "00-01-34100-25574",
"00-01-73360-73149", "00-01-34100-51574", "00-01-34100-07602",
"00-01-89961-00420", "00-01-71990-90029", "00-01-34100-31341",
"00-02-34100-30602", "00-01-34100-17536", "00-01-34100-57602",
"00-01-18200-11047", "00-01-34100-00880", "00-01-34100-07602",
"07-01-67084-27455", "00-01-34100-07340", "00-01-80660-95615",
"00-01-34100-50222", "00-01-34100-15509", "00-01-72311-63009",
"00-01-18200-54028", "06-01-19699-02000", "00-01-73360-35201",
"06-01-85165-10504", "06-01-34986-10003", "00-03-70310-30703",
"00-01-18200-53168", "00-01-18200-01991", "00-01-89961-10120",
"00-01-82000-72385", "00-01-18200-00784", "00-01-71990-30030",
"00-01-72890-00011", "00-01-34100-00622", "00-01-18200-15168",
"00-01-52300-00440", "00-01-34100-00355", "00-01-71990-00048",
"00-01-34100-77435", "00-01-80157-11125", "00-01-52300-01301",
"06-01-85165-10505", "00-01-87692-83102", "00-01-34100-50505",
"00-01-34100-00355", "00-01-52300-00440", "00-01-34100-50340",
"00-01-73360-13343", "00-01-80660-95301", "00-01-34100-14505",
"00-01-34100-59574", "00-01-34100-07306", "00-01-18200-53168",
"00-01-34100-15256", "27-01-00001-06502", "00-01-71990-77828",
"00-01-18200-43218", "00-01-73360-13343", "00-01-72311-63001",
"00-01-18200-00987", "00-01-18200-79047", "00-01-18200-00466",
"00-01-82000-72385", "00-01-34100-57602", "00-02-34100-25505",
"00-01-34100-01341", "00-03-73360-97341", "00-01-18200-00987",
"00-01-34100-00488", "00-01-18200-15168", "00-01-34100-01306",
"00-02-18200-29031", "00-01-34100-48602", "00-01-85498-73837",
"00-02-34100-62509", "00-01-34100-00009", "00-02-34100-17306",
"00-01-18200-00106", "00-01-41827-00712", "00-01-71990-70002",
"00-01-82488-12700", "00-01-72890-00030", "00-01-18200-00956",
"00-01-84173-32130", "00-01-52300-00536", "00-01-80660-95625",
"00-01-22100-00157", "00-01-34100-03306", "00-01-18200-00639",
"00-01-18200-15047", "00-01-85498-73837", "00-01-22100-00170",
"00-01-52300-02540", "00-01-52300-02540", "00-01-34100-68574",
"00-01-34100-03509", "00-01-18200-00978", "00-01-71990-10006",
"00-01-52300-02540", "00-01-18200-01991", "00-03-34100-00354",
"00-01-18200-03982", "07-01-18200-53025", "00-01-18200-03982",
"00-01-72890-00016", "00-01-34100-15509", "00-01-84173-10545",
"00-01-34100-03340", "00-01-71990-48004", "00-01-34100-62340",
"00-01-71990-77828", "00-01-34100-00904", "00-01-71990-00047",
"00-01-87692-10012", "00-01-34100-07341", "00-01-18200-79047",
"00-01-85725-00005", "00-01-52300-00540", "00-01-71990-30030",
"00-01-34100-50574", "00-02-73360-82341", "00-01-34100-57306",
"00-01-72311-63011", "00-01-73360-35201", "00-01-34100-50574",
"00-01-71990-10033", "00-01-71990-00048", "00-01-34100-57536",
"00-01-70897-01331", "00-01-52300-00434", "00-01-71990-48016",
"00-01-34100-31602", "00-01-18200-00834", "00-01-34100-31306",
"00-01-18200-11168", "00-01-34100-00252", "00-02-72890-00012",
"00-01-52300-00022", "00-02-34100-17306", "00-01-52300-00017",
"00-01-82488-12356", "00-01-18200-04989", "00-01-34100-01222",
"00-03-34100-00354", "00-01-34100-14505", "00-01-18200-00933",
"00-01-52300-00416", "00-02-18200-29031", "00-01-18200-00865",
"00-01-82488-12910", "00-01-80660-95625", "00-01-41827-00076",
"00-01-18200-27168", "00-01-34100-53505", "00-01-34100-01340",
"00-01-18200-02989", "00-01-34100-62505", "00-01-73360-50202",
"00-01-34100-01256", "00-01-71250-40205", "00-01-34100-15340",
"00-02-18200-29031", "00-01-72311-63012", "00-03-18200-00697",
"00-02-18200-00166", "00-01-34100-00491", "00-01-52300-02966",
"00-01-22100-00171", "00-01-34100-14574", "00-01-49483-18000",
"00-01-71990-09511", "00-01-34100-50222", "00-02-71250-00019",
"00-01-34100-03509", "00-01-18200-53168", "00-01-34100-57306",
"00-01-34100-17505", "00-02-34100-17306", "00-01-87000-50882",
"00-01-34100-50574", "00-01-83820-12360", "00-01-34100-10505",
"00-01-71990-70002", "00-03-70897-01123", "00-01-18200-00833",
"00-01-34100-57256", "00-01-34100-62340", "07-01-19256-00058",
"00-01-71250-40205", "00-01-09487-00007", "00-01-18200-00833",
"00-01-83820-23473", "00-01-34100-00355", "00-01-34100-01256",
"00-01-71439-34806", "00-01-34100-51306", "00-01-34100-50306",
"06-01-33745-13000", "00-01-34100-00904", "00-01-18200-03982",
"00-01-18200-00769", "00-01-52300-00966", "00-01-52300-00022",
"00-01-52300-00540", "00-01-71990-10074", "00-02-18200-00801",
"00-01-71990-30030", "00-01-18200-96867", "00-02-18200-87418",
"00-01-34100-15222", "00-01-34100-15340", "00-01-87692-40020",
"00-01-18200-00126", "00-01-71439-34806", "00-01-34100-15256",
"00-02-18200-00701", "00-02-73360-82301", "00-01-68213-03112",
"00-01-73360-80301", "00-01-34100-46805", "00-01-18200-11025",
"00-01-34100-53505", "00-02-18200-00647", "00-01-18200-00974",
"00-01-62067-05172", "00-01-71990-30069", "00-01-34100-01528",
"00-02-83820-12393", "00-02-18200-87418", "00-01-34100-01509",
"00-01-34100-57602", "00-01-34100-15509", "00-01-34100-03509",
"00-01-34100-01602", "00-01-34100-50222", "00-01-34100-67505",
"00-01-84173-37133", "00-02-34100-25505", "00-01-18200-00834",
"00-01-71990-00028", "00-01-34100-03602", "00-01-22100-00171",
"00-01-18200-00106", "00-01-83741-10012", "00-01-73360-11348",
"00-01-80660-95935", "00-01-18200-86418", "00-01-22100-01640",
"00-01-84173-32130", "00-01-71990-48016", "00-01-62067-05172",
"00-01-18200-00891", "00-01-52300-00022", "00-01-34100-62340",
"00-01-34100-50306", "00-01-34100-17256", "00-01-34100-57306",
"00-01-62067-05172", "00-01-85725-11508", "00-03-18200-00697",
"00-01-34100-01505", "00-01-18200-00466", "00-01-34100-00271",
"00-01-18200-43218", "00-01-70897-01331", "00-01-18200-00974",
"00-03-34100-00304", "00-02-34100-00005", "00-01-80157-11016",
"00-01-34100-57256", "00-01-34100-17505", "06-01-13008-71310",
"00-01-34100-57306", "00-01-34100-00559", "00-01-52300-02540",
"00-01-82054-80441", "00-01-71990-10033", "00-02-73360-82341",
"00-01-83820-12360", "00-02-18200-00166", "00-01-18200-00834",
"00-01-62067-05172", "00-01-52300-02903", "00-02-34100-17306",
"00-01-80660-95937", "00-01-52300-00536", "00-01-34100-77435",
"00-01-70310-37037", "00-01-73360-35201", "00-01-34100-57306",
"00-01-18200-61047", "00-01-62067-38072", "00-01-34100-50574",
"07-01-19256-00054", "00-01-34100-62505", "00-02-83741-00006",
"00-03-70897-01123", "00-01-34100-57341", "00-01-34100-25574",
"00-01-34100-00554", "00-03-18200-00042", "06-01-35985-00016",
"00-01-34100-15340", "00-01-18200-04990", "00-01-73360-50661",
"00-01-52300-00022", "00-01-34100-50340", "00-02-18200-00801",
"00-01-18200-00769", "00-03-34100-00354", "00-01-49483-11200",
"00-01-73360-35301", "00-01-34100-50602", "07-02-39165-00125",
"00-01-71990-10074", "00-01-70897-01331", "00-01-71439-22033",
"00-02-82488-00006", "00-01-18200-00670", "06-01-35985-00016",
"00-01-71990-48016", "00-01-22100-07500", "00-01-34100-17602",
"00-01-73360-11214", "00-01-34100-10602", "00-01-18200-11168",
"00-01-34100-31306", "00-01-18200-00468", "00-02-82488-00006",
"00-01-87692-10012", "00-02-82488-00006", "00-01-18200-79047",
"00-01-87692-30040", "00-01-34100-01509", "00-02-83741-00006",
"27-01-00001-06505", "06-01-85165-10505", "00-01-18200-86418",
"00-01-18200-53168", "00-01-34100-67602", "00-01-80660-95625",
"00-01-71990-00048", "00-01-62067-05155", "00-01-71990-48004",
"00-01-18200-61047", "00-01-18200-00313", "00-02-83820-12393",
"00-01-71990-77828", "00-01-18200-00126", "00-01-71990-30030",
"00-01-34100-01602", "00-01-82488-12345", "00-01-71670-04064",
"00-01-34100-03306", "00-01-18200-00964", "00-01-34100-50505",
"00-01-18200-00974", "06-01-85165-10707", "00-02-18200-29031",
"00-01-68213-03112", "00-01-34100-10505", "00-01-18200-04989",
"00-01-34100-17505", "00-01-72890-00020", "00-01-72311-63011",
"00-01-34100-01222", "00-01-84173-32130", "07-01-60890-95602",
"00-01-70897-18331", "00-01-72890-00020", "00-01-87692-27102",
"06-01-35985-12001", "00-01-73360-35301", "00-01-70897-01331",
"00-01-18200-04990", "00-01-18200-00769", "00-01-18200-04997",
"00-01-70897-01125", "00-01-18200-41218", "00-01-18200-92867",
"00-04-34100-00152", "00-01-18200-53218", "00-01-34100-10505",
"00-01-84694-00001", "00-01-34100-62340", "00-01-52300-00435",
"00-01-34100-25602", "00-01-34100-62340", "00-01-62067-05155",
"00-01-34100-50505", "00-01-18200-79047", "00-01-34100-00555",
"00-01-18200-00466", "07-01-18200-53025", "00-01-71990-00007",
"00-01-34100-07341", "00-01-89961-00120", "06-01-19699-00006",
"00-02-18200-86167", "00-01-71439-22033", "00-01-09487-00007",
"00-01-72311-63009", "00-01-73360-11214", "00-01-42572-05715",
"00-01-34100-50340", "00-01-34100-31341", "00-02-22100-02500",
"00-02-80660-95785", "00-01-71990-70002", "07-01-98373-12603",
"00-01-18200-00865", "00-01-71990-00027", "00-01-85498-73837",
"00-02-71250-00019", "00-01-80660-95615", "00-02-70310-30701",
"00-01-85498-12345", "00-01-18200-86031", "00-01-87692-33103",
"00-01-62067-05155", "00-01-18200-53218", "00-01-87000-50901",
"00-01-71990-48016", "00-01-73360-11214", "00-01-34100-00579",
"00-01-34100-62340", "00-01-87692-10012", "00-01-34100-62340",
"00-01-70310-00012", "00-01-18200-00016", "00-01-80157-61147",
"00-01-18200-04997", "00-01-18200-00784", "00-01-71439-45712",
"00-01-18200-00833", "00-01-71990-77603", "00-01-34100-15340",
"00-01-71990-30030", "00-01-18200-61047", "00-01-34100-30306",
"00-01-34100-15505", "00-03-18200-00697", "00-04-25569-19231",
"00-01-18200-04997", "00-01-34100-15602", "00-01-71990-47712",
"00-01-22100-01640", "00-01-34100-15256", "06-01-85165-10502",
"00-01-71990-30005", "00-02-18200-29031", "00-02-71250-00019",
"06-01-35985-10062", "06-01-19699-00002", "00-01-18200-00468",
"00-01-34100-17505", "00-02-71990-77005", "00-01-34100-80706",
"00-02-18200-00801", "00-01-34100-48602", "00-01-34100-00904",
"00-01-73360-50202", "00-01-34100-30306", "00-01-89961-00120",
"00-01-34100-10602", "00-01-34100-03306", "00-02-72890-00012",
"00-01-62067-05142", "00-01-18200-53168", "00-01-34100-77435",
"00-01-34100-48574", "00-01-72890-00011", "00-01-83820-07531",
"00-01-34100-01222", "07-01-18200-53025", "00-01-62067-04955",
"00-01-18200-79047", "00-03-41827-00046", "00-01-18200-15047",
"06-01-85165-10106", "00-02-18200-87418", "00-02-18200-29031",
"00-01-18200-00773", "00-01-82488-13000", "00-01-73360-13343",
"00-01-62067-38055", "00-01-34100-50222", "00-01-71990-00008",
"00-01-85498-73837", "00-01-34100-00009", "00-01-71990-90029",
"00-01-34100-00009", "00-01-34100-01509")
EDIT 3:
Using the data in EDIT 2: I have the following examples.
00-01-34100-01509 which is one of the IDs in the second edit data. This should collapse to 1341001509.
Example 2:
00-01-62067-05155 should collapse to 16206705155
Example 3: 00-01-82488-12356 should collapse to 18248812356
Example 4: 06-01-19699-00002 should collapse to 611969900002
Example 5: 00-01-09439-00008 should collapse to 10943900008
Example 6: 00-01-09439-00008 should collapse to 10943900008
The common theme here is that it is just the first leading zeros being removed. That is the leading zeros in S and G.
So what I am now trying to do is to gsub the IDs data removing the - so I will have data which looks like the following (taking example 6) - 00010943900008 then from here remove the leading zeros so the data becomes 10943900008. Which is much simpler than what I had previously thought.
EDIT 4:
When I run my version
I get the following consol output:
> df_panel$COLUPC <- gsub("-","",df_panel$UPC)
> df_panel$COLUPC <- sub("^[0]+", "", df_panel$COLUPC)
> beer_PANEL_GR$COLUPCmatch <- beer_PANEL_GR$COLUPC %in% df_panel$COLUPC
> sum(beer_PANEL_GR$COLUPCmatch == FALSE)
[1] 896
> sum(beer_PANEL_GR$COLUPCmatch == TRUE)
[1] 19119
>
> beer_PANEL_GR$COLUPC <- as.character(beer_PANEL_GR$COLUPC)
> df <- full_join(df_panel, beer_PANEL_GR, by = "COLUPC") #Joining with UPC causes us to lose a lot of observations
> dim(df)
[1] 5293488 40
When I run your version I get the following consol output:
> # remove 0s at the beginning of the string, or preceded by "-"
> df_panel$COLUPC <- gsub("(?<=^|-)0","", df_panel$UPC, perl = TRUE)
>
> # remove dashes
> df_panel$COLUPC <- gsub("-", "", df_panel$COLUPC)
> # remove leading zeros
> df_panel$COLUPC <- gsub("^0+", "", df_panel$COLUPC)
>
> beer_PANEL_GR$COLUPCmatch <- beer_PANEL_GR$COLUPC %in% df_panel$COLUPC
> sum(beer_PANEL_GR$COLUPCmatch == FALSE)
[1] 7382
> sum(beer_PANEL_GR$COLUPCmatch == TRUE)
[1] 12633
>
> df2 <- full_join(df_panel, beer_PANEL_GR, by = "COLUPC")
> dim(df2)
[1] 3564132 40
Addressing your edit, how about:
library(dplyr)
# remove 0s at the beginning of the string, or preceded by "-"
gsub("(?<=^|-)0","", IDs, perl = TRUE) %>%
# remove dashes
gsub("-", "", .) %>%
# remove leading zeros
gsub("^0+", "", .)
[1] "1418270712" "1523001540" "1182000987" "1838207131" "1341001222"
[6] "13410050602"
Related
I want to remove rows of exp.normal that ends with ".1" substring.
Then, I want to retain only rows where the rownames of exp.normal match another dataframe exp.kirp. I want to combine the dataframes column-wise.
My code returns the two dataframe with different row lengths.
exp.normal <- exp.normal[!is.infinite(rowSums(exp.normal)),]
exp.normal <- na.omit(exp.normal)
exp.normal <- exp.normal[!grepl('.1$', rownames(exp.normal)),]
exp.kirp.samp <- exp.kirp[rownames(exp.kirp) %in% rownames(exp.normal),]
exp.norm <- exp.normal[rownames(exp.normal) %in% rownames(exp.kirp.samp),]
Output:
> dim(exp.normal)
[1] 19947 32
> dim(exp.kirp)
[1] 12097 202
Traceback:
Error in cbind(...) : number of rows of matrices must match (see arg 2)
Example data:
dput(exp.norm)
structure(c(45.7005, 14525.5304, 2691.0051, 3648.1196, 3785.6462,
508.7428, 3386.262, 1189.0624, 375.0458, 1767.0259, 27.3361,
17196.2434, 2821.7784, 3730.9721, 8095.7046, 955.9156, 2899.3971,
1115.2977, 457.7995, 1821.7784, 45.3806, 19112.246, 3016.1901,
4261.1092, 9791.2504, 967.2683, 2105.4082, 795.0396, 419.566,
1941.4399, 50.9688, 14891.6723, 3558.9722, 3323.9259, 4598.989,
451.9671, 1407.3294, 410.278, 435.5518, 1647.4305, 108.6162,
9145.2729, 2705.0238, 2702.0338, 991.7092, 820.278, 3857.5328,
2015.6448, 332.5631, 1643.2139, 57.4382, 16482.7822, 2320.9426,
2881.0338, 5242.4173, 434.1923, 2427.6701, 985.4808, 369.4413,
1952.1095, 17.378, 16061.5305, 2530.1829, 4110.9756, 16517.6829,
1559.0915, 1917.0732, 607.9268, 558.2317, 1896.3415, 20.626,
16527.3632, 2925.5966, 4156.8309, 10196.4714, 635.9169, 2173.8045,
781.4243, 469.7868, 1987.3547, 9.4012, 21390.7875, 2698.0142,
4198.1283, 15663.0906, 1081.0287, 1256.9429, 591.6457, 585.8632,
1796.5457, 70.6982, 36833.3219, 2334.0517, 3118.3636, 3214.0421,
1119.1551, 2489.5293, 1626.5344, 372.9573, 1927.1034, 27.0724,
41553.7462, 2221.6031, 3961.8517, 7251.1787, 1841.8174, 1867.5525,
1033.0047, 366.9096, 2048.0069, 93.6277, 20812.3777, 3167.1196,
3068.6141, 1033.9674, 238.3764, 3262.2283, 927.9891, 368.2065,
1723.5054, 61.5543, 29531.5934, 3283.48, 3599.218, 531.7693,
224.8289, 2014.6628, 1123.1672, 321.6031, 1888.563, 28.7045,
14833.5053, 2878.5984, 3181.6764, 840.2164, 340.3734, 2266.8694,
1306.6222, 430.3965, 1614.8444, 32.3228, 20450.3576, 2638.1234,
3181.1024, 7459.6457, 791.6043, 2367.126, 982.2835, 444.2257,
1665.3543, 9.6095, 16474.3224, 2779.0965, 4393.5681, 14546.3247,
1160.0268, 1175.3446, 641.2711, 529.4793, 1929.9387, 33.5985,
13559.3697, 2940.1548, 3312.0163, 741.4452, 493.1579, 4346.1672,
1557.4905, 370.7226, 1643.4799, 26.9466, 26391.888, 2614.1732,
3922.3097, 11886.2642, 1467.3491, 2908.8364, 740.1575, 431.4961,
1638.1452, 51.9065, 20112.0953, 2840.2768, 3421.635, 1817.6237,
567.9274, 2244.2088, 867.8801, 385.6534, 1525.9852, 91.0713,
21614.0244, 2868.2575, 2628.6353, 408.8988, 486.2043, 1470.0472,
2422.3217, 404.9217, 1386.5275, 28.0177, 16412.3814, 2963.8349,
4044.6946, 7816.1037, 625.0426, 3087.001, 940.9758, 393.0399,
1762.8796, 45.7438, 30459.0464, 2345.8452, 3155.2153, 9420.2711,
1107.0516, 1614.3134, 818.9615, 484.3678, 1613.9445, 17.0691,
14674.209, 3037.4688, 3741.8818, 5809.3256, 688.5928, 4134.4713,
828.4763, 341.3822, 1690.2581, 44.1648, 12330.3918, 2558.2269,
3040.4316, 2229.2957, 686.3165, 4200.9857, 959.8567, 420.3641,
1570.6747, 17.6861, 19647.681, 2496.6604, 4041.9243, 6503.5269,
749.796, 2689.1762, 1000.3566, 423.1116, 1590.8995, 41.1018,
40847.6969, 2306.5671, 3025.5213, 3386.5546, 861.7896, 1909.7417,
1051.6651, 394.6467, 1680.361, 53.3787, 12392.7638, 2879.6926,
3503.1038, 7530.2986, 559.5507, 2631.9834, 796.9258, 437.1859,
1609.8138, 20.4024, 15847.5751, 2082.0815, 2966.7382, 1372.3176,
962.441, 1013.412, 810.0858, 478.0043, 1715.6652, 44.1964, 27787.4338,
2517.8366, 3247.7944, 3787.8788, 680.8477, 2439.9693, 955.1208,
431.5305, 2040.2762, 38.2464, 11097.4617, 3136.9164, 3515.104,
1173.7936, 411.0357, 3917.6148, 1790.5061, 351.9027, 1664.5743,
41.0752, 12219.0158, 3045.3933, 3753.4768, 371.6285, 612.8901,
2541.1433, 1990.7641, 445.4353, 1827.0053, 29.0631, 22662.2977,
2744.6112, 4235.6337, 13412.0641, 1127.8605, 1697.7296, 904.7316,
513.0433, 1923.9901), dim = c(10L, 32L), dimnames = list(c("A1BG",
"A2M", "AAMP", "AARS", "ABAT", "ABCA1", "ABCA2", "ABCA3", "ABCB7",
"ABCF1"), c("TCGA-BQ-5887-11A-01R-1965-07", "TCGA-DZ-6133-11A-01R-1965-07",
"TCGA-BQ-5884-11A-01R-1592-07", "TCGA-BQ-7044-11A-01R-1965-07",
"TCGA-BQ-5888-11A-01R-1592-07", "TCGA-BQ-7051-11A-02R-1965-07",
"TCGA-BQ-5879-11A-01R-1592-07", "TCGA-BQ-5882-11A-01R-1592-07",
"TCGA-BQ-5894-11A-01R-1592-07", "TCGA-DZ-6132-11A-01R-1965-07",
"TCGA-BQ-7045-11A-01R-1965-07", "TCGA-GL-A59R-11A-11R-A26U-07",
"TCGA-P4-A5E8-11A-12R-A28H-07", "TCGA-P4-A5ED-11A-11R-A28H-07",
"TCGA-BQ-7059-11A-01R-1965-07", "TCGA-BQ-5877-11A-01R-1592-07",
"TCGA-A4-A4ZT-11A-11R-A26U-07", "TCGA-BQ-5891-11A-01R-1592-07",
"TCGA-A4-A57E-11A-11R-A26U-07", "TCGA-BQ-7061-11A-01R-1965-07",
"TCGA-GL-6846-11A-01R-1965-07", "TCGA-BQ-5890-11A-01R-1592-07",
"TCGA-DZ-6131-11A-01R-1965-07", "TCGA-BQ-7055-11A-01R-1965-07",
"TCGA-B9-4115-11A-01R-1758-07", "TCGA-BQ-5875-11A-01R-1592-07",
"TCGA-BQ-7046-11A-01R-1965-07", "TCGA-GL-7966-11A-01R-2204-07",
"TCGA-DZ-6134-11A-01R-1965-07", "TCGA-GL-A9DE-11A-11R-A37K-07",
"TCGA-Y8-A8RY-11A-11R-A37K-07", "TCGA-BQ-5878-11A-01R-1592-07"
)))
> dput(exp.kirp)
structure(c(7.65342121905285, 14.3511850042327, 10.3737643425674,
10.0819596419255, 9.44832324553207, 5.36085937172008, 9.78880184184623,
10.3776687573505, 11.16757118884, 9.53872845925388, 9.59256168492636,
11.6467250199966, 9.64995723240483, 7.72893066783674, 7.8938008505495,
10.2355345297148, 4.7113572383413, 10.307474405626, 5.1591164988591,
6.82029258613417, 11.8163747078537, 10.6949102196217, 11.2422803547626,
8.59352109955772, 3.36085586100493, 9.85271363828624, 8.10771088195776,
11.3942834720292, 7.22380769346318, 10.1931428909004, 8.38100686984167,
10.16821542222, 7.89538927254103, 9.15267449482738, 10.3931817074315,
12.7214881212179, 10.1435311751243, 11.5928015984057, 9.61257956084687,
6.32433538607651, 9.96819071167712, 10.0874192149778, 10.14614948871,
9.68818857795196, 10.4090748876599, 10.6687443412299, 10.8848143975244,
11.5872726722286, 6.84628770347992, 10.2373774088459, 5.09389393824392,
12.4136523086721, 11.1918237390263, 10.1912122382252, 9.9623840324273,
6.22565668754941, 10.3398477765017, 10.3103072842012, 11.1287210937383,
9.12552428380532, 10.0220514952103, 11.5514020529076, 9.33154847897224,
7.15215907641367, 8.30013562492319, 10.2772713571793, 3.07471093384149,
9.90686083971644, 5.84486014570242, 7.85453119568792, 10.1727165187711,
11.3218051787965, 11.6207019570471, 8.14964399360302, 2.57221059331231,
9.770250762185, 8.2978654585063, 11.084297043801, 5.60196387752462,
8.88994408176988, 8.11137409958519, 10.6640911447013, 7.63002607924622,
8.98301980527478, 9.95231536364664, 9.50590581220822, 12.3983662709548,
9.74297479204394, 9.19670588373063, 5.03164662255944, 9.10229620506499,
10.4765987884097, 10.4203162794891, 9.17212793704954, 10.6963363394643,
10.517658858193, 10.6383042190661, 9.89639568127639, 4.96453263672961,
10.153415608879, 4.70168212029528, 10.1689338100564, 9.96839262629172,
9.87305770150294, 9.75535162798677, 6.170389794965, 10.238532641469,
9.94050095178643, 11.0690397931313, 9.18120494347434, 10.1026705963882,
12.2134644534149, 9.76558563096394, 7.17400985460863, 8.51390402420206,
10.6841464369114, 4.46450490913191, 10.2220147915752, 6.26732229936107,
8.34111937214434, 9.4563488702199, 11.3165579369515, 11.3692227748971,
7.56443152168505, 3.28865386360517, 9.3642795739989, 7.97047579449743,
4.39174710202512, 1.99320334863612, 9.09364811306327, 7.99728901996415,
11.0296194917623, 7.76961109298212, 9.33304596974955, 9.89296486273821,
13.2154969925887, 10.5733689401427, 10.0743768517653, 8.12622125684463,
4.20618996652279, 9.22288882589791, 10.7973270736212, 10.0356016423526,
9.9017250769121, 10.2083981309069, 11.6540022938525, 11.3739522748552,
8.46892945061225, 3.85569076543445, 10.0627068729804, 7.99645936536463,
13.6832285748428, 10.3714563849361, 10.4176870383992, 10.0652444551968,
5.98935248863472, 10.1079093507719, 11.2050505161752, 11.6645692817891,
9.14131578384568, 10.4504026669882, 11.7286541625055, 9.0418201112886,
7.18088382351455, 8.02888128245278, 10.4406656190814, 4.29335576875801,
9.41768714387146, 5.17869870850833, 6.88790782664042, 10.6188892562694,
10.3842123858415, 11.7658466970944, 7.8638220194434, 4.34994552290862,
9.22504206592907, 7.63249652719248, 10.3723102883537, 5.34994198553451,
9.48564780343169, 8.04610739496541, 10.4463535299365, 7.68066614055097,
9.04610766787879, 9.95556447653129, 11.8525448023587, 10.4640851987958,
10.4308625115247, 9.91159766583895, 5.37742225475561, 9.4487843137183,
9.99933682779041, 10.2770366214296, 9.0418201112886, 10.5578953924552,
10.4808099887997, 10.9764939558956, 10.4242897631097, 4.93490330712903,
10.9389453617464, 5.13719199914349, 12.0367193976262, 10.8202555636581,
10.3262700402849, 9.91216810777653, 5.52284042813955, 10.0653680664815,
10.5954686012028, 11.2355920880251, 9.10909645558853, 9.94472260124272,
11.5783590097537, 9.29504775099388, 7.53455423776441, 9.73144632203062,
10.3357853667922, 2.69972929338667, 11.0827581319988, 5.12711734972685,
7.20385010562547, 10.2823150418909, 9.98084618229907, 10.5982618153731,
8.87118417188149, 4.52284042813955, 8.94910562068286, 8.00490358291707,
11.0657044429112, 7.06536766254031, 8.7728136979997, 8.38298319671723,
10.0327018868117, 7.20326283373569, 9.47196961605789, 9.90466577812603,
9.49413435666035, 10.19403569598, 10.4724772111563, 11.1448925980171,
7.28004970633421, 9.06939981937328, 10.8754120177161, 10.2352930230415,
9.35902847180107, 10.7267739242029, 10.5529081501417, 11.2956215011293,
8.85725977829128, 4.55387039485638, 11.0056049014623, 6.95117512427229,
10.9656928148969, 10.5991523452113, 10.4556415168452, 9.46537845450025,
6.72337288854289, 10.0139441477751, 9.28408724134641, 11.4833270722276,
9.49617650025352, 10.4425557194467, 11.449658606754, 8.69606512512137,
6.7730033396692, 9.79804905625499, 10.6006257403548, 6.41043269493023,
9.33218159403931, 6.79787807595744, 3.66448284036468, 9.36947738932311,
12.1844424384667, 11.1649886100581, 8.30833493415398, 1.07949758402178,
8.50996853808744, 8.25444192773356, 11.0514188710826, 7.25444192773355,
10.2092692396731, 8.12117551878819, 10.7242737250263, 7.07511328163257,
9.24190737941017, 9.38899238984847, 11.5242722006977, 10.8138023505891,
9.39692881604932, 7.33690372047785, 5.99837831118633, 9.20104960820531,
10.271192322104, 10.3421982587672, 9.9583334671237, 10.8104108086575,
11.3588364978417, 10.3011032299907, 9.79976033549459, 4.03371371384131,
9.4502035528435, 3.61712213221935, 11.9975977242282, 9.91379851626213,
9.95999222715916, 9.90779350021794, 7.85831302551685, 10.3997246047534,
11.7402171909708, 11.7246448152361, 9.75167259046219, 9.68598603763708,
11.889881014998, 10.3276744540453, 8.04315308658688, 9.09516358798289,
10.1351204181247, 2.58736499093646, 10.6921133041207, 6.09727578619146,
10.2388193351428, 11.1387141417635, 9.60973686632844, 9.91029869099744,
7.0969238361025, 1.22478119439559, 7.85207251858217, 8.9132990393591,
11.4041748741886, 7.92325557174226, 11.9555213595621, 8.66299657133128,
9.99675988081849, 6.52046939465285, 9.19469924394784, 11.4453499696318,
13.7764385227361, 10.9582985177207, 9.73202730526663, 11.8793070895146,
5.674830656291, 9.30694800222568, 11.6556994451055, 10.4213132544326,
10.4922544904181, 10.5534738814693, 10.6996637140919, 10.2332275555211,
9.52531486289245, 6.37164075239763, 11.3566558699168, 3.15639661659767,
10.9506421354115, 10.8015070949819, 9.51027580369917, 10.0114476969219,
5.9232585434528, 10.1410206692489, 10.9093204661635, 11.1601119970792,
9.35889972298825, 9.89445931817658, 11.7097666903094, 10.3085136675886,
7.97175793055237, 9.39274501543653, 10.3787624907905, 6.94163806584425,
11.8483258820978, 5.28766436585954, 6.24271872469928, 10.0904144275087,
10.7658517928209, 10.6194247493219, 8.72532024089938, 6.12970957941129,
7.37932464409351, 8.97175821789909, 13.5632882613554, 6.31028194625214,
8.64822705939532, 9.18732138776524, 10.4159346076083, 6.46711478729057,
8.74638197005123, 9.80945126361765, 11.0458332956411, 12.6596471469673,
7.83425486655716, 10.9194011668402, 4.66760264911311, 8.44306371542639,
11.6735729859861, 10.592053792505, 10.0394481089966, 10.738081785684,
10.6804287993454, 10.4484290791818, 8.73413344315171, 6.43659995275074,
10.6685246199734, 8.20162111992077, 12.7954861179578, 10.0094938620897,
9.93400880935778, 9.75330735360058, 7.0082672553098, 9.74081003982032,
10.7382235152475, 11.6720072357516, 8.44494819471576, 9.99223068991282,
11.0703963722836, 9.47126368282115, 7.81421970385892, 9.50816775056871,
11.4983748936735, 6.23485957395859, 11.5903647810385, 6.26655044400704,
5.21798676951157, 10.0355914707199, 10.2338111446746, 11.8068330986116,
7.98850462265062, 2.26801511865652, 8.75403900144463, 8.99470954056443,
13.8050246222504, 9.39637273263616, 9.25174607188694, 8.59323012958596,
9.64227593422842, 7.73613603148148, 8.71848835304267, 11.3433845026852,
11.2390478373181, 10.8940677306605, 10.8800576375586, 10.6070670903737,
5.6912382257558, 9.46054395023954, 9.9164661944151, 9.35162700413489,
9.86387775885121, 10.0058120312308, 10.7880204671486, 10.3793997040699,
9.6038232649825, 4.26803009485916, 10.2782371039023, 6.26475409489153,
12.8198023802381, 11.7916439373724, 9.53606689182685, 10.3591574288036,
5.71406413888836, 10.0630462305102, 10.2195932783632, 11.455724780085,
8.42239448735832, 9.24745049099849, 12.1236985893816, 9.92518871297862,
7.89993146943836, 8.72944026793844, 10.4251666934438, 3.90670784266208,
9.20041190448898, 6.65613736826174, 10.2389705995139, 9.53818563335118,
11.7495320050765, 12.0646887131882, 8.44896824561834, 3.71406963555001,
9.8415805435622, 8.31270283399943, 5.24299494486063, 6.49167354977479,
9.98614168480194, 8.4582261368972, 9.59531026428392, 8.57655917226128,
8.95437954428976, 10.6495249285462, 14.9290954211312, 10.5422470205379,
9.17443338491354, 3.94233621005718, 5.67224428652735, 8.7869060376421,
10.8632316688763, 9.68413003969761, 10.0472843996015, 10.3973606067328,
10.2873877943623, 10.1185985217259, 9.54984491532867, 3.90670784266208,
8.67224499381939, 5.31084177712261, 13.1978705809921, 9.83862580682156,
9.54417855835282, 10.3479785104067, 6.95803180123955, 10.5275135616911,
10.7557094705532, 11.5723066760841, 9.32468663898682, 9.95090752477796,
11.895394953164, 10.1730449599892, 7.96865297090106, 8.25269807010003,
10.3946717952085, 3.87765591897716, 9.72523791794449, 6.55356786048423,
6.95803180123955, 10.5069909263905, 10.2877565830483, 10.5166977513248,
8.4726051481255, 4.38016145559557, 8.85875002019905, 8.67423888569003,
8.68286505123176, 5.31506259026344, 10.3333282396821, 9.31228692654063,
10.4142296315944, 7.85981553329836, 8.44751431678991, 11.0858929239232,
9.28987549215634, 11.1939401219005, 9.81036549777261, 8.34254452326977,
5.73563860631975, 9.00522312099486, 10.3688436329726, 9.94464471906402,
10.1522308353126, 10.7920437057173, 10.4456146711682, 10.5376544440661,
9.5106538108757, 6.00694202058687, 10.5239172935582, 6.07015744755466,
13.1209957412573, 10.9576837919022, 9.8945869439465, 9.23740931858995,
4.16326524153571, 10.523017510386, 10.532865308348, 10.619622367405,
9.45558963540716, 11.5136546699209, 11.209027561542, 8.50311772962086,
6.94462773667725, 9.43239470040379, 10.9762474117231, 3.74822505823549,
8.22396403818439, 5.28179467379438, 5.07015744755466, 9.40882053452885,
9.15195265459219, 11.7917134419519, 8.19669113197608, 4.74823042627325,
10.2400835226029, 7.76318049103673, 10.5252116829711, 5.07015744755466,
9.41356636990191, 8.69464935705204, 10.5448108847847, 6.4670479262821,
8.25602512546126, 10.0954826060933, 9.24807628369884, 10.2848014332547,
9.6103510748559, 7.88436713970017, 6.04591033802914, 7.98344684046261,
9.83392420760322, 10.426537121557, 8.77797787921888, 11.1932452952465,
10.889486091222, 10.329430966151, 9.26655583156462, 3.62269612816767,
10.5609425545211, 5.59521719360418, 13.9452588877727, 11.136365018286,
10.0338699199786, 9.01178744914112, 7.04236745871339, 9.77739489226773,
10.1944306108943, 11.0829417734315, 9.2483306049673, 10.2526052400654,
11.2392043296817, 9.78826123015354, 7.37009277188134, 9.28978490856306,
10.770835486621, 4.07112766139071, 8.79617819177588, 6.50274907789938,
4.95248267443048, 10.0124053945822, 10.8791588020481, 10.1400457232732,
9.64018220112663, 7.9855897990336, 8.44341851204054, 8.1851425389976,
11.3009567982918, 7.87032217794857, 9.12571179465448, 8.10396912418597,
10.0952450966319, 7.11418480239357, 9.3240403215499, 9.55790821677093,
12.5506935970456, 11.2890199649612, 9.74578287914873, 8.83302700041379,
6.80619256918401, 8.91595554395471, 10.4010993451482, 10.426361210739,
9.65687923171916, 10.3180541086121, 10.2515587448026, 10.4679560946132,
9.3710569683502, 5.63055399711973, 9.66161458306801, 3.49639843291562,
12.046947416625, 7.48250063297666, 10.3197777792499, 9.47365522397164,
6.95629018836077, 9.81146637438849, 9.67304984423569, 11.9690923038785,
9.99940969863312, 10.4101699932371, 12.2230772873117, 9.90523370112278,
7.10116469832952, 8.84409058546722, 11.2100883041446, 2.89400283481279,
12.1723192712031, 7.02813039834545, 7.76654547216671, 9.33005660517673,
9.71268261052322, 10.1695833227844, 7.83721098924632, 1.12849137851042,
7.14639353071401, 7.90194088386521, 10.2858186675831, 4.09194519835092,
9.82683604431664, 9.28935150377015, 10.3990006824976, 7.03632419457704,
9.25259305937555, 9.96009564626238, 13.6495813760682, 11.2087562804992,
8.62991100391626, 10.8093586506308, 4.36112967113519, 8.66452479510135,
10.680372183305, 10.4092425218598, 10.166842081886, 11.1148826551454,
11.5218621620828, 10.8830252212276, 8.50698349076537, 6.57472823565416,
10.546746035965, 7.53012147669141, 15.4681548425702, 9.79693513247838,
9.38793395115019, 8.98389796929759, 5.4990039364996, 9.68780798972713,
10.2221443965538, 11.2238327266469, 8.86631579791574, 9.0839648421511,
11.7688762222962, 9.43339244300495, 7.75707059749995, 7.30848010698361,
10.3788698682696, 5.10854118358023, 9.397938764198, 5.14397700761083,
5.54294517859488, 11.3477092783888, 10.1821779329165, 10.5068822633094,
7.95165772727702, 6.07899829224459, 9.17521394310747, 8.44018606106649,
10.2131062858423, 5.98655201209625, 9.76585079853419, 8.60644889080884,
9.71847037220014, 7.04446188071599, 9.27521498813589, 10.9901943954469,
11.5209125420426, 9.65147026322989, 10.6316810922306, 5.74458092333422,
8.26977703134258, 10.0141332026215, 9.03736135461269, 9.96719613631738,
9.93388860627356, 10.7248434525252, 10.2215812006449, 10.3621030533842,
11.4668080776894, 7.99184662847717, 9.33580121861008, 5.66820140316967,
11.9947030005322, 9.64794224178718, 10.1925126545798, 10.2674620603105,
7.5327081297796, 9.69101099541322, 11.2548391298699, 11.7689577047578,
9.19050300947907, 9.56006949444913, 11.2747943605218, 10.0095406093405,
8.00727063981421, 9.97974599674609, 10.3602156709497, 3.94458674388554,
12.5277668843542, 5.61301589155362, 10.2348465241763, 11.2408106474539,
10.7817980294952, 11.3929731417988, 8.56239088440517, 2.39709102017243,
8.66820104853111, 9.02757458085238, 12.6567500587125, 6.6095466615844,
11.4664830513311, 8.87780988126623, 10.3650994493445, 8.35292135920633,
7.93983129663749, 9.4812924824758, 12.99012924981, 10.4898544816469,
8.58540087815148, 10.8618786972288, 7.71067834148438, 8.93864071660805,
11.4031536223651, 10.7979767770093, 9.84830839236378, 9.47801724034487,
10.6468489299639, 10.3611048811716, 8.96812071185483, 5.42457950821431,
10.7176375280321, 5.99997294921438, 11.249476392449, 10.9207261360998,
10.3485991467441, 9.01994649367039, 6.53868559158395, 9.41478373733666,
10.432374028565, 11.0171355028297, 9.38017184771789, 10.5614675309011,
11.5465057188577, 9.17787519818794, 7.04065904752138, 11.0984136358699,
10.7076152049804, 2.37928824997193, 12.014117542008, 6.92057199739813,
4.98912992235826, 10.120606346187, 10.2084736088023, 10.4873799256012,
9.27923169328214, 1.52972101613862, 9.22178985762886, 7.95596194920281,
12.8045414410067, 7.92201464494711, 9.47902214483086, 7.4885399127133,
10.1807490909157, 6.70138409679205, 9.69962242189537, 10.1778751981879,
9.210584414225, 10.547619289895, 10.7420721677751, 11.6561438053443,
5.89426967756534, 9.09193991076902, 10.4509432609399, 10.2168981211661,
9.50697734395297, 10.1421973186731, 10.289252816822, 11.0579561962947,
8.44378357961471, 4.8801957289431, 11.4458754454719, 4.65344988200111,
12.0454913320053, 10.7230307474822, 9.54770815725829, 9.91346136214138,
6.41298301444827, 10.2791008470934, 11.2856294138809, 11.2550839389669,
9.39663175119314, 10.3644682533465, 12.3720956896476, 9.96448281969001,
7.26859346154833, 9.52846036210416, 10.6382351536235, 4.31017286300409,
11.0690011128305, 5.82641628715103, 6.80493638177069, 9.89786451034248,
9.94481833785949, 11.6774877172589, 9.01061202172265, 6.17266952106643,
9.5752544613585, 8.43194131360949, 11.0206656082034, 7.72110534236959,
10.5081439758347, 8.68994452434197, 10.8685929396885, 6.5444359272107,
8.68679113793189, 10.8525006396094, 11.1533795693109, 11.5873539730162,
8.99794551516942, 9.87102611145061, 3.93820109105871, 9.06141196830279,
10.3168170965086, 10.0968873453464, 9.78516864210074, 10.6089638880433,
10.6772891469749, 10.6027030400136, 10.5354892373923, 5.69623034672501,
10.4314703380043, 3.77147362287765, 11.9414284466395, 8.62911943026167,
9.47885474304884, 9.32603460205856, 7.52941181265447, 10.1897935554257,
10.6869529617348, 10.9824945325978, 9.54925041577186, 9.24348510543789,
12.6197561225204, 10.4229927556855, 7.78641143939821, 10.3518043181888,
10.2176900900617, 2.72462844562171, 13.9420477990178, 6.05669840185354,
8.84668667779751, 10.976770882955, 11.2310420929738, 11.9128532309879,
8.82897202130395, 2.92626515369416, 9.02841610083176, 9.70477480883974,
13.4396963975546, 6.34522248637767, 10.2353559766448, 8.78641143939821,
10.5578417013582, 6.64287771079513, 9.06181129266044, 10.6360273009065,
12.9150155941383, 11.9719694740862, 8.95056581144122, 10.0640399701312,
7.2355968744756, 9.21728961537032, 11.5561009223631, 9.95201054312933,
10.0229270264038, 10.7583335783419, 10.7564054571028, 10.3137246220883,
8.76465102394471, 5.79502400957859, 9.85443348382268, 5.2076547248541,
12.2695992772649, 10.7969615679355, 9.78033864553782, 9.39835742542764,
6.02727350425616, 9.5278442796299, 11.038246992221, 11.1457431696807,
9.00232447215102, 9.03646624386732, 12.062072892785, 10.0526651781329,
7.29576250712518, 7.86933843816656, 9.61223637364334, 3.32380238041871,
8.70355167900769, 5.73415681989236, 8.69454663545621, 11.3998442086989,
11.3703586767908, 12.3235644563292, 9.05367152133834, 10.196743500022,
9.91601721982026, 8.03544796058228, 12.1704251930224, 5.87162190579696,
8.07993112355535, 8.03136652207373, 9.91767640083016, 5.98915036355025,
9.10990689967612, 11.1609999554059, 11.726925238744, 12.2657872278982,
9.73588750953117, 11.4370505772681, 5.57906848944612, 9.27950936832482,
9.65861638042757, 9.17054127127384, 9.78579998093652, 10.5632561528303,
10.254342831578, 10.148100924125, 10.2726105411389, 4.18254929202166,
10.6197309312023), dim = c(50L, 20L), dimnames = list(c("A1BG",
"A2M", "A4GALT", "AAAS", "AACS", "AADAT", "AAGAB", "AAK1", "AAMP",
"AARS2", "AARSD1", "AARS", "AASDHPPT", "AASDH", "AASS", "AATF",
"AATK", "ABAT", "ABCA11P", "ABCA12", "ABCA1", "ABCA2", "ABCA3",
"ABCA5", "ABCA6", "ABCA7", "ABCB10", "ABCB1", "ABCB4", "ABCB6",
"ABCB7", "ABCB8", "ABCB9", "ABCC10", "ABCC1", "ABCC3", "ABCC4",
"ABCC5", "ABCC6", "ABCC9", "ABCD1", "ABCD3", "ABCD4", "ABCE1",
"ABCF1", "ABCF2", "ABCF3", "ABCG1", "ABCG2", "ABHD10"), c("TCGA.2K.A9WE.01A",
"TCGA.2Z.A9J1.01A", "TCGA.2Z.A9J3.01A", "TCGA.2Z.A9J5.01A", "TCGA.2Z.A9J6.01A",
"TCGA.2Z.A9J7.01A", "TCGA.2Z.A9J8.01A", "TCGA.2Z.A9JD.01A", "TCGA.2Z.A9JI.01A",
"TCGA.2Z.A9JJ.01A", "TCGA.2Z.A9JO.01A", "TCGA.2Z.A9JQ.01A", "TCGA.4A.A93W.01A",
"TCGA.4A.A93X.01A", "TCGA.4A.A93Y.01A", "TCGA.5P.A9JU.01A", "TCGA.5P.A9JY.01A",
"TCGA.5P.A9KE.01A", "TCGA.A4.7288.01A", "TCGA.A4.7583.01A")))
We may use intersect
nm1 <- intersect(row.names(exp.normal), row.names(exp.kirp))
exp.kirp.samp <- exp.kirp[nm1,]
exp.norm <- exp.normal[nm1,]
dim(exp.norm)
#[1] 8 32
dim(exp.kirp.samp)
#[1] 8 20
In the sample data showed, there are no duplicates for row names. It may be better to check for any duplicates with frequency count on the row names
> table(row.names(exp.normal))
A1BG A2M AAMP AARS ABAT ABCA2 ABCA3 ABCB7
1 1 1 1 1 1 1 1
> table(row.names(exp.kirp))
A1BG A2M A4GALT AAAS AACS AADAT AAGAB AAK1 AAMP AARS AARS2 AARSD1 AASDH AASDHPPT
1 1 1 1 1 1 1 1 1 1 1 1 1 1
AASS AATF AATK ABAT ABCA1 ABCA11P ABCA12 ABCA2 ABCA3 ABCA5 ABCA6 ABCA7 ABCB1 ABCB10
1 1 1 1 1 1 1 1 1 1 1 1 1 1
ABCB4 ABCB6 ABCB7 ABCB8 ABCB9 ABCC1 ABCC10 ABCC3 ABCC4 ABCC5 ABCC6 ABCC9 ABCD1 ABCD3
1 1 1 1 1 1 1 1 1 1 1 1 1 1
ABCD4 ABCE1 ABCF1 ABCF2 ABCF3 ABCG1 ABCG2 ABHD10
1 1 1 1 1 1 1 1
If there are duplicates, it is not clear about the logic for processing based on the post.
My first data frame (df) contains Entrydate and ExitDate columns. Another dataframe (n1) has all trading dates. I need a new column in first dataframe calculated as number of days as calculated from the second dataframe. How do I call this dayCount function for each row of df. When I try to use mapply, I am unable to pass n1 as a parameter.
dayCount <- function (startDate, endDate, n1) {
return (nrow(subset(n1, Date >= startDate & Date <= endDate)))
}
df<- structure(list(EntryDate = structure(c(11355, 11418, 11436, 11449,
11520, 11523, 11548, 11620, 11768, 11773), class = "Date"), ExitDate = structure(c(11360,
11422, 11438, 11457, 11522, 11526, 11554, 11625, 11772, 11778
), class = "Date")), row.names = c(22L, 65L, 76L, 84L, 135L,
138L, 155L, 204L, 305L, 307L), class = "data.frame")
n1<- structure(c(11354, 11355, 11358, 11359, 11360, 11361, 11362,
11365, 11366, 11367, 11368, 11369, 11372, 11373, 11374, 11375,
11376, 11379, 11380, 11381, 11382, 11383, 11386, 11388, 11389,
11390, 11393, 11394, 11395, 11396, 11397, 11400, 11401, 11402,
11403, 11404, 11407, 11408, 11409, 11410, 11411, 11414, 11415,
11416, 11418, 11421, 11422, 11423, 11424, 11428, 11429, 11430,
11431, 11432, 11435, 11436, 11437, 11438, 11439, 11442, 11444,
11445, 11446, 11449, 11450, 11451, 11452, 11453, 11456, 11457,
11458, 11459, 11460, 11463, 11464, 11465, 11466, 11467, 11470,
11471, 11472, 11473, 11474, 11477, 11478, 11479, 11480, 11481,
11484, 11485, 11486, 11487, 11488, 11491, 11492, 11493, 11494,
11495, 11498, 11499, 11500, 11501, 11502, 11505, 11506, 11507,
11508, 11509, 11512, 11513, 11514, 11515, 11516, 11519, 11520,
11521, 11522, 11523, 11526, 11527, 11528, 11529, 11530, 11533,
11534, 11535, 11536, 11537, 11540, 11541, 11542, 11543, 11544,
11547, 11548, 11550, 11551, 11554, 11555, 11557, 11558, 11561,
11562, 11563, 11564, 11565, 11568, 11569, 11570, 11571, 11572,
11575, 11576, 11577, 11578, 11579, 11582, 11583, 11584, 11585,
11586, 11589, 11590, 11591, 11592, 11593, 11596, 11598, 11599,
11600, 11603, 11604, 11605, 11606, 11607, 11610, 11611, 11612,
11613, 11614, 11617, 11618, 11619, 11620, 11624, 11625, 11626,
11627, 11628, 11631, 11632, 11633, 11634, 11635, 11638, 11639,
11640, 11641, 11645, 11646, 11647, 11648, 11649, 11652, 11653,
11654, 11655, 11659, 11660, 11661, 11662, 11663, 11666, 11667,
11668, 11669, 11670, 11674, 11675, 11676, 11677, 11680, 11682,
11683, 11684, 11687, 11688, 11689, 11690, 11691, 11694, 11695,
11696, 11697, 11698, 11701, 11702, 11703, 11704, 11705, 11708,
11709, 11710, 11711, 11712, 11715, 11716, 11717, 11718, 11719,
11722, 11723, 11724, 11725, 11726, 11729, 11730, 11731, 11732,
11733, 11736, 11737, 11738, 11739, 11740, 11743, 11744, 11745,
11746, 11747, 11750, 11751, 11752, 11753, 11754, 11757, 11758,
11759, 11760, 11761, 11764, 11765, 11766, 11767, 11768, 11772,
11773, 11774, 11778), class = "Date")
You can use %in% to count number of days in n1 between each EntryDate and ExitDate.
df$dayCount <- colSums(mapply(function(x, y) n1 %in% seq(x, y, by = '1 day'),
df$EntryDate, df$ExitDate))
df
# EntryDate ExitDate dayCount
#22 2001-02-02 2001-02-07 4
#65 2001-04-06 2001-04-10 3
#76 2001-04-24 2001-04-26 3
#84 2001-05-07 2001-05-15 7
#135 2001-07-17 2001-07-19 3
#138 2001-07-20 2001-07-23 2
#155 2001-08-14 2001-08-20 4
#204 2001-10-25 2001-10-30 3
#305 2002-03-22 2002-03-26 2
#307 2002-03-27 2002-04-01 3
I have the following format, please advise how to convert it to a list in R?
"{1948, 2507, 2510, 7030, 7110, 9009, 00027, 00206, 00399, 00717, 00814, 00828, 00848, 00917, 01050, 01105, 01144, 02130, 02768, 03037, 03752, 03754, 04070, 04110, 05050, 05255, 05289, 05564, 05595, 06100, 06330, 06671, 07041, 07119, 07137, 07273, 07313, 07454, 07871, 08104, 08714, 08726, 08995, 09059, 09073, 09525, 09949, 09981, 10092, 10439, 10782, 11185, 11507, 11712, 11806, 11858, 11980, 12067, 12113, 12139, 12643, 13820, 14534, 15007, 15014, 15549, 15953, 16151, 16174, 16634, 16733, 16888, 17111, 17207, 17377, 17721, 17900, 18118, 18400, 18686, 18880, 19080, 19342, 19444, 19772, 19790, 19891, 20091, 20245, 20402, 20811, 21114, 21345, 21811, 21881, 22222, 22311, 22320, 22831, 22969, 23251, 23572, 23734, 23862, 23889, 24034, 24463, 25172, 25688, 26143, 26221, 26803, 26850, 26898, 27497, 28291, 28343, 29411, 29419, 30024, 30561, 30923, 31345, 31351, 31555, 31927, 32198, 32861, 33020, 33040, 33095, 33188, 33311, 33368, 33377, 33475, 33519, 33574, 33592, 34207, 34235, 34272, 34484, 34854, 34872, 34875, 34876, 34880, 35222, 35292, 35344, 36177, 36266, 37038, 37060, 37548, 37686, 37700, 38139, 39368, 39369, 39633, 40132, 40698, 40704, 40744, 40819, 41311, 41971, 42102, 42616, 43055, 43211, 43234, 43428, 43494, 43934, 44117, 44252, 44272, 44301, 44336, 44619, 44866, 44888, 45049, 45197, 45412, 45718, 46694, 46736, 47000, 48046, 48540, 49078, 49109, 49216, 49388, 49464, 50056, 50155, 50217, 50477, 50692, 51122, 51445, 51946, 52475, 52537, 52982, 54011, 54031, 54160, 54963, 55000, 55537, 56080, 56163, 56282, 56760, 56787, 57102, 57727, 57871, 58101, 58558, 58882, 59902, 60225, 60397, 60501, 60619, 60703, 60890, 61075, 61894, 61944, 62322, 62337, 62380, 62413, 62729, 62766, 62923, 63010, 63234, 63977, 64127, 65359, 65428, 65542, 65750, 65863, 66184, 66636, 66712, 67201, 67439, 67953, 68133, 68854, 69251, 69959, 70107, 70725, 70768, 71081, 71099, 71948, 72013, 72377, 72400, 72420, 72735, 73000, 73015, 73142, 73223, 73455, 73717, 74049, 74492, 74854, 74941, 75142, 75399, 75464, 75587, 75618, 75642, 75887, 76357, 76651, 77199, 77302, 77456, 77579, 77601, 77649, 77668, 77694, 77745, 78006, 78010, 78178, 78335, 78656, 78729, 78808, 78824, 78844, 78945, 79416, 79471, 79915, 80077, 80111, 80189, 80262, 80409, 80470, 80529, 80539, 80838, 81272, 81513, 81658, 81740, 81743, 81762, 81843, 82001, 82070, 82106, 82342, 82472, 82719, 83670, 84009, 84151, 84299, 84430, 84450, 84460, 84945, 86411, 86443, 86446, 86668, 86942, 87286, 87317, 87624, 87785, 88023, 88517, 88696, 88787, 88868, 88977, 89206, 90108, 90440, 90734, 90802, 90849, 90920, 90931, 91011, 91031, 91133, 91777, 91949, 92162, 92494, 93012, 93172, 94300, 94517, 95142, 95410, 95559, 95859, 96112, 97255, 97787, 97986, 98240, 98817, 99050, 99198, 99222, 99241, 99295, 99326, 99335, 99503, 99603, 99643, 99803, 99968}"
THIS IS NOT A DUPLICATE OF convert json to list in a vectorized way in R
IT'S COMPLETELY DIFFERENT BECAUSE THE FORMAT IS ABSOLUTELY DIFFERENT.
Try this one line code:
as.numeric(sapply(strsplit(substr(j,2,nchar(j)-1),split = ","),trimws))
[1] 1948 2507 2510 7030 7110 9009 27 206 399 717 814 828 848 917 1050 1105 1144
[18] 2130 2768 3037 3752 3754 4070 4110 5050 5255 5289 5564 5595 6100 6330 6671 7041 7119
[35] 7137 7273 7313 7454 7871 8104 8714 8726 8995 9059 9073 9525 9949 9981 10092 10439 10782
[52] 11185 11507 11712 11806 11858 11980 12067 12113 1213 ..
Your input:
j<-"{1948, 2507, 2510, 7030, 7110, 9009, 00027, 00206, 00399, 00717, 00814, 00828, 00848, 00917, 01050, 01105, 01144, 02130, 02768, 03037, 03752, 03754, 04070, 04110, 05050, 05255, 05289, 05564, 05595, 06100, 06330, 06671, 07041, 07119, 07137, 07273, 07313, 07454, 07871, 08104, 08714, 08726, 08995, 09059, 09073, 09525, 09949, 09981, 10092, 10439, 10782, 11185, 11507, 11712, 11806, 11858, 11980, 12067, 12113, 12139, 12643, 13820, 14534, 15007, 15014, 15549, 15953, 16151, 16174, 16634, 16733, 16888, 17111, 17207, 17377, 17721, 17900, 18118, 18400, 18686, 18880, 19080, 19342, 19444, 19772, 19790, 19891, 20091, 20245, 20402, 20811, 21114, 21345, 21811, 21881, 22222, 22311, 22320, 22831, 22969, 23251, 23572, 23734, 23862, 23889, 24034, 24463, 25172, 25688, 26143, 26221, 26803, 26850, 26898, 27497, 28291, 28343, 29411, 29419, 30024, 30561, 30923, 31345, 31351, 31555, 31927, 32198, 32861, 33020, 33040, 33095, 33188, 33311, 33368, 33377, 33475, 33519, 33574, 33592, 34207, 34235, 34272, 34484, 34854, 34872, 34875, 34876, 34880, 35222, 35292, 35344, 36177, 36266, 37038, 37060, 37548, 37686, 37700, 38139, 39368, 39369, 39633, 40132, 40698, 40704, 40744, 40819, 41311, 41971, 42102, 42616, 43055, 43211, 43234, 43428, 43494, 43934, 44117, 44252, 44272, 44301, 44336, 44619, 44866, 44888, 45049, 45197, 45412, 45718, 46694, 46736, 47000, 48046, 48540, 49078, 49109, 49216, 49388, 49464, 50056, 50155, 50217, 50477, 50692, 51122, 51445, 51946, 52475, 52537, 52982, 54011, 54031, 54160, 54963, 55000, 55537, 56080, 56163, 56282, 56760, 56787, 57102, 57727, 57871, 58101, 58558, 58882, 59902, 60225, 60397, 60501, 60619, 60703, 60890, 61075, 61894, 61944, 62322, 62337, 62380, 62413, 62729, 62766, 62923, 63010, 63234, 63977, 64127, 65359, 65428, 65542, 65750, 65863, 66184, 66636, 66712, 67201, 67439, 67953, 68133, 68854, 69251, 69959, 70107, 70725, 70768, 71081, 71099, 71948, 72013, 72377, 72400, 72420, 72735, 73000, 73015, 73142, 73223, 73455, 73717, 74049, 74492, 74854, 74941, 75142, 75399, 75464, 75587, 75618, 75642, 75887, 76357, 76651, 77199, 77302, 77456, 77579, 77601, 77649, 77668, 77694, 77745, 78006, 78010, 78178, 78335, 78656, 78729, 78808, 78824, 78844, 78945, 79416, 79471, 79915, 80077, 80111, 80189, 80262, 80409, 80470, 80529, 80539, 80838, 81272, 81513, 81658, 81740, 81743, 81762, 81843, 82001, 82070, 82106, 82342, 82472, 82719, 83670, 84009, 84151, 84299, 84430, 84450, 84460, 84945, 86411, 86443, 86446, 86668, 86942, 87286, 87317, 87624, 87785, 88023, 88517, 88696, 88787, 88868, 88977, 89206, 90108, 90440, 90734, 90802, 90849, 90920, 90931, 91011, 91031, 91133, 91777, 91949, 92162, 92494, 93012, 93172, 94300, 94517, 95142, 95410, 95559, 95859, 96112, 97255, 97787, 97986, 98240, 98817, 99050, 99198, 99222, 99241, 99295, 99326, 99335, 99503, 99603, 99643, 99803, 99968}"
This code removes first and last character of the string ("{" and "}" characters), splits values by "," and removes whitespaces using trimws. After that it moves the format to number.
If it happens your data actually is json, stick with the rjson package. This answer is assuming your data is not json (since rjson::fromjson throws an error on your data)
Try:
string <- "{1948, 2507, 2510, 7030, 7110, 9009, 00027, 00206, 00399, 00717, 00814, 00828, 00848, 00917, 01050, 01105, 01144, 02130, 02768, 03037, 03752, 03754, 04070, 04110, 05050, 05255, 05289, 05564, 05595, 06100, 06330, 06671, 07041, 07119, 07137, 07273, 07313, 07454, 07871, 08104, 08714, 08726, 08995, 09059, 09073, 09525, 09949, 09981, 10092, 10439, 10782, 11185, 11507, 11712, 11806, 11858, 11980, 12067, 12113, 12139, 12643, 13820, 14534, 15007, 15014, 15549, 15953, 16151, 16174, 16634, 16733, 16888, 17111, 17207, 17377, 17721, 17900, 18118, 18400, 18686, 18880, 19080, 19342, 19444, 19772, 19790, 19891, 20091, 20245, 20402, 20811, 21114, 21345, 21811, 21881, 22222, 22311, 22320, 22831, 22969, 23251, 23572, 23734, 23862, 23889, 24034, 24463, 25172, 25688, 26143, 26221, 26803, 26850, 26898, 27497, 28291, 28343, 29411, 29419, 30024, 30561, 30923, 31345, 31351, 31555, 31927, 32198, 32861, 33020, 33040, 33095, 33188, 33311, 33368, 33377, 33475, 33519, 33574, 33592, 34207, 34235, 34272, 34484, 34854, 34872, 34875, 34876, 34880, 35222, 35292, 35344, 36177, 36266, 37038, 37060, 37548, 37686, 37700, 38139, 39368, 39369, 39633, 40132, 40698, 40704, 40744, 40819, 41311, 41971, 42102, 42616, 43055, 43211, 43234, 43428, 43494, 43934, 44117, 44252, 44272, 44301, 44336, 44619, 44866, 44888, 45049, 45197, 45412, 45718, 46694, 46736, 47000, 48046, 48540, 49078, 49109, 49216, 49388, 49464, 50056, 50155, 50217, 50477, 50692, 51122, 51445, 51946, 52475, 52537, 52982, 54011, 54031, 54160, 54963, 55000, 55537, 56080, 56163, 56282, 56760, 56787, 57102, 57727, 57871, 58101, 58558, 58882, 59902, 60225, 60397, 60501, 60619, 60703, 60890, 61075, 61894, 61944, 62322, 62337, 62380, 62413, 62729, 62766, 62923, 63010, 63234, 63977, 64127, 65359, 65428, 65542, 65750, 65863, 66184, 66636, 66712, 67201, 67439, 67953, 68133, 68854, 69251, 69959, 70107, 70725, 70768, 71081, 71099, 71948, 72013, 72377, 72400, 72420, 72735, 73000, 73015, 73142, 73223, 73455, 73717, 74049, 74492, 74854, 74941, 75142, 75399, 75464, 75587, 75618, 75642, 75887, 76357, 76651, 77199, 77302, 77456, 77579, 77601, 77649, 77668, 77694, 77745, 78006, 78010, 78178, 78335, 78656, 78729, 78808, 78824, 78844, 78945, 79416, 79471, 79915, 80077, 80111, 80189, 80262, 80409, 80470, 80529, 80539, 80838, 81272, 81513, 81658, 81740, 81743, 81762, 81843, 82001, 82070, 82106, 82342, 82472, 82719, 83670, 84009, 84151, 84299, 84430, 84450, 84460, 84945, 86411, 86443, 86446, 86668, 86942, 87286, 87317, 87624, 87785, 88023, 88517, 88696, 88787, 88868, 88977, 89206, 90108, 90440, 90734, 90802, 90849, 90920, 90931, 91011, 91031, 91133, 91777, 91949, 92162, 92494, 93012, 93172, 94300, 94517, 95142, 95410, 95559, 95859, 96112, 97255, 97787, 97986, 98240, 98817, 99050, 99198, 99222, 99241, 99295, 99326, 99335, 99503, 99603, 99643, 99803, 99968}"
string as list of characters:
string_as_list_char <- as.list(strsplit(gsub('\\{|\\}', '', string), ", "))[[1]]
or converted to numeric:
string_as_list_num <- as.list(as.numeric(strsplit(gsub('\\{|\\}', '', string), ", ")[[1]]))
I am trying to do Leave One Out Cross Validation. I am following all the instructions, but don't really understand what I am doing, so I am getting an error. Maybe my dataset is too small, I can include it here:
clay oc ph_h2o avg_N2O sum_tmax
31.54643 2.654043 6.725000 5.8397204 1644.0
31.54643 2.654043 6.725000 8.9456498 1626.0
31.54643 2.654043 6.725000 36.6636187 1846.5
31.54643 2.654043 6.725000 27.9717408 1651.5
31.54643 2.654043 6.725000 13.7662532 1433.5
31.54643 2.654043 6.725000 28.4065759 1597.5
31.54643 2.654043 6.725000 9.7437375 1585.5
20.15455 1.371111 6.090909 2.8604854 1644.0
20.15455 1.371111 6.090909 11.4821949 1626.0
20.15455 1.371111 6.090909 20.1477475 1846.5
20.15455 1.371111 6.090909 3.9438700 1651.5
20.15455 1.371111 6.090909 4.8634605 1597.5
30.14316 3.224697 7.221811 10.2540652 802.5
30.14316 3.224697 7.221811 17.7039395 841.0
30.14316 3.224697 7.221811 19.3734159 983.5
30.14316 3.224697 7.221811 17.2422255 781.0
30.14316 3.224697 7.221811 17.9839534 412.5
18.06667 1.852857 5.911111 4.1653732 1012.5
18.06667 1.852857 5.911111 4.5732676 1201.0
18.06667 1.852857 5.911111 8.1417138 1003.5
8.11250 0.886250 6.650000 0.4631667 818.0
8.11250 0.886250 6.650000 2.1779861 397.5
8.11250 0.886250 6.650000 1.6355573 641.5
8.11250 0.886250 6.650000 2.8754931 259.5
22.47405 1.816556 5.684229 4.5025055 1324.0
22.47405 1.816556 5.684229 3.6881473 1634.5
22.47405 1.816556 5.684229 4.7470418 1370.0
22.47405 1.816556 5.684229 8.2378739 1559.5
The code I try on these is:
train_control<-trainControl(method="LOOCV")
control<-train(avg_N2O ~., data=slim, trControl=train_control, method="nb")
All classes should be numeric, and they are.
I've used linear regression to look at the relationship of these variables to avg_N2O, but it has been suggested that I use LOOCV. I would like to have a predictive model in the end and this is my training set.
Have a list of text-sections which are required to be split into sentences by:
> textList <- list(sections=sections[(length(sections)-2):length(sections)])
> textList$sentences <- sapply(textList$sections, function(x) strsplit(as.character(x), "(?<=und/KON)\\s(?!\\S+/V)|(?<=oder/KON)\\s|(?<=/\\$[[:punct:]])\\s(?!dass/KOUS)(?!dann/ADV)(?!weil/KOUS)", perl=TRUE))
> sent <- textList$sentences
The final goal is to add IDs to all sentences and arrange them together into a list of dataframes --one dataframe corresponding to each section.
> sent.list <- lapply(seq_along(sent), function(i)
+ data.frame(ID=paste(sprintf("%02d", i), sprintf("%03d", seq_along(sent[[i]])), sep = ""),
+ Sentence=sent[[i]]))
Error in data.frame(ID = paste(sprintf("%02d", i), sprintf("%03d", seq_along(sent[[i]])), :
arguments imply differing number of rows: 1, 0
ISSUE: However I try to variate the split in the first step, somehow it seems I get a list with exactly one character(0) element (the last one). This hinders the execution of the second step --creating the list of dataframes-- with the error above.
Please note that the structure of the list seems somehow corrupted. Downwards --R console copy-paste-- the first two sections are beginning (at #*) with $... #* (which btw. I cannot interpret meaningfully). However, the third section (at #**) starts with [[3]].
> sent
$... #*
[1] "Das/ART Spiel/NN besteht/VVFIN aus/APPR mehreren/PIAT Früchten/NN -LRB-/TRUNC rote/ADJA Kirschen/NN ,/$,"
.
.
.
[51] "-RRB-/TRUNC sie/PPER bleiben/VVFIN die/ART ganze/ADJA Zeit/NN über/APPR konzetriert/ADJD bei/APPR der/ART Sache/NN ./$."
[52] "Das/ART Spiel/NN ist/VAFIN eine/ART absolue/ADJA Kaufempfehlung/NN !!!!/CARD "
$... #*
[1] "Obstgarten/NN ist/VAFIN DAS/NE Einsteigerspiel/NN für/APPR Kinder/NN ab/APPR zwei/CARD Jahren/NN ./$."
.
.
.
[36] "hochgelobten/ADJA Klassiker/NN werden/VAFIN lassen/VVINF kann/VMFIN ./$."
[[3]] #**
character(0)
I tried much to reproduce the error on artificially reproduced data without much success. So please excuse the complicated code.
The smallest version of textList for which I could reproduce the error when executed in the R console:
> textList
$sections
[1] "Obstgarten/NN ist/VAFIN DAS/NE Einsteigerspiel/NN für/APPR Kinder/NN ab/APPR zwei/CARD Jahren/NN ./$. Preis/NN führt/VVFIN ,/$, aus/APPR einem/ART einfachen/ADJA Spiel/NN schnell/ADJD einen/ART hochwertigen/ADJA und/KON hochgelobten/ADJA Klassiker/NN werden/VAFIN lassen/VVINF kann/VMFIN ./$. "
[2] ""
Following the content of a dput file containing the smallest version of textList which reproduces the example.
structure(list(sections = c("Obstgarten/NN ist/VAFIN DAS/NE Einsteigerspiel/NN für/APPR Kinder/NN ab/APPR zwei/CARD Jahren/NN ./$. Die/ART Spielidee/NN ist/VAFIN wie/KOKOM bei/APPR allen/PIDAT Spielen/NN mit/APPR dieser/PDAT Zielaltersklasse/NN außerordentlich/ADJD einfach/ADJD ./$. Hier/ADV geht/VVFIN es/PPER darum/PROAV ,/$, reihum/ADV zu/PTKZU würfeln/VVINF ./$. Der/ART Würfel/NN zeigt/VVFIN keine/PIAT Zahlen/NN ,/$, sondern/KON vier/CARD Farben/NN ,/$, einen/ART Raben/NN und/KON einen/ART Obstkorb/NN ./$. Bei/APPR einer/ART Farbe/NN darf/VMFIN man/PIS ein/ART Stück/NN Obst/NN von/APPR einem/ART der/ART vier/CARD Obstbäume/NN im/APPRART Obstgarten/NN pflücken/VVFIN ,/$, bei/APPR einem/ART Raben/NN muss/APPR eines/ART von/APPR neun/CARD Rabenpuzzleteilen/NN gelegt/VVPP werden/VAINF ,/$, bei/APPR einem/ART Obstkorb/NN darf/VMFIN man/PIS zwei/CARD Obststücke/NN nach/APPR Wahl/NN abräumen/VVINF ./$. Entweder/KON es/PPER gewinnen/VVFIN alle/PIS ,/$, weil/KOUS alles/PIS Obst/NN abgeerntet/VVPP ist/VAFIN ,/$, bevor/KOUS der/ART Rabe/NN fertig/ADJD gepuzzlet/VVPP wurde/VAFIN oder/KON es/PPER verlieren/VVFIN alle/PIDAT gegen/APPR den/ART fertigen/ADJA Raben/NN ./$. Die/ART Idee/NN eines/ART ``/CARD kooperativen/ADJA ''/ADJA Spiels/NN hat/VAFIN viele/PIDAT Freunde/NN ,/$, macht/VVFIN das/ART Spiel/NN aber/ADV noch/ADV langweiliger/ADJD ,/$, als/KOUS es/PPER unbedingt/ADV nötig/ADJD wäre/VAFIN ./$. Unser/PPOSAT vierjähriger/ADJA Sohn/NN versucht/VVFIN schon/ADV so/ADV zu/PTKZU mogeln/VVINF ,/$, dass/KOUS der/ART Rabe/NN gewinnt/VVFIN -/$( einfach/ADV um/APPR mehr/PIAT Pepp/NN in/APPR das/ART Spiel/NN zu/PTKZU bringen/VVINF ./$. Selbst/ADV unsere/PPOSAT zweijährige/ADJA Tochter/NN wagt/VVFIN sich/PRF schon/ADV an/APPR die/ART Regeln/NN ,/$, wenn/KOUS sie/PPER sich/PRF spielerisch/ADJD dem/ART Diktat/NN des/ART Würfels/NN verweigert/VVFIN und/KON erklärt/VVFIN ,/$, jedes/PIDAT Obst/NN zu/PTKZU pflücken/VVINF ,/$, aber/KON bei/APPR einem/ART roten/ADJA Würfel/NN keine/PIAT rote/ADJA Kirsche/NN ./$. Das/ART Spiel/NN besticht/VVFIN vor/APPR allem/PIS durch/APPR die/ART Qualität/NN seiner/PPOSAT Verarbeitung/NN ./$. Die/ART Obstsorten/NN sind/VAFIN gut/ADJD gestaltete/ADJA und/KON lackierte/ADJA Holzstücke/NN ./$. Die/ART Kirschen/NN hängen/VVFIN paarweise/ADV am/APPRART Baum/NN und/KON auch/ADV die/ART Obstkörbe/NN sind/VAFIN liebevoll/ADJD geflochten/VVPP ./$. Solch/PIDAT ein/ART Spiel/NN packt/VVFIN man/PIS immer/ADV wieder/ADV gerne/ADV aus/PTKVZ ./$. Besonders/ADV schön/ADJD ist/VAFIN die/ART Sonderedition/NN im/APPRART Blechkasten/NN statt/APPR im/APPRART Pappkarton/NN ./$. Warum/PWAV Spielehersteller/NN sich/PRF immer/ADV wieder/ADV vor/APPR den/ART Kosten/NN einer/ART hochwertigen/ADJA Herstellung/NN drücken/VVINF bleibt/VVFIN ein/ART ungeklärtes/ADJA Geheimnis/NN ,/$, zumal/KOUS so/ADV schöne/ADJA Spiele/NN wie/KOKOM Obstgarten/NN beweisen/VVFIN ,/$, dass/KOUS eine/ART hochwertige/ADJA und/KON liebevolle/ADJA Gestaltung/NN ,/$, die/PRELS selbstverständlich/ADJD zu/APPR einem/ART etwas/ADV höheren/ADJA Preis/NN führt/VVFIN ,/$, aus/APPR einem/ART einfachen/ADJA Spiel/NN schnell/ADJD einen/ART hochwertigen/ADJA und/KON hochgelobten/ADJA Klassiker/NN werden/VAFIN lassen/VVINF kann/VMFIN ./$. ",
"")), .Names = "sections")
Just remove element with length equal to 0:
sent <- unlist(sent,recursive=FALSE)
sent <- sent[lapply(sent,length)>0]
EDIT OP seems to have problems on how to reproduce the error , I show here how to reproduce it:
Using this as sent for example:
sent = list("a",character(0)) ## you get an error because of character(0)
lapply(seq_along(sent),
function(i)
data.frame(ID=paste(sprintf("%02d", i),
sprintf("%03d", seq_along(sent[[i]])), sep = ""),
Sentence=sent[[i]]))
Reproduce the error :
Error in data.frame(ID = paste(sprintf("%02d", i), sprintf("%03d", seq_along(sent[[i]])), :
arguments imply differing number of rows: 1, 0