Displaying Rows conditional on values of two columns - R [duplicate] - r

This question already has answers here:
How to combine multiple conditions to subset a data-frame using "OR"?
(5 answers)
Closed 2 years ago.
treat
age
education
black
hispanic
married
nodegree
re74
re75
1:
0
23
10
1
0
0
1
0
2:
0
26
12
0
0
0
0
0
3:
0
22
9
1
0
0
1
0
4:
0
18
9
1
0
0
1
0
I'm trying to only display data where either re74==0 or re75==0 or both are equal to zero, which implies that I'm disregarding the rows where both are equal to one.

(¬_¬)df <- data.frame(
... stringsAsFactors = FALSE,
... treat = c("1:", "2:", "3:", "4:"),
... age = c(0L, 0L, 0L, 0L),
... education = c(23L, 26L, 22L, 18L),
... black = c(10L, 12L, 9L, 9L),
... hispanic = c(1L, 0L, 1L, 1L),
... married = c(0L, 0L, 0L, 0L),
... nodegree = c(0L, 0L, 0L, 0L),
... re74 = c(1L, 0L, 1L, 1L),
... re75 = c(1L, 0L, 0L, 0L)
... )
(¬_¬)df[df$re74==0 |df$re75==0, ]
treat age education black hispanic married nodegree re74 re75
2 2: 0 26 12 0 0 0 0 0
3 3: 0 22 9 1 0 0 1 0
4 4: 0 18 9 1 0 0 1 0

You can use filter from dplyr
library(dplyr)
df %>% filter(re74 == 0 | re75 == 0)

We can use subset
subset(df, re74 == 0 | re75 == 0)

Related

My balanced panel data shows as unbalanced panel data. I cannot use "make.pbalanced()" and "is.pbalanced()" which have no effect

I have a panel data set which has no missing values and all are numeric values only except the date which is in "month/day/year" format but in quarterly frequency.
There are no missing values at all. However, I still do not understand why my data is shown as unbalanced one when I run the "is.pbalanced()" code command.
I also cannot make it balanced by running "make.pbalanced()" code showing the errors I do not understand.
Even when I run "table(DATA$Firm, DATA$Date)" (screenshot attached below), the table output only shows 1 and 0. Thus, the id-time matches do not seem to duplicate more than these.
DATA is the data file I used (screenshot attached below as well) of this on the bottom as well. The data and output are too huge to attach full file here (only except the snapshot) so please understand.
I would appreciate if I can know how I can make this panel data usable despite of its obvious balanced characteristics I have. Thank you.
> DATA=pdata.frame(data,index=c("Firm","Date"))
> DATA<-make.pbalanced(DATA)
Error in seq.default(from = min_value, to = max_value, by = 1) :
'from' must be a finite number
In addition: Warning messages:
1: In make.pconsecutive.indexes(x, balanced = balanced, ...) :
NAs introduced by coercion
2: In min(df_index[, "times"]) :
no non-missing arguments to min; returning Inf
3: In max(df_index[, "times"]) :
no non-missing arguments to max; returning -Inf
> is.pbalanced(DATA)
[1] FALSE
If I provide dput(DATA) for the first 20 rows in DATA, the output is as follows:
> dput(DATA)
structure(list(Date = structure(c(3L, 18L, 35L, 52L, 54L, 27L,
28L, 44L, 45L, 60L, 61L, 10L, 11L, 12L, 13L, 28L, 30L, 45L, 47L,
63L), .Label = c("12/31/1998", "12/31/1999", "12/31/2000", "12/31/2002",
"12/31/2003", "12/31/2004", "12/31/2005", "12/31/2008", "12/31/2009",
"12/31/2010", "12/31/2011", "12/31/2013", "12/31/2014", "12/31/2015",
"12/31/2016", "12/31/2019", "12/31/2020", "3/31/1998", "3/31/1999",
"3/31/2000", "3/31/2001", "3/31/2004", "3/31/2005", "3/31/2006",
"3/31/2007", "3/31/2009", "3/31/2010", "3/31/2011", "3/31/2012",
"3/31/2015", "3/31/2016", "3/31/2017", "3/31/2018", "3/31/2021",
"6/30/1998", "6/30/1999", "6/30/2000", "6/30/2001", "6/30/2004",
"6/30/2005", "6/30/2006", "6/30/2007", "6/30/2009", "6/30/2010",
"6/30/2011", "6/30/2012", "6/30/2015", "6/30/2016", "6/30/2017",
"6/30/2018", "6/30/2021", "9/30/1998", "9/30/1999", "9/30/2000",
"9/30/2003", "9/30/2004", "9/30/2005", "9/30/2006", "9/30/2009",
"9/30/2010", "9/30/2011", "9/30/2012", "9/30/2014", "9/30/2015",
"9/30/2016", "9/30/2017", "9/30/2020"), class = "factor"), Firm = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), .Label = c("296", "10718", "52239", "100263", "100432",
"101273", "102798", "102931", "103660", "105309", "105334", "106599",
"106801", "107495", "107501", "107559", "107574", "107737", "107755",
"107766", "107791", "108048", "108297", "108299", "108679", "108729",
"108731", "110803", "111464", "111469", "111483", "111484", "111487",
"111489", "111492", "111493", "111503", "111506", "111509", "111514",
"111522", "111536", "111555", "111589", "111590", "111600", "111695",
"111703", "111716", "111727", "111750", "111751", "111752", "111775",
"111796", "111808", "111938", "111940", "111941", "111942", "111955",
"111956", "112001", "112028", "112066", "112137", "112153", "112347",
"112367", "112371", "112427", "112472", "112666", "112738", "112852",
"113501", "113582", "113848", "113959", "114957", "114958", "116126",
"116324", "117005", "135894", "135939", "146262", "154189", "1000001",
"1000132", "1000181", "1000198", "1000234", "1000242", "1000517",
"1000757", "1000858", "1000881", "1000897", "1001061", "1001172",
"1001283", "1001526", "1001577", "1001616", "1001915", "1002018",
"1002061", "1002312", "1002320", "1002374", "1002376", "1002587",
"1002650", "1002815", "1002827", "1002835", "1002839", "1002923",
"1003021", "1003053", "1003057", "1003059", "1003229", "1003260",
"1003405", "1003495", "1003683", "1003698", "1003943", "1004349",
"1004369", "1004594", "1004595", "1004628", "1004823", "1005002",
"1005330", "1005419", "1005420", "1005519", "1005570", "1005575",
"1005625", "1005629", "1006105", "1006110", "1006155", "1006217",
"1006232", "1006379", "1006460", "1006474", "1006511", "1006676",
"1006720", "1006781", "1006799", "1007050", "1007451", "1007518",
"1007544", "1007561", "1007564", "1007606", "1007631", "1007708",
"1007780", "1007831", "1007879", "1007890", "1007923", "1008222",
"1008290", "1008336", "1008494", "1008501", "1008521", "1008541",
"1008974", "1009297", "1009608", "1009702", "1009707", "1010040",
"1010079", "1010118", "1010171", "1010179", "1010218", "1010383",
"1010384", "1010456", "1010469", "1010513", "1010515", "1010523",
"1010559", "1010680", "1010697", "1010871", "1010884", "1010892",
"1011249", "1011315", "1011369", "1011532", "1011549", "1011550",
"1011601", "1011608", "1011628", "1011633", "1011636", "1011666",
"1011793", "1011813", "1011965", "1012183", "1012304", "1012356",
"1012472", "1012850", "1012854", "1021617", "1024280", "1028649",
"1032627", "1032628", "1037429", "1047191", "1078689", "1079592",
"1085370", "1094670", "1095030", "1095890", "1098870", "1103990",
"1116650", "1130830", "1136430", "1150911", "1164070", "1165550",
"1167911", "1169072", "1169451", "1169570", "1169574", "1177670",
"1199506", "1200034", "1200141", "1200336", "1201617", "1203212",
"1203998", "1204112", "1204249", "1205697", "1205991", "1206695",
"1209238", "1209508", "1231250", "1236950", "1239130", "1254611",
"1261831", "1278491", "1299590", "1308650", "1349851", "1364272",
"1371810", "1373451", "1415470", "1461924", "1462905", "1468726",
"1470067", "1471922", "1475575", "1492469", "1493548", "1494156",
"1497186", "1502005", "1503676", "1510039"), class = "factor"),
Country = c(30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L), X1 = c(3.28071e+11,
2.47603e+11, 2.68036e+11, 2.8843e+11, 3.12936e+11, 1.63006e+11,
1.62064e+11, 1.62003e+11, 1.75994e+11, 1.66539e+11, 1.90875e+11,
8.48942e+11, 9.11332e+11, 9.38555e+11, 9.11507e+11, 8.80528e+11,
9.15665e+11, 8.83188e+11, 8.59914e+11, 9.23223e+11), X2 = c(420.9,
109.62, 115.46, 170.57, 256.84, 245.79, 28.1, 320.61, 37.39,
51.84, 24.73, 28.56, 149.12, 176.7, 204.86, 241.27, 245.69,
328.73, 270.17, 225.57), X3 = c(1.397e+09, 6.826e+09, 8.407e+09,
6.218e+09, 1.96e+09, 4.39e+08, 3.011e+09, 4.27e+08, 2.918e+09,
1.738e+09, 3.219e+09, 2e+05, 2e+05, 2e+05, 2e+05, 1.7844e+10,
2e+05, 1.7161e+10, 2e+05, 2e+05), X4 = c(41.6563, 21.4688,
29.8125, 37.0938, 33.6875, 28.25, 30.88, 29.31, 24.69, 28.99,
26.13, 168.84, 168.16, 127.56, 177.26, 170.63, 163.85, 131.27,
167.44, 158.21), X5 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X6 = c(0.931431568,
0.931431568, 0.931431568, 0.931431568, 0.931431568, 0.931431568,
0.931431568, 0.931431568, 0.931431568, 0.931431568, 0.931431568,
0.931431568, 0.931431568, 0.931431568, 0.931431568, 0.931431568,
0.931431568, 0.931431568, 0.931431568, 0.931431568), X7 = c(0.270710059,
0.277063061, 0.291823581, 0.431113358, 0.165191665, 0.031089322,
0.00946807, 0.040553104, 0.012598261, 0.006557103, 0.008332575,
0.003612478, 0.050244788, 0.387559494, 0.250535044, 0.081293992,
0.179216725, 0.110762938, 0.197073477, 0.275862491), X8 = c(0.000327816,
0.002603413, 0.002742109, 0.004050941, 0.000200038, 0.000115447,
1.27133e-05, 0.000150589, 1.69164e-05, 2.43491e-05, 1.11887e-05,
1.34145e-05, 6.74667e-05, 0.00044596, 0.000278949, 0.000109158,
0.000133225, 0.000148728, 0.0001465, 0.000307149), X9 = c(0.270893931,
0.318213603, 0.391916461, 0.289869936, 0.38006593, 0.011946002,
0.04790828, 0.01161946, 0.046428549, 0.047294196, 0.051217786,
4.59817e-07, 4.59817e-07, 4.59817e-07, 4.59817e-07, 0.283917419,
4.59817e-07, 0.273050148, 4.59817e-07, 4.59817e-07), X10 = c(0.0002414,
0.003316004, 0.004084039, 0.003020644, 0.000338686, 8.42066e-06,
5.44127e-05, 8.19048e-06, 5.27321e-05, 3.33374e-05, 5.81716e-05,
4.80818e-09, 4.80818e-09, 4.80818e-09, 4.80818e-09, 0.000322465,
4.80818e-09, 0.000310122, 4.80818e-09, 4.80818e-09), X11 = c(0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), X12 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), X13 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), X14 = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L), X15 = c(7L,
4L, 2L, 2L, 3L, 25L, 12L, 24L, 15L, 22L, 10L, 18L, 10L, 9L,
9L, 12L, 15L, 15L, 20L, 8L), X16 = c(0.579324474, 0.67828913,
0.797352123, 0.65619424, 0.643443626, 1.362389303, 1.076926757,
1.158663556, 1.326649492, 1.070007097, 1.263384865, 0.976581992,
1.299930534, 1.665132357, 1.202016572, 1.076926757, 1.111254141,
1.326649492, 0.885775977, 1.319718044), X17 = c(13.07130455,
8.316492188, 8.766790769, 9.539716667, 12.43554545, 8.205709375,
11.61956875, 9.233861538, 11.37893538, 10.52281061, 11.20804697,
11.50165758, 12.36111364, 13.1170197, 16.01679545, 11.61956875,
16.46697656, 11.37893538, 16.98400462, 15.10651515)), row.names = c("296- 12/31/2000",
"296-3/31/1998", "296-6/30/1998", "296-9/30/1998", "296-9/30/2000",
"10718-3/31/2010", "10718-3/31/2011", "10718-6/30/2010", "10718-6/30/2011",
"10718-9/30/2010", "10718-9/30/2011", "52239-12/31/2010", "52239-12/31/2011",
52239-12/31/2013", "52239-12/31/2014", "52239-3/31/2011", "52239-3/31/2015",
"52239-6/30/2011", "52239-6/30/2015", "52239-9/30/2014"), class = c("pdata.frame",
"data.frame"), index = structure(list(Firm = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L), .Label = c("296", "10718", "52239"), class = "factor"),
Date = structure(c(1L, 6L, 10L, 14L, 15L, 7L, 8L, 11L, 12L,
16L, 17L, 2L, 3L, 4L, 5L, 8L, 9L, 12L, 13L, 18L), .Label = c("12/31/2000",
"12/31/2010", "12/31/2011", "12/31/2013", "12/31/2014", "3/31/1998",
"3/31/2010", "3/31/2011", "3/31/2015", "6/30/1998", "6/30/2010",
"6/30/2011", "6/30/2015", "9/30/1998", "9/30/2000", "9/30/2010",
"9/30/2011", "9/30/2014"), class = "factor")), row.names = c(220L,
7L, 11L, 13L, 196L, 1713L, 2161L, 1816L, 2274L, 1930L, 2379L,
2052L, 2504L, 2983L, 3278L, 2162L, 3413L, 2275L, 3554L, 3121L
), class = c("pindex", "data.frame")))
The first 20 lines (rows) of DATA are as below. This is after I put "pdata.frame" so the first column as been added automatically accordingly:
DATA=pdata.frame(data,index=c("Firm","Date"))
DATA[1:20,]
Date Firm Country X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17
296-12/31/2000 12/31/2000 296 30 3.28071e+11 420.90 1.3970e+09 41.6563 0 0.9314316 0.270710059 0.0003278160 2.708939e-01 2.414000e-04 0 0 0 0 7 0.5793245 13.071305
296-3/31/1998 3/31/1998 296 30 2.47603e+11 109.62 6.8260e+09 21.4688 0 0.9314316 0.277063061 0.0026034130 3.182136e-01 3.316004e-03 0 0 0 0 4 0.6782891 8.316492
296-6/30/1998 6/30/1998 296 30 2.68036e+11 115.46 8.4070e+09 29.8125 0 0.9314316 0.291823581 0.0027421090 3.919165e-01 4.084039e-03 0 0 0 0 2 0.7973521 8.766791
296-9/30/1998 9/30/1998 296 30 2.88430e+11 170.57 6.2180e+09 37.0938 0 0.9314316 0.431113358 0.0040509410 2.898699e-01 3.020644e-03 1 0 0 0 2 0.6561942 9.539717
296-9/30/2000 9/30/2000 296 30 3.12936e+11 256.84 1.9600e+09 33.6875 0 0.9314316 0.165191665 0.0002000380 3.800659e-01 3.386860e-04 0 0 0 0 3 0.6434436 12.435545
10718-3/31/2010 3/31/2010 10718 30 1.63006e+11 245.79 4.3900e+08 28.2500 0 0.9314316 0.031089322 0.0001154470 1.194600e-02 8.420660e-06 0 0 0 1 25 1.3623893 8.205709
10718-3/31/2011 3/31/2011 10718 30 1.62064e+11 28.10 3.0110e+09 30.8800 0 0.9314316 0.009468070 0.0000127133 4.790828e-02 5.441270e-05 0 0 0 1 12 1.0769268 11.619569
10718-6/30/2010 6/30/2010 10718 30 1.62003e+11 320.61 4.2700e+08 29.3100 0 0.9314316 0.040553104 0.0001505890 1.161946e-02 8.190480e-06 0 0 0 1 24 1.1586636 9.233862
10718-6/30/2011 6/30/2011 10718 30 1.75994e+11 37.39 2.9180e+09 24.6900 0 0.9314316 0.012598261 0.0000169164 4.642855e-02 5.273210e-05 0 0 0 1 15 1.3266495 11.378935
10718-9/30/2010 9/30/2010 10718 30 1.66539e+11 51.84 1.7380e+09 28.9900 0 0.9314316 0.006557103 0.0000243491 4.729420e-02 3.333740e-05 0 0 0 1 22 1.0700071 10.522811
10718-9/30/2011 9/30/2011 10718 30 1.90875e+11 24.73 3.2190e+09 26.1300 0 0.9314316 0.008332575 0.0000111887 5.121779e-02 5.817160e-05 0 0 0 1 10 1.2633849 11.208047
52239-12/31/2010 12/31/2010 52239 30 8.48942e+11 28.56 2.0000e+05 168.8400 0 0.9314316 0.003612478 0.0000134145 4.598170e-07 4.808180e-09 0 0 0 1 18 0.9765820 11.501658
52239-12/31/2011 12/31/2011 52239 30 9.11332e+11 149.12 2.0000e+05 168.1600 0 0.9314316 0.050244788 0.0000674667 4.598170e-07 4.808180e-09 0 0 0 1 10 1.2999305 12.361114
52239-12/31/2013 12/31/2013 52239 30 9.38555e+11 176.70 2.0000e+05 127.5600 0 0.9314316 0.387559494 0.0004459600 4.598170e-07 4.808180e-09 0 0 0 0 9 1.6651324 13.117020
52239-12/31/2014 12/31/2014 52239 30 9.11507e+11 204.86 2.0000e+05 177.2600 0 0.9314316 0.250535044 0.0002789490 4.598170e-07 4.808180e-09 0 0 0 0 9 1.2020166 16.016795
52239-3/31/2011 3/31/2011 52239 30 8.80528e+11 241.27 1.7844e+10 170.6300 0 0.9314316 0.081293992 0.0001091580 2.839174e-01 3.224650e-04 0 0 0 1 12 1.0769268 11.619569
52239-3/31/2015 3/31/2015 52239 30 9.15665e+11 245.69 2.0000e+05 163.8500 0 0.9314316 0.179216725 0.0001332250 4.598170e-07 4.808180e-09 0 0 0 0 15 1.1112541 16.466977
52239-6/30/2011 6/30/2011 52239 30 8.83188e+11 328.73 1.7161e+10 131.2700 0 0.9314316 0.110762938 0.0001487280 2.730501e-01 3.101220e-04 0 0 0 1 15 1.3266495 11.378935
52239-6/30/2015 6/30/2015 52239 30 8.59914e+11 270.17 2.0000e+05 167.4400 0 0.9314316 0.197073477 0.0001465000 4.598170e-07 4.808180e-09 0 0 0 0 20 0.8857760 16.984005
52239-9/30/2014 9/30/2014 52239 30 9.23223e+11 225.57 2.0000e+05 158.2100 0 0.9314316 0.275862491 0.0003071490 4.598170e-07 4.808180e-09 0 0 0 0 8 1.3197180 15.106515
>
From the definition of is.pbalanced:
Balanced data are data for which each individual has the same time periods
As an example:
library(plm)
data("Grunfeld", package = "plm")
is.pbalanced(Grunfeld)
#> [1] TRUE
table(Grunfeld$firm,Grunfeld$year)
#>
#> 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
non.balanced <- Grunfeld[-sample(200,10),]
is.pbalanced(non.balanced)
#> [1] FALSE
table(non.balanced$firm,non.balanced$year)
#> 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 2 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
#> 3 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
#> 4 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
#> 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 8 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
#> 9 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
#> 10 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
As shown above, table of a pbalanced dataset doesn't have zeros : the periods are the same for every firm.
You can verify this in pbalanced source code:
is.pbalanced.default <- function(x, y, ...) {
if (length(x) != length(y)) stop("The length of the two vectors differs\n")
x <- x[drop = TRUE] # drop unused factor levels so that table
y <- y[drop = TRUE] # gives only needed combinations
z <- table(x, y)
if (any(v <- as.vector(z) == 0L)) {
balanced <- FALSE # Any zero means False
} else { balanced <- TRUE
table of the dataset you're using has many zeroes, which explains why is.pbalanced(DATA)==FALSE
It would be useful to provide dput(data) to find out why make.pbalanced doesn't work.

Order dataframe by colnames

I have a dataframe like this :
G2_ref G10_ref G12_ref G2_alt G10_alt G12_alt
20011953 3 6 0 5 1 5
12677336 0 0 0 1 3 6
20076754 0 3 0 12 16 8
2089670 0 4 0 1 11 9
9456633 0 2 0 3 10 0
468487 0 0 0 0 0 0
And I'm trying to sort the columns to have finally this column order :
G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
I tried : df[,order(colnames(df))]
But I had this order :
G10_alt G10_ref G12_alt G12_ref G2_alt G2_ref
If anyone had any idea it will be great.
One option would be to extract the numeric part and also the substring at the end and then do the order
df[order(as.numeric(gsub("\\D+", "", names(df))),
factor(sub(".*_", "", names(df)), levels = c('ref', 'alt')))]
# G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
#20011953 3 5 6 1 0 5
#12677336 0 1 0 3 0 6
#20076754 0 12 3 16 0 8
#2089670 0 1 4 11 0 9
#9456633 0 3 2 10 0 0
#468487 0 0 0 0 0 0
data
df <- structure(list(G2_ref = c(3L, 0L, 0L, 0L, 0L, 0L), G10_ref = c(6L,
0L, 3L, 4L, 2L, 0L), G12_ref = c(0L, 0L, 0L, 0L, 0L, 0L), G2_alt = c(5L,
1L, 12L, 1L, 3L, 0L), G10_alt = c(1L, 3L, 16L, 11L, 10L, 0L),
G12_alt = c(5L, 6L, 8L, 9L, 0L, 0L)), .Names = c("G2_ref",
"G10_ref", "G12_ref", "G2_alt", "G10_alt", "G12_alt"),
class = "data.frame", row.names = c("20011953",
"12677336", "20076754", "2089670", "9456633", "468487"))
I am guessing your data is from genetics and looks pretty standard, first columns with ref alleles for all variants then followed by alt alleles for all variants.
Meaning we could just use alternated column index from half way of your dataframe, i.e.: we will try to create this index - c(1, 4, 2, 5, 3, 6) then subset:
ix <- c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))
ix
# [1] 1 4 2 5 3 6
df1[, ix]
# G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
# 20011953 3 5 6 1 0 5
# 12677336 0 1 0 3 0 6
# 20076754 0 12 3 16 0 8
# 2089670 0 1 4 11 0 9
# 9456633 0 3 2 10 0 0
# 468487 0 0 0 0 0 0
# or all in one line
df1[, c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))]
An easy solution using dplyr:
library(dplyr)
df <- df %>%
select(G2_ref, G2_alt, G10_ref, G10_alt, G12_ref, G12_alt)
Perhaps this is less (complicated) code than #akrun's answer, but only really suitable for when you want to order a small number of columns.

Apply function across multiple columns

Please find here a very small subset of a long data.table I am working with
dput(dt)
structure(list(id = 1:15, pnum = c(4298390L, 4298390L, 4298390L,
4298558L, 4298558L, 4298559L, 4298559L, 4299026L, 4299026L, 4299026L,
4299026L, 4300436L, 4300436L, 4303566L, 4303566L), invid = c(15L,
101L, 102L, 103L, 104L, 103L, 104L, 106L, 107L, 108L, 109L, 87L,
111L, 2L, 60L), fid = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
4L, 4L, 4L, 4L, 3L, 3L, 2L, 2L), .Label = c("CORN", "DowCor",
"KIM", "Texas"), class = "factor"), dom_kn = c(1L, 0L, 0L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), prim_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), pat_kn = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), net_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L), age_kn = c(1L,
0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), legclaims = c(5L,
0L, 0L, 2L, 5L, 2L, 5L, 0L, 0L, 0L, 0L, 5L, 0L, 5L, 2L), n_inv = c(3L,
3L, 3L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), .Names = c("id",
"pnum", "invid", "fid", "dom_kn", "prim_kn", "pat_kn", "net_kn",
"age_kn", "legclaims", "n_inv"), class = "data.frame", row.names = c(NA,
-15L))
I am looking to apply a tweaked greater than comparison in 5 different columns.
Within each pnum (patent), there are multiple invid (inventors). I want to compare the values of the columns dom_kn, prim_kn, pat_kn, net_kn, and age_kn per row, to the values in the other rows with the same pnum. The comparison is simply > and if the value is indeed bigger than the other, one "point" should be attributed.
So for the first row pnum == 4298390 and invid == 15, you can see the values in the five columns are all 1, while the values for invid == 101 | 102 are all zero. This means that if we individually compare (is greater than?) each value in the first row to each cell in the second and third row, the total sum would be 10 points. In every single comparison, the value in the first row is bigger and there are 10 comparisons.
The number of comparisons is by design 5 * (n_inv -1).
The result I am looking for for row 1 should then be 10 / 10 = 1.
For pnum == 4298558 the columns net_kn and age_kn both have values 1 in the two rows (for invid 103 and 104), so that each should get 0.5 points (if there would be three inventors with value 1, everyone should get 0.33 points). The same goes for pnum == 4298558.
For the next pnum == 4299026 all values are zero so every comparison should result in 0 points.
Thus note the difference: There are three different dyadic comparisons
1 > 0 --> assign 1
1 = 1 --> assign 1 / number of positive values in column subset
0 = 0 --> assign 0
Desired result
An extra column result in the data.table with values 1 0 0 0.2 0.8 0.2 0.8 0 0 0 0 1 0 0.8 0.2
Any suggestions on how to compute this efficiently?
Thanks!
vars = grep('_kn', names(dt), value = T)
# all you need to do is simply assign the correct weight and sum the numbers up
dt[, res := 0]
for (var in vars)
dt[, res := res + get(var) / .N, by = c('pnum', var)]
# normalize
dt[, res := res/sum(res), by = pnum]
# id pnum invid fid dom_kn prim_kn pat_kn net_kn age_kn legclaims n_inv res
# 1: 1 4298390 15 CORN 1 1 1 1 1 5 3 1.0
# 2: 2 4298390 101 CORN 0 0 0 0 0 0 3 0.0
# 3: 3 4298390 102 CORN 0 0 0 0 0 0 3 0.0
# 4: 4 4298558 103 DowCor 0 0 0 1 1 2 2 0.2
# 5: 5 4298558 104 DowCor 1 1 1 1 1 5 2 0.8
# 6: 6 4298559 103 DowCor 0 0 0 1 1 2 2 0.2
# 7: 7 4298559 104 DowCor 1 1 1 1 1 5 2 0.8
# 8: 8 4299026 106 Texas 0 0 0 0 0 0 4 NaN
# 9: 9 4299026 107 Texas 0 0 0 0 0 0 4 NaN
#10: 10 4299026 108 Texas 0 0 0 0 0 0 4 NaN
#11: 11 4299026 109 Texas 0 0 0 0 0 0 4 NaN
#12: 12 4300436 87 KIM 1 1 1 1 1 5 2 1.0
#13: 13 4300436 111 KIM 0 0 0 0 0 0 2 0.0
#14: 14 4303566 2 DowCor 1 1 1 1 1 5 2 0.8
#15: 15 4303566 60 DowCor 1 0 0 1 0 2 2 0.2
Dealing with the above NaN case (arguably the correct answer), is left to the reader.
Here's a fastish solution using dplyr:
library(dplyr)
dt %>%
group_by(pnum) %>% # group by pnum
mutate_each(funs(. == max(.) & max(.) != 0), ends_with('kn')) %>%
#give a 1 if the value is the max, and not 0. Only for the column with kn
mutate_each(funs(. / sum(.)) , ends_with('kn')) %>%
#correct for multiple maximums
select(ends_with('kn')) %>%
#remove all non kn columns
do(data.frame(x = rowSums(.[-1]), y = sum(.[-1]))) %>%
#make a new data frame with x = rowsums for each indvidual
# and y the colusums
mutate(out = x/y)
#divide by y (we could just use /5 if we always have five columns)
giving your desired output in the column out:
Source: local data frame [15 x 4]
Groups: pnum [6]
pnum x y out
(int) (dbl) (dbl) (dbl)
1 4298390 5 5 1.0
2 4298390 0 5 0.0
3 4298390 0 5 0.0
4 4298558 1 5 0.2
5 4298558 4 5 0.8
6 4298559 1 5 0.2
7 4298559 4 5 0.8
8 4299026 NaN NaN NaN
9 4299026 NaN NaN NaN
10 4299026 NaN NaN NaN
11 4299026 NaN NaN NaN
12 4300436 5 5 1.0
13 4300436 0 5 0.0
14 4303566 4 5 0.8
15 4303566 1 5 0.2
The NaNs come from the groups with no winners, convert them back using eg:
x[is.na(x)] <- 0

1 factor column and 15 columns to factor under the 1 factor column then sum each 15 individually

I don't know the vocabulary otherwise I am sure I would be able to effectively search for this. So far I have not found anything and I am running out of time.
So I have 16 columns of information, 1 of them is a factor column, we'll assume dates, and the other 15 are hour times (6 am - 8 pm, representing hour only) with either a 1 or a 0, representing active state or inactive state. What I want to do is
Group the data by the factor column, (Dates)
After everything is grouped, I want to individually sum over each 15 columns per grouping
display a 2 dimensional table with the dates running vertically and time sum running horizontally
Please, if you can help, please use the vocabulary so I can not only learn it myself, but so I can look up documentation and teach it to others too please.
An example would be
Date Hour1 Hour2 Hour3 Hour4 Hour5 ... Hour15
9-15 0 0 0 1 1 ... 0
9-15 0 1 1 1 1 ... 0
9-16 0 1 1 1 0 ... 0
9-16 0 0 0 0 0 ... 1
9-16 1 1 0 0 0 ... 1
9-18 0 1 0 1 1 ... 0
.
.
.
11-7 0 1 1 1 0 ... 0
What I want is
Hour1 Hour2 Hour3 Hour4 Hour5 ... Hour15
9-15 5 10 15 25 45 ... 20
9-16 5 6 25 28 15 ... 11
9-17 3 45 42 6 17 ... 32
9-18 5 10 15 25 45 ... 20
.
.
.
11-7 12 36 84 9 7 ... 21
where each of the entry is the sum over the column variable rather than a 1 or zero frequency count.
You can do that quite easily with dplyr - first group by column "Date", then summarise each of the other columns with sum:
require(dplyr)
df %>%
group_by(Date) %>%
summarise_each(funs(sum))
#Source: local data frame [4 x 7]
#
# Date Hour1 Hour2 Hour3 Hour4 Hour5 Hour15
#1 11-7 0 1 1 1 0 0
#2 9-15 0 1 1 2 2 0
#3 9-16 1 2 1 1 0 2
#4 9-18 0 1 0 1 1 0
data
df <- structure(list(Date = structure(c(2L, 2L, 3L, 3L, 3L, 4L, 1L), .Label = c("11-7",
"9-15", "9-16", "9-18"), class = "factor"), Hour1 = c(0L, 0L,
0L, 0L, 1L, 0L, 0L), Hour2 = c(0L, 1L, 1L, 0L, 1L, 1L, 1L), Hour3 = c(0L,
1L, 1L, 0L, 0L, 0L, 1L), Hour4 = c(1L, 1L, 1L, 0L, 0L, 1L, 1L
), Hour5 = c(1L, 1L, 0L, 0L, 0L, 1L, 0L), Hour15 = c(0L, 0L,
0L, 1L, 1L, 0L, 0L)), .Names = c("Date", "Hour1", "Hour2", "Hour3",
"Hour4", "Hour5", "Hour15"), class = "data.frame", row.names = c(NA,
-7L))

R summaryBy or other summary method

I am trying to create a summary table and having a mental hang up. Essentially, what I think I want is a summaryBy statement getting colSums for the subsets for ALL columns except the factor to summarize on.
My data frame looks like this:
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524
comp103680_c0 10 0 0 0 0 0 1
comp103947_c0 3 0 0 0 0 0 0
comp104660_c0 1 1 1 0 0 0 0
comp105255_c0 10 0 0 0 0 0 0
What I would like to do is get colSums for all columns after Cluster using Cluster as the grouping factor.
I have tried a bunch of things. The last was the ply ddply
> groupColumns = "Cluster"
> dataColumns = colnames(GO_matrix_MF[,2:ncol(GO_matrix_MF)])
> res = ddply(GO_matrix_MF, groupColumns, function(x) colSums(GO_matrix_MF[dataColumns]))
> head(res)
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524 GO:0004674 GO:0045735
1 1 121 138 196 94 43 213 97 20
2 2 121 138 196 94 43 213 97 20
I am not sure what the return values represent, but they do not represent the colSums
Try:
> aggregate(.~Cluster, data=ddf, sum)
Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
1 1 1 1 0 0 0 0
2 3 0 0 0 0 0 0
3 10 0 0 0 0 0 1
I think you are looking for something like this. I modified your data a bit. There are other options too.
# Modified data
foo <- structure(list(Cluster = c(10L, 3L, 1L, 10L), GO.0003677 = c(11L,
0L, 1L, 5L), GO.0003700 = c(0L, 0L, 1L, 0L), GO.0046872 = c(0L,
9L, 0L, 0L), GO.0008270 = c(0L, 0L, 0L, 0L), GO.0043565 = c(0L,
0L, 0L, 0L), GO.0005524 = c(1L, 0L, 0L, 0L)), .Names = c("Cluster",
"GO.0003677", "GO.0003700", "GO.0046872", "GO.0008270", "GO.0043565",
"GO.0005524"), class = "data.frame", row.names = c("comp103680_c0",
"comp103947_c0", "comp104660_c0", "comp105255_c0"))
library(dplyr)
foo %>%
group_by(Cluster) %>%
summarise_each(funs(sum))
# Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
#1 1 1 1 0 0 0 0
#2 3 0 0 9 0 0 0
#3 10 16 0 0 0 0 1

Resources