R quick way to write multiple lines of code with slight variations - r

I am working on a project in R which is fairly code heavy at least compared to my previous R projects. The code is using multiple ifelse statements on previous columns data then creating a new column with the results. As the data I am using is a 5 minute timeframe, therefore I have to write a new line of code for every 5 minute slice of time. The data I have is from 09:30 to 16:00 so that is a lot of lines of code, around 75 by my calculations. Example of my data;
Date Open High Low Close doy
1 2015-09-21 09:30:00 164.6700 164.7100 164.3700 164.5300 264
2 2015-09-21 09:35:00 164.5300 164.9000 164.5300 164.6400 264
3 2015-09-21 09:40:00 164.6600 164.8900 164.6000 164.8900 264
4 2015-09-21 09:45:00 164.9100 165.0900 164.9100 164.9736 264
5 2015-09-21 09:50:00 164.9399 165.0980 164.8200 164.8200 264
This data is then filtered onto a table like this;
data <- structure(list(doy = c(264, 265, 266, 267, 268, 271, 272, 11,12, 13), Date = structure(c(1442824200, 1442910600, 1442997000,1443083400, 1443169800, 1443429000, 1443515400, 1452504600, 1452591000,1452677400), class = c("POSIXct", "POSIXt"), tzone = ""), Or_High = c(164.71,162.96, 163.38, 161.37, 163.91, 162.06, 160.22, 164.5, 165.23,165.84), OR_Low = c(164.37, 162.62, 162.98, 161.06, 163.57, 161.66,159.7, 164.06, 164.84, 165.4), HOD = c(165.56, 163.36, 163.38,162.24, 164.43, 162.06, 160.96, 164.5, 165.78, 165.84), LOD = c(165.22,163.1, 162.98, 161.95, 164.24, 161.66, 160.75, 164.06, 165.56,165.4), Close = c(164.92, 163.02, 162.58, 161.85, 162.94, 159.84,160.19, 163.83, 165.02, 161.38), Range = c(0.340000000000003,0.260000000000019, 0.400000000000006, 0.29000000000002, 0.189999999999998,0.400000000000006, 0.210000000000008, 0.439999999999998, 0.219999999999999,0.439999999999998), `A-val` = c(NA, NA, NA, NA, NA, NA, NA, 0.0673439999999994,0.0659639999999996, 0.0729499999999996), `A-up` = c(NA, NA, NA,NA, NA, NA, NA, 164.567344, 165.295964, 165.91295), `A-down` = c(NA,NA, NA, NA, NA, NA, NA, 163.992656, 164.774036, 165.32705), `09:35` = structure(c(NA,NA, NA, NA, NA, NA, NA, 0, 0, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `09:40` = structure(c(NA, NA, NA, NA, NA,NA, NA, -1, 1, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL,"Low")), `09:45` = structure(c(NA, NA, NA, NA, NA, NA, NA,0, 1, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")),`09:50` = structure(c(NA, NA, NA, NA, NA, NA, NA, -1, 1,0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `09:55` = structure(c(NA,NA, NA, NA, NA, NA, NA, -1, 0, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:00` = structure(c(NA, NA, NA, NA,NA, NA, NA, -1, 0, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:05` = structure(c(NA, NA, NA, NA,NA, NA, NA, -1, 0, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:10` = structure(c(NA, NA, NA, NA,NA, NA, NA, -1, 0, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:15` = structure(c(NA, NA, NA, NA,NA, NA, NA, -2, 0, -1), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:20` = structure(c(NA, NA, NA, NA,NA, NA, NA, 0, 0, -1), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:25` = structure(c(NA, NA, NA, NA,NA, NA, NA, -2, -1, -1), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:30` = structure(c(NA, NA, NA, NA,NA, NA, NA, 0, 0, -1), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:35` = structure(c(NA, NA, NA, NA,NA, NA, NA, 0, 0, -1), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:40` = structure(c(NA, NA, NA, NA,NA, NA, NA, 0, -1, -2), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:45` = structure(c(NA, NA, NA, NA,NA, NA, NA, 0, -1, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:50` = structure(c(NA, NA, NA, NA,NA, NA, NA, -1, -1, -2), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low")), `10:55` = structure(c(NA, NA, NA, NA,NA, NA, NA, -1, -1, 0), .Dim = c(10L, 1L), .Dimnames = list(NULL, "Low"))), .Names = c("doy", "Date", "Or_High","OR_Low", "HOD", "LOD", "Close", "Range", "A-val", "A-up", "A-down","09:35", "09:40", "09:45", "09:50", "09:55", "10:00", "10:05","10:10", "10:15", "10:20", "10:25", "10:30", "10:35", "10:40","10:45", "10:50", "10:55"), row.names = c(1L, 2L, 3L, 4L, 5L,6L, 7L, 78L, 79L, 80L), class = "data.frame")
This is what the lines of code looks like;
data[,14] <- ifelse(df %>% filter(hour(Date) == 09 & minute(Date) == 45) %>% select(Low) > data[,10], 1, ifelse(df %>% filter(hour(Date) == 09 & minute(Date) == 45) %>% select(High) < data[,11], -1, 0))
Then the next line of code would look like;
data[,15] <- ifelse(df %>% filter(hour(Date) == 09 & minute(Date) == 50) %>% select(Low) > data[,10], 1, ifelse(df %>% filter(hour(Date) == 09 & minute(Date) == 50) %>% select(High) < data[,11], -1, 0))
And the next like this etc;
data[,16] <- ifelse(df %>% filter(hour(Date) == 09 & minute(Date) == 55) %>% select(Low) > data[,10], 1, ifelse(df %>% filter(hour(Date) == 09 & minute(Date) == 55) %>% select(High) < data[,11], -1, 0))
As you can see with each new line of code only certain parts of the code are changed, such as the hours, minutes and column references for summing. Perhaps the below example will make it clearer.
Example;
colnames(data)[14] <- "09:45"
colnames(data)[15] <- "09:50"
colnames(data)[16] <- "09:55"
colnames(data)[17] <- "10:00"
colnames(data)[18] <- "10:05"
In this code would there be anyway to change the [#col ref#] and times without individually changing each line of code by hand? I realise that copy and paste can be used with notepad but that still means having write the individual changes. My main concern is not about the time taken to write this but moreover the risk of errors from human input.
If anyone has any tips or tricks as to how this can be done, or another way of achieving the same without using multiple if statements on the structure of my existing code I would be most grateful for your help. This question is related to previous question I posted here and may add clarity for what I am trying to achieve.
Thanks.

As vanao veneri mentioned it is better to use a text editor for writing bulk code quickly.
I found that Sublime 3 with Text Pastry add-on did exactly what I needed using the insert nuns command.
Thanks for the help.

Related

How to visualize a correlation of one variable to many using the corrplot package in r

I am trying to visualize the correlation of one variable (cl_wet) on 16 other variables in R using the corrplot package. My data does contain a fair amount of NA values, but I have been able to omit them in my correlation code I have used before. It just won't work for the visualization:
Here is the code I have been running:
liqcor <- cor(x = liquid.wet$cl_wet, y = liquid.wet[2:17], use = "complete.obs")
corrplot(liqcor)
And I have been receiving these errors:
Error in symbols(Pos, add = TRUE, inches = FALSE, rectangles = matrix(1, :
invalid symbol coordinates
Warning in min(corr, na.rm = TRUE) :
no non-missing arguments to min; returning Inf
Warning in max(corr, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Error in symbols(Pos, add = TRUE, inches = FALSE, rectangles = matrix(1, :
invalid symbol coordinates
Any advice is appreciated! Below is a sample of my code for reproducibility
liquid.dataset <-
structure(
list(
cl_wet = c(
0.15738,
0.07897,
0.21313,
0.20552,
0.21005,
0.3,
0.30583,
0.29432,
0.22091,
0.14322,
0.17247,
0.29264,
0.12911,
0.2439,
0.32264,
0.333,
0.4097,
0.1386,
0.25436,
0.52432,
0.44101,
0.20917,
0.14436,
0.17538,
0.13455
),
Moisture = c(
95,
98,
95,
96,
95,
93,
89,
91,
88,
96,
96,
93,
96,
91,
89,
92,
88,
NA,
NA,
89,
89,
96,
96,
96,
97
),
Dry.matter = c(
5L,
2L,
5L,
4L,
5L,
7L,
11L,
9L,
12L,
4L,
4L,
7L,
4L,
9L,
11L,
8L,
12L,
NA,
NA,
11L,
11L,
4L,
4L,
4L,
3L
),
TN = c(
0.530443645,
0.28263789,
0.512529976,
0.497601918,
0.491630695,
0.666786571,
0.745407674,
0.723513189,
0.980275779,
0.617026379,
0.330407674,
0.719532374,
0.447841727,
0.768297362,
1.062877698,
0.91558753,
1.188273381,
NA,
NA,
0.878764988,
0.860851319,
0.468741007,
0.436894484,
0.371211031,
0.301546763
),
P2O5 = c(
0.179856115,
0.082733813,
0.179856115,
0.191846523,
0.191846523,
0.167865707,
0.179856115,
0.20383693,
0.383693046,
0.167865707,
0.101918465,
0.287769784,
0.143884892,
0.251798561,
0.419664269,
0.35971223,
0.575539568,
NA,
NA,
0.323741007,
0.35971223,
0.116306954,
0.143884892,
0.131894484,
0.10911271
),
K2O = c(
0.275779376,
0.1558753,
0.347721823,
0.323741007,
0.335731415,
0.419664269,
0.431654676,
0.431654676,
0.503597122,
0.347721823,
0.275779376,
0.455635492,
0.251798561,
0.419664269,
0.575539568,
0.575539568,
0.815347722,
NA,
NA,
0.539568345,
0.551558753,
0.383693046,
0.239808153,
0.227817746,
0.287769784
),
Na = c(
0.065947242,
0.037170264,
0.088729017,
0.082733813,
0.085131894,
NA,
NA,
NA,
NA,
NA,
0.082733813,
NA,
0.049160671,
0.100719424,
0.179856115,
0.179856115,
0.251798561,
NA,
NA,
0.143884892,
0.143884892,
0.088729017,
0.053956835,
0.076738609,
0.088729017
),
Ca = c(
0.083932854,
0.049160671,
0.075539568,
0.087529976,
0.086330935,
NA,
NA,
NA,
NA,
NA,
0.080335731,
NA,
0.085131894,
0.09352518,
0.131894484,
0.131894484,
0.167865707,
NA,
NA,
0.112709832,
0.117505995,
0.052757794,
0.073141487,
0.086330935,
0.083932854
),
Mg = c(
0.059952038,
0.034772182,
0.07793765,
0.081534772,
0.081534772,
NA,
NA,
NA,
NA,
NA,
0.064748201,
NA,
0.050359712,
0.098321343,
0.119904077,
0.10911271,
0.167865707,
NA,
NA,
0.215827338,
0.227817746,
0.047961631,
0.056354916,
0.080335731,
0.061151079
),
Zn = c(
0.028776978,
0.013189448,
0.044364508,
0.044364508,
0.045563549,
NA,
NA,
NA,
NA,
NA,
0.004916067,
NA,
0.029976019,
0.007074341,
0.007553957,
0.006235012,
0.009952038,
NA,
NA,
0.006594724,
0.006714628,
0.013189448,
0.023980815,
0.005755396,
0.001115108
),
Fe = c(
0.008992806,
0.004916067,
0.023980815,
0.025179856,
0.026378897,
NA,
NA,
NA,
NA,
NA,
0.01558753,
NA,
0.010551559,
0.00911271,
0.017985612,
0.017985612,
0.022781775,
NA,
NA,
0.013189448,
0.01558753,
0.005635492,
0.008513189,
0.014388489,
0.002877698
),
Mn = c(
0.001918465,
0.000791367,
0.003597122,
0.003717026,
0.00383693,
NA,
NA,
NA,
NA,
NA,
0.001558753,
NA,
0.001798561,
0.001798561,
0.003117506,
0.00323741,
0.003956835,
NA,
NA,
0.003956835,
0.004316547,
0.001318945,
0.001558753,
0.001438849,
0.000683453
),
Cu = c(
0.003117506,
0.002398082,
0.007913669,
0.008393285,
0.008633094,
NA,
NA,
NA,
NA,
NA,
0.001558753,
NA,
0.002517986,
0.006354916,
0.003357314,
0.002278177,
0.004796163,
NA,
NA,
0.001558753,
0.001558753,
0.001438849,
0.00383693,
0.002398082,
0.000959233
),
S = c(
0.071942446,
0.028776978,
0.106714628,
0.107913669,
0.107913669,
NA,
NA,
NA,
NA,
NA,
0.028776978,
NA,
0.037170264,
0.068345324,
0.131894484,
0.131894484,
0.179856115,
NA,
NA,
0.095923261,
0.094724221,
0.081534772,
0.061151079,
0.034772182,
0.023980815
),
NH3 = c(
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
0.851318945,
0.731414868,
0.923261391,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA
),
TC = c(
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_,
NA_real_
),
NO3 = c(
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
0.001318945,
7.79e-05,
3.6e-05,
NA,
NA,
NA,
NA,
NA,
NA,
NA,
NA
)
),
row.names = c(NA, 25L),
class = "data.frame"
)
Since you have missing values and the number of missing values are not equal between your features, it is not recommended to use 'comple.obs' option within the cor() function because every row containing missing value will be completely removed and you will loose information for some pairwise comparisons that are not missing indeed. Instead, better to use 'pairwise.comlpete.obs' option where missing values will be removed only a particular pairwise correlation. Thus the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables.
#calculate correlation
liqcor <- cor(x = liquid.dataset$cl_wet, y =liquid.dataset[2:17], use = "pairwise.complete.obs")
# plot correlation matrix
library(corrplot)
corrplot(liqcor
,addgrid.col = T
,type = 'upper'
,addCoef.col = T
,number.cex = .7
,diag = T
,tl.cex = .9)

transpose from one variable under another in R

here example of my data
mydat=structure(list(ADR.N.14.0 = c(8140010250001, 8140010250002),
NOMYAR.N.16.6 = c(1, 1), KOFPOR1.N.16.6 = c(7, 10), POR1.C.254 = c("о",
"BB"), VOZPOR1.N.16.6 = c(80, 45), VYSPOR1.N.16.6 = c(24,
17), DEMPOR1.N.16.6 = c(36, 16), POLNOT1.N.16.6 = c(0.6,
0.9), ZAPZAH1.N.16.6 = c(210, 160), NOMYAR2.N.16.6 = c(1,
1), KOFSOCT2.N.16.6 = c(3, 0), POR2.C.254 = c("BB", "о"),
VOZPOR2.N.16.6 = c(70, 45), VYSPOR2.N.16.6 = c(22, 17), DEMPOR2.N.16.6 = c(26,
22), POLNOT2.N.16.6 = c(0, 0), ZAPZAH2.N.16.6 = c(0, 0)), class = "data.frame", row.names = c(NA,
-2L))
how for each value of ADR,N,14,0move data from one variable under another.
To be more clear
here variables with prefix1
NOMYAR,N,16,6 KOFPOR**1**,N,16,6 POR**1**,C,254 VOZPOR**1**,N,16,6 VYSPOR**1**,N,16,6 DEMPOR**1**,N,16,6 POLNOT**1**,N,16,6 ZAPZAH**1**,N,16,6
and near rows with prefix2
NOMYAR**2**,N,16,6 KOFPOR**2**,N,16,6 POR**2**,C,254 VOZPOR**2**,N,16,6 VYSPOR**1**,N,16,6 DEMPOR**2**,N,16,6 POLNOT**2**,N,16,6 ZAPZAH**2**,N,16,6
so i need that for for ADR,N,14,0 =8140010250001
the content of the fields with the prefix 2 was under the content of the fields with the prefix 1
like this
result=structure(list(ADR.N.14.0 = c(8140010250001, 8140010250001, 8140010250002,
8140010250002, NA, NA, NA, NA, NA, NA), NOMYAR.N.16.6 = c(1,
1, 1, 1, NA, NA, NA, NA, NA, NA), KOFPOR1.N.16.6 = c(7, 3, 10,
0, NA, NA, NA, NA, NA, NA), POR1.C.254 = c("о", "BB", "BB", "о",
"", "", "", "", "", ""), VOZPOR1.N.16.6 = c(80, 70, 45, 45, NA,
NA, NA, NA, NA, NA), VYSPOR1.N.16.6 = c(24, 22, 17, 17, NA, NA,
NA, NA, NA, NA), DEMPOR1.N.16.6 = c(36, 26, 16, 22, NA, NA, NA,
NA, NA, NA), POLNOT1.N.16.6 = c(0.6, 0, 0.9, 0, NA, NA, NA, NA,
NA, NA), ZAPZAH1.N.16.6 = c(210, 0, 160, 0, NA, NA, NA, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -10L))
How can i do such transpose?
You can use pivot_longer and specify names_pattern to include pattern of names that you want together.
tidyr::pivot_longer(mydat, cols = -ADR.N.14.0,
names_to = c('.value'),
names_pattern = '(.*?)\\d?\\..*')
# ADR.N.14.0 NOMYAR KOFPOR POR VOZPOR VYSPOR DEMPOR POLNOT ZAPZAH KOFSOCT
# <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 8140010250001 1 7 о 80 24 36 0.6 210 3
#2 8140010250001 1 NA BB 70 22 26 0 0 NA
#3 8140010250002 1 10 BB 45 17 16 0.9 160 0
#4 8140010250002 1 NA о 45 17 22 0 0 NA

How to create sub data frames for each row with data before and after a row

I've got a data like below:
ex <- structure(list(timestamp = structure(c(1502480763.554, 1502480763.554,
1502480764.968, 1502480765.554, 1502480768.554, 1502480770.554,
1502480773.519, 1502480775.72, 1502480777.43, 1502480778.278,
1502480778.288, 1502480778.759, 1502480780.472, 1502480782.815,
1502480785.521, 1502480785.531, 1502480785.707, 1502480787.639,
1502480789.1, 1502480790.682, 1502480791.554, 1502480793.322,
1502480794.363, 1502480795.923, 1502480799.239, 1502480800.27,
1502480800.554, 1502480802.554, 1502480805.63, 1502480805.959,
1502480807.327, 1502480809.554, 1502480809.564, 1502480810.554,
1502480812.8, 1502480813.838, 1502480813.848, 1502480816.24,
1502480816.24, 1502480835.56, 1502480838.576, 1502480848.384,
1502480851.859, 1502480853.554, 1502480856.375, 1502480857.688,
1502480905.554, 1502480910.554, 1502480910.945, 1502480911.816
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), order = c(NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 2L, 2L, 2L, 2L, NA, NA, NA,
3L, NA, 4L, 4L, 4L, 4L, 4L, NA, 5L, 5L, 5L, 6L, 6L, 6L, NA, NA,
NA, NA, NA, 7L, NA, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 10L,
10L), cat = c(0, 0, 1, 1, 1, 1, 1, 1, 1, 99, 99, 1, 1, 1, 99,
99, 21, 1, 1, 1, 94, 1, 1, 1, 1, 1, 1, 1, 94, 1, 1, 99, 99, 1,
61, 10, 3, 4, 4, 1, 1, 1, 1, 1, 1, 16, 1, 1, 13, 94), var1 = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L,
0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L,
0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L,
1L), var2 = c(NA, NA, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, NA, NA, 0.9,
0.9, 0.9, 0.9, NA, NA, NA, NA, NA, 5.3, 5.3, 5.3, 5.3, 5.3, NA,
8.6, 8.6, 8.6, 14.5, 14.5, 14.5, NA, NA, NA, NA, NA, 7.4, NA,
7.4, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 4.6, 4.6, -4.1, -4.1),
var3 = c(NA, NA, 35.8, 59.3, 51.3, 57.3, 77.5, 82.4, 41.6,
NA, NA, 66.8, 53, 77.1, NA, NA, 55.8, 81.4, 45.8, 37.9, NA,
38.5, 32, 72, 46.9, 76.4, 76.9, 88, NA, 11.7, 49.4, NA, NA,
64.1, NA, NA, NA, NA, NA, 72.5, 77.7, 83.3, 96.4, 83.3, 95.3,
NA, 69.8, 78.9, NA, NA), var4 = c(NA, NA, 26.6, 24, 9.7,
12.7, 21, 12.7, 9.7, NA, NA, 14, 20.3, 25.6, NA, NA, 18.6,
25.3, 15.7, 10.7, NA, 12.8, 8, 41.9, 12.8, 8.5, 10.2, 14.3,
NA, 19.3, 40, NA, NA, 1.2, NA, NA, NA, NA, NA, 10, 21.9,
19, 42, 11.8, 18.4, NA, 33.5, 3.7, NA, NA), var5 = c(NA,
NA, 2.8, 5.2, 2.3, 4.4, -0.9, 0.3, -0.8, NA, NA, 1.3, 1.5,
5.2, NA, NA, -0.7, -0.9, -0.3, 2.8, NA, 0.3, 1.8, 5.3, -0.9,
4.9, 0.9, 4.8, NA, 1.6, -0.8, NA, NA, -0.7, NA, NA, NA, NA,
NA, 0.4, 0.4, 2.2, 4.2, 1.5, -0.1, NA, 0.3, 1.8, NA, NA),
var6 = c(NA, NA, NA, NA, NA, TRUE, NA, NA, TRUE, NA, NA,
TRUE, TRUE, NA, NA, NA, NA, NA, TRUE, TRUE, NA, NA, NA, NA,
TRUE, TRUE, NA, NA, NA, NA, NA, NA, NA, TRUE, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -50L))
Within the same values of column order I need to create (for each row) two nested sub-dataframes - one with data before and one with data in this row and after. So let's take for example a block of data where order == 1:
ex %>% filter(order == 1) %>% print()
# A tibble: 6 x 9
timestamp order cat var1 var2 var3 var4 var5 var6
<dttm> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <lgl>
1 2017-08-11 19:46:04 1 1 1 2.5 35.8 26.6 2.8 NA
2 2017-08-11 19:46:05 1 1 1 2.5 59.3 24 5.20 NA
3 2017-08-11 19:46:08 1 1 1 2.5 51.3 9.7 2.3 NA
4 2017-08-11 19:46:10 1 1 1 2.5 57.3 12.7 4.40 TRUE
5 2017-08-11 19:46:13 1 1 1 2.5 77.5 21 -0.9 NA
6 2017-08-11 19:46:15 1 1 0 2.5 82.4 12.7 0.300 NA
I need two additional columns with nested data frames: data_before and data_after. For first row data_before would be empty and data_after would contain all the rows. For second row, data_before would contain only first row and data_after would contain rows from 2 to 6. For third row, data_before would contain first two rows and data_after would contains rows from 3 to 6 and so on... Such an operation need to be performed for every value of order in original data frame. How it can be accomplished?
Expected output for one block of data (with order == 1) would be:
structure(list(order = c(1, 1, 1, 1, 1, 1), data_before = list(
structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame"),
structure(list(timestamp = structure(1502480764.968, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), cat = 1, var1 = 1L, var2 = 2.5,
var3 = 35.8, var4 = 26.6, var5 = 2.8, var6 = NA), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -1L)), structure(list(
timestamp = structure(c(1502480764.968, 1502480765.554
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), cat = c(1,
1), var1 = c(1L, 1L), var2 = c(2.5, 2.5), var3 = c(35.8,
59.3), var4 = c(26.6, 24), var5 = c(2.8, 5.2), var6 = c(NA,
NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L)), structure(list(timestamp = structure(c(1502480764.968,
1502480765.554, 1502480768.554), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), cat = c(1, 1, 1), var1 = c(1L, 1L, 1L),
var2 = c(2.5, 2.5, 2.5), var3 = c(35.8, 59.3, 51.3),
var4 = c(26.6, 24, 9.7), var5 = c(2.8, 5.2, 2.3), var6 = c(NA,
NA, NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L)), structure(list(timestamp = structure(c(1502480764.968,
1502480765.554, 1502480768.554, 1502480770.554), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), cat = c(1, 1, 1, 1), var1 = c(1L,
1L, 1L, 1L), var2 = c(2.5, 2.5, 2.5, 2.5), var3 = c(35.8,
59.3, 51.3, 57.3), var4 = c(26.6, 24, 9.7, 12.7), var5 = c(2.8,
5.2, 2.3, 4.4), var6 = c(NA, NA, NA, TRUE)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L)), structure(list(
timestamp = structure(c(1502480764.968, 1502480765.554,
1502480768.554, 1502480770.554, 1502480773.519), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), cat = c(1, 1, 1, 1, 1), var1 = c(1L,
1L, 1L, 1L, 1L), var2 = c(2.5, 2.5, 2.5, 2.5, 2.5), var3 = c(35.8,
59.3, 51.3, 57.3, 77.5), var4 = c(26.6, 24, 9.7, 12.7,
21), var5 = c(2.8, 5.2, 2.3, 4.4, -0.9), var6 = c(NA,
NA, NA, TRUE, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))), data_after = list(structure(list(
timestamp = structure(c(1502480764.968, 1502480765.554, 1502480768.554,
1502480770.554, 1502480773.519, 1502480775.72), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), cat = c(1, 1, 1, 1, 1, 1), var1 = c(1L,
1L, 1L, 1L, 1L, 0L), var2 = c(2.5, 2.5, 2.5, 2.5, 2.5, 2.5
), var3 = c(35.8, 59.3, 51.3, 57.3, 77.5, 82.4), var4 = c(26.6,
24, 9.7, 12.7, 21, 12.7), var5 = c(2.8, 5.2, 2.3, 4.4, -0.9,
0.3), var6 = c(NA, NA, NA, TRUE, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L)), structure(list(
timestamp = structure(c(1502480765.554, 1502480768.554, 1502480770.554,
1502480773.519, 1502480775.72), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), cat = c(1, 1, 1, 1, 1), var1 = c(1L, 1L,
1L, 1L, 0L), var2 = c(2.5, 2.5, 2.5, 2.5, 2.5), var3 = c(59.3,
51.3, 57.3, 77.5, 82.4), var4 = c(24, 9.7, 12.7, 21, 12.7
), var5 = c(5.2, 2.3, 4.4, -0.9, 0.3), var6 = c(NA, NA, TRUE,
NA, NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L)), structure(list(timestamp = structure(c(1502480768.554,
1502480770.554, 1502480773.519, 1502480775.72), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), cat = c(1, 1, 1, 1), var1 = c(1L,
1L, 1L, 0L), var2 = c(2.5, 2.5, 2.5, 2.5), var3 = c(51.3, 57.3,
77.5, 82.4), var4 = c(9.7, 12.7, 21, 12.7), var5 = c(2.3, 4.4,
-0.9, 0.3), var6 = c(NA, TRUE, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L)), structure(list(
timestamp = structure(c(1502480770.554, 1502480773.519, 1502480775.72
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), cat = c(1,
1, 1), var1 = c(1L, 1L, 0L), var2 = c(2.5, 2.5, 2.5), var3 = c(57.3,
77.5, 82.4), var4 = c(12.7, 21, 12.7), var5 = c(4.4, -0.9,
0.3), var6 = c(TRUE, NA, NA)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -3L)), structure(list(timestamp = structure(c(1502480773.519,
1502480775.72), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
cat = c(1, 1), var1 = 1:0, var2 = c(2.5, 2.5), var3 = c(77.5,
82.4), var4 = c(21, 12.7), var5 = c(-0.9, 0.3), var6 = c(NA,
NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L)), structure(list(timestamp = structure(1502480775.72, class = c("POSIXct",
"POSIXt"), tzone = "UTC"), cat = 1, var1 = 0L, var2 = 2.5, var3 = 82.4,
var4 = 12.7, var5 = 0.3, var6 = NA), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -1L)))), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -6L))
Check this:
library(tidyverse)
slice_dataframe <- function(r, ord = 1) {
tibble("order" = ord,
"data_before" = list(slice(ex, row_number() <= (r - ord))),
"data_after" = list(slice(ex, row_number() >= (r + ord))))
}
map_df(1:nrow(ex), slice_dataframe)
Or this:
ex.list <- lapply(split(ex, ex$order), function(x){
ex.x <- as.data.frame(do.call(rbind,
lapply(1:nrow(x), function(i){
c(x$order[i], ifelse(i==1, list(data.frame()), list(x[1:(i-1), ])), list(x[i:nrow(x), ]))
})
))
names(ex.x) <- c('order', 'data_before', 'data_after')
ex.x
})
Edit:
Trying to give some more explanation to the code posted before:
# lapply() applies a function (input arg 2) to each element of a list (input arg 1)
# and returns a list of return values of the function applied on each input element
ex.list <- lapply(
# the split() function returns a list of data.frames, subsets of ex
# splitted by ex$order; these will be the input for the 1. lapply() call
split(ex, ex$order),
# the following function will be applied to each of these data.farmes
# to create the return values
function(x){ # 'x' will be a data.frame, subset ox 'ex' with one single value of ex$order
list.of.rows <- lapply(# we now loop over each row in the data.frame
# containing data with one single value of ex$order,
# 'i' is the row number
1:nrow(x),
# the functions will create 1 row for the resulting data.frame
function(i){
c(# the row is 1 vector containing the following 3 values
# the first column of the putput data.frame is the value of ex$order
x$order[i],
# the value for row i of data_before
ifelse(i==1, list(data.frame()), list(x[1:(i-1), ])), # for the first row we return an empty list, else the data.frame with previous (1:(i-1)) rows
# the values for row i of data_after
list(x[i:nrow(x), ]) # subset of rows as off row i
)
})
# now that we have a list (list.of.rows) that contains one row for the output data.frame
# we rbind these into one data.frame
ex.x <- as.data.frame(do.call(rbind, # do.call(rbind, ...) cobines elements of ... using rbind()
list.of.rows
))
names(ex.x) <- c('order', 'data_before', 'data_after') # give column names to the output data.frame
ex.x # define the return value of the function of the 1. lapply() call
})
Using tidyverse we can split on order and for each dataframe create two new columns data_before and data_after which would contain a list of dataframes based on the conditions.
library(tidyverse)
ex %>%
group_split(order) %>%
map_dfr(. %>%
mutate(data_before = map(seq_len(nrow(.)), function(y) .[seq_len(y - 1), ]),
data_after = map(seq_len(nrow(.)), function(y)
if (y == nrow(.)) .[0,] else .[(y + 1):nrow(.), ]))) %>%
select(order, data_before, data_after)
# A tibble: 50 x 3
# order data_before data_after
# <int> <list> <list>
# 1 1 <tibble [0 × 9]> <tibble [5 × 9]>
# 2 1 <tibble [1 × 9]> <tibble [4 × 9]>
# 3 1 <tibble [2 × 9]> <tibble [3 × 9]>
# 4 1 <tibble [3 × 9]> <tibble [2 × 9]>
# 5 1 <tibble [4 × 9]> <tibble [1 × 9]>
# 6 1 <tibble [5 × 9]> <tibble [0 × 9]>
# 7 2 <tibble [0 × 9]> <tibble [3 × 9]>
# 8 2 <tibble [1 × 9]> <tibble [2 × 9]>
# 9 2 <tibble [2 × 9]> <tibble [1 × 9]>
#10 2 <tibble [3 × 9]> <tibble [0 × 9]>
# … with 40 more rows
This can also be translated in base R in the following way
do.call(rbind, lapply(split(ex, ex$order), function(x) {
x$data_before <- lapply(seq_len(nrow(x)), function(y) x[seq_len(y - 1), ])
x$data_after <- lapply(seq_len(nrow(x)), function(y)
if (y == nrow(x)) x[0,] else x[(y + 1):nrow(x), ])
x
}))

Keep column names for createDummyFeatures "reference" (n-1)

I have this kind of data.
library(dplyr)
glimpse(samp)
Observations: 5
Variables: 5
$ review_count <int> 68, 3, 7, 9, 5
$ Alcohol <fct> full_bar, NA, full_bar, beer_and_wi...
$ BikeParking <fct> True, NA, False, NA, NA
$ BusinessAcceptsBitcoin <fct> NA, NA, NA, NA, NA
$ BusinessAcceptsCreditCards <fct> True, NA, NA, True, True
I want to create 1-p dummy features. The createDummyFeatures function of the mlr package has the option reference to do this.
library(mlr)
dummy = createDummyFeatures(samp, target = "review_count", method = "reference")
The problem is that it doesn´t keep the original column names.
glimpse(dummy)
Observations: 5
Variables: 6
$ review_count <int> 68, 3, 7, 9, 5
$ Alcohol.full_bar <dbl> 1, NA, 1, 0, NA
$ Alcohol.none <dbl> 0, NA, 0, 0, NA
$ True <dbl> 1, NA, 0, NA, NA
$ True.1 <dbl> NA, NA, NA, NA, NA
$ True.2 <dbl> 1, NA, NA, 1, 1
The question is how can I keep them?
An Idea is to create them by the 1-of-nmethod and then remove all columns which contain "False".
dummy2 = createDummyFeatures(samp, target = "review_count")
dummy2 = dummy2 %>%
select(-contains("False"))
glimpse(dummy2)
Observations: 5
Variables: 7
$ review_count <int> 68, 3, 7, 9, 5
$ Alcohol.beer_and_wine <dbl> 0, NA, 0, 1, NA
$ Alcohol.full_bar <dbl> 1, NA, 1, 0, NA
$ Alcohol.none <dbl> 0, NA, 0, 0, NA
$ BikeParking.True <dbl> 1, NA, 0, NA, NA
$ BusinessAcceptsBitcoin.True <dbl> NA, NA, NA, NA, NA
$ BusinessAcceptsCreditCards.True <dbl> 1, NA, NA, 1, 1
However, I don´t know if it is the same as n-1 especially for the factors with more then 2 levels (The dummy coding is for an XGBoost regression where "review count" is the target variable).
dput(samp)
structure(list(review_count = c(68L, 3L, 7L, 9L, 5L), Alcohol = structure(c(2L,
NA, 2L, 1L, NA), .Label = c("beer_and_wine", "full_bar", "none"
), class = "factor"), BikeParking = structure(c(2L, NA, 1L, NA,
NA), .Label = c("False", "True"), class = "factor"), BusinessAcceptsBitcoin = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("False",
"True"), class = "factor"), BusinessAcceptsCreditCards = structure(c(2L,
NA, NA, 2L, 2L), .Label = c("False", "True"), class = "factor")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
Edit
For those who have the same problem, I fixed this issue using caret.
library(caret)
dummy_dat = dummyVars("~ .", data = samp, fullRank = T)
dat = data.frame(predict(dummy_dat, newdata = samp))

Calculating the mean of 3 columns in data frame

I have 3 data frames and they are just replicates. So I want to bind them and calculate the mean of each fraction.
Three data frames:
Nr.1
> dput(head(tbl_gel1))
structure(list(Name = c("yal003w", "yal005c", "yal012w", "yal016w",
"yal035w", "yal038w"), `1_1` = c(1.08346521189121, NA, NA, NA,
NA, NA), `1_10` = c(0.267721905361376, 1.43303883148383, 1.61684304894131,
NA, NA, NA), `1_11` = c(0.189487668138674, 0.75522363065885,
1, NA, NA, NA), `1_12` = c(NA, 1.01340492119247, NA, NA, NA,
NA), `1_13` = c(0.374782308020683, 0.945489433731933, NA, NA,
NA, 0.0317297633029047), `1_14` = c(0.437488212634424, 1.18763709680314,
NA, NA, NA, 0.0278039649538794), `1_15` = c(1, 0.963283876302253,
NA, NA, NA, 0.101985769564935), `1_16` = c(0.933864874212228,
0.534233379286527, NA, NA, NA, 0.216767470594226), `1_17` = c(1,
0.665519263271478, NA, NA, 1, 1), `1_18` = c(0.666036574750145,
0.570465125348879, NA, NA, NA, 1.42894349812116), `1_19` = c(0.514337131747938,
0.23204076838128, NA, NA, 1, 1.2521214021452), `1_2` = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), `1_20` = c(NA,
NA, NA, NA, NA, 1.40803677399372), `1_21` = c(1.09990599806138,
NA, NA, NA, NA, 1.04631699593704), `1_22` = c(1.26442418472118,
NA, NA, NA, NA, 0.928872017485782), `1_23` = c(1.11596921281805,
NA, NA, NA, 1, 0.34698227364696), `1_24` = c(0.754496014447251,
NA, NA, NA, 1, 0.222234793614252), `1_3` = c(6.29254185223621,
NA, NA, 0.693642968439352, NA, NA), `1_4` = c(1.36347593974479,
NA, NA, 1, NA, NA), `1_5` = c(0.765885344543765, NA, NA, 1, NA,
NA), `1_6` = c(0.238118001668604, 0.679584207611477, NA, NA,
NA, NA), `1_7` = c(0.847897771442355, 0.277348019879946, NA,
NA, NA, NA), `1_8` = c(0.356154192700505, 1, 0.409523853881517,
NA, NA, NA), `1_9` = c(0.180109142324181, 1, 0.578310191227172,
NA, NA, 0.093113736249161)), .Names = c("Name", "1_1", "1_10",
"1_11", "1_12", "1_13", "1_14", "1_15", "1_16", "1_17", "1_18",
"1_19", "1_2", "1_20", "1_21", "1_22", "1_23", "1_24", "1_3",
"1_4", "1_5", "1_6", "1_7", "1_8", "1_9"), row.names = c(NA,
6L), class = "data.frame")
Nr. 2
> dput(head(tbl_gel2))
structure(list(Name = c("yal003w", "yal005c", "yal012w", "yal016w",
"yal035w", "yal038w"), `2_1` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `2_2` = c(1.0548947840373, NA,
NA, NA, NA, NA), `2_3` = c(1.61794716486303, 0.346821796129205,
NA, NA, NA, NA), `2_4` = c(1, NA, NA, 0.378254379051086, NA,
NA), `2_5` = c(0.670710809411423, NA, NA, 1, NA, NA), `2_6` = c(0.313872585645673,
NA, NA, NA, NA, NA), `2_7` = c(0.299293639466945, 0.13920907824675,
NA, NA, NA, NA), `2_8` = c(0.311431376422469, 0.511742245543671,
0.342807141055383, NA, NA, NA), `2_9` = c(0.243672215177189,
1, 0.689138745271004, NA, NA, 0.0540861571772987), `2_10` = c(0.154732102234279,
1.08973258347909, 1, NA, NA, NA), `2_11` = c(0.149365726324845,
1.1210733533474, 1.0427649268992, NA, NA, 0.0955468461925663),
`2_12` = c(0.153741630869067, 2.96276072446013, 1, NA, NA,
NA), `2_13` = c(0.629371115599316, 0.952868912207058, 0.0771105403237483,
NA, NA, 0.0885212695236819), `2_14` = c(0.907644486740723,
1.43000783337778, NA, NA, NA, 0.138102409899801), `2_15` = c(1.09683345304359,
0.423641943213571, NA, NA, NA, 0.255699738225622), `2_16` = c(0.913095779338154,
0.510977400533081, NA, NA, 0.520556617688936, 0.284898552722227
), `2_17` = c(0.935941553863477, 0.388225948821767, NA, NA,
1.14984991998928, 1), `2_18` = c(2.21746156904543, 0.642743615867438,
NA, NA, NA, 2.22716071647178), `2_19` = c(0.500618035526774,
0.282924681750454, NA, NA, NA, 1), `2_20` = c(0.701627311828743,
0.254001731153973, NA, NA, 1, 1.15996914621286), `2_21` = c(1.97359874904275,
NA, NA, NA, 1.67526802494991, 1.38709456754353), `2_22` = c(2.09198896289293,
NA, NA, NA, NA, 0.921672834103247), `2_23` = c(1.18791465369551,
NA, NA, NA, NA, 0.576309066193914), `2_24` = c(0.473199477125101,
0.176144702328764, NA, NA, 1, 0.130236848112641)), .Names = c("Name",
"2_1", "2_2", "2_3", "2_4", "2_5", "2_6", "2_7", "2_8", "2_9",
"2_10", "2_11", "2_12", "2_13", "2_14", "2_15", "2_16", "2_17",
"2_18", "2_19", "2_20", "2_21", "2_22", "2_23", "2_24"), row.names = c(NA,
6L), class = "data.frame")
Nr.3
> dput(head(tbl_gel3))
structure(list(Name = c("yal003w", "yal005c", "yal012w", "yal016w",
"yal035w", "yal038w"), `3_1` = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), `3_2` = c(1, 1.4605309655311,
NA, NA, NA, NA), `3_3` = c(1.74480713727388, 0.42825619952525,
NA, NA, NA, NA), `3_4` = c(1, 0.431712121875013, NA, 0.395182020245312,
NA, NA), `3_5` = c(2.26247329056518, 0.644462177666441, NA, 1,
NA, NA), `3_6` = c(0.619783374266709, 0.472094874244026, NA,
NA, NA, NA), `3_7` = c(0.45731912574756, 0.176354321796083, NA,
NA, NA, NA), `3_8` = c(0.271829278733367, 0.517232771669986,
0.153774052052871, NA, NA, NA), `3_9` = c(0.141017619508583,
1.41279969394534, 0.651948154271122, NA, NA, NA), `3_10` = c(NA,
1.64435171100405, 0.998807430240956, NA, NA, NA), `3_11` = c(0.110046035477971,
1.33684444261939, 1.25595310581771, NA, NA, 0.0236163735479745
), `3_12` = c(NA, 0.982250906830292, 0.39283619985401, NA, NA,
0.0688303458902568), `3_13` = c(0.136798076436642, 0.55729642483448,
0.176525038283566, NA, NA, 0.0251189412372225), `3_14` = c(0.316623893146817,
1, NA, NA, NA, 0.0727823461722849), `3_15` = c(NA, 0.607991038574375,
NA, NA, NA, 0.133968257432001), `3_16` = c(0.362994392402489,
0.547183167896534, NA, NA, NA, 0.0777347708647245), `3_17` = c(1,
0.116561118715651, NA, NA, 0.710972173471528, 1), `3_18` = c(NA,
3.63330458071475, NA, NA, NA, 3.24019081192985), `3_19` = c(NA,
NA, NA, NA, NA, 2.46635222132474), `3_20` = c(0.452303676849426,
0.0896715384025126, NA, NA, 1, 1), `3_21` = c(1.50169299468485,
0.513442106966708, NA, NA, 1.45124841710635, 1.02529618467026
), `3_22` = c(0.565232592993276, 0.748536315065533, NA, NA, 2.9089322117881,
0.782555457293307), `3_23` = c(1.62622280168665, 0.704926586534075,
NA, NA, NA, 0.584486806995139), `3_24` = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Name",
"3_1", "3_2", "3_3", "3_4", "3_5", "3_6", "3_7", "3_8", "3_9",
"3_10", "3_11", "3_12", "3_13", "3_14", "3_15", "3_16", "3_17",
"3_18", "3_19", "3_20", "3_21", "3_22", "3_23", "3_24"), row.names = c(NA,
6L), class = "data.frame")
I used function below to bind them. There are different number of rows in each data frame and in some cases different names so in the final table should be more rows than in each of them.
mylist <- list(tbl_gel1,tbl_gel2,tbl_gel3)
tbl_all <- Reduce(function(x, y) merge(x, y, all=T,by="Name",sort=F),
mylist, accumulate=F)
Everything goes fine until this moment.
Now I want to calculate the mean of each fraction (there is 24 fractions in total)
## Calculating the mean
tbl_all1 <- tbl_all[-1]
ind <- c(1, 25, 49)
tbl_mean <- cbind(tbl_all[1], sapply(0:23, function(i) rowMeans(tbl_all1[ind+i])))
There is something wrong with that function because sum of many rows gives 0. That's definitely wrong because in tbl_gel1 and others are only rows with atleast one number in any fraction.
If I take a look on tbl_mean I see that rows with sum of 0 are in the bottom.

Resources