Automatically creating bins for a numeric variable in r - r

So I have a variable as below.
var <- c(0L, 5L, 4L, 115L, 0L, 0L, 0L, 2L, 365L, 4L, 20L, 61L, 365L,
0L, 365L, 0L, 14L, 0L, 0L, 72L, 0L, 0L, 6L, 105L, 150L, 0L, 365L,
0L, 1L, 28L, 161L, 6L, 0L, 2L, 12L, 0L, 10L, 49L, 7L, 2L, 51L,
0L, 0L, 11L, 0L, 0L, 17L, 0L, 0L, 7L, 0L, 28L, 0L, 0L, 0L, 44L,
0L, 3L, 0L, 0L, 0L, 1L, 1L, 0L, 4L, 87L, 0L, 321L, 0L, 0L, 0L,
0L, 9L, 0L, 0L, 0L, 140L, 0L, 0L, 0L, 0L, 0L, 1L, 8L, 20L, 0L,
4L, 14L, 3L, 0L, 0L, 0L, 39L, 4L, 9L, 0L, 0L, 0L, 1L, 7L)
I want to create bins of different sizes (or same no matter) to categorize and plot as a bar chart for this variable.
I know it's possible to find automatic/reccommended binning however I am unsure how to do so in R?
Tried using the bin() function to no avail . I read about the Jenks method as well, but is there a way to create the best possible bins in R?
Would like to use it to plot a bar plot in ggplot.

Your description sounds like you're wanting to plot a histogram of var. This can be done easily enough in ggplot using geom_histogram. The key here is that ggplot likes to have a data frame, so you just have to specify your variable in a dataframe first, which you can do inside the ggplot() function:
ggplot(data.frame(var), aes(var)) + geom_histogram(color='black', alpha=0.2)
Gives you this:
The default is to use 30 bins, but you can specify either number of bins via bins= or the size of the bins via binwidth=:
ggplot(data.frame(var), aes(var)) + geom_histogram(bins=10, color='black', alpha=0.2)
If you want to plot the basic bar geom, then geom_histogram() works just fine. If you change to use the stat_bin() function instead, it will perform the same binning method, but then you can apply and use a different geom if you want to:
ggplot(data.frame(var), aes(var)) +
stat_bin(geom='area', bins=10, alpha=0.2, color='black')
If you're looking to grab just the numbers/data from "binning" a variable like you have, one of the simplest ways might be to use cut() from dplyr.
Use of cut() is pretty simple. You specify the vector and a breaks= argument. Breaks can be specified a list of places where you want to "cut" your data (or "bin" your data), or you can just set breaks=10 and it will give you an evenly cut set of 10 bins. The result is a factor with levels= that correspond to the range for each of the breaks. In the case of var with breaks=10, you get the following:
> var_cut <- cut(var, breaks = 10)
> levels(var_cut)
[1] "(-0.365,36.5]" "(36.5,73]" "(73,110]" "(110,146]" "(146,182]" "(182,219]" "(219,256]"
[8] "(256,292]" "(292,328]" "(328,365]"

Related

Update dataframe by Comparing Date field records in a second dataframe and append new records only

I want to compare the Date field of two dataframes and add only the latest records from the second one. The first dataframe has the latest records. These records are updated daily from site. The second one reads the records from a csv file that I saved from the previous day.
data I read from the internet:
df_new<-structure(list(DCounter = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
CCounter = c(125L, 36L, 22L, 17L, 11L, 8L, 4L, 20L, 8L, 3L),
RCounter = c(24L, 33L, 34L, 50L, 33L, 21L, 62L, 10L, 20L, 31L),
CrCounter = c(1L, 1L, 8L, 2L, 2L, 8L, 2L, 3L, 0L, 1L),
Date = c("20/03/2020", "19/03/2020", "18/03/2020", "17/03/2020", "16/03/2020", "15/03/2020", "14/03/2020", "13/03/2020", "12/03/2020","11/03/2020")),
class = "data.frame", row.names = c(NA, 10L))
Format the date field to be Date type and rename field
df_new$Date = as.Date(df_new$Date, format = "%d/%m/%y")
colnames(df_new)<-c("D","C","R","Cr","Date")
#old data- read from csv file has data from yesterday
#----------------------
#df_old <- read.csv("df_Saved.csv",header=T)
df_old<-structure(list(D = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
C = c(6L, 12L, 7L, 11L, 8L, 4L, 20L, 8L, 3L, 4L, 1L, 3L, 3L, 0L, 2L, 0L, 0L, 10L, 1L, 0L, 2L, 17L, 15L, 6L, 5L),
R = c(3L,3L, 0L, 3L, 2L, 2L, 0L, 0L, 3L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
Cr = c(1L, 0L, 0L, 0L, 0L, 2L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
Date = structure(c(17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L, 9L, 8L, 25L, 24L, 23L, 22L, 21L, 20L, 19L, 18L, 7L, 6L, 5L, 4L, 3L, 2L, 1L),
.Label = c("2/24/2020", "2/25/2020", "2/26/2020", "2/27/2020",
"2/28/2020", "2/29/2020", "3/1/2020", "3/10/2020", "3/11/2020", "3/12/2020",
"3/13/2020", "3/14/2020", "3/15/2020", "3/16/20", "3/17/20", "3/18/20",
"3/19/20", "3/2/2020", "3/3/2020", "3/4/2020", "3/5/2020", "3/6/2020", "3/7/2020", "3/8/2020", "3/9/2020"), class = "factor")),
class = "data.frame", row.names = c(NA, -25L))
Get today's date and format it
#--------------
dateToAdd<-format(Sys.time(), "%Y/%m/%d")
#extract ONLY updated dates
df_newExtracted<- with(df_new, df_new[(Date >= dateToAdd), ])
if(df_old$Date[1]< df_newExtracted$Date[1] ){
df_final<-rbind(df_newExtracted,df_old)
cat("Add New records\n")
}else{
df_final<-df_old
cat("Nothing new \n")
}
df_final
write.csv(df_final, "df_Saved.csv", row.names=FALSE)
I couldn't figure out the root cause of the problem, sometimes if the difference in the date one day, it works and sometimes the difference 2 days , it's not working. Sometimes if the df_newExtractedrepresent a date that has not been updated by the site like for example: if we run the code today date and they still haven't update their records, the variable will be empty and crash all calculation.
Some suggest the issue related to writing to csv file and reading csv,which will change the format and make the file unstable, and I should use lubridate, that's why I have added the formatting lines. Any suggestion ?

Kruskal.wallis gives out equal p-values

Friends,
I'm having an issue with the Kruskal wallis test in r, testing for stable seasonality with the Kruskal-wallis test. The p-values tested for each variable are coming out the same. Using Kruskal.test(formula, data = mydata) from the library(stats) package . I'm having a hard time believing that the pvalues would be the same.
My dataset is a monthly dataset with 163 obs, 3 macro economic variables in the model and two seasonal dummies.
I'm testing each independent macro economic variable with the dependent variable in the following way Kruskal.test(y~x, data = mydata). So for the data example below it would be Kruskal.test(pr~mev06_mp_lag2, data = mydata). And repeated for each mev in the dataset. All the pvalues for testing the 3 mev's (mev06_mp_lag2, mev29_lag2, mev108_lag1) comes out to be this output:
data: pr by mev29_lag2
Kruskal-Wallis chi-squared = 162, df = 162, p-value = 0.4852
Here is the data:
structure(list(date = structure(c(28L, 56L, 42L, 97L, 1L, 111L,
83L, 70L, 15L, 151L, 138L, 125L, 29L, 57L, 43L, 98L, 2L, 112L,
84L, 71L, 16L, 152L, 139L, 126L, 30L, 58L, 44L, 99L, 3L, 113L,
85L, 72L, 17L, 153L, 140L, 127L, 31L, 59L, 45L, 100L, 4L, 114L,
86L, 73L, 18L, 154L, 141L, 128L, 32L, 60L, 46L, 101L, 5L, 115L,
87L, 74L, 19L, 155L, 142L, 129L, 33L, 61L, 47L, 102L, 6L, 116L,
88L, 75L, 20L, 156L, 143L, 130L, 34L, 62L, 48L, 103L, 7L, 117L,
89L, 76L, 21L, 157L, 144L, 131L, 35L, 63L, 49L, 104L, 8L, 118L,
90L, 77L, 22L, 158L, 145L, 132L, 36L, 64L, 50L, 105L, 9L, 119L,
91L, 78L, 23L, 159L, 146L, 133L, 37L, 65L, 51L, 106L, 10L, 120L,
92L, 79L, 24L, 160L, 147L, 134L, 38L, 66L, 52L, 107L, 11L, 121L,
93L, 80L, 25L, 161L, 148L, 135L, 39L, 67L, 53L, 108L, 12L, 122L,
94L, 81L, 26L, 162L, 149L, 136L, 40L, 68L, 54L, 109L, 13L, 123L,
95L, 82L, 27L, 163L, 150L, 137L, 41L, 69L, 55L, 110L, 14L, 124L,
96L), .Label = c("01APR2006", "01APR2007", "01APR2008", "01APR2009",
"01APR2010", "01APR2011", "01APR2012", "01APR2013", "01APR2014",
"01APR2015", "01APR2016", "01APR2017", "01APR2018", "01APR2019",
"01AUG2006", "01AUG2007", "01AUG2008", "01AUG2009", "01AUG2010",
"01AUG2011", "01AUG2012", "01AUG2013", "01AUG2014", "01AUG2015",
"01AUG2016", "01AUG2017", "01AUG2018", "01DEC2005", "01DEC2006",
"01DEC2007", "01DEC2008", "01DEC2009", "01DEC2010", "01DEC2011",
"01DEC2012", "01DEC2013", "01DEC2014", "01DEC2015", "01DEC2016",
"01DEC2017", "01DEC2018", "01FEB2006", "01FEB2007", "01FEB2008",
"01FEB2009", "01FEB2010", "01FEB2011", "01FEB2012", "01FEB2013",
"01FEB2014", "01FEB2015", "01FEB2016", "01FEB2017", "01FEB2018",
"01FEB2019", "01JAN2006", "01JAN2007", "01JAN2008", "01JAN2009",
"01JAN2010", "01JAN2011", "01JAN2012", "01JAN2013", "01JAN2014",
"01JAN2015", "01JAN2016", "01JAN2017", "01JAN2018", "01JAN2019",
"01JUL2006", "01JUL2007", "01JUL2008", "01JUL2009", "01JUL2010",
"01JUL2011", "01JUL2012", "01JUL2013", "01JUL2014", "01JUL2015",
"01JUL2016", "01JUL2017", "01JUL2018", "01JUN2006", "01JUN2007",
"01JUN2008", "01JUN2009", "01JUN2010", "01JUN2011", "01JUN2012",
"01JUN2013", "01JUN2014", "01JUN2015", "01JUN2016", "01JUN2017",
"01JUN2018", "01JUN2019", "01MAR2006", "01MAR2007", "01MAR2008",
"01MAR2009", "01MAR2010", "01MAR2011", "01MAR2012", "01MAR2013",
"01MAR2014", "01MAR2015", "01MAR2016", "01MAR2017", "01MAR2018",
"01MAR2019", "01MAY2006", "01MAY2007", "01MAY2008", "01MAY2009",
"01MAY2010", "01MAY2011", "01MAY2012", "01MAY2013", "01MAY2014",
"01MAY2015", "01MAY2016", "01MAY2017", "01MAY2018", "01MAY2019",
"01NOV2006", "01NOV2007", "01NOV2008", "01NOV2009", "01NOV2010",
"01NOV2011", "01NOV2012", "01NOV2013", "01NOV2014", "01NOV2015",
"01NOV2016", "01NOV2017", "01NOV2018", "01OCT2006", "01OCT2007",
"01OCT2008", "01OCT2009", "01OCT2010", "01OCT2011", "01OCT2012",
"01OCT2013", "01OCT2014", "01OCT2015", "01OCT2016", "01OCT2017",
"01OCT2018", "01SEP2006", "01SEP2007", "01SEP2008", "01SEP2009",
"01SEP2010", "01SEP2011", "01SEP2012", "01SEP2013", "01SEP2014",
"01SEP2015", "01SEP2016", "01SEP2017", "01SEP2018"), class = "factor"),
pr = c(0.1691759261, 0.1975689455, 0.1701795466, 0.1889038722,
0.1743304586, 0.1850822209, 0.1725476026, 0.1806130453, 0.1769864586,
0.1546961801, 0.18850436, 0.1695999754, 0.1660947088, 0.1929270116,
0.1629685381, 0.1716883769, 0.1782082767, 0.177316379, 0.1586548395,
0.1816295787, 0.1634939904, 0.1653658139, 0.1669465832, 0.1547769918,
0.17154596, 0.1824150313, 0.1600967574, 0.1819462462, 0.1625842114,
0.1605423212, 0.174298958, 0.16859091, 0.1567519737, 0.1549443922,
0.1528250707, 0.1563427163, 0.1562236709, 0.1544731644, 0.1595362963,
0.1749852828, 0.1536175907, 0.1668984941, 0.1532514745, 0.152745466,
0.1590015917, 0.1500819546, 0.1504755171, 0.1583227453, 0.1546476157,
0.1634331963, 0.1565167637, 0.1699421465, 0.1657200266, 0.1642684245,
0.1675084975, 0.1617848489, 0.1662501795, 0.1648139984, 0.1645302595,
0.169286769, 0.1707244798, 0.1845315559, 0.1752391568, 0.1899788506,
0.1784046029, 0.1842806875, 0.1836403012, 0.1753696341, 0.1738240496,
0.1747609205, 0.1724421753, 0.1803992831, 0.1763816185, 0.187630168,
0.1877238382, 0.1860668525, 0.1854666743, 0.1860146483, 0.1781037416,
0.185259322, 0.1879122146, 0.178520754, 0.1875367517, 0.18694397,
0.1860777227, 0.1979044449, 0.1833497201, 0.192027271, 0.1926325454,
0.1916103719, 0.1851319974, 0.1864458557, 0.1832327814, 0.1808570791,
0.1851145899, 0.1815387272, 0.1870942258, 0.1943564723, 0.1862582923,
0.1907279007, 0.1859213896, 0.1865372709, 0.1898453914, 0.1847275775,
0.1736567497, 0.1771092243, 0.1822902114, 0.1840752276, 0.1892670811,
0.1923250842, 0.1852956789, 0.1917880299, 0.18771724, 0.1857801687,
0.1868263217, 0.1867604143, 0.1824500898, 0.1758283625, 0.1829290332,
0.1808247326, 0.183507277, 0.1852845389, 0.1808714285, 0.1818222883,
0.1755951829, 0.1774808136, 0.1775837234, 0.1696830467, 0.172385402,
0.1694350722, 0.168336944, 0.1680335702, 0.1684147459, 0.1726731413,
0.1633235864, 0.1707780779, 0.1606329755, 0.1634684695, 0.1652849939,
0.15803428, 0.1616158193, 0.1527704105, 0.1584612931, 0.1550232032,
0.1534022945, 0.164970584, 0.1565023361, 0.1622506128, 0.1551517442,
0.1539405645, 0.152548495, 0.1516353176, 0.1523898229, 0.1477241538,
0.1502876518, 0.1515682192, 0.1540217905, 0.1589165786, 0.1531622236,
0.1583882529, 0.1532322761, 0.157552401, 0.1621688871), month = c(12L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 1L, 2L, 3L, 4L, 5L, 6L), mon1 = c(0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L), mon3 = c(0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), mev06_mp_lag2 = c(0.2779810102,
0.1874272639, 0.1332826385, 0.1128640237, 0.1247535199, 0.1545791804,
0.2106891929, 0.2757365926, 0.329455103, 0.3808671396, 0.4450555294,
0.5340975751, 0.5971738413, 0.5881040948, 0.4793350636, 0.3124264887,
0.2197636246, 0.2206435437, 0.3113169675, 0.4196078671, 0.5003884945,
0.5494487995, 0.5369484545, 0.4606922562, 0.3338162715, 0.278520389,
0.3170366404, 0.4156696136, 0.4787532552, 0.4443344043, 0.3681819294,
0.2878537618, 0.2048228841, 0.1251537938, 0.0382989338, -0.058589422,
-0.142185008, -0.153725768, -0.074125689, 0.0484987522, 0.0608517463,
-0.079803144, -0.303655154, -0.429635585, -0.363580402, -0.1573843,
0.0420304555, 0.1835101363, 0.2542206609, 0.2533515836, 0.1774048348,
0.0536834552, -0.031620066, -0.048554527, -0.010029088, 0.0691957026,
0.1865379823, 0.314751579, 0.3867383564, 0.3849543674, 0.3270672177,
0.3352052154, 0.4333568873, 0.5807725419, 0.6594152281, 0.5820169704,
0.4614498827, 0.382189864, 0.3472850124, 0.3700953746, 0.4332794073,
0.5388940866, 0.6346031107, 0.6722549883, 0.6226019329, 0.5308626721,
0.5406836123, 0.652356085, 0.8470071782, 0.9341209812, 0.8264468016,
0.612419938, 0.5006911837, 0.5691599433, 0.7307708771, 0.8473791813,
0.8590757515, 0.7900410964, 0.7171039073, 0.6076028502, 0.5505395263,
0.5661995614, 0.631423817, 0.7324609809, 0.776800689, 0.7461146765,
0.6396693594, 0.5909067989, 0.6163303443, 0.6923212327, 0.7608602548,
0.7385415186, 0.7245230167, 0.735008075, 0.7303155287, 0.7306620594,
0.7216900251, 0.710357153, 0.668241137, 0.6465248078, 0.6386886106,
0.644503099, 0.6750915049, 0.6733980993, 0.707678618, 0.7411667711,
0.7159390625, 0.6659808449, 0.6197029436, 0.5965547889, 0.5673138317,
0.5608362128, 0.5669008884, 0.5795942214, 0.5905982279, 0.556992012,
0.5359266787, 0.5449271219, 0.5753646848, 0.6196930073, 0.6313425488,
0.6047324646, 0.5262327459, 0.4680502206, 0.4339327769, 0.422330442,
0.4388551617, 0.4449027001, 0.4724310877, 0.4603556503, 0.3559313099,
0.2192993453, 0.1752438701, 0.2708768468, 0.4398555582, 0.5419383533,
0.5258750189, 0.4264906744, 0.3512451556, 0.3047050285, 0.3177822041,
0.3703341357, 0.4374805453, 0.5119974656, 0.5479752418, 0.5383546522,
0.4763979544, 0.4418530239, 0.4423212346, 0.4638361889, 0.4725955269,
0.4199050848, 0.3677860365), mev29_lag2 = c(12052.672746,
12155.974991, 12259.977269, 12364.551523, 12471.923335, 12575.751994,
12681.578091, 12792.424151, 12903.799861, 13014.933326, 13125.644747,
13237.759633, 13347.540807, 13456.257594, 13563.261568, 13668.005405,
13772.061616, 13868.872889, 13963.208033, 14057.010446, 14145.406294,
14227.079383, 14301.142959, 14368.046479, 14424.924247, 14471.887375,
14508.019112, 14532.668323, 14547.065728, 14552.236417, 14550.020205,
14541.465439, 14527.537817, 14509.400483, 14488.246542, 14464.991414,
14441.692779, 14419.373969, 14399.416496, 14382.82297, 14369.044585,
14358.108259, 14348.715697, 14340.186543, 14332.550823, 14325.428273,
14318.322395, 14310.559769, 14301.864431, 14291.633935, 14279.435535,
14264.935547, 14247.97805, 14230.01465, 14210.49904, 14189.108376,
14166.881283, 14144.225632, 14121.472414, 14098.568702, 14076.59218,
14055.590158, 14035.983138, 14018.088095, 14001.533115, 13987.079436,
13973.759653, 13961.158726, 13949.839264, 13939.826368, 13931.070165,
13923.347123, 13916.816802, 13911.291278, 13906.706121, 13903.022798,
13900.161493, 13898.209865, 13897.051213, 13896.655547, 13897.047312,
13898.205564, 13900.125572, 13902.837452, 13906.230209, 13910.294112,
13914.960492, 13920.218961, 13926.287609, 13932.889015, 13940.451345,
13949.327157, 13959.352267, 13970.583834, 13983.14564, 13997.391872,
14012.965904, 14030.139859, 14048.917902, 14069.304752, 14091.541249,
14113.971365, 14137.471712, 14162.48361, 14187.783215, 14212.951734,
14237.687089, 14262.119284, 14285.160082, 14306.785799, 14326.567908,
14344.249129, 14360.498045, 14374.927988, 14388.841191, 14403.027623,
14417.285193, 14431.921345, 14447.347759, 14464.280067, 14482.60458,
14503.01009, 14525.873936, 14551.515778, 14580.356316, 14610.776601,
14643.555251, 14679.101052, 14716.763371, 14756.356798, 14797.710201,
14841.323243, 14885.552108, 14930.758122, 14976.563876, 15022.743933,
15070.254048, 15116.300407, 15163.332681, 15212.634721, 15262.129309,
15311.443993, 15360.633228, 15410.700926, 15460.012042, 15508.70943,
15555.948922, 15601.38129, 15647.017242, 15691.593748, 15737.814211,
15784.098257, 15824.336441, 15857.184087, 15890.739854, 15937.050823,
15997.292301, 16049.370568, 16063.033239, 16023.148233, 15962.775179,
15932.931115, 15961.380588), mev108_lag1 = c(3.4265582593,
3.8373450191, 4.1211669551, 4.2500265274, 4.2336477943, 4.1032530543,
3.9050112432, 3.691568661, 3.5215361911, 3.4547437295, 3.5245107487,
3.6740870118, 3.8205614376, 3.9060148228, 3.9500668579, 3.9928147249,
4.056423068, 4.097207087, 4.0423248638, 3.8590572205, 3.6249134397,
3.4534377102, 3.419037145, 3.448572797, 3.4287569276, 3.3235979183,
3.3376619007, 3.7361174237, 4.6156476062, 5.5516500424, 5.9018553329,
5.3364327802, 4.406525535, 3.9641497661, 4.5369688556, 5.6155652665,
6.3806850947, 6.3128039966, 5.8286655665, 5.6572058382, 6.1906323861,
7.0408483819, 7.4827400214, 7.0669869294, 6.1581569245, 5.3936717805,
5.2364436715, 5.4913612016, 5.777206406, 5.8339229216, 5.7719456704,
5.8170713396, 6.1029576358, 6.5263492298, 6.8736849118, 6.9975096947,
6.9363923153, 6.7924979551, 6.6668133872, 6.6299076039, 6.7439828613,
7.0243025303, 7.3370606372, 7.4869066644, 7.3844430207, 7.1374881632,
6.940002926, 6.9245088132, 7.0301738798, 7.1305865095, 7.1405475978,
7.1156467585, 7.1524809409, 7.3303394277, 7.6756343523, 8.1680801673,
8.7542261364, 9.1808145707, 9.1010680729, 8.4114150872, 7.6844861301,
7.7270955321, 8.9146989491, 10.361039125, 10.796323189, 9.4618739177,
7.2049954246, 5.5270537994, 5.2221817889, 5.905531143, 6.7592672119,
7.1298927381, 7.0304213613, 6.697874346, 6.3607611025, 6.1569021347,
6.2001333982, 6.5397429639, 7.0184856606, 7.3825719382, 7.5069332339,
7.4599546294, 7.377008726, 7.3638030204, 7.3988155209, 7.4176473452,
7.3829883718, 7.3415942425, 7.3652515353, 7.492033304, 7.6543284954,
7.7427624077, 7.7070473944, 7.6101649913, 7.5623895662, 7.6286991237,
7.7329248639, 7.7505651547, 7.6137269809, 7.4246691851, 7.337208565,
7.4360967197, 7.5892255476, 7.5910082105, 7.3256377393, 6.9067676469,
6.5375463809, 6.3577677595, 6.320229607, 6.3124546301, 6.2662262884,
6.2427837167, 6.3428922976, 6.6124818018, 6.9249171793, 7.0836464531,
6.9995311857, 6.784745399, 6.6375952256, 6.6797395345, 6.7927792813,
6.775540136, 6.5260699355, 6.2318486432, 6.1687507324, 6.4951667771,
7.0000862167, 7.3264282363, 7.2857205376, 6.9859881738, 6.6532338989,
6.4623367973, 6.4024537545, 6.3988018644, 6.3987025271, 6.4148188331,
6.4801548851, 6.6043861168, 6.7236064103, 6.7473536828, 6.6336225214,
6.4408520391, 6.2759289867), p_pr = c(0.1841979358, 0.1909299357,
0.1800235425, 0.1873193897, 0.1778321909, 0.1771717461, 0.1769871609,
0.1769369574, 0.1767002661, 0.1766514006, 0.1772474365, 0.1786372508,
0.1793958093, 0.1873407005, 0.1744738837, 0.1779058647, 0.1660300916,
0.165123522, 0.1662612377, 0.1675426585, 0.1680743656, 0.1680322376,
0.1668552618, 0.1643117778, 0.1604937471, 0.1674889291, 0.1589809185,
0.1707308583, 0.1656141418, 0.1669016231, 0.1658465865, 0.1626002246,
0.1584857239, 0.1556467109, 0.1550484409, 0.1554116407, 0.1553698903,
0.1642789961, 0.1562188049, 0.1676637554, 0.1607636607, 0.159365876,
0.154912779, 0.1508778098, 0.1504706517, 0.1538985266, 0.1585854408,
0.1628016268, 0.1653325485, 0.1746734474, 0.1636385773, 0.1694169075,
0.1595285254, 0.1602916429, 0.1622777106, 0.1647745096, 0.1677972871,
0.170901438, 0.1726448513, 0.1727558383, 0.1718106875, 0.182016627,
0.1762909312, 0.1891248658, 0.1824141631, 0.1800526397, 0.1767170916,
0.1748339829, 0.1743303929, 0.1752424115, 0.1769369171, 0.17959844,
0.182145123, 0.1926835257, 0.1831830764, 0.190698247, 0.1837433962,
0.1875573393, 0.1922445975, 0.1928025222, 0.1883983926, 0.1831397417,
0.1831222451, 0.1882066078, 0.1932319714, 0.2020834894, 0.1878958952,
0.1907776136, 0.179564677, 0.1783669915, 0.1788699402, 0.1800391448,
0.1813284168, 0.1829512395, 0.1831328753, 0.181735949, 0.1790137171,
0.1875337053, 0.1799754626, 0.191124027, 0.1842840392, 0.1833786054,
0.1825845794, 0.182550754, 0.1822481672, 0.1820347832, 0.1814673532,
0.18082831, 0.1795880318, 0.1882358605, 0.1790916575, 0.1878672726,
0.1797660056, 0.1793430747, 0.1799398102, 0.1807822543, 0.180246357,
0.1788849577, 0.1772437109, 0.1760414846, 0.1749113359, 0.1838871358,
0.1750360156, 0.1836953752, 0.1744313344, 0.1722844661, 0.170542729,
0.1699684655, 0.1702419601, 0.1709120463, 0.1706566897, 0.1694752567,
0.1672817086, 0.175105, 0.1653820849, 0.1735863964, 0.1646891174,
0.1638476083, 0.1636914003, 0.1629671545, 0.1601006771, 0.1561250286,
0.1539170317, 0.1550840353, 0.1586350423, 0.1705586865, 0.1617244458,
0.1681380973, 0.1570702457, 0.1547307475, 0.1537854739, 0.1541593825,
0.155270079, 0.1567753976, 0.1573188283, 0.1566263272, 0.154594785,
0.1625938782, 0.1536205501, 0.1632453909, 0.1552261163, 0.1537721633,
0.1517811103), r_pr = c(-0.01502201, 0.0066390098, -0.009843996,
0.0015844825, -0.003501732, 0.0079104748, -0.004439558, 0.003676088,
0.0002861925, -0.02195522, 0.0112569236, -0.009037275, -0.013301101,
0.0055863112, -0.011505346, -0.006217488, 0.0121781851, 0.0121928571,
-0.007606398, 0.0140869202, -0.004580375, -0.002666424, 9.13213e-05,
-0.009534786, 0.0110522129, 0.0149261022, 0.0011158389, 0.0112153879,
-0.00302993, -0.006359302, 0.0084523714, 0.0059906854, -0.00173375,
-0.000702319, -0.00222337, 0.0009310756, 0.0008537806, -0.009805832,
0.0033174915, 0.0073215274, -0.00714607, 0.0075326181, -0.001661304,
0.0018676562, 0.0085309399, -0.003816572, -0.008109924, -0.004478882,
-0.010684933, -0.011240251, -0.007121814, 0.000525239, 0.0061915012,
0.0039767816, 0.0052307869, -0.002989661, -0.001547108, -0.00608744,
-0.008114592, -0.003469069, -0.001086208, 0.0025149289, -0.001051774,
0.0008539848, -0.00400956, 0.0042280478, 0.0069232096, 0.0005356512,
-0.000506343, -0.000481491, -0.004494742, 0.0008008432, -0.005763504,
-0.005053358, 0.0045407618, -0.004631395, 0.0017232781, -0.001542691,
-0.014140856, -0.0075432, -0.000486178, -0.004618988, 0.0044145066,
-0.001262638, -0.007154249, -0.004179044, -0.004546175, 0.0012496574,
0.0130678684, 0.0132433805, 0.0062620573, 0.0064067109, 0.0019043646,
-0.00209416, 0.0019817146, -0.000197222, 0.0080805087, 0.0068227671,
0.0062828296, -0.000396126, 0.0016373504, 0.0031586655, 0.007260812,
0.0021768236, -0.008591417, -0.004925559, 0.0008228582, 0.0032469176,
0.0096790493, 0.0040892237, 0.0062040214, 0.0039207574, 0.0079512344,
0.006437094, 0.0068865115, 0.0059781601, 0.0022037328, -0.003056595,
0.0056853223, 0.004783248, 0.008595941, 0.0013974031, 0.0058354128,
-0.001873087, 0.0011638485, 0.0051963475, 0.0070409944, -0.000285419,
0.0021434419, -0.001476974, -0.002319746, -0.001441687, 0.0011330373,
-0.002431859, -0.002058499, -0.002808318, -0.004056142, -0.000379139,
0.0015935936, -0.004932874, 0.0015151421, -0.003354618, 0.0045442614,
-6.0832e-05, -0.005232748, -0.005588103, -0.00522211, -0.005887484,
-0.001918502, -0.000790183, -0.001236979, -0.002524065, -0.002880256,
-0.009051244, -0.007031176, -0.005058108, -0.000572995, -0.0036773,
-0.000458327, -0.004857138, -0.00199384, 0.0037802378, 0.0103877768
)), .Names = c("date", "pr", "month", "mon1", "mon3", "mev06_mp_lag2",
"mev29_lag2", "mev108_lag1", "p_pr", "r_pr"), class = "data.frame", row.names = c(NA,
-163L))
Am I missing something with the nuances of this test? Thoughts?
A Kruskal-Wallis test compares the dependent variable across groups defined by the unique values of the independent variable (analogous to one-way ANOVA). Your independent variables are continuous, so each splits your 163 observations into the same 163 different groups, each with one observation. This is why the tests come out the same.
A clue was in the output - the test had 162 degrees of freedom on 163 observations!
Kruskal-Wallis chi-squared = 162, df = 162, p-value = 0.4852
So the Kruskal-Wallis test isn't appropriate here, either you meant to bin your dependent variables first (although a K-W test still wouldn't be right as your groups would be ordered), or use a test for correlation.

How to superpose barplots in ggplot2

I would like to superpose three barplots.
Plot 1:
Plot 2:
Plot 3:
fzg <- structure(list(start = c(40L, 22L, 37L, 32L, 72L, 41L, 2L, 11L, 57L, 10L, 102L, 40L, 17L, 48L, 86L, 46L, 49L, 7L, 1L, 2L, 13L, 69L, 42L, 31L, 39L, 64L, 39L, 29L, 67L, 5L, 1L, 54L, 32L, 7L, 4L, 67L, 14L, 26L, 20L, 42L, 26L, 57L, 0L, 34L, 114L), period = 1:45, zug = c(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), typ = c(2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), dyn1 = c(0L, 0L, 0L, 0L, 203L, 0L, 0L, 0L, 111L, 0L, 112L, 0L, 0L, 0L, 191L, 0L, 95L, 0L, 0L, 0L, 0L, 92L, 0L, 0L, 0L, 176L, 0L, 0L, 135L, 0L, 0L, 60L, 0L, 0L, 0L, 110L, 0L, 0L, 0L, 0L, 0L, 185L, 0L, 0L, 148L), dyn2 = c(0L, 0L, 0L, 0L, 203L, 0L, 0L, 0L, 0L, 0L, 223L, 0L, 0L, 0L, 0L, 0L, 286L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 268L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 305L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 333L)), .Names = c("start", "period", "zug", "typ", "dyn1", "dyn2"), row.names = c(NA, -45L), class = "data.frame")
x_scale_max <- max(fzg$start, fzg$dyn1, fzg$dyn2)
ggplot(fzg, aes(x=period, y=start, fill=typ)) + geom_bar(stat="identity", position="dodge") + ylim(0,x_scale_max)
ggplot(fzg, aes(x=period, y=dyn1, fill=typ)) + geom_bar(stat="identity", position="dodge") + ylim(0,x_scale_max)
ggplot(fzg, aes(x=period, y=dyn2, fill=typ)) + geom_bar(stat="identity", position="dodge")+ ylim(0,x_scale_max)
The resulting barplot should
show all the small bars from plot 1 in color 0
show all the highlighted bars from plot 1 in color 1
show the added portions from plot 2 in color 2
show the added portions from plot 3 in color 3
I managed to get all in one plot
library(reshape2)
mdat <- melt(fzg[c("start", "period", "dyn1", "dyn2")], measured=c("start","dyn1","dyn2"), id="period")
ggplot(mdat, aes(x=period, y=value, fill=variable)) + geom_bar(stat="identity", position="stack") + ylim(0,x_scale_max)
But the color highlighting of the different steps does not work well.
If you are looking for this
plot
Just modify your code :
mdat <- melt(fzg[c("start", "period", "dyn1", "dyn2", "typ")], measured=c("start","dyn1","dyn2"), id=c("period", "typ"))
mdat <- mdat[mdat$value != 0,]
ggplot(mdat, aes(x=period, y=value, fill=interaction(variable,typ))) + geom_bar(stat = "identity")

scale x-axis correctly for barplot

When plotting a stacked barplot using {graphics} I get a problem with the x-axis not scaling correctly, the ticks aren't aligned to the bars properly, leaving the axis too short.
# dummy data
mat <- structure(c(0L, 5L, 7L, 10L, 12L, 14L, 16L, 18L, 20L, 22L, 24L,
26L, 28L, 30L, 32L, 34L, 36L, 38L, 40L, 42L, 44L, 46L, 48L, 50L,
52L, 54L, 56L, 58L, 60L, 62L, 63L, 64L, 0L, 0L, 0L, 3L, 0L, 0L,
1L, 0L, 0L, 0L, 5L, 0L, 1L, 4L, 0L, 9L, 0L, 0L, 1L, 0L, 8L, 0L,
7L, 0L, 1L, 1L, 6L, 0L, 1L, 3L, 4L, 0L, 1L, 1L, 5L, 6L, 1L, 6L,
0L, 0L, 5L, 4L, 1L, 8L, 0L, 1L, 1L, 3L, 1L, 3L, 1L, 0L, 1L, 1L,
1L, 1L, 0L, 3L, 3L, 5L, 4L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 6L, 0L,
11L, 0L, 7L, 0L, 6L, 0L, 0L, 5L, 4L, 0L, 1L, 0L, 1L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 9L, 0L), .Dim = c(32L, 5L
), .Dimnames = list(NULL, c("Time", "Var1", "Var2", "Var3", "Var4"
)))
# barplot
barplot(t(mat[,2:4]), beside=F, legend=levels(mat), col=c("blue",'red','forestgreen','purple'))
# manually assign x-axis
axis(1,at=c(1:32),labels=mat[,1])
Any pointers on this would be highly appreciated. Im not interested in a ggplot2 solution. Thanks!
For your axis, get the coordinates of the barplots first.
bp <-barplot(t(mat[,2:5]), beside=F,
legend = levels(mat), col = c("blue",'red','forestgreen','purple'))
Now use bp for x-tick labels
axis(1,at=bp,labels=mat[,1])
The resulting plot
Also, if you play with the width of your plot window/device, you can get all the labels.

Searching Lists in R based on closest date [duplicate]

This question already has answers here:
Vectorise find closest date function
(6 answers)
Closed 8 years ago.
Currently trying to write some that would return the last date from an ordered list that is less than date X.
Right now I have this: it gets a list of days, and gets an index off the day we're going to be doing search on and range of how many dates we want to go back.
After that it checks if the date exists or not (e.g. Feb 30th). If the date doesn't exist, it decreases the date by 1 and then applies filter again (otherwise it tries to subtract 1 day from NA and fails).
library(lubridate)
getDate <- function(dates,day,range){
if(range == 'single')
{return (day-1)}
z <- switch(range,
single = days(1),
month = days(30),
month3 = months(3),
month6 = months(6),
year = years(1)
)
new_day <-(dates[day]-z)
i <- 1
while (is.na(new_day)){
new_day <- dates[day] - days(i) - z
}
ind<-which.min(abs (diff <-(new_day-dates)))
if (diff[ind] < 0)
{ind <- ind -1}
return (ind[1])
}
While this function works, the problem is the speed efficiency. I have a feeling that which.min(abs()) is far from the quickest and I'm wondering if there are any better alternatives (outside of also writing my own function to search lists).
stocks <- list(structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), mday = c(2L, 3L, 4L, 7L, 8L, 9L, 10L, 11L, 14L, 15L, 16L, 17L,
18L, 22L, 23L, 24L, 25L, 28L, 29L, 30L, 31L, 1L, 4L, 5L, 6L), mon = c(0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L), year = c(108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L,
108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L, 108L,
108L, 108L, 108L), wday = c(3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), yday = c(1L, 2L, 3L, 6L, 7L,
8L, 9L, 10L, 13L, 14L, 15L, 16L, 17L, 21L, 22L, 23L, 24L, 27L, 28L, 29L, 30L,
31L, 34L, 35L, 36L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", "min",
"hour", "mday", "mon", "year", "wday", "yday", "isdst"), tzone = "UTC",
class = c("POSIXlt", "POSIXt")))
old_pos <- getDate(stocks[[1]],21,"month") #should return 0
old_pos <- getDate(stocks[[1]],22,"month") #should return 1
This does not return a vector, nor a date, only an index and the main question isn't about working (which it does), but optimizing it.
The value is later on being used in another function, one possible speed up is to first match all of the old indexes to new ones and then return that as another list. However not sure if it would offer any speed up.
Using #agstudy's reformulation including sDate and x.Date
data.table
We can perform the calculations in data.table like this where the first column shows the original date in sDate and the second column is the corresponding x.Date date:
> library(data.table)
> data.table(date = x.Date, x.Date, key = "date")[J(sDate),, roll = TRUE]
date x.Date
1: 2003-02-03 2003-02-02
2: 2003-02-12 2003-02-10
3: 2003-02-16 2003-02-15
sqldf Using sqldf its like this:
> library(sqldf)
> sDateDF <- data.frame(sDate = sDate)
> xDateDF <- data.frame(xDate = x.Date)
>
> sqldf("select s.sdate sDate, max(x.xdate) xDate
+ from sDateDF s join xDateDF x on x.xDate <= s.sDate
+ group by s.sDate")
sDate xDate
1 2003-02-03 2003-02-02
2 2003-02-12 2003-02-10
3 2003-02-16 2003-02-15
zoo
Using zoo, we create two zoo series, merge them and use na.locf like this. The result is the x.Date corresponding to each sDate (i.e. the second column in either of the above solutions):
> library(zoo)
>
> zx <- zoo(seq_along(x.Date), x.Date)
> zs <- zoo(seq_along(sDate), sDate)
> x.Date[na.locf(merge(zx, zs))[sDate, "zx"]]
[1] "2003-02-02" "2003-02-10" "2003-02-15"
If I understand you have a vector of dates, for example :
x.Date <- as.Date("2003-02-01") + c(1, 3, 7, 9, 14,20)
"2003-02-02" "2003-02-04" "2003-02-08" "2003-02-10" "2003-02-15" "2003-02-21"
and giving a vector of dates, for example:
sDate <- as.Date("2003-02-01") + c(2,11,15)
You try to get the closer date in x.Date to this giving date but less than this date:
lapply(sDate,function(x)max(x.Date[x.Date-x <=0]))
[[1]]
[1] "2003-02-02"
[[2]]
[1] "2003-02-10"
[[3]]
[1] "2003-02-15"

Resources