How to merge bins in R - r

So, I am trying to merge bins of a histogram whenever the number of observations in a bin is less than 6.
library(fitdistrplus)
mydata <-read.csv("Book2.csv",stringsAsFactors=FALSE)
QF3<-as.numeric(mydata[,1])
histrv<-hist(QF3,breaks="FD")
binvec<-data.frame(diff(histrv$breaks))
binbreak=histrv$breaks
freq<-histrv$count
datmean=as.numeric(mean(QF3))
datsigma=as.numeric(sd(QF3))
templist<-as.numeric()#empty list
for (i in 1:nrow(binvec)){
templist[i]=pnorm(binbreak[i+1],datmean,datsigma)-pnorm(binbreak[i],datmean,datsigma)
}
pi<-data.frame(templist)
chisqvec<-(freq-length(QF3)*pi)^2/(length(QF3)*pi)
xstat=sum(chisqvec)
The above code will provide a histogram with five bins that contain less than 6 observations, which are the bins 6000-7000, 7000-8000, 8000-9000, 9000-10000, and 10000-11000. Each of these 5 bins contain 2, 5, 2, 2, and 1 observations respectively. I would like to merge the bins that they can have more than 5 observations.
In other words, I would like to have the two bins 6000-8000 and 8000-11000 so that they can contain 7 observations and 5 observations.
Does anyone have any clue on how to approach this problem?
QF3 looks like the following:
> QF3
[1] 2016 1425 2000 785 823 2484 1870 770 1220 3454 1056 2745 2830
[14] 950 601 1245 2663 1500 1717 1070 1704 2517 1090 3310 3389 2200
[27] 882 2113 600 1900 4417 745 530 1630 1600 4530 948 2764 2202
[40] 1052 2685 1120 1275 2300 1590 1935 3957 4283 3215 5684 4092 7548
[53] 4547 3510 3063 5549 6460 5204 4626 4965 5023 8111 5525 4804 5994
[66] 8471 4767 7142 3420 4061 5102 9135 3861 5372 7274 5054 7318 3791
[79] 4901 3549 4758 4859 10190 5609 7624 5841 4908 4974 6691 5713 3235
[92] 4464 2656 4399 9581 3993 4061

Related

hts method for creating hierarchical time series

I am trying to convert a time series into a hierarchical one using the hts package and my structure is as follows:
My data is already stored as a ts object in R with a weekly frequency and all columns are labeled:
Then I try to make the hts object but I get this error message, probably because I do not fully understand how to split the columns in the argument "characters".
x <- hts(abc, c(1,1,2,2))
Since argument characters are not specified, the default labelling system is used.
Error: Argument nodes must be a list.
Thanks in advance
First you need to use the argument characters. It is not the second argument of hts(). Second, you need to specify characters according to the names of your columns. In this case, the first character is the first level of disaggregation (A, B, C, D, E), while the next two characters specify the next level of disaggregation. So your argument should be characters = c(1,2).
Here is an example with synthetic data, but with the same structure as yours.
library(hts)
abc <- matrix(sample(1:100, 32*140, replace=TRUE), ncol=32)
colnames(abc) <- c(
paste0("A0",1:5),
paste0("B0",1:9),"B10",
paste0("C0",1:8),
paste0("D0",1:5),
paste0("E0",1:4)
)
abc <- ts(abc, start=2019, frequency=365.25/7)
x <- hts(abc, characters = c(1,2))
x
#> Hierarchical Time Series
#> 3 Levels
#> Number of nodes at each level: 1 5 32
#> Total number of series: 38
#> Number of observations per series: 140
#> Top level series:
#> Time Series:
#> Start = 2019
#> End = 2021.66392881588
#> Frequency = 52.1785714285714
#> [1] 1735 1645 1472 1638 1594 1722 1525 1761 1500 1746 1331 1567 1853 1652 1587
#> [16] 1540 1453 1989 1629 1587 1596 1474 1320 1599 1762 1419 1931 1447 2102 1608
#> [31] 1439 1909 1331 1742 1428 1677 1534 1657 1741 1612 1574 1954 1542 2067 1512
#> [46] 1850 1650 1666 1321 1332 1924 1786 1496 1695 1363 1437 1740 1448 1260 1371
#> [61] 1661 1726 1786 1641 1463 1616 1641 1895 1503 1430 1972 1705 1722 1447 1515
#> [76] 1636 1544 1727 1960 1647 1682 1569 1616 1628 1706 1837 1738 1659 1574 1716
#> [91] 1409 1428 1411 1708 1606 1501 1413 1707 1552 1567 1693 1748 2034 1557 1402
#> [106] 1649 1637 1653 1857 1401 1519 1600 1844 1585 1796 1612 1456 1626 1390 1368
#> [121] 1492 1765 1644 1773 1302 2027 1810 1652 1819 1628 1574 1655 1650 1817 1605
#> [136] 1422 1793 1999 1489 1667
Created on 2021-10-18 by the reprex package (v2.0.1)

How do I paste data from the first day of the month for the whole month?

I have a data frame containing weights for stocks in a portfolio, it is around 1000 stocks and 4000 days of data. I want to apply the weights of the first day of each month to all days of that month. However, I still want to retain the structure of daily data.
My data is similar to this:
data <- as.data.frame(matrix(1:4000, nrow = 200, ncol = 20))
rownames(data) <- seq(as.Date("2018/01/01"), as.Date("2018/07/19"), 1)
So I want to have the values of the first of January copied to all days of January, values of the first day of February copied to all days in Februari etc.
I have no clue how to handle this.
Any tips?
You might want to use grouped dplyr::mutate_at:
library(dplyr)
data %>%
rownames_to_column("Date") %>%
mutate(Month = format(as.Date(Date), "%Y-%m")) %>%
group_by(Month) %>%
mutate_at(vars(starts_with("V")), .funs = list(weight = ~first(.))) %>%
column_to_rownames("Date") %>%
select(Month, starts_with("V"))
Output
# Month V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V1_weight
# 2018-01-01 2018-01 1 201 401 601 801 1001 1201 1401 1601 1801 2001 2201 2401 2601 2801 3001 3201 3401 3601 3801 1
# 2018-01-02 2018-01 2 202 402 602 802 1002 1202 1402 1602 1802 2002 2202 2402 2602 2802 3002 3202 3402 3602 3802 1
# 2018-01-03 2018-01 3 203 403 603 803 1003 1203 1403 1603 1803 2003 2203 2403 2603 2803 3003 3203 3403 3603 3803 1
# 2018-01-04 2018-01 4 204 404 604 804 1004 1204 1404 1604 1804 2004 2204 2404 2604 2804 3004 3204 3404 3604 3804 1
# ...
# 2018-01-21 2018-01 21 221 421 621 821 1021 1221 1421 1621 1821 2021 2221 2421 2621 2821 3021 3221 3421 3621 3821 1
# 2018-01-22 2018-01 22 222 422 622 822 1022 1222 1422 1622 1822 2022 2222 2422 2622 2822 3022 3222 3422 3622 3822 1
# 2018-01-23 2018-01 23 223 423 623 823 1023 1223 1423 1623 1823 2023 2223 2423 2623 2823 3023 3223 3423 3623 3823 1
# 2018-01-24 2018-01 24 224 424 624 824 1024 1224 1424 1624 1824 2024 2224 2424 2624 2824 3024 3224 3424 3624 3824 1
# V2_weight V3_weight V4_weight V5_weight V6_weight V7_weight V8_weight V9_weight V10_weight V11_weight V12_weight
# 2018-01-01 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# 2018-01-02 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# 2018-01-03 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# 2018-01-04 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# ...
# 2018-01-21 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# 2018-01-22 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# 2018-01-23 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# 2018-01-24 201 401 601 801 1001 1201 1401 1601 1801 2001 2201
# V13_weight V14_weight V15_weight V16_weight V17_weight V18_weight V19_weight V20_weight
# 2018-01-01 2401 2601 2801 3001 3201 3401 3601 3801
# 2018-01-02 2401 2601 2801 3001 3201 3401 3601 3801
# 2018-01-03 2401 2601 2801 3001 3201 3401 3601 3801
# 2018-01-04 2401 2601 2801 3001 3201 3401 3601 3801
# ...
# 2018-01-20 2401 2601 2801 3001 3201 3401 3601 3801
# 2018-01-21 2401 2601 2801 3001 3201 3401 3601 3801
# 2018-01-22 2401 2601 2801 3001 3201 3401 3601 3801
# 2018-01-23 2401 2601 2801 3001 3201 3401 3601 3801
# 2018-01-24 2401 2601 2801 3001 3201 3401 3601 3801
here is a solution using the dplyr package
library(dplyr)
library(zoo)
# convert row names to variable
tibble::rownames_to_column(data, "Date") -> data
# add first value of month to all rows
data %>%
mutate(yearmon=zoo::as.yearmon(Date)) %>% #select years-months
group_by(yearmon) %>% #group by them
mutate(first=first(V1)) -> data_new #add value
One option would be the following:
Using lubridate identify rows that are not on the first of the months and set their values to NA
Using zoo package forward fill the na values with the last available non-missing datapoint using na.locf
Like this:
library(lubridate)
library(zoo)
data[day(row.names(data)) != 1, ] = NA
data = na.locf(data)

Adding a second y-axis as a formula of the first x and y-axis

I've got the following data:
ClusterID AvgGenes nCoreGenes Ratio
20001 1941 1572 0.809892
20005 1599 1374 0.859287
20008 2017 1712 0.848785
20009 1808 1590 0.879425
20013 1823 1469 0.805815
20015 2056 1677 0.815661
20019 2135 1783 0.835129
20020 3152 2625 0.832805
20026 2028 1586 0.782051
20028 1835 1420 0.773842
20030 2885 2189 0.758752
20031 1772 1485 0.838036
20032 1722 1473 0.855401
20034 1801 1459 0.810105
20035 1677 1339 0.798450
20042 2193 1651 0.752850
20047 1747 1345 0.769891
20049 1306 1008 0.771822
20051 1738 1358 0.781358
20052 1552 1188 0.765464
20062 2179 1509 0.692520
20065 2047 1894 0.925256
20074 1948 1568 0.804928
20088 2588 2192 0.846986
20103 1916 1341 0.699896
20109 2511 2190 0.872162
20117 1668 1278 0.766187
20162 1936 1601 0.826963
20167 2068 1856 0.897485
20168 4375 3992 0.912457
20170 3961 3252 0.821005
20190 2327 2013 0.865062
20196 3350 2522 0.752836
20198 3028 2302 0.760238
20207 1522 1241 0.815375
20208 1791 1546 0.863205
20215 3013 1853 0.615002
20219 2803 2043 0.728862
20225 4604 2931 0.636620
20247 1927 1567 0.813181
20248 2510 1732 0.690040
20251 2252 1674 0.743339
20279 2843 1775 0.624340
20293 1611 1245 0.772812
20313 2277 1914 0.840580
20314 2320 1915 0.825431
20318 2201 1762 0.800545
20320 2287 1943 0.849585
20321 2060 1645 0.798544
20323 2242 1524 0.679750
20327 2132 1845 0.865385
20328 1685 1402 0.832047
20329 2393 1727 0.721688
20341 2190 1729 0.789498
20368 3906 2991 0.765745
20370 3245 2325 0.716487
20373 2608 1935 0.741948
20374 3632 2380 0.655286
20388 1787 1435 0.803022
20408 1506 1262 0.837981
20423 1979 1428 0.721577
20433 2452 1646 0.671289
20459 2118 1649 0.778565
20462 1778 1496 0.841395
20478 1653 1447 0.875378
20492 2709 1895 0.699520
20494 2686 1773 0.660089
20498 2676 1909 0.713378
20508 1425 1092 0.766316
20517 2461 1983 0.805770
20548 2752 2059 0.748183
20565 2239 1764 0.787852
20566 2368 1882 0.794764
20569 2285 1877 0.821444
20572 2179 1703 0.781551
20573 1609 1355 0.842138
20577 1753 1379 0.786651
20579 1786 1426 0.798432
20589 1811 1239 0.684152
20600 2293 1822 0.794592
20650 1693 1422 0.839929
20677 1904 1485 0.779937
20729 1680 1362 0.810714
20742 2210 1855 0.839367
20744 1583 1372 0.866709
20746 2087 1743 0.835170
20750 1859 1418 0.762776
20753 1701 1496 0.879483
20758 1480 1169 0.789865
20759 1839 1406 0.764546
20772 2068 1786 0.863636
20773 2321 2024 0.872038
20775 2528 2012 0.795886
20784 1869 1592 0.851792
20788 1843 1516 0.822572
20809 1541 1352 0.877352
20811 1569 1346 0.857871
20824 1594 1323 0.829987
20836 2287 1688 0.738085
20857 2252 1704 0.756661
20890 1884 1340 0.711253
20903 1681 1404 0.835217
20966 1826 1455 0.796824
20967 1877 1605 0.855088
20990 2125 1605 0.755294
21002 1743 1345 0.771658
21027 1866 1504 0.806002
21047 2866 2191 0.764480
21049 2163 1596 0.737864
21059 2298 1847 0.803742
21085 1640 1490 0.908537
21258 3002 1950 0.649567
21325 2945 2117 0.718846
21326 2343 1996 0.851899
21348 2362 1809 0.765876
21370 2313 1553 0.671422
21384 1932 1383 0.715839
21405 1948 1398 0.717659
21477 1852 1538 0.830454
21584 2514 1838 0.731106
21586 1247 910 0.729751
21734 1619 1452 0.896850
21818 1593 1363 0.855618
21826 2688 2009 0.747396
21845 2595 1854 0.714451
21889 1678 1285 0.765793
22085 1718 1314 0.764843
22153 1290 1139 0.882946
22347 2356 1629 0.691426
22359 2170 1552 0.715207
22396 1648 1337 0.811286
I would like to use AvgGenes as my x-axis and nCoreGenes as my primary y-axis. In addition, I would like to add a second y-axis for the ratio which is nCoreGenes/AvgGenes*100 (pCoreGenes). However, I couldn't find the right formula: y-axis/x-axis*100 to use for scale_y_continuous(sec.axis()) in ggplot2.
cluster2core$pCoreGenes <- cluster2core$Ratio*100
g6 <- ggplot(cluster2core, aes(AvgGenes, nCoreGenes))
g6 <- g6 + geom_point(aes(y = nCoreGenes)) + geom_smooth(method = lm)
g6 <- g6 + geom_line(aes(y = pCoreGenes))
g6 <- g6 + labs(y = "Number of core genes", x = "Average number of genes")
#g6 <- g6 + scale_y_continuous(sec.axis = sec_axis())
The mean value of the ratio % is 78.7 so I expect to get a horizontal line which indicates that on average genomes has 78% core genes.
Using secondary axes in ggplot requires a cheat. You need to pretend that your secondary y axis data are in the same range as the primary y axis data, so scale it accordingly. Multiplying by 100 does not suffice, as you want to have the data in the range around 1000 or so. Multiplying by 4000 should get you there.
Then, you need to reverse the process for the axis, specifying an argument to sec_axis. Normally, you would divide by 4000, but since you want percentage, divide by 40:
ggplot(df, aes(x=AvgGenes, y=nCoreGenes)) + geom_point() +
geom_smooth(method=lm) +
geom_line(aes(y=Ratio*4000)) +
scale_y_continuous(sec.axis=sec_axis( ~ . / 40))
Also, there is no need to specify the esthetics in geom_point since it is inherited from the esthetics in the ggplot() call.

How to get indices of outliers in a dataframe boxplot?

I have a dataframe and I want to get each columns of outliers indices.
Here is part of my dataframe;
mediamarkt[,48]
[1] 7126 4012 3711 3237 3432 2671 2861 7065 3158 4023 4770 3861
[13] 4108 7408 9071 3596 3889 4093 4446 6059 8345 10291 5546 5129
[25] 4683 4670 5694 8619 11047 5743 5775 5216 5283 4854 7871 9944
[37] 3797 3821 3834 3999 4577 8898 11396 4508 5459 3668 3885 4021
[49] 7491 8831 3513 3606 3332 3189 3656 6859 9167 3306 3305 3379
[61] 3507 3912 6562 8245 3420 3445 3530 3404 3847 7187 9128 3623
[73] 3581 3401 2784 3024 6342 7835 2766 2718 2578 2591 2737 5479
[85] 7064 2528 2550 2287 1893 1846
First of all I have tried to get value of outliers with this codes:
boxplot(mediamarkt[,48])$out and I get 2 outliers;
[1] 11047 11396
Everything is okey so far but when I need to get indices of outliers with these code below:
which(mediamarkt[,48] %in% boxplot_mediamarkt$out)
[1] 5 18 29 43 59
I get more than 2 outliers, it does not match these results
What is wrong with my codes
Could anyone help me about solve my problem?
#G5W has asked a question that remains open. This code shows how to do easy input for your data and suggests that your boxplot_mediamarkt is not the output of boxplot or boxplot.stats from your data.
dat <- scan()
1: 7126 4012 3711 3237 3432 2671 2861 7065 3158 4023 4770 3861
13: 4108 7408 9071 3596 3889 4093 4446 6059 8345 10291 5546 5129
25: 4683 4670 5694 8619 11047 5743 5775 5216 5283 4854 7871 9944
37: 3797 3821 3834 3999 4577 8898 11396 4508 5459 3668 3885 4021
49: 7491 8831 3513 3606 3332 3189 3656 6859 9167 3306 3305 3379
61: 3507 3912 6562 8245 3420 3445 3530 3404 3847 7187 9128 3623
73: 3581 3401 2784 3024 6342 7835 2766 2718 2578 2591 2737 5479
85: 7064 2528 2550 2287 1893 1846
91:
Read 90 items
> boxplot(dat)$out
[1] 11047 11396
> which(dat %in% boxplot(dat)$out)
[1] 29 43

Barplot using three columns

The data in the table is given below:
Year NSW Vic. Qld SA WA Tas. NT ACT Aust.
1 1917 1904 1409 683 440 306 193 5 3 4941
2 1927 2402 1727 873 565 392 211 4 8 6182
3 1937 2693 1853 993 589 457 233 6 11 6836
4 1947 2985 2055 1106 646 502 257 11 17 7579
5 1957 3625 2656 1413 873 688 326 21 38 9640
6 1967 4295 3274 1700 1110 879 375 62 103 11799
7 1977 5002 3837 2130 1286 1204 415 104 214 14192
8 1987 5617 4210 2675 1393 1496 449 158 265 16264
9 1997 6274 4605 3401 1480 1798 474 187 310 18532
I want to plot a graph with (Year) on my x-axis and (total value) on my Y-axis. The barplot should depicting the ACT and NT value for the respective (Years).
I tried the following command:
barplot(as.matrix(r_data$ACT, r_data$NT), main="r_data", ylab="Total", beside=TRUE)
The above command showed the barplot of ACT column per year but didn't show the Bar plot of NT column.
You have to create the matrix in a different way:
barplot(as.matrix(r_data[c("ACT", "NT")]),
main="r_data", ylab="Total", beside=TRUE)
You can also use cbind instead of as.matrix and keep the rest of your original approach:
barplot(cbind(r_data$ACT, r_data$NT),
main="r_data", ylab="Total", beside=TRUE)

Resources