R - pareto like summary for histograms

R - pareto like summary for histograms - r

I would like to generate summary of a histogram in a table format. With plot=FALSE, i am able to get histogram object.
> hist(y,plot=FALSE)
$breaks
[1] 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8
$counts
[1] 48 1339 20454 893070 1045286 24284 518 171 148
[10] 94 42 42 37 25 18 21 14 5
$density
[1] 0.00012086929 0.00337174962 0.05150542703 2.24884871999 2.63214538964
[6] 0.06114978928 0.00130438111 0.00043059685 0.00037268032 0.00023670236
[11] 0.00010576063 0.00010576063 0.00009317008 0.00006295276 0.00004532598
[16] 0.00005288032 0.00003525354 0.00001259055
$mids
[1] 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7
$xname
[1] "y"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Is there a way to summarize this object like pareto chart summary. (Below summary is for different data, including this as an example)
Pareto chart analysis for counts
Frequency Cum.Freq. Percentage Cum.Percent.
c 2294652 2294652 33.689225770 33.68923
f 1605467 3900119 23.570868362 57.26009
g 896893 4797012 13.167848880 70.42794
i 464220 5261232 6.815505091 77.24345
b 365399 5626631 5.364651985 82.60810
j 332239 5958870 4.877809219 87.48591
h 215313 6174183 3.161145249 90.64705
l 129871 6304054 1.906717637 92.55377
e 107001 6411055 1.570948818 94.12472
k 104954 6516009 1.540895526 95.66562
d 103648 6619657 1.521721321 97.18734
m 56172 6675829 0.824696377 98.01203
o 51093 6726922 0.750128391 98.76216
n 49320 6776242 0.724097865 99.48626
p 32321 6808563 0.474524881 99.96079
q 1334 6809897 0.019585291 99.98037
r 620 6810517 0.009102609 99.98947
s 247 6810764 0.003626362 99.99310
u 182 6810946 0.002672056 99.99577
t 162 6811108 0.002378424 99.99815
z 126 6811234 0.001849885 100.00000

You can write a wrapper function that will convert the relevant parts of the hist output into a data.frame:
myfun <- function(x) {
h <- hist(x, plot = FALSE)
data.frame(Frequency = h$counts,
Cum.Freq = cumsum(h$counts),
Percentage = h$density/sum(h$density),
Cum.Percent = cumsum(h$density)/sum(h$density))
}
Here's an example on the built-in iris dataset:
myfun(iris$Sepal.Width)
# Frequency Cum.Freq Percentage Cum.Percent
# 1 4 4 0.026666667 0.02666667
# 2 7 11 0.046666667 0.07333333
# 3 13 24 0.086666667 0.16000000
# 4 23 47 0.153333333 0.31333333
# 5 36 83 0.240000000 0.55333333
# 6 24 107 0.160000000 0.71333333
# 7 18 125 0.120000000 0.83333333
# 8 10 135 0.066666667 0.90000000
# 9 9 144 0.060000000 0.96000000
# 10 3 147 0.020000000 0.98000000
# 11 2 149 0.013333333 0.99333333
# 12 1 150 0.006666667 1.00000000

Related

Subracting rows from dataframe

Let's say I have the following dataset:
Industry Country Year AUS AUS AUT AUT ...
A AUS 1 0.5 0.2 0.1 0.01
B AUS 2 0.3 0.5 2 0.1
A AUT 3 1 1.2 1.3 0.3
B AUT 4 0.5 0 0.8 2
... ... ... ... ... ... ....
VA 11 10 47 55
tot 24 23 50 70
How can I subtract ONLY the last two rows(tot= tot-VA) to get:
Industry Country Year AUS AUS AUT AUT ...
A AUS 1 0.5 0.2 0.1 0.01
B AUS 2 0.3 0.5 2 0.1
A AUT 3 1 1.2 1.3 0.3
B AUT 4 0.5 0 0.8 2
... ... ... ... ... ... ....
VA 11 10 47 55
FI 13 13 3 15
FI/VA 1.2 1.3 0.06 0.27
Where FI is simply tot-VA

You could try this:
check which columns are numeric
use sapply to calculate the new row FI
bind them together
num <- sapply(df, class) == "numeric"
df_tot<- data.frame(as.list(sapply(df[, num], function(x) x[length(x)]-x[length(x)-1])))
df_tot$Industry <- "FI"
df <- data.table::rbindlist(list(df, df_tot), fill = TRUE)
EDIT:
If you just want to sum up all rows but the last one, then you could try this:
num <- sapply(df, class) == "numeric"
df_tot <- data.frame(as.list(sapply(df[1:(nrow(df)-1), num], sum)))
df_tot$Industry <- "FI"
df <- data.table::rbindlist(list(df, df_tot), fill = TRUE)

Here's a tidyverse approach to the issue of subtracting selected rows across columns:
library(tidyverse)
df %>%
# subtract across the relevant columns:
summarise(across(matches("^AU"), ~(.x[Industry == "tot"] - .x[Industry == "VA"]))) %>%
# add the 'new' column `Industry`:
mutate(Industry = "FI") %>%
# bind result back into `df`:
bind_rows(df,.)
Industry AU1 AU2 AU3
1 A 0.1 0.4 7.0
2 B 0.7 3.0 1.0
3 A 3.0 2.5 0.1
4 VA 11.0 10.0 47.0
5 tot 24.0 23.0 50.0
6 FI 13.0 13.0 3.0
If you no longer need rows #4 and #5, add this to the pipe:
filter(!Industry %in% c("VA", "tot"))
Data:
df <- data.frame(
Industry = c("A","B","A","VA","tot"),
AU1 = c(0.1,0.7,3,11,24),
AU2 = c(0.4,3,2.5,10,23),
AU3 = c(7, 1, 0.1,47,50)
)

Removing a column from a matrix

I'm a bit new to R and wanting to remove a column from a matrix by the name of that column. I know that X[,2] gives the second column and X[,-2] gives every column except the second one. What I really want to know is if there's a similar command using column names. I've got a matrix and want to remove the "sales" column, but X[,-"sales"] doesn't seem to work for this. How should I do this? I would use the column number only I want to be able to use it for other matrices later, which have different dimensions. Any help would be much appreciated.

I'm not sure why all the answers are solutions for data frames and not matrices.
Per #Sotos's and #Moody_Mudskipper's comments, here is an example with the builtin state.x77 data matrix.
dat <- head(state.x77)
dat
#> Population Income Illiteracy Life Exp Murder HS Grad Frost Area
#> Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
#> Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
#> Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
#> Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
#> California 21198 5114 1.1 71.71 10.3 62.6 20 156361
#> Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
# for removing one column
dat[, colnames(dat) != "Area"]
#> Population Income Illiteracy Life Exp Murder HS Grad Frost
#> Alabama 3615 3624 2.1 69.05 15.1 41.3 20
#> Alaska 365 6315 1.5 69.31 11.3 66.7 152
#> Arizona 2212 4530 1.8 70.55 7.8 58.1 15
#> Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
#> California 21198 5114 1.1 71.71 10.3 62.6 20
#> Colorado 2541 4884 0.7 72.06 6.8 63.9 166
# for removing more than one column
dat[, !colnames(dat) %in% c("Area", "Life Exp")]
#> Population Income Illiteracy Murder HS Grad Frost
#> Alabama 3615 3624 2.1 15.1 41.3 20
#> Alaska 365 6315 1.5 11.3 66.7 152
#> Arizona 2212 4530 1.8 7.8 58.1 15
#> Arkansas 2110 3378 1.9 10.1 39.9 65
#> California 21198 5114 1.1 10.3 62.6 20
#> Colorado 2541 4884 0.7 6.8 63.9 166
#be sure to use `colnames` and not `names`
names(state.x77)
#> NULL
Created on 2020-06-27 by the reprex package (v0.3.0)

my favorite way:
# create data
df <- data.frame(x = runif(100),
y = runif(100),
remove_me = runif(100),
remove_me_too = runif(100))
# remove column
df <- df[,!names(df) %in% c("remove_me", "remove_me_too")]
so this dataframe:
> df
x y remove_me remove_me_too
1 0.731124508 0.535219259 0.33209113 0.736142042
2 0.612017350 0.404128030 0.84923974 0.624543223
3 0.415403559 0.369818154 0.53817387 0.661263087
4 0.199780006 0.679946936 0.58782429 0.085624708
5 0.343304259 0.892128112 0.02827132 0.038203599
becomes this:
> df
x y
1 0.731124508 0.535219259
2 0.612017350 0.404128030
3 0.415403559 0.369818154
4 0.199780006 0.679946936
5 0.343304259 0.892128112

As always in R there are many potential solutions. You can use the package dplyr and select() to easily remove or select columns in a data frame.
df <- data.frame(x = runif(100),
y = runif(100),
remove_me = runif(100),
remove_me_too = runif(100))
library(dplyr)
select(df, -remove_me, -remove_me_too) %>% head()
#> x y
#> 1 0.35113636 0.134590652
#> 2 0.72545356 0.165608839
#> 3 0.81000067 0.090696049
#> 4 0.29882204 0.004602398
#> 5 0.93492918 0.256870750
#> 6 0.03007377 0.395614901
You can read more about dplyr and its verbs here.

As a general case, if you remove so many columns that only one column remains, R will convert it to a numeric vector. You can prevent it by setting drop = FALSE.
(df <- data.frame(x = runif(6),
y = runif(6),
remove_me = runif(6),
remove_me_too = runif(6)))
# x y remove_me remove_me_too
# 1 0.4839869 0.18672217 0.0973506 0.72310641
# 2 0.2467426 0.37950878 0.2472324 0.80133920
# 3 0.4449471 0.58542547 0.8185943 0.57900456
# 4 0.9119014 0.12089776 0.2153147 0.05584816
# 5 0.4979701 0.04890334 0.7420666 0.44906667
# 6 0.3266374 0.37110822 0.6809380 0.29091746
df[, -c(3, 4)]
# x y
# 1 0.4839869 0.18672217
# 2 0.2467426 0.37950878
# 3 0.4449471 0.58542547
# 4 0.9119014 0.12089776
# 5 0.4979701 0.04890334
# 6 0.3266374 0.37110822
# Result is a numeric vector
df[, -c(2, 3, 4)]
# [1] 0.4839869 0.2467426 0.4449471 0.9119014 0.4979701 0.3266374
# Keep the matrix type
df[, -c(2, 3, 4), drop = FALSE]
# x
# 1 0.4839869
# 2 0.2467426
# 3 0.4449471
# 4 0.9119014
# 5 0.4979701
# 6 0.3266374

R Extracting inner node information and splits from ctree (partykit)

Hi I'm currently trying to extract some of the inner node information stored in the constant partying object in R using ctree in partykit but I'm finding navigating the objects a bit difficult, I'm able to display the information on a plot but I'm not sure how to extract the information - I think it requires nodeapply or another function in the partykit?
library(partykit)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct, inner_panel = node_barplot(irisct))
Plot with inner node details
All the information is accessible by the functions to plot, but I'm after a text output similar to:
Example output

The main trick (as previously pointed out by #G5W) is to take the [id] subset of the party object and then extract the data (by either $data or using the data_party() function) which contains the response. I would recommend to build a table with absolute frequencies first and then compute the relative and marginal frequencies from that. Using the irisct object the plain table can be obtained by
tab <- sapply(1:length(irisct), function(id) {
y <- data_party(irisct[id])
y <- y[["(response)"]]
table(y)
})
tab
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## setosa 50 50 0 0 0 0 0
## versicolor 50 0 50 49 45 4 1
## virginica 50 0 50 5 1 4 45
Then we can add a little bit of formatting to a nice table object:
colnames(tab) <- 1:length(irisct)
tab <- as.table(tab)
names(dimnames(tab)) <- c("Species", "Node")
And then use prop.table() and margin.table() to compute the frequencies we are interested in. The as.data.frame() method transform from the table layout to a "long" data.frame:
as.data.frame(prop.table(tab, 1))
## Species Node Freq
## 1 setosa 1 0.500000000
## 2 versicolor 1 0.251256281
## 3 virginica 1 0.322580645
## 4 setosa 2 0.500000000
## 5 versicolor 2 0.000000000
## 6 virginica 2 0.000000000
## 7 setosa 3 0.000000000
## 8 versicolor 3 0.251256281
## 9 virginica 3 0.322580645
## 10 setosa 4 0.000000000
## 11 versicolor 4 0.246231156
## 12 virginica 4 0.032258065
## 13 setosa 5 0.000000000
## 14 versicolor 5 0.226130653
## 15 virginica 5 0.006451613
## 16 setosa 6 0.000000000
## 17 versicolor 6 0.020100503
## 18 virginica 6 0.025806452
## 19 setosa 7 0.000000000
## 20 versicolor 7 0.005025126
## 21 virginica 7 0.290322581
as.data.frame(margin.table(tab, 2))
## Node Freq
## 1 1 150
## 2 2 50
## 3 3 100
## 4 4 54
## 5 5 46
## 6 6 8
## 7 7 46
And the split information can be obtained with the (still unexported) .list.rules.party() function. You just need to ask for all node IDs (the default is to use just the terminal node IDs):
partykit:::.list.rules.party(irisct, i = nodeids(irisct))
## 1
## ""
## 2
## "Petal.Length <= 1.9"
## 3
## "Petal.Length > 1.9"
## 4
## "Petal.Length > 1.9 & Petal.Width <= 1.7"
## 5
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length <= 4.8"
## 6
## "Petal.Length > 1.9 & Petal.Width <= 1.7 & Petal.Length > 4.8"
## 7
## "Petal.Length > 1.9 & Petal.Width > 1.7"

Most of the information that you want is accessible without much work.
I will show how to get the information, but leave you to format the
information into a pretty table.
Notice that your tree structure irisct is just a list of each of the nodes.
length(irisct)
[1] 7
Each node has a field data that contains the points that have made it down
this far in the tree, so you can get the number of observations at the node
by counting the rows.
dim(irisct[4]$data)
[1] 54 5
nrow(irisct[4]$data)
[1] 54
Or doing them all at once to get your table 2
NObs = sapply(1:7, function(n) { nrow(irisct[n]$data) })
NObs
[1] 150 50 100 54 46 8 46
The first column of the data at a node is the class (Species),
so you can get the count of each class and the probability of each class
at a node
table(irisct[4]$data[1])
setosa versicolor virginica
0 49 5
table(irisct[4]$data[1]) / NObs[4]
setosa versicolor virginica
0.00000000 0.90740741 0.09259259
The split information in your table 3 is a bit more awkward. Still,
you can get a text version of what you need just by printing out the
top level node
irisct[1]
Model formula:
Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
Fitted party:
[1] root
| [2] Petal.Length <= 1.9: setosa (n = 50, err = 0.0%)
| [3] Petal.Length > 1.9
| | [4] Petal.Width <= 1.7
| | | [5] Petal.Length <= 4.8: versicolor (n = 46, err = 2.2%)
| | | [6] Petal.Length > 4.8: versicolor (n = 8, err = 50.0%)
| | [7] Petal.Width > 1.7: virginica (n = 46, err = 2.2%)
Number of inner nodes: 3
Number of terminal nodes: 4
To save the output for parsing and display
TreeSplits = capture.output(print(irisct[1]))

Plotting gridded field

I have a gridded field that I plotted with the image function
df <- datainSUB
yr mo dy hr lon lat cell sst avg moavg
1900 6 5 17 -73.5 -60.5 83 2.4 2.15 3.15
1900 6 7 17 -74.5 -60.5 83 3.9 2.15 3.15
1900 8 17 17 -70.5 -60.5 83 -0.9 2.15 0.60
1900 8 18 17 -73.5 -60.5 83 2.1 2.15 0.60
1900 9 20 17 -71.5 -60.5 83 0.2 2.15 2.20
1900 9 21 17 -74.5 -61.5 83 1.6 2.15 2.20
gridplot <- function(df){
pdf(paste(df$mo,".pdf"))
# Compute the ordered x- and y-values
LON <- seq(-180, 180, by = space)
LAT <- seq(-90, 90, by = space)
# Build the matrix to be plotted
moavg <- matrix(NA, nrow=length(LON), ncol=length(LAT))
moavg[cbind(match(round(df$lon, -1), LON), match(round(df$lat, -1), LAT))] <- df$moavg
# Plot the image
image(LON, LAT, moavg)
map(add=T,col="saddlebrown",interior = FALSE, database="world")
dev.off()
}
I want to add a colour legend to the plot but I don't know how to do that. Maybe ggplot is better?
Many thanks

Add the following line after plotting your data:
legend(x="topright", "your legend goes here", fill="saddlebrown")

Confidence Interval of Difference of Means between two datasets

I'm working on two datasets, derrived fromm cats, an in-build R dataset.
> cats
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
7 F 2.1 8.1
8 F 2.1 8.2
9 F 2.1 8.3
10 F 2.1 8.5
11 F 2.1 8.7
12 F 2.1 9.8
...
137 M 3.6 13.3
138 M 3.6 14.8
139 M 3.6 15.0
140 M 3.7 11.0
141 M 3.8 14.8
142 M 3.8 16.8
143 M 3.9 14.4
144 M 3.9 20.5
I want to find the 99% Confidence Interval on the difference of means values between the Bwt of Male and Female specimens (Sex == M and Sex == F respectively)
I know that t.test does this, among other things, but if I break up cats to two datasets that contain the Bwt of Males and Females, t.test() complains that the two datasets are not of the same length, which is true. There's only 47 Females in cats, and 87 Males.
Is it doable some other way or am I misinterpreting data by breaking them up?
EDIT:
I have a function suggested to me by an Answerer on another Question that gets the CI of means on a dataset, may come in handy:
ci_func <- function(data, ALPHA){
c(
mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)),
mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data))
)
}

You should apply the t.test with the formula interface:
t.test(Bwt ~ Sex, data=cats, conf.level=.99)

Alternatively to t.test, if you really only interested in the difference of means, you can use:
DescTools::MeanDiffCI(cats$Bwt, cats$Sex)
which gives something like
meandiff lwr.ci upr.ci
-23.71474 -71.30611 23.87662
This is calculated with 999 bootstrapped samples by default. If you want more, you can specify this in the R parameter:
DescTools::MeanDiffCI(cats$Bwt, cats$Sex, R = 1000)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - pareto like summary for histograms - r

Related

Subracting rows from dataframe

Removing a column from a matrix

R Extracting inner node information and splits from ctree (partykit)

Plotting gridded field

Confidence Interval of Difference of Means between two datasets

Categories

Resources