Counts, bars, bins for each pandas DataFrame histogram subplot - count

I am making separate histograms of travel distance per departure hour. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Thanks in advance!
Edit:
Data I'm using, about 2500 columns of this, Distance is float64, the Departuretime is str
Histogram output I'm receiving
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns

By using the 'cut' function you can withdraw the requested data directly from your dataframe, instead of from the graph. This is less error-sensitive.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Related

Re-ordering box plots by mean in R

I have some code to plot boxplots for 150 variables, each with 3 replicates. This works no problem, but I want to reorder them so they appear from lowest mean to highest mean along the x-axis. Any suggestions on what I can use to do this?
You can order the variables according to their mean value by just writing:
ordered.names = names(df)[order(colMeans(df))]
Then, you'll know the order in which you have to draw them and can use this order in the loop you use to draw the plots or to reorder the data.frame itself (df = df[,ordered.names]))

How to create a 100% stacked bar chart in R by counting data?

I am trying to create a bar chart using ggplot that adds up difference scores and groups them with positive or negative values and then creates a graph of the percentage. I can't seem to figure out the right code to do this however and could use some guidance.
I have two columns I am focusing on: one for the grade level and then another column with the difference score. I tried summing up the values of positive and negative for an aggregate total, but kept running into errors manipulating that data.
I ended up making a new column and merged it to the data frame if the values in a row were less than or greater than 0. I was able to graph this, but I struggle to create a 100% stacked bar chart.
Ideally what I hope to do is to create a stacked bar chart with grades 6th - 10th in the X-axis and the y-axis being the percentage of students in that grade with a positive difference score against the % with a negative score.
# Attempting to create a new column of boolean values to create the chart
Pos_Neg_df <- c(Fall_Math_Data$RITDifference >0)
Percentage_Math_Data <- cbind(Fall_Math_Data, Pos_Neg_df)
# Plotted this
ggplot(Percentage_Math_Data) +geom_bar(aes(x = Grade, fill = Pos_Neg_df)
Can you provide some sample data? It's difficult to see what exactly you're trying to do. That said, in your geom_bar, adding position = "stack" may be what you're looking for (see ggplot2 documentation.)

How to plot a barchart or histogram in R, indicating the probability (discrete data)?

I am new to R. I have discrete data. I want to plot a chart (barchart or histogram) indicating for each existing value (in my data) the normalized number of occurrences (actual count for that value divided by total records). For the moment I have figured out to use:
hist(mydata$x,5,probability = TRUE)
where the number 5 corresponds to the number of rectangles. This example works if the base of the rectangle is length=1, therefore I would always need to know the range of results and I could not have data like {0, 0.5, 1, 1.5, ...}. How to make a more general solution? I really think that there is a single line solution, for something so basic.
Thanks
I assume your are looking for the combination
table()
barplot()
e.g.
counts <- table(mtcars$gear)
barplot(counts / sum(counts), main="Car Distribution", xlab="Number of Gears")
Yes. There is a line for this.
barplot(prop.table(table(data$x)))
data$x is a discrete variable.
table(data$x) will give you a table with the first row=the different values of data$x and
the second row=the frequencies of each of those values.
prop.table(table(data$x)) will also give you a table. The same table but this time each
value will be divided by the length of the variable data$x so you will get the
probability of having each different value.
barplot will plot you a barchart. At x-axis you will get the first row of prop.table(table(data$x)). And at y-axis you will get the second row of prop.table(table(data$x)).

switching the place of x and y axis data in r

I have a vector of data which consists of 20,000 numbers ranging between 0 and 1, i want to plot this data where x axis is the number values and y axis is their frequencies.
|
Freq|
|
|
|______________
values
but when i use plot(vector) in R, it shows frequency on x axis named as index and number values on y.
In the arguments used by plot() function i couldn't find anything helpful.
does anybody know how i could do this?
If you want a plot of frequencies, the best type of plot to make would be a barplot and the easiest way to make a barplot is just to pass a table to barplot(). For example
barplot(table(vector))
or if you just want a needle-style plot
plot(table(vector))
would also work.
If you want to trim outliers from the table, you could try
barplot( table( vector[vector<quantile(vector, .98)] ) )
here we drop samples that are above the 98% quantile.

Can qplot directly display percentages without intermediate columns?

I have a bunch of histograms to plot on data that is still coming. As the sample sizes vary, in order to compare them I need to plot the histograms with percentages not counts.
qplot (field, data=mydata, geom="histogram", binwidth=10)
the above qplot displays the counts. The density option is not applicable as it divides the counts within a bin to the bin's width, whereas I need to divide on the total number of samples.
I can precalculate a column containing the percentage, but it's cumbersome (I have many data sets).
Is there a better way to tell qplot to directly plot the histogram with percentages (ideally, also displayed as percentages (as 69%) and not as 0.69)?
Thanks!
try this:
ggplot(movies,aes(x=rating))+stat_bin(aes(n=nrow(movies), y=..count../n))+
scale_y_continuous(formatter = "percent")

Resources