I would like to plot in R the equivalent of the binscatter command that you can find in Stata.
I have found the statar package that should give the same with the command stat_binmean.
I am having problems in setting the bins though. I want to set the specific values of x at which I want the bin to be constructed. Indeed , for now, I have only managed to set the number of bins that I want, leaving to R the option to set the corresponding values of x.
The following is my code:
library(statar)
library(ggplot2)
g<-ggplot( df , aes(x=var_x , y=var_y))
g + stat_binmean(n=0)
From the statar's instruction code: "Set (n) to zero if you want to use distinct value of x for grouping", but how do I specify the specific values of the grouping?
PS: I am also fine with other commands, like stat_summary_bin, but my problem stays the same.
Related
I have a csv-file with round about 180 columns (in my example called df). So far I managed to use the ggstatsplot::ggbetweenstats package to plot the data. One column called Group contains the information of the treatment condition and represents the x-axis. The y-axis is changing for each plot. (in the example below it's Bcells.CD45)
ggstatsplot::ggbetweenstats (df, x = Group, y = Bcells.CD45 , plot.type = "violin")
Now, I tried to use the for loop function to replace the value of the y-axis for each generated plot.
for (i in names(df) [1:ncol(df)]) { ggstatsplot::ggbetweenstats(df, x = Group, y = i , plot.type = "violin")}
R returns the following error:
can't subset columns that don't exist.
x The column i doesn't exist.
Run rlang::last_error() to see where the error occurred.
I have the impression that either the ggstatsplot package cant't handle i as placeholder for changing column-names or I'm making a mistake in defining i.
Thanks for your help!
Best Martin
Please have a look at the FAQ-vignette on ggstatsplot website, which documents how to use ggstatsplot functions in a for loop:
https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/faq.html
Relevant text:
I am trying to decile my data into equal bins and summarise it to see if there are any existing patterns with respect to the Dependent Variable. While summarising the data, I also want to see the lower bound and the upper bound of a variable for each decile.
I have written the below code in R-
telecom_final_Analyse<-read.csv("sampletelecomfinal.csv")
col_name_final<-colnames(telecom_final_Analyse)
Variable_profile<-vector("list",79) #I have 79 variables
names(Variable_profile)<-col_name_final
for (j in 1:79) {
if(class(telecom_final_Analyse[,col_name_final[j]])=="numeric" || class(telecom_final_Analyse[,col_name_final[j]])=="integer"){
telecom_final_Analyse%>%mutate(dec=ntile(telecom_final_Analyse[,col_name_final[j]],10))->telecom_final_Analyse
z<-as.name(col_name_final[j])
telecom_final_Analyse%>%group_by(dec)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N,greaterthan = min(z,na.rm=TRUE),lessthan=max(z,na.rm=TRUE))->Variable_profile[[col_name_final[j]]]
}
else{
x<-as.name(col_name_final[j])
telecom_final_Analyse%>%group_by_(x)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N)->Variable_profile[[col_name_final[j]]]
}
}
I am getting the following error - Error in min(z, na.rm = TRUE) : invalid 'type' (symbol) of argument
The following is the code I used for one variable to get the desired output In the same way I want to get output for all integer/numeric variables in the dataset
telecom_final_Analyse%>%mutate(dec=ntile(telecom_final_Analyse$eqpdays ,10))->telecom_final_Analyse
telecom_final_Analyse%>%group_by(dec)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N,greaterthan=min(eqpdays,na.rm=TRUE),lessthan=max(eqpdays,na.rm=TRUE))
I am able to do it manually for 1 variable, this is the output I got. The same way I want for my other continuous variables as well
I've not run this (no reprex) but you can extent your code for the single variable with mutate_if(is.numeric,{a function},{some parameters})
See: https://dplyr.tidyverse.org/reference/mutate_all.html
So try...
telecom_final_Analyse%>%mutate_if(is.numeric, ntile, 10)
Note this will.mutate the existing columns. If you want to keep the old ones and create new ones you can wrap multiple mutate functions in "list(first_function, second_function)" and then the output data set will be wider than before. It's all there in the online help.
Hope this works for you
Matlab's [n,mapx] = histc(x, bin_edged) returns the counts of x in each bin as n and returns a map, which is the same length of x which is the bin index that each element of x was placed into.
I can do the same thing in Julia as follows:
Using StatsBase
x = rand(1000)
bin_e = 0:0.1:1
h = fit(Histogram, x, bin_e)
yx = map((z) -> findnext(z.<=h.edges[1],1),x) .- 1
Is this the "right way" to do this? It seem a bit kludgy.
Inspired by this python question you should be able to define a small function that delivers the desired mapping (modulo conventions):
binindices(edges, data) = searchsortedlast.(Ref(edges), data)
Note that the bin edges are sorted and we can use seachsortedlast to get the last bin edge smaller or equal than a datapoint. Broadcasting this over all of the data we obtain the mapping. Note that the Ref(edges) indicates that edges is a scalar under broadcasting (that means that the full array is considered in each call).
Although conceptionally identical to your solution, this approach is about 13x faster on my machine.
I filed an issue over at StatsBase.jl's github page suggesting to add this as a feature.
After looking through the code for Histogram.jl I found that they already included a function binindex. So this solution is probably the best:
x = 0:0.001:10
h1 = fit(Histogram,x,0:10,closed=left)
xmap1 = StatsBase.binindex.(Ref(h1), x)
h2 = fit(Histogram,x,0:10,closed=right)
xmap2 = StatsBase.binindex.(Ref(h2), x)
I stumbled across this question when I was trying to figure out how many occurrences of each value I had in a list of values. If each value is in its own bin (as for categorical data, or integer data with a small number of unique values), this is what one would be plotting in a histogram.
If that is what you want, then countmap() in StatBase package is just what you need.
I am trying to replace all values of r for which r<=10 with the value of the 1st observation in x (which is 1). This is just a very simplified example of what I am trying to do, so please do not question why I'm trying to do this in a complicated way because the full code is more complicated. The only thing I need help with is figuring out how to use the vector I created (p1) to replace r[p1] or equivalently r[c(1,2,3,4)] with x[ 1 ] (which is equal to 1). I can not write p1 explicitly because it will be generated in a loop (not shown in code).
x=c(1,2,3)
r=c(1,3,7,10,15)
assign(paste0("p", x[1]), which(r<=10))
p1
r[paste0("p", x[1])]=x[1]
In the code above, I tried using r[paste0("p", x[1])]=x[1] but this is the output I end up with
When instead I would like to see this output
Basically, I need to figure out a way to call p1 in this code r[??]=x[1] without explicitly typing p1.
I have included the full code I am attempting below in case context is needed.
##Creates a function to generate discrete random values from a specified pmf
##n is the number of random values you wish to generate
##x is a vector of discrete values (e.g. c(1,2,3))
##pmf is the associated pmf for the discrete values (e.g. c(.3,.2,.5))
r.dscrt.pmf=function(n,x,pmf){
set.seed(1)
##Generate n uniform random values from 0 to 1
r=runif(n)
high=0
low=0
for (i in 1:length(x)){
##High will establish the appropriate upper bound to consider
high=high+pmf[i]
if (i==1){
##Creates the variable p1 which contains the positions of all
##r less than or equal to the first value of pmf
assign(paste0("p", x[i]), which(r<=pmf[i]))
} else {
##Creates the variable p2,p3,p4,etc. which contains the positions of all
##r between the appropriate interval of high and low
assign(paste0("p", x[i]), which(r>low & r<=high))
}
##Low will establish the appropriate lower bound to consider
low=low+pmf[i]
}
for (i in 1:length(x)){
##Will loops to replace the values of r at the positions specified at
##p1,p2,p3,etc. with x[1],x[2],x[3],etc. respectively.
r[paste0("p", x[i])]=x[i]
}
##Returns the new r
r
}
##Example call of the function
r.dscrt.pmf(10,c(0,1,3),c(.3,.2,.5))
get is like assign, in that it lets you refer to variables by string instead of name.
r[get(paste0("p", x[1]))]=x[1]
But get is one of those "flags" of something that could be written in a much clearer and safer way.
Would this suit your needs?
ifelse(r<11, x[1], r)
[1] 1 1 1 1 15
I have a very large data set that I have binned, and stored each bin (subset) as a list so that I can easily call any given subset. My problem is in calling for a specific column within a subset.
For example my data (which has diameters and strengths as the columns), is broken up into 20 bins, by diameter. I manually binned the data, like so:
subset.1 <- subset(mydata, Diameter <= 0.01)
Similar commands were used, to make 20 bins. Then I stored the names (subset.1 through subset.20) into a list:
diameter.bin<-list(subset.1, ... , subset.20)
I can successfully call each diameter bin using:
diameter.bin[x]
Now, if I only want to see the strength values for a given diameter bin, I can use the original name (that is store in the list):
subset.x$Strength
But I cannot get this information using the list call:
diameter.bin[x]$Strength
This command returns NULL
Note that when I call any subset (either by diameter.bin[x], subset.x or even subset.x$Strength) my column headers do show up. When I use:
names(subset.1)
This returns "Diameter" and "Strength"
But when I use:
names(diameter.bin[1])
This returns NULL.
I'm assuming that the column header is part of the problem, but I'm not sure how to fix it, other than take the headers off of the original data file. I would prefer not to do this if at all possible.
The end goal is to look at the distribution of strength values for each diameter bin, so I will be doing things like drawing histograms, calculating parameters etc. I was hoping to do something along these lines to produce the histograms:
n=length(diameter.bin)
for(i in (1:n))
{
hist(diameter.bin[i]$Strength)
}
And do something similar to this to store median values for each bin in a new vector.
Any tips are greatly appreciated, as right now I'm doing it all 1 bin at a time, and I know a loop (or something similar) would really speed up my analysis.
You need two square brackets. Here is a reproducible example demonstrating the issue:
> diam <- data.frame(x=rnorm(5), y=rnorm(5))
>
> diam.l <- list(diam, diam)
> diam.l[1]$x
NULL
> diam.l[[1]]$x
[1] -0.5389441 -0.5155441 -1.2437108 -2.0044323 -0.6914124