Finding values for a range of variables against the same constant - r

I am currently attempting to find the bias for a range of means previously calculated against the same constant. I have been using the code
b_all<-bias(1,c(x2:x6))
But its only returning the bias of the first variable x2. I'm sure there is a simple fix that I'm just not seeing. Thanks for the help.

Hard to say without any data to verify, but this could work:
b_all <- sapply(2:6, function(i){bias(1,get(paste0("x", i)))})

Related

ArgumentError: quantiles are undefined in presence of NaNs or missing values

I would like to create a boxplot that contains some missing values in Julia. Here is some reproducible code:
using DataFrames
using StatsPlots
df = DataFrame(y = [1,2,3,2,1,2,4,NaN,NaN,2,1])
boxplot(df[!, "y"])
Output:
ArgumentError: quantiles are undefined in presence of NaNs or missing values
I know that the error happens because of the NaN values, but is there not an option in boxplot to still plot the values instead of removing the missing values beforehand? I would assume that it might be designed in a way that it works in presence of missing values. In R it will still plot the boxplot, so I was wondering why in Julia you must remove these missing values and what is an appropriate way to do this?
so I was wondering why in Julia you must remove these missing values
So the general reason is difference in philosophy of design behind R and Julia.
R was designed to be maximally convenient at the risk of doing an incorrect thing sometimes. It tries to guess what you most likely want and does this. In this case - you most likely want NaN values to be ignored.
Julia is designed for safety and production use. If you have NaN in your data it means that data preparation process had some serious issue (like division of 0 by 0). In production scenarios you want your code to error in such cases as otherwise it is hard to identify the root cause of the issue.
Now, seconding what Dan Getz commented - most likely your NaN is actually missing (as you refer to it as missing). These two should not be mixed and have a significantly different interpretation. NaN is a value that is undefined or unrepresentable, especially in floating-point arithmetic (e.g. 0 divided by 0). While missing is a value that is missing (e.g. we have not collected a measurement).
Still - even if your data contained missing you would get an error for the same safety reason.
what is an appropriate way to do this?
NaNs are very rare in practice, so what Dan Getz recommended is a typical way to filter them. Other would be [x for x in df.y if !isnan(x)].
If you had missing values in your data (as this is most likely what you want) you should write boxplot(skipmissing(df.y)).

How do I use prodlim function with a non-binary variable in formula?

I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/

Constraints on BsplinesComp

I am using BsplinesComp for a sample problem.
The objective is to maximize the area under the line.
My problem arises when I want to set a constraint for one of the values in the output array that bspline gives. So a value such that the spline goes through that no matter what configuration it is in.
I tried this in two ways and I have uploaded the codes. They are both very badly coded so i think there is a neater way to do so. Links to codes:
https://gist.github.com/stackoverflow38/5eae1e86c5802a4df91becdf580d28c5
1- Using an extra explicit component in which the middle array value is imposed to be a selected value
2- Tried to use an execcomp but I get an error. Target shapes do not match.
I vaguely remember reading such a question but could not find it.
Overall I am trying to set a constraint for either the first, middle or last value of the bspline and some range that it should be in.
Similar to the plots here
So, I think you want to know the best way to do this, and the best way is to not use any extra components at all. You can directly constrain a single point in the output of the BsplinesComp by using the "indices" argument in the add_constraint call. Here, I constrain the first point in the spline to lie on the interval [-1, 1].
model.add_constraint('interp.h', lower=-1, upper=1, indices=[0])
Running the model gives me a shape that looks more like one of the ones you included.
Just for reference, for the errors you got with 1 and 2:
Not sure what is wrong here, but maybe the version you uploaded isn't the latest. You never used the AeraComp in a constraint, so it didn't do anything.
The exception was due to a size mismatch in connecting the vector output of the Bsplines comp to a scaler expression. You can do this by specifying the "src_indices", giving it a list of which indices in the array to connect to the target. model.connect('interp.h', 'execcomp.x', src_indices=[0])

Getting value of some elemnets by having their difference

I am currently working on particular algorithm, but I face with a problem that I'm not sure what I have to do to resolve it. I appreciate if anyone helps me out.
There are some objects{O1,O2,O3,.....}, each of them has a value that we don't know about its amount, we call them {V1,V2,V3,....} also there is another element we call it w(w1,w2,w3.....) which shows the difference between values, I mean w1=v2-v1, w2=v3-v2,w3=v4-v3 and so on. I'm wondering if there is any way to get value of v1,v2,v3...etc without having the value of V1?
Looking forward for your reply guys,
Thanks.
Not in general. Knowing the differences between successive numbers in a list of numbers under-determines the set of numbers. This is particularly obvious in the case when w1 = w2 = w3 = ... = wk = 1. That would tell you that the viare consecutive numbers, but nothing else could be inferred. You wouldn't be able to distinguish 3,4,5,6,7 from 10,11,12,13,14 (for example).
Having said that, it would of course be possible if you know one of the numbers, and the known number wouldn't need to be the first one. Knowing any single one of the numbers would suffice. Furthermore, knowing something like the sum of the vi would be sufficient since you could express the sum as a function of the unknown number v1 and solve the resulting equation.

Prediction with cpdist using "probabilities" as evidence

I have a very quick question with an easy reproducible example that is related to my work on prediction with bnlearn
library(bnlearn)
Learning.set4=cbind(c("Yes","Yes","Yes","No","No","No"),c(9,10,8,3,2,1))
Learning.set4=as.data.frame(Learning.set4)
Learning.set4[,c(2)]=as.numeric(as.character(Learning.set4[,c(2)]))
colnames(Learning.set4)=c("Cause","Cons")
b.network=empty.graph(colnames(Learning.set4))
struct.mat=matrix(0,2,2)
colnames(struct.mat)=colnames(Learning.set4)
rownames(struct.mat)=colnames(struct.mat)
struct.mat[1,2]=1
bnlearn::amat(b.network)=struct.mat
haha=bn.fit(b.network,Learning.set4)
#Some predictions with "lw" method
#Here is the approach I know with a SET particular modality.
#(So it's happening with certainty, here for example I know Cause is "Yes")
classic_prediction=cpdist(haha,nodes="Cons",evidence=list("Cause"="Yes"),method="lw")
print(mean(classic_prediction[,c(1)]))
#What if I wanted to predict the value of Cons, when Cause has a 60% chance of being Yes and 40% of being no?
#I decided to do this, according the help
#I could also make a function that generates "Yes" or "No" with proper probabilities.
prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
Here is what the help says:
"In the case of a discrete or ordinal node, two or more values can also be provided. In that case, the value for that node will be sampled with uniform probability from the set of specified values"
When I predict the value of a variable using categorical variables, I for now just used a certain modality of said variable as in the first prediction in the example. (Having the evidence set at "Yes" gets Cons to take a high value)
But if I wanted to predict Cons without knowing the exact modality of the variable Cause with certainty, could I use what I did in the second prediction (Just knowing the probabilities) ?
Is this an elegant way or are there better implemented ones I don't know off?
I got in touch with the creator of the package, and I will paste his answer related to the question here:
The call to cpquery() is wrong:
Prediction_idea=cpdist(haha,nodes="Cons",evidence=list("Cause"=c("Yes","Yes","Yes","No","No")),method="lw")
print(mean(prediction_idea[,c(1)]))
A query with the 40%-60% soft evidence requires you to place these new probabilities in the network first
haha$Cause = c(0.40, 0.60)
and then run the query without an evidence argument. (Because you do not have any hard evidence, really, just a different probability distribution for Cause.)
I will post the code that lets me do what I wanted off of the fitted network from the example.
change=haha$Cause$prob
change[1]=0.4
change[2]=0.6
haha$Cause=change
new_prediction=cpdist(haha,nodes="Cons",evidence=TRUE,method="lw")
print(mean(new_prediction[,c(1)]))

Resources