Mean value for different groups - r

I am stuck with a 'for' loop and would greatly appreciate some help.
I have a dataframe, called 'df' including data for the number of people per household (household_size), ranging from 0 (I replaced the missing values with a 0) to 8, as well as the number of car.
My aim is to write a quick code that computes the average number of cars depending on the household size.
I tried the following:
avg <- function(df){
i <- df$household_size
for (i in 0 : 8){
print(mean(df$car))
}
}
I'm pretty sure I'm missing something really basic here, but I don't know what.
Thanks everyone for your input.
I wouldn't have used a function for this. However, this is an exercise as part of an introductory coding with R module that specifically requires a for-loop.

Here a solution to print the mean for each size group using a for loop. Let me know if it worked
for(i in unique(df$household_size)){
print(paste(i,' : ',mean(df[df$household_size%in%i,car])))
}
As mentioned in a comment, I took away the function part because I don't see the point of having it. But if it's mandatory, you can use lapply, that behaves a bit like a for loop according to me:
lapply(unique(df$household_size), function(i){
return(paste(i,' : ',mean(df[df$household_size%in%i,car])))
}
)

Related

Matrice help: Finding average without the zeros

I'm creating a Monte Carlo model using R. My model creates matrices that are filled with either zeros or values that fall within the constraints. I'm running a couple hundred thousand n values thru my model, and I want to find the average of the non zero matrices that I've created. I'm guessing I can do something in the last section.
Thanks for the help!
Code:
n<-252500
PaidLoss_1<-numeric(n)
PaidLoss_2<-numeric(n)
PaidLoss_3<-numeric(n)
PaidLoss_4<-numeric(n)
PaidLoss_5<-numeric(n)
PaidLoss_6<-numeric(n)
PaidLoss_7<-numeric(n)
PaidLoss_8<-numeric(n)
PaidLoss_9<-numeric(n)
for(i in 1:n){
claim_type<-rmultinom(1,1,c(0.00166439057698873, 0.000810856947763742, 0.00183509730283373, 0.000725503584841243, 0.00405428473881871, 0.00725503584841243, 0.0100290201433936, 0.00529190850119495, 0.0103277569136224, 0.0096449300102424, 0.00375554796858996, 0.00806589279617617, 0.00776715602594742, 0.000768180266302492, 0.00405428473881871, 0.00226186411744623, 0.00354216456128371, 0.00277398429498122, 0.000682826903379993))
claim_type<-which(claim_type==1)
claim_Amanda<-runif(1, min=34115, max=2158707.51)
claim_Bob<-runif(1, min=16443, max=413150.50)
claim_Claire<-runif(1, min=30607.50, max=1341330.97)
claim_Doug<-runif(1, min=17554.20, max=969871)
if(claim_type==1){PaidLoss_1[i]<-1*claim_Amanda}
if(claim_type==2){PaidLoss_2[i]<-0*claim_Amanda}
if(claim_type==3){PaidLoss_3[i]<-1* claim_Bob}
if(claim_type==4){PaidLoss_4[i]<-0* claim_Bob}
if(claim_type==5){PaidLoss_5[i]<-1* claim_Claire}
if(claim_type==6){PaidLoss_6[i]<-0* claim_Claire}
}
PaidLoss1<-sum(PaidLoss_1)/2525
PaidLoss3<-sum(PaidLoss_3)/2525
PaidLoss5<-sum(PaidLoss_5)/2525
PaidLoss7<-sum(PaidLoss_7)/2525
partial output of my numeric matrix
First, let me make sure I've wrapped my head around what you want to do: you have several columns -- in your example, PaidLoss_1, ..., PaidLoss_9, which have many entries. Some of these entries are 0, and you'd like to take the average (within each column) of the entries that are not zero. Did I get that right?
If so:
Comment 1: At the very end of your code, you might want to avoid using sum and dividing by a number to get the mean you want. It obviously works, but it opens you up to a risk: if you ever change the value of n at the top, then in the best case scenario you have to edit several lines down below, and in the worst case scenario you forget to do that. So, I'd suggest something more like mean(PaidLoss_1) to get your mean.
Right now, you have n as 252500, and your denominator at the end is 2525, which has the effect of inflating your mean by a factor of 100. Maybe that's what you wanted; if so, I'd recommend mean(PaidLoss_1) * 100 for the same reasons as above.
Comment 2: You can do what you want via subsetting. Take a smaller example as a demonstration:
test <- c(10, 0, 10, 0, 10, 0)
mean(test) # gives 5
test!=0 # a vector of TRUE/FALSE for which are nonzero
test[test!=0] # the subset of test which we found to be nonzero
mean(test[test!=0]) # gives 10, the average of the nonzero entries
The middle three lines are just for demonstration; the only necessary lines to do what you want are the first (to declare the vector) and the last (to get the mean). So your code should be something like PaidLoss1 <- mean(PaidLoss_1[PaidLoss_1 != 0]), or perhaps that times 100.
Comment 3: You might consider organizing your stuff into a dataframe. Instead of typing PaidLoss_1, PaidLoss_2, etc., it might make sense to organize all this PaidLoss stuff into a matrix. You could then access elements of the matrix with [ , ] indexing. This would be useful because it would clean up some of the code and prevent you from having to type lots of things; you could also then make use of things like the apply() family of functions to save you from having to type the same commands over and over for different columns (such as the mean). You could also use a dataframe or something else to organize it, but having some structure would make your life easier.
(And to be super clear, your code is exactly what my code looked like when I first started writing in R. You can decide if it's worth pursuing some of that optimization; it probably just depends how much time you plan to eventually spend in R.)

error in function - argument is a length of zero in R-studio

deadcheck<-function(a,t){ #function to check if dead for specific age at a time age sending to function
roe<-which( birthmort$age[i]==fertmortc$min & fertmortc$max) #checks row in fertmortc(hart) to pick an age that meets min and max age requirements I think this could be wrong...
prob<-1-(((1-fertmortc$mortality[roe])^(1/365))^t) #finds the prob for the row that meets the above requirements
if(runif(1,0,1)<=prob) {d<-TRUE} else {d<-FALSE} #I have a row that has the probability of death every 7 days.
return(d) #outputs if dead
Background: I am creating an agent based model that is a population in a dataframe that is simulating how Tuberculosis spreads in a population. ( I know that there are probably 10000 better ways of having done this). I have thus far created a loop that populates my dataframe with people ages etc. I am now trying to create a function that will go to a chart that lists the probability of death per year, based on a age bracket. 0-5,5-10,10-15 etc. (I have math in there b/c I want it to check who lives, dies, makes babies every 7 days). I have a function similar to this that check who is pregnant and it works. However I for the life of me can't figure out why this function is not working. I keep getting the following error.
Error in if (runif(1, 0, 1) <= prob) { : argument is of length zero
I am unsure how to fix this.
I apologize in advanced it this is a dumb question, I have been trying to teach myself to code over the last 4-5 months. If I asked this question in the wrong format or incorrectly then please let me know how to do so correctly.
Value of prob is of length zero. It means
prob = NULL
in this case. Try to print alter your code and add
print(prob)
so you can check partial result.
As you suspected in your comments, the expression
birthmort$age[i]==fertmortc$min & fertmortc$max
is problematic. What this does is evaluate the comparison birthmort$age[i]==fertmortc$min, and then takes the result of that comparison and combines it with fertmortc$max using the and operator. This involves forming the and of a Boolean value and an integer, which is unlikely to make much sense.
Just guessing, you perhaps want:
birthmort$age[i] >= fertmortc$min & birthmort$age[i] <= fertmortc$max
I don't know if this will fix your problem -- you haven't given enough to test it. For optimal help, you should give a reproducible example. See this for how to do so in R

Getting a for loop to ignore missing values - R

I've got this for loop:
for(i in 1:length(class.data$ID)) {
class.data$FinalExam_GroupMCScore[i]=mc.data$PSYC.260.Exam....2017.3.
[which(mc.data$SIS.User.ID == class.data$FinalExam_MCGroupNumber[i])]
}
To merge two class grade files. Students did a part of their final exam in groups. The problem I'm having is that not everyone opted to do the group portion so they are missing a code for class.data$FinalExam_MCGroupNumber. The for loop gets hung up on these missing values and I can't get past. I suspect I need a an if statement embedded in there but I'm not familiar enough with R yet to write one in.
I've looked at some of the other posts on this and they don't help just because I'm having a tough time seeing how to embed an if or ifelse with a more complicated function following. Any help would be appreciated! I just want it to assign an NA on FinalExam_GroupMCScore to all students with NA on FinalExam_MCGroupNumber and carry on as normal!
Thank you!!

How to use apply with a function that required 2 parameters

I looked at the existing posts but could not get a clear answer... I have a data frame and I would like to modify each data by a calculation that takes into account the min and max of each lines.
I would like to use apply associated to a function:
sc=function(x,seg) {(x-seg[2])*100/(seg[1]-seg[2])}
or
sc=function(x,a,b) {(x-b)*100/(a-b)}
where x is a line of the data frame and seg=c(a,b) calculated as follow
d=dim(data) ## data is my dataframe
for (i in (1:d[1])) ## the calculation has to be done for each line, according
## the min and max of the specific line
{
seg=c(max(data[i,]),min(data[i,]))
data[i,]=apply(data[i,],1,sc)
return(data)
}
This does not work, obviously, because I do not know how to tell apply that it needs to take into account more than one parameter...
There is probably a R function that does this specific calculation, but since I am a R beginner, I would really appreciate to understand how to create such coding.
Thanks for the help!
Stéphane
Update:
Here is what I found for a solution, but it does not sound completely logical to me...
for (i in (1:d[1])) {
t=apply(data,2,sc,seg=range(data[i,]))
data[i,]=t[i,] }
The third parameter you pass to apply should be a function. Also, there's no reason to loop when you use apply.
apply(d,1,function(x) c(min(x), max(x)))
will return a 2-row matrix with the min and max values for each row. Although there is a build in function to get min/max called `range
apply(d,1,range)

Cumulative sum for n rows

I have been trying to produce a command in R that allows me to produce a new vector where each row is the sum of 25 rows from a previous vector.
I've tried making a function to do this, this allows me to produce a result for one data point.
I shall put where I haver got to; I realise this is probably a fairly basic question but it is one I have been struggling with... any help would be greatly appreciated;
example<-c(1;200)
fun.1<-function(x)
{sum(x[1:25])}
checklist<-sapply(check,FUN=fun.1)
This then supplies me with a vector of length 200 where all values are NA.
Can anybody help at all?
Your example is a bit noisy (e.g., c(1;200) has no meaning, probably you want 1:200 there, or, if you would like to have a list of lists then something like rep, there is no check variable, it should have been example, etc.).
Here's the code what I think you need probably (as far as I was able to understand it):
x <- rep(list(1:200), 5)
f <- function(y) {y[1:20]}
sapply(x, f)
Next time please be more specific, try out the code you post as an example before submitting a question.

Resources