How do I perform operations on variables omitting missing values - r

I'm sure this question will seem a little bit basic for many but here is my problem:
I want to create a new variable which is an equation of other variables in RStudio:
D$satisfaction.conditions <- (D$imp.distance * D$sat.distance
+ D$imp.salaire * D$sat.salaire
+ D$imp.horaires * D$sat.horaires
+ D$imp.chargetravail * D$sat.chargetravail
+ D$imp.nbservice * D$sat.nbservice
+ D$imp.locaux * D$sat.locaux
+ D$imp.equipements * D$sat.equipements
+ D$imp.ambiance * D$sat.ambiance
+ D$imp.relationcollegues * D$sat.relationcollegues
+ D$imp.stress * D$sat.stress)
The issue is that I have some missing values in the equation, so I get a NA result for some observations.
I know that there is something to do with na.rm=TRUE but I can't find where to put it. I tried at the end but I get a
Error: unexpected symbol in:
" + D$imp.relationcollegues * D$sat.relationcollegues
+ D$imp.stress * D$sat.stress) na.rm"
How can I get my new variable {satisfaction.conditions} ommiting NA values ?

Add this line of code before performing the calculation:
D[is.na(D)] <- 0
This will replace all the NAs with 0s

Related

Given an imported dataset, how do I assign the value of 0 to a new column if a value of a previous column = 1? On R

I'm a beginner to R so please excuse me if this is an overly simple question.
With n-back data, I'm trying to calculate the bias of the one-back trials.
I imported a dataset from Excel to R successfully and everything works except for the line
if (df$one_back_aprime == 1) {
df$one_back_bias <- 0
}
in the following code:
library(readxl)
nback_mornings_data <- read_excel("NFS4and5_mornings_R.xlsx", sheet = 1) #replace with file name
df <- data.frame(nback_mornings_data)
if (df$one_back_hit_rate > df$one_back_fa_rate) {
df$one_back_aprime <- 0.5 + ((df$one_back_hit_rate - df$one_back_fa_rate) * (1 + df$one_back_hit_rate - df$one_back_fa_rate))/(4 * df$one_back_hit_rate * (1 - df$one_back_fa_rate))
} else if (df$one_back_hit_rate < df$one_back_fa_rate) {
df$one_back_aprime <- 0.5 + ((df$one_back_fa_rate - df$one_back_hit_rate) * (1 + df$one_back_fa_rate - df$one_back_hit_rate))/(4 * df$one_back_fa_rate * (1 - df$one_back_hit_rate))
}
if (df$one_back_aprime == 1) {
df$one_back_bias <- 0
} else {
df$one_back_bias <- (((1 - df$one_back_hit_rate) * (1 - df$one_back_fa_rate)) - (df$one_back_hit_rate * df$one_back_fa_rate)) / (((1 - df$one_back_hit_rate) * (1 - df$one_back_fa_rate)) + (df$one_back_hit_rate * df$one_back_fa_rate))
}
When I run the code, everything works, except when the one_back_aprime == 1 , it prints NaN (because that's what the equation at else churns out).
However, I'm confused as to why this is because I already put that it should be assigned the value of 0. So clearly I'm doing that wrong.
Can anyone please help to change that portion of the code so that when one_back_aprime = 1, the new one_back_bias column will show the value of 0? Note: I've tried print("0") but that doesn't work either.
Any help would be much appreciated!
I'm sure I'm missing something very basic, but again, I'm just a beginner.

Question about the Division operator in R not returning the correct value

I am trying to caculate the Bayes Theorem for Cancer and tried to plug in the correct values in my formula as such:
cancer <- (1 * (1/100000)) / (1*(1/100000)) + ((10/99999) * (99999/100000))
In this case, cancer = 1.0001
However, the correct answer should be: 0.09090909, as proven by running the code separately, like this:
num = (1 * (1/100000))
den = (1*(1/100000)) + ((10/99999) * (99999/100000))
num / den
> 0.09090909
Can you please let me know why this is the case and how I should run the combined equation in the future to get the proper result?
Parentheses are needed:
cancer <- (1 * (1/100000)) / ((1*(1/100000)) + ((10/99999) * (99999/100000)))

R How to apply an equation to each row of 4 columns where each column is a parameter in the equation?

Below is a 3*4 matrix, where 2 columns represent the lat/lon coordinates of one location and the other two are coordinates of a second location. I'm trying to apply the great circle distance formula to each row. I'm pretty sure I should use something in the apply family, but can't figure out how.
d=as.data.frame(split(as.data.frame(c( 33.43527 ,-112.01194 , 37.72139 , -122.22111, -3.78444 , -73.30833 , -12.02667 , -77.12278,37.43555,38.88333,40.97667,28.81528)* pi/180),1:4))
colnames(d)=c('lat','lon','lat2','lon2')
This is the equation I would like to be applied to each of the 3 rows:
sum(acos(sin(lat) * sin(lat2) + cos(lat) * cos(lat2) * cos(lon2 -lon)) * 6371)*0.62137
The lat, lon, lat2, lon2 represent the column names in matrix d.
The final vector would look like this:
answer= 645.0978, 626.3632, 591.4725
Any help would be much appreciated.
You can use mapply and provide all 4 columns as parameter to the function as:
An option is to write as:
mapply(function(lat,lon,lat2,lon2)sum(acos(sin(lat) * sin(lat2) +
cos(lat) * cos(lat2) * cos(lon2 -lon)) * 6371)*0.62137,
d[,"lat"],d[,"lon"],d[,"lat2"],d[,"lon2"])
#Result: With updated data
#[1] 645.0978 626.3632 591.4725
We subset the columns of 'd' with [ (as it is a matrix - for data.frame, $ can also work), and then do the arithmetic
(acos(sin(d[,"lat"]) * sin(d[,"lat2"]) +
cos(d[,"lat"]) * cos(d[,"lat2"]) *
cos(d[,"lon2"] -d[,"lon"])) * r)*0.62137
#[1] 3153.471 10892.893 6324.854
This can also be done in a loop with apply
apply(d, 1, function(x) (acos(sin(x[1]) * sin(x[3]) +
cos(x[1]) * cos(x[3]) * cos(x[4] - x[2])) * r)* 0.62137)
#[1] 3153.471 10892.893 6324.854
The with function would allow you to use the expression:
(acos(sin(lat) * sin(lat2) + cos(lat) * cos(lat2) * cos(lon2 -lon)) * 6371)*0.62137
but you would need to transform it the d-matrix to a dataframe:
with(data.frame(d), ( acos( sin(lat) * sin(lat2) +
cos(lat) * cos(lat2) * cos(lon2 -lon) ) * 6371) *
0.62137
)
[1] 3153.471 10892.893 6324.854
The sum should not be used since the +, sin,cos, and acos functions are all vectorized but the sum function is not. I've tried to rearrange the indentation so the terms are easier to recognize.

Why is the width.cutoff argument in the deparse function limited to 500 bytes?

Why is the width.cutoff argument in deparse limited to 500 bytes?
Consider the following reproducible example:
a <- substitute(12345.6789 * x0 + 123523.623529 * x1 + 1235235.6734636 * x2
+ 657567.6756756 * x3 + 756765.23523 * x4 + 54645.65464 * x5)
deparse(a)
[1] "12345.6789 * x0 + 123523.623529 * x1 + 1235235.6734636 * x2 + "
[2] " 657567.6756756 * x3 + 756765.23523 * x4 + 54645.65464 * x5"
The default value of width.cutoff is 60, meaning that the function will attempt to split the character string into 60 character chunks. If you specify the argument to an integer above 500, you get the following error:
deparse(a, width.cutoff = 501)
[1] "12345.6789 * x0 + 123523.623529 * x1 + 1235235.6734636 * x2 + "
[2] " 657567.6756756 * x3 + 756765.23523 * x4 + 54645.65464 * x5"
Warning message:
In deparse(a, width.cutoff = 501) :
invalid 'cutoff' value for 'deparse', using default
I could understand if the default was set to 60 because of some concern about storing a massive character string in memory. However, I don't understand why I am not allowed to set this argument to 1000 or 10000. Why is there an upper bound on the number of characters that deparse can return in a single string?
R can clearly create a string with more than 500 characters.
nchar(paste(rep('a', 1000), collapse = ''))
[1] 1000
I tried to understand what was going on by going to the source code on github. I found this hilarious comment
The previous issue with the global "cutoff" variable is now implemented
by creating a deparse1WithCutoff() routine which takes the cutoff from
the caller and passes this to the different routines as a member of the
LocalParseData struct. Access to the deparse1() routine remains unaltered.
This is exactly as Ross had suggested ...
One possible fix is to restructure the code with another function which
takes a cutoff value as a parameter. Then "do_deparse" and "deparse1"
could each call this deeper function with the appropriate argument.
I wonder why I didn't just do this? -- it would have been quicker than
writing this note. I guess it needs a bit more thought ...
Before I dive deeper into the source code, can someone explain to me why there is an upper bound on this argument? Thanks.

Get derivative in R

I'm trying to take the derivative of an expression:
x = read.csv("export.csv", header=F)$V1
f = expression(-7645/2* log(pi) - 1/2 * sum(log(w+a*x[1:7644]^2)) + (x[2:7645]^2/(w + a*x[1:7644]^2)),'a')
D(f,'a')
x is simply an integer vector, a and w are the variables I'm trying to find by deriving. However, I get the error
"Function '[' is not in Table of Derivatives"
Since this is my first time using R I'm rather clueless what to do now. I'm assuming R has got some problem with my sum function inside of the expression?
After following the advice I now did the following:
y <- x[1:7644]
z <- x[2:7645]
f = expression(-7645/2* log(pi) - 1/2 * sum(log(w+a*y^2)) + (z^2/(w + a*y^2)),'a')
Deriving this gives me the error "sum is not in the table of derivatives". How can I make sure the expression considers each value of y and z?
Another Update:
y <- x[1:7644]
z <- x[2:7645]
f = expression(-7645/2* log(pi) - 1/2 * log(w+a*y^2) + (z^2/(w + a*y^2)))
d = D(f,'a')
uniroot(eval(d),c(0,1000))
I've eliminated the "sum" function and just entered y and z. Now, 2 questions:
a) How can I be sure that this is still the expected behaviour?
b) Uniroot doesn't seem to like "w" and "a" since they're just symbolic. How would I go about fixing this issue? The error I get is "object 'w' not found"
This should work:
Since you have two terms being added f+g, the derivative D(f+g) = D(f) + D(g), so let's separate both like this:
g = expression((z^2/(w + a*y^2)))
f = expression(- 1/2 * log(w+a*y^2))
See that sum() was removed from expression f, because the multiplying constant was moved into the sum() and the D(sum()) = sum(D()). Also the first constant was removed because the derivative is 0.
So:
D(sum(-7645/2* log(pi) - 1/2 * log(w+a*y^2)) + (z^2/(w + a*y^2)) = D( constant + sum(f) + g ) = sum(D(f)) + D(g)
Which should give:
sum(-(1/2 * (y^2/(w + a * y^2)))) + -(z^2 * y^2/(w + a * y^2)^2)
expression takes only a single expr input, not a vector, and it is beyond r abilities to vectorize that.
you can also do this with a for loop:
foo <- c("1+2","3+4","5*6","7/8")
result <- numeric(length(foo))
foo <- parse(text=foo)
for(i in seq_along(foo))
result[i] <- eval(foo[[i]])

Resources