Why am I getting an NA when calculating the mean? - r

Every time I try to calculate this line "DHS <- mean(ahebachelors2008) - mean(ahebachelors1992)" I receive an NA answer. Calculating mean(ahe2008) works but calculating mean(ahebachelors2008) does not work.
setwd("~/Google Drive/R Data")
data <- read.csv('cps92_08.csv')
year <- data$year
year1992 <- subset(data,year<2000)
year2008 <- subset(data,year>2000)
ahe1992 <- (year1992$ahe)
ahe2008 <- (year2008$ahe)
max(ahe1992)
min(ahe1992)
mean(ahe1992)
median(ahe1992)
sd(ahe1992)
max(ahe2008)
min(ahe2008)
mean(ahe2008)
median(ahe2008)
sd(ahe2008)
adjahe <- ahe1992*(215.2/140.3)
max(adjahe)
min(adjahe)
mean(adjahe)
median(adjahe)
sd(adjahe)
D <- mean(ahe2008) - mean(adjahe)
education <- data$bachelor
ahebachelors1992 <- subset(adjahe, education>0)
ahehighschool1992 <- subset(adjahe,education<1)
ahebachelors2008 <- subset(ahe2008,education>0)
ahehighschool2008 <- subset(ahe2008,education<1)
DHS <- mean(ahebachelors2008) - mean(ahebachelors1992)

education is the same length as data, whereas ahe2008 is a subset of data. So when you pass education as the condition on ahe2008, it creates NAs (because that's the corresponding value in ahe2008 for those elements.
Here's a simpler example:
d1<-c(1:5)
d2<-c(1:5,1:5)
subset(d1,d2==1)
[1] 1 NA
Possible solutions would be to create separate bachelor vectors for each year, or to not continuously subset but just use multiple conditions where you need them.
If you're trying to avoid typing the full data$something every time, consider using with(), or even better - the dplyr package.
For example, all the code leading up to the last line could be replaced with this (assuming I didn't miss anything):
DHS <- mean(with(data,ahe[year>2000 & education>0])) -
mean(with(data,ahe[year<2000 & education>0]*(215.2/140.3))
(If you're new to R, note that the [] structure is a simpler way to call on subset).
You might also want to consider using summary which will give you min, median, mean, and max, leaving you with just sd to add manually.:
summary(with(data,ahe[year>2000]))

If the values you are trying to calculate mean on contain NA then the output will be NA. You can overcome it by adding na.rm = TRUE to your mean:
DHS <- mean(ahebachelors2008, na.rm=TRUE) - mean(ahebachelors1992, na.rm=TRUE)

Related

R: Efficiently Calculate Deviations from the Mean Using Row Operations on a DF (Without Using a For Loop)

I am generating a very large data frame consisting of a large number of combinations of values. As such, my coding has to be as efficient as possible or else 1) I get errors like - R cannot allocate vector of size XX or 2) the calculations take forever.
I am to the point where I need to calculate r (in the example below r = 3) deviations from the mean for each sample (1 sample per row of the df)(Labeled dev1 - dev3 in pic below):
These are my data in R:
I tried this (r is the number of values in each sample, here set to 3):
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
When I try this, I get:
I am guessing that this code is attempting to calculate the difference between each row of X1 (x) and the entire vector of X1$x.bar instead of 81 for the 1st row, 81.25 for the 2nd row, etc.
Once again, I can easily do this using for loops, but I'm assuming that is not the most efficient way.
Can someone please stir me in the right direction? Any assistance is appreciated.
Here is the whole code for the small sample version with r<-3. WARNING: This computes all possible combinations, so the df's get very large very quick.
options(scipen = 999)
dp <- function(x) {
dp1<-nchar(sapply(strsplit(sub('0+$', '', as.character(format(x, scientific = FALSE))), ".",
fixed=TRUE),function(x) x[2]))
ifelse(is.na(dp1),0,dp1)
}
retain1<-function(x,minuni) length(unique(floor(x)))>=minuni
# =======================================================
r<-3
x0<-seq(80,120,.25)
X0<-data.frame(t(combn(x0,r)))
names(X0)<-paste("x",1:r,sep="")
X<-X0[apply(X0,1,retain1,minuni=r),]
rm(X0)
gc()
X$x.bar<-rowMeans(X)
dp1<-dp(X$x.bar)
X1<-X[dp1<=2,]
rm(X)
gc()
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
Because R is vectorized you only need to subtract x.bar from from x1, x2, x3 collectively:
devs <- X1[ , 1:3] - X1[ , 4]
X1devs <- cbind(X1, devs)
That's it...
I think you just got the margin wrong, in apply you're using 1 as in row wise, but you want to do column wise so use 2:
X2<-apply(X1[,1:r], 2, function(x) x-X1$x.bar)
But from what i quickly searched, apply family isn't better in performance than loops, only in clarity. Check this post: Is R's apply family more than syntactic sugar?

R programming Function (Returning a subset of Real Mean Squared)

I am new to R and am working on writing some cool functions while I learn statistics in parallel. I'm trying to make a function that will take a numeric vector, perform the "root mean squared" operations and then have the output return essentially same vector with the possible outliers removed.
For example, if the vector is c(2,4,9,10,100) the resulting RMS would be about 37.
Therefore, I want the output to return the same vector with the possible outlier (in this case, 100) removed from the dataset. So the result would be 2, 4, 9, 10
I put my code below but the output isn't working. I tried it 2 different ways. Everything up to the line that says RMS final works. But below that it does not.
How can I modify this function so that it does what I want? Also, as a bonus, and this might be asking a lot but based on my coding below, any tips for a newbie on making functions would be something I'd be grateful for as well. Thanks so much!
RMS_x <- c(2,4,9,10,100)
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
for(i in length(RMS_x)){
if(abs(RMS_x[i]) > RMS_FINAL){
output <- RMS_x[i]}
else {NULL} }
return(output)
}
#Root Mean Squared Function - Takes a numeric vector
RMS <- function(RMS_x){
RMS_MEAN <- mean(RMS_x)
RMS_DIFF <- (RMS_x-RMS_MEAN)
RMS_DIFF_SQ <- RMS_DIFF^2
RMS_FINAL <- sqrt(sum(RMS_DIFF_SQ)/length(RMS_x))
#output <- ifelse(abs(RMS_x) > RMS_FINAL,RMS_x, NULL)
return(RMS_FINAL)
}
Try following in the first lines of the RMS function.
RMS <- function(RMS_x) {
bp <- boxplot(RMS, plot = FALSE)
RMS_x <- RMS_x[!(RMS_x %in% bp$out)]
...
Now, you have RMS_x sans the outliers.
The boxplot function has a way of determining the outliers. Here, I am using that to remove them.
Since you are asking more specifically about R and R functions I’ll focus my response on that. There are a couple errors I'll point out then provide a few alternative solutions.
Your first function isn’t producing the output you want for two reasons:
The logic instructs the function to return a single value rather than a vector. If you’re trying to load a vector within your for loop (one without the outlier) make sure to initialize the vector outside of the function : output <- vector() (note that in my solution below however this is not required). Also the value it is returning is just a value in your vector RMS_x that is greater than the RMS rather that finding an outlier, just fyi if that's what you wanted.
There’s an error and/or typo in your for loop argument, it’s minor but it turns your for loop into not-a-loop whatsoever – which is obviously the total opposite of what you intended. The for loop needs a vector to loop through, the argument should be: for(i in 1:length(RMS_x))
In your code the loop is jumping straight to i = 5 because that is the length of your vector (length(RMS_x) = 5). Given that the values in the RMS_x vector were already in ascending order your code happens to give the "right" answer but that's just because of how you initially loaded the vector. This may have been a typo in your question, and it's a difference of only 2 code characters, but it totally changes what the function looks for.
Solution:
To get what you are trying to accomplish, you need to write two functions: 1.) that defines what's considered an outlier in your data set and 2.) a second function that strips out the outliers and calculates RMS. Then from there either make the functions independent or nest them to pass variables (this kind of goes with your bonus request as well since it's multiple ways of writing functions).
Function to identify outliers:
outlrs <- function(vec){
Q1 <- summary(vec)["1st Qu."]
Q3 <- summary(vec)["3rd Qu."]
# defining outliers can get complicated depending on your sample data but
# your data set is super simple so we'll keep it that way
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
assign("non_outlier_range", bounds, envir = globalEnv())
# the assign() function will create an actual object in your environment
# called non_outlier_range that you can access directly - return()
# just mean the result will be spit out into the console or into a variable
# you load it into
}
Now moving on to the second function, a few options here:
First Way: Input bounds argument into RMS_func()
RMS_func <- function(dat, bounds){
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call function from approach 1 - note that here the assign() in the
# definition of outlrs() would be required to refer to non_outlier_range:
RMS_func(dat = RMS_x, bounds = non_outlier_range)
Second Way: Call outlrs() inside the second function
RMS_func <- function(dat){
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
# Call RMS_func - here the assign() in outlrs() would not be needed is not
# needed because the output will exist within the functions temp environment
# and be passed to RMS_func
RMS_func(dat = RMS_x)
Third Way: Nest outlrs() definition within the RMS_Func - in this case you only need one nested function to accomplish your task
RMS_Func <- function(dat){
outlrs <- function(vec){
Q1 <- summary(dat)["1st Qu."]
Q3 <- summary(dat)["3rd Qu."]
#Q1 <- quantile(vec)["25%"]
#Q3 <- summary(vec)["75%"]
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5*(IQR)
upper_bound <- Q3 + 1.5*(IQR)
bounds <- c(lower_bound, upper_bound)
return(bounds)
}
bounds <- outlrs(vec = dat)
dat <- dat[!(dat < min(bounds)) & !(dat > max(bounds))]
dat_MEAN <- mean(dat)
dat_DIFF <- (dat-dat_MEAN)
dat_DIFF_SQ <- dat_DIFF^2
dat_FINAL <- sqrt(sum(dat_DIFF_SQ)/length(dat))
return(dat_FINAL)
}
P.S. Wrote this pretty quickly - will likely re-test and edit later. Hopefully for now this helps.

R Matching closest number from columns

I have a list of responses to 7 questions from a survey, each their own column, and am trying to find the response within the first 6 that is closest (numerically) to the 7th. Some won't be the exact same, so I want to create a new variable that produces the difference between the closest number in the first 6 and the 7th. The example below would produce 0.
s <- c(1,2,3,4,5,6,3)
s <- t(s)
s <- as.data.frame(s)
s
Any help is deeply appreciated. I apologize for not having attempted code as nothing I have tried has actually gotten close.
How about this?
which.min( abs(s[1, 1:6] - s[1, 7]))
I'm assuming you want it generalized somehow, but you'd need to provide more info for that. Or just run it through a loop :-)
EDIT: added the loop from the comment and changed exactly 2 tiny things.
s <- c(1,2,3,4,5,6,3)
t <- c(1,2,3,4,5,6,7)
p <- c(1,2,3,4,5,6,2)
s <- data.frame(s,t,p)
k <- t(s)
k <- as.data.frame(k)
k$t <- NA ### need to initialize the column
for(i in 1:3){
## need to refer to each line of k when populating the t column
k[i,]$t <- which.min(abs(k[i, 1:6] - k[i, 7])) }

R studio doesn't find objects in my function

I’m new to programming and I’m currently writing a function to go through hundreds of csv files in the working directory.
The files have tons of NA values in it.
The function (which I call it corr) has two parameters, the directory, and a threshold value (numeric vector of length 1 indicating the number of complete cases).
The purpose of the function is to take the complete cases for two columns that are sulfate and nitrate(second and third column in the spreadsheet) and calculate the correlation between them if the number of complete cases is greater than the threshold parameter.
The function should return a vector with the correlation if it met the threshold requirement (the default threshold value is 0).
When I run the code I get back two of the following:
A + sign in the console
OR
2.The objects I created in the function can't be found.
Any help would be much appreciated. Thank you in advance!
corr <- function(directory, threshold=0){
filelist2<- data.frame(list.files(path=directory,
pattern=".csv", full.names=TRUE))
corvector <- numeric()
for(i in 1:length(filelist2)){
data <-data.frame(read.csv(filelist2[i]))
removedNA<-complete.cases(data)
newdata<-data[removedNA,2:3]
if(nrow(removedNA) > threshold){
corvector<-c(corvector, cor(data$sulfate, data$nitrate ))
}
}
corvector
}
I don't think your nrow(removedNA) does what you think it does. To replicate the example I use the mtcars dataset.
data <- mtcars # create dataset
data[2:4, 2] <- NA # create some missings in column 2
data[15:17, 3] <- NA # create some missing in column 3
removedNA <- complete.cases(data)
table(removedNA) # 6 missings indeed
nrow(removedNA) # NULL removedNA is no data.frame, so nrow() doesn't work
newdata <- data[removedNA, 2:3] # this works though
nrow(newdata) # and this shows the rows in 'newdata'
#---- therefore instead of nrow(removedNA) try
if(nrow(data)-nrow(newdata) < threshold) {
...
}
NB: I changed the > in < in the line with threshold. I guess it depends on whether you want to set an absolute minimum number of lines (in which cases you could simply use nrow(newdata) > threshold) as threshold, or whether you want the threshold to reflect the different number of lines in the original data and 'new' data.

Using mapply() in R over rows, vs. columns

I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.

Resources