na.omit function is not removing rows containing NA - r

Hi there I am looking on the internet what is wrong and the na.omit() function is not removing the rows with NA. Could you please help me?
library(TTR)
library(quantmod)
library(doParallel) #this library is for parallel core processing
StartDate = "2010-01-01"
EndDate = "2020-03-20"
myStock <- c("AMZN")
getSymbols(myStock, src="yahoo", from=StartDate, to=EndDate)
gdat <-coredata(AMZN$AMZN.Close) # Create a 2-d array of all the data. Or...
Data <- data.frame(date=index(AMZN), coredata(AMZN)) # Create a data frame with the data and (optionally) maintain the date as an index
Data$rsi22 <- data.frame(RSI(Cl(Data), n=22))
Data$rsi44 <- data.frame(RSI(Cl(Data), n=44))
colnames(Data)
DatanoNA <- na.omit(Data) #remove rows with NAs

I think you're looking for the complete.cases() function. na.omit() is for removing NA values in a vector, not for removing rows containing NA values from a data frame.
Also, your data frame construction is a little wonky (see below for more explanation). Try this:
Data <- data.frame(date=index(AMZN), coredata(AMZN),
rsi22=RSI(Cl(Data), n=22),
rsi44=RSI(Cl(Data), n=44))
nrow(Data)
nrow(Data[complete.cases(Data),])
Normally every column of a data frame is a vector. The results of RSI() are stored as a vector. When you say
Data$rsi22 <- data.frame(RSI(Cl(Data), n=22))
what you're doing is wrapping the results in a data frame and then embedding it an another data frame (Data), which is something you can legally do in R but which is unusual and confuses a lot of the standard data-processing functions.

You could try complete.cases
DatanoNA <- Data[complete.cases(Data),]

Related

Combine imputed data by group in r using mice

my question is a follow-up to this question on imputation by group using "mice":
multiple imputation and multigroup SEM in R
The code in the answer works fine as far as the imputation part goes. But afterwards I am left with a list of actually complete data but more than one set. The sample looks as follows:
'Set up data frame'
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
'Introduce NAs'
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
df
'Impute values by group:'
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
df.clean
As you can see, df.clean is a list of 3. One element per group. But each element containing a complete data set I am looking for.
The original answer suggests to rbind() the obtained data in df.clean which leaves me with a new data set with 45 (3x the original size) observations.
Here is the original code for the last step:
imputed.both <- do.call(args = df.clean, what = rbind)
Which data is the "right" one? And why the last step?
Thanks a bunch!
There's a bug in the code, i have a edited version below that works:
#Set up data frame
set.seed(12345)
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
#Introduce NAs
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
# check NAs
colSums(is.na(df))
#Impute values by group:
# here's the bug
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
imputed.both <- do.call(args = df.clean, what = rbind)
dim(imputed.both)
# returns 15,4
In the code in the question, you have
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
dim(do.call(rbind,df.clean))
#this returns 45,4
The function is specified with "x" but you call df from the global environment. Hence you impute on the complete df.
So to answer your question, if you do this step:
split(df,df$ID)
You split your data frame into a list of data.frames with only A,B or Cs. Then if you lapply through this list, you get
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
names(df.clean)
lapply(df.clean,dim)
each item of the list df.clean contains a subset of the original df, with ID being A, B or C. Now you combine this list together into a data.frame using:
imputed.both <- do.call(rbind,df.clean)

Finding Mean of a column in an R Data Set, by using FOR Loops to remove Missing Values

I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))

Remove row names from a list of dataframes?

My code:
library(quantmod)
library(tseries)
library(ggplot2)
companies = c("IOC.BO", "BPCL.BO", "ONGC.BO", "HINDPETRO.BO", "GAIL.BO")
stocks = list()
for(i in 1:5){
stocks[[i]] = getSymbols(companies[i], auto.assign = FALSE)
}
stocks is a list of dataframes. Now I'm trying to bind the all $adjusted columns all the dataframes stored in stock but to do that I need to remove the rownames (someone please tell me if there's a better method to do this):
for(i in 1:5)
rownames(stocks[[i]])<- NULL
but the resulting dataframes still have their row names, could someone please tell me where I'm going wrong?
P.S. Further my end goal is to have a dataframe with only the adjusted columns of the dataframes in the list stocks for which I did this:
adjusted=data.frame()
for(i in 1:5)
coln=stocks[[1]][,6]
adjusted=cbind(ajusted,coln)
adjusted
but this returns adjusted as a list.
Row names
Regarding row names after running the code in the question
rownames(stocks[[1]])
## NULL
so it is not true that stocks have row names afterwards.
Adjusted series
To create a time series of adjusted values use Ad as shown below.
Adjusted <- do.call("merge", lapply(stocks, Ad))
Putting it all together
Note that we don't really need the entire row names processing and the following is sufficient. The second last line is optional as its only purpose is to make the column names nicer and the last line converts the xts object Adjusted to a data frame and may not be needed either since you may find working with an xts object more convenient than using data frames.
library(quantmod)
library(ggplot2)
stocks <- lapply(companies, getSymbols, auto.assign = FALSE)
Adjusted <- do.call("merge", lapply(stocks, Ad))
names(Adjusted) <- sub(".BO.Adjusted", "", names(Adjusted))
adjustedDF <- fortify(Adjusted)

Reading csv in r - numbers like "general"

First, I am new on R.
My csv has some numbers considered like "general" so I can't do the math with data. Is there any solution for this?
I have tried data >- as.numeric ( as.character(data)) but I failed.
data <- read.csv(file="TC.csv", header=TRUE, sep=",")
data[ data == "?" ] <- NA
for(i in 1:ncol(data)) {
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}
I get this message:
In mean.default(results) : argument is not numeric or logical: returning NA
I think the problem is related to numbers like on yellow cell.
Sample input:
You shouldn't need to loop over the data set to remove rows. Also, I don't believe the highlighted rows are the root of the problem. To make it easier, I would convert the data to a data frame.
data <- as.data.frame(read.csv(file="TC.csv", header=TRUE, sep=","))
To remove the '?' character, you should be able to run the code below. I think it is easier to run the code below instead of converting it to NA and then dropping it.
data <- data[!grepl('?',data$Column),]
mean(TC$Column)
summary(TC)
In summary, you should convert it to a data frame, replace/drop the rows that have values that aren't numeric, and then perform your summary stats.
You are getting that error message because you are applying the mean function to a list, when it operates on numeric types.
In R, the usual way of dealing with multi-dimensional data is not to loop over it, but to use one of the various apply functions, which perform an operation on one dimension of your data. Here you are looking for the column mean, which you get by:
TC.csv
a_0,a_1,a_2,a_3,a_4
3030.93,1,1,1,1
3095.78,2,2,2,2
2932.61,3,3,?,3
3032.24,4,4,4,4
2946.25,5,5,5,5
3058.88,6,?,6,6
get_mean.R
data <- read.csv(file="TC.csv", header=TRUE, sep=",", na.strings="?")
# apply( data, dimension, function, function_args )
col_means <- apply( data, 2, mean, na.rm=1 )
Apply Functions Over Array Margins
Apply a Function over a List or Vector

R: Split-Apply-Combine... Apply Functions via Aggregate to Row-Bound Data Frames Subset by Class

Update: My NOAA GHCN-Daily weather station data functions have since been cleaned and merged into the rnoaa package, available on CRAN or here: https://github.com/ropensci/rnoaa
I'm designing a R function to calculate statistics across a data set comprised of multiple data frames. In short, I want to pull data frames by class based on a reference data frame containing the names. I then want to apply statistical functions to values for the metrics listed for each given day. In effect, I want to call and then overlay a list of data frames to calculate functions on a vector of values for every unique date and metric where values are not NA.
The data frames are iteratively read into the workspace from file based on a class variable, using the 'by' function. After importing the files for a given class, I want to rbind() the data frames for that class and each user-defined metric within a range of years. I then want to apply a concatenation of user-provided statistical functions to each metric within a class that corresponds to a given value for the year, month, and day (i.e., the mean [function] low temperature [class] on July 1st, 1990 [date] reported across all locations [data frames] within a given region [class]. I want the end result to be new data frames containing values for every date within a region and a year range for each metric and statistical function applied. I am very close to having this result using the aggregate() function, but I am having trouble getting reasonable results out of the aggregate function, which is currently outputting NA and NaN for most functions other than the mean temperature. Any advice would be much appreciated! Here is my code thus far:
# Example parameters
w <- c("mean","sd","scale") # Statistical functions to apply
x <- "C:/Data/" # Folder location of CSV files
y <- c("MaxTemp","AvgTemp","MinTemp") # Metrics to subset the data
z <- c(1970:2000) # Year range to subset the data
CSVstnClass <- data.frame(CSVstations,CSVclasses)
by(CSVstnClass, CSVstnClass[,2], function(a){ # Station list by class
suppressWarnings(assign(paste(a[,2]),paste(a[,1]),envir=.GlobalEnv))
apply(a, 1, function(b){ # Data frame list, row-wise
classData <- data.frame()
sapply(y, function(d){ # Element list
CSV_DF <- read.csv(paste(x,b[2],"/",b[1],".csv",sep="")) # Read in CSV files as data frames
CSV_DF1 <- CSV_DF[!is.na("Value")]
CSV_DF2 <- CSV_DF1[which(CSV_DF1$Year %in% z & CSV_DF1$Element == d),]
assign(paste(b[2],"_",d,sep=""),CSV_DF2,envir=.GlobalEnv)
if(nrow(CSV_DF2) > 0){ # Remove empty data frames
classData <<- rbind(classData,CSV_DF2) # Bind all data frames by row for a class and element
assign(paste(b[2],"_",d,"_bound",sep=""),classData,envir=.GlobalEnv)
sapply(w, function(g){ # Function list
# Aggregate results of bound data frame for each unique date
dataFunc <- aggregate(Value~Year+Month+Day+Element,data=classData,FUN=g,na.action=na.pass)
assign(paste(b[2],"_",d,"_",g,sep=""),dataFunc,envir=.GlobalEnv)
})
}
})
})
})
I think I am pretty close, but I am not sure if rbind() is performing properly, nor why the aggregate() function is outputting NA and NaN for so many metrics. I was concerned that the data frames were not being bound together or that missing values were not being handled well by some of the statistical functions. Thank you in advance for any advice you can offer.
Cheers,
Adam
You've tackled this problem in a way that makes it very hard to debug. I'd recommend switching things around so you can more easily check each step. (Using informative variable names also helps!) The code is unlikely to work as is, but it should be much easier to work iteratively, checking that each step has succeeded before continuing to the next.
paths <- dir("C:/Data/", pattern = "\\.csv$")
# Read in CSV files as data frames
raw <- lapply(paths, read.csv, str)
# Extract needed rows
filter_metrics <- c("MaxTemp", "AvgTemp", "MinTemp")
filter_years <- 1970:2000
filtered <- lapply(raw, subset,
!is.na(Value) & Year %in% filter_years & Element %in% filter_metrics)
# Drop any empty data frames
rows <- vapply(filtered, nrow, integer(1))
filtered <- filtered[rows > 0]
# Compute aggregates
my_aggregate <- function(df, fun) {
aggregate(Value ~ Year + Month + Day + Element, data = df, FUN = fun,
na.action = na.pass)
}
means <- lapply(filtered, my_aggregate, mean)
sds <- lapply(filtered, my_aggregate, sd)
scales <- lapply(filtered, my_aggregate, scale)

Resources