I am trying to calculate mean, IQR, and CV. The data set is "flights_DTW" and subset is "DEP_DELAY_NEW" "NA" are not removed.
Hi is if possible to calculate CV using the following codes:
mean(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE)
mean(flights_DTW$ARR_DELAY_NEW, na.rm = TRUE)
IQR(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE)
IQR(flights_DTW$ARR_DELAY_NEW, na.rm = TRUE)
CV(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE)
CV(flights_DTW$ARR_DELAY_NEW, na.rm = TRUE)
cat(sprintf("16.76 = %.2f", flights_DTW$DEP_DELAY_NEW)))```
I came up with the following result:
[1] 16.75676
[1] 16.43083
[1] 8
[1] 9
Error in CV(flights_DTW$DEP_DELAY_NEW, na.rm = TRUE) :
could not find function "CV"
What I want is that I want to put everything in a command.
If you don't want to install a package just for one function, you can define your own cv function as:
CV <- function(x, na.rm=TRUE){
sd(x, na.rm = na.rm)/mean(x, na.rm = na.rm)
}
> CV(mtcars$mpg)
[1] 0.2999881
This is my first time using any custom functions, so bear with me. I made a function for standard error that I'd like to use with aggregate. It worked until I tried to exclude NAs.
Dummy data frame to work with:
se <- function(x) sd(x)/sqrt(length(x))
df <- data.frame(site = c('N','N','N','S','S','S'),
birds = c(NA,4,2,9,3,1),
worms = c(2,1,2,4,0,5))
means <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = mean)
error <- aggregate(df[,2:3], na.rm = T, list(site = df$site), FUN = se)
So aggregate worked before I excluded NAs (e.g. error <- aggregate(df[,2:3], list(site = df$site), FUN = se)), and it works when finding the mean (using the rest of the values to take the mean and ignoring the missing value). How can I exclude NAs in that same manner when using my custom se function?
The problem is that you do not have an explicit argument for na.rm in your se function. If you add that to your function, it should work:
se <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm)/sqrt(sum(!is.na(x)))
}
I am new to r. I have a dataframe showing 8 trials per participant (in rows) per 4 different tasks/measures (in columns). I would like to remove outliers* (per participant per task) and convert them to NAs while keeping pre-existing NAs.
The code I am using is below; it is throwing out the pre-existing NAs (i.e.,the NAs that exist within the raw dataframe) with the additional result that I cannot get a dataframe back (it won't accept as.data.frame) I think because of unequal sizes. I presume the problem is the remove outliers function but
I thought that when the action on the NAs was within a function it was just stating how to deal with NAs during the function application only, and
I have tried to change the function with variations on na.rm = FALSE throughout but that won’t run. Any help much appreciated.
fname = "VSA perceptual controls_right.csv"
ctrl_vsa_trials = read.csv(fname, header = TRUE, stringsAsFactors = FALSE, na.strings = c(""))
remove_outliers = function(x, na.rm = TRUE, ...){
qnt = quantile(x, probs = c(.25, .75), na.rm = na.rm, ...)
H = 1.5 * IQR(x, na.rm = na.rm)
y = x
y[x < (qnt[1] - H)] = NA
y[x > (qnt[2] + H)] = NA
y
}
ctrl_vsa_trials_clean = aggregate(cbind(Pre_first,Post_first,Pre_adj,Post_adj) ~ Ppt, ctrl_vsa_trials, remove_outliers, na.action = NULL)
this is due to issues I had with the measuring device, I feel it is justified!
I am not sure that I understand what you exactly need, but if what you are trying is to replace columns with cleaned columns, you can try this:
ctrl_vsa_trials_clean <- ctrl_vsa_trials
cols <- c("Pre_first", "Post_first", "Pre_adj", "Post_adj")
ctrl_vsa_trials_clean[, cols] <- apply(ctrl_vsa_trials_clean[, cols], 2,
remove_outliers)
I'm currently working with data in a *csv. I've got an effective script to plot my data already, but I'm stumped by what seems to be the simplest task. I'm trying to write a script that takes my data (arranged in columns) and have it calculate the mean by column and write it to a new document(./testAVG).
Also, I'm trying to take the same data, calculate the SD (by column) and append that data to the end of the original document (preferably in a repeat for the total number of rows of data I have).
Here's the script I have so far:
#Number of lines with data
Nlines = 5
#Number of lines to skip
Nskip = 0
chem <- read.table("./test.csv", skip=Nskip, sep=",", col.names = c("Sample", "SiO2", "Al2O3", "FeO", "MgO", "CaO", "Na2O", "K2O", "Total", "eSiO2", "eAl2O3", "eFeO", "eMgO", "eCaO", "eNa2O", "eK2O"), fill=TRUE, header = TRUE, nrow=Nlines)
sd1 <- sd(chem$SiO2)
sd2 <- sd(chem$Al2O3)
sd3 <- sd(chem$FeO)
sd4 <- sd(chem$MgO)
sd5 <- sd(chem$CaO)
sd6 <- sd(chem$Na2O)
sd7 <- sd(chem$K2O)
avg1 <- colMeans(chem$SiO2, na.rm = FALSE, dims=1)
avg2 <- colMeans(chem$Al2O3, na.rm = FALSE, dims=1)
avg3 <- colMeans(chem$FeO, na.rm = FALSE, dims=1)
avg4 <- colMeans(chem$MgO, na.rm = FALSE, dims=1)
avg5 <- colMeans(chem$CaO, na.rm = FALSE, dims=1)
avg6 <- colMeans(chem$Na2O, na.rm = FALSE, dims=1)
avg7 <- colMeans(chem$K2O, na.rm = FALSE, dims=1)
write <- write.table(sd1,sd2,sd3,sd4,sd5,sd6,sd7, file="./test.csv", append=TRUE, sep=",", dec=".", col.names = c("eSiO2", "eAl2O3", "eFeO", "eMgO", "eCaO", "eNa2O", "eK2O"))
write <- write.table(avg1, avg2, avg3, avg4, avg5, avg6, avg7, file="./testAVG.csv", append=FALSE, sep=",", dec=".", col.names = c("Sample", "SiO2", "Al2O3", "FeO", "MgO", "CaO", "Na2O", "K2O", "Total"))
The data I'm working with is this
Sample, SiO2, Al2O3, FeO, MgO, CaO, Na2O, K2O, Total,eSiO2,eAl2O3,eFeO,eMgO,eCaO,eNa2O,eK2O
01,65.01,14.77,0.34,1.31,17.27,1.14,0.2,100,,,,,,,
02,72.6,16.27,0.53,0.06,1.27,5.55,3.71,100,,,,,,,
03,64.95,14.65,0.18,1.29,17.48,1.21,0.23,100,,,,,,,
04,64.95,14.65,0.18,1.29,17.48,1.21,0.23,100,,,,,,,
I get this error:
Error in colMeans(chem$SiO2, na.rm = FALSE, dims = 1) :
'x' must be an array of at least two dimensions
Any advice? Thanks
The comments already hint at how to do it, but it seems that you are rather new to R, so let me explicitly show you how you could do it better, using the mtcars dataset:
df <- mtcars
df_sd <- apply(df, 2, sd) # this is how to use apply. See ?apply
df_avg <- colMeans(df) # this is how to use colMeans. See ?colMeans
write.table(df_sd, file="test.csv") # no assignment necessary.
write.table(df_avg, file="testAVG.csv") # writing the file is a desired side effect...
Moreover, please consider the following line:
avg1 <- colMeans(chem$SiO2, na.rm = FALSE, dims=1)
The cool thing about colMeans is that it computes the columnwise means for many columns at once. Here, you are supplying only one vector, namely chem$SiO2. If this is really what you want to do, you would just write
avg1 <- mean(chem$SiO2)
I have about 30 lines of code that do just this (getting Z scores):
data$z_col1 <- (data$col1 - mean(data$col1, na.rm = TRUE)) / sd(data$col1, na.rm = TRUE)
data$z_col2 <- (data$col2 - mean(data$col2, na.rm = TRUE)) / sd(data$col2, na.rm = TRUE)
data$z_col3 <- (data$col3 - mean(data$col3, na.rm = TRUE)) / sd(data$col3, na.rm = TRUE)
data$z_col4 <- (data$col4 - mean(data$col4, na.rm = TRUE)) / sd(data$col4, na.rm = TRUE)
data$z_col5 <- (data$col5 - mean(data$col5, na.rm = TRUE)) / sd(data$col5, na.rm = TRUE)
Is there some way, maybe using apply() or something, that I can just essentially do (python):
for col in ['col1', 'col2', 'col3']:
data{col} = ... z score code here
Thanks R friends.
A data.frame is a list, thus you can use lapply. Don't use apply on a data.frame as this will coerce to a matrix.
lapply(data, function(x) (x - mean(x,na.rm = TRUE))/sd(x, na.rm = TRUE))
Or you could use scale which performs this calculation on a vector.
lapply(data, scale)
You can translate the python style approach directy
for(col in names(data)){
data[[col]] <- scale(data[[col]])
}
Note that this approach is not memory efficient in R as [[<.data.frame copies the entire data.frame each time.
I think you're right, apply() may be the way to go here.
For example:
data <- array(1:20, dim=c(4, 5))
data.zscores <- apply(data, 2, function(x)
(x-mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE))
The function apply() takes a matrix or array as it's first argument. The "2" refers to the dimension the function is iterated over - which in our case is columns. If we wanted to do it by row, we'd go with "1". Lastly, we have the function we want to apply to each column. See ?apply for more details.
Check this out
I iterate through the data frame to recognise NA rows
for(i in names(houseDF)){
print(i)
print(nrow(houseDF[is.na(houseDF[i]),]))
print("---------------------")
}