undefined columns selected and cannot xtfrm data frame error - r

I am trying to write a code that checks for outliers based on IQR and change those respective values to "NA". So I wrote this:
dt <- rnorm(200)
dg <- rnorm(200)
dh <- rnorm(200)
l <- c(1,3) #List of relevant columns
df <- data.frame(dt,dg,dh)
To check if the column contains any outliers and change their value to NA:
vector.is.empty <- function(x) return(length(x) ==0)
#Checks for empty values in vector and returns booleans.
for (i in 1:length(l)){
IDX <- l[i]
BP <- boxplot.stats(df[IDX])
OutIDX <- which(df[IDX] %in% BP$out)
if (vector.is.empty(OutIDX)==FALSE){
for (u in 1:length(OutIDX)){
IDX2 <- OutIDX[u]
df[IDX2,IDX] <- NA
}
}
}
So, when I run this code, I get these error messages:
I've tried to search online for any good answers. but I'm not sure why they claim that the column is unspecified. Any clues here?

I would do something like that in order to replace the outliers:
# Set a seed (to make the example reproducible)
set.seed(31415)
# Generate the data.frame
df <- data.frame(dt = rnorm(100), dg = rnorm(100), dh = rnorm(100))
# A list to save the result of boxplot.stats()
l <- list()
for (i in 1:ncol(df)){
l[[i]] <- boxplot.stats(df[,i])
df[which(df[,i]==l[[i]]$out),i] <- NA
}
# Which values have been replaced?
lapply(l, function(x) x$out)

Related

How to create new variable at the end of each loop iteration in R

I am trying to create a variable that is a function of 4 other variables. I have the following code:
set.seed(123)
iter <- 1000
group <- c('A','B','C','D','E','F')
for (i in group) {
df <- df1[df1$group == i,]
x_ <- vector(mode="numeric", length=1000)
assign(eval(paste0("X_", i)), globalenv()) #This is the issue
a <- rnorm(iter, mean=df$a, sd=df$sea)
b <- rnorm(iter, mean=df$b, sd=df$seb)
c <- rnorm(iter, mean=df$c, sd=df$sec)
z <- rnorm(iter, mean=df$zbar, sd=df$se_z)
X_[i] = (a + c*(z-df$zbar))/(-b)
}
I am unable to create a unique group-specific variable (e.g. X_A, X_B, ...) and I am unsure why the -assign( )- function is not working properly. The dataframe df1 has 6 rows (one for each group) and then the number of columns is equal to the number of variables plus a string variable for group. I am not trying to append this new variables X_[i] to the dataset I am just trying to place it in the global environment. I believe the issue lies in my assigning the placement of the variable, but it isn't generating a numeric variable X.
df1 is a dataframe with 6 observations of 9 variables containing a, sea, b, seb, c, sec, zbar, se_z. These are just the means and standard deviations of a, b, c, and z, respectively. The 9th variable is group which contains A, B, ..., F. When I use the code df <-df1[df1$group == i,] I am trying to create a unique X variable for each group entity.
Try something like this:
dynamicVariableName <- paste0("X_", i)
assign(dynamicVariableName, (a + c*(z-df$zbar))/(-b))
Alternatively to the answer from #ErrorJordan, you can write your loop like that:
set.seed(123)
iter <- 1000
group <- c('A','B','C','D','E','F')
for(i in group)
{
df <- df1[df1$group == i,]
a <- rnorm(iter, mean=df$a, sd=df$sea)
b <- rnorm(iter, mean=df$b, sd=df$seb)
c <- rnorm(iter, mean=df$c, sd=df$sec)
z <- rnorm(iter, mean=df$zbar, sd=df$se_z)
X <- (a + c*(z-df$zbar))/(-b)
assign(paste0("X_",i),X,.GlobalEnv)
}
As suggested by #MrFlick, you can also stored your data into a list, to do so you can just modify your loop to get:
set.seed(123)
iter <- 1000
group <- c('A','B','C','D','E','F')
X = vector("list",length(group))
names(X) = group
for(i in 1:length(group))
{
df <- df1[df1$group == group[i],]
a <- rnorm(iter, mean=df$a, sd=df$sea)
b <- rnorm(iter, mean=df$b, sd=df$seb)
c <- rnorm(iter, mean=df$c, sd=df$sec)
z <- rnorm(iter, mean=df$zbar, sd=df$se_z)
X[[i]] <- (a + c*(z-df$zbar))/(-b)
}
df1 dataframe
df1 = data.frame(a = c(1:6),
b = c(1:6),
c = c(1:6),
zbar = c(1:6),
sea = rep(1,6),
seb = rep(1,6),
sec = rep(1,6),
se_z = rep(1,6),
group = group)
It's a little hard to parse what you want to do, but I'm assuming it's something like
for each value in group make an object (in the global env) called X_A, X_B, ...
for each one of those objects, assign it the value (a + c*(z-df$zbar))/(-b)
I think this should do that for you:
set.seed(123)
group <- c('A','B','C','D','E','F')
for (i in group) {
df <- df1[df1$group == i,]
a <- rnorm(iter, mean=df$a, sd=df$sea)
b <- rnorm(iter, mean=df$b, sd=df$seb)
c <- rnorm(iter, mean=df$c, sd=df$sec)
z <- rnorm(iter, mean=df$zbar, sd=df$se_z)
assign(paste0("X_", i), (a + c*(z-df$zbar))/(-b), globalenv())
}
Note that in the code example you gave, the command iter <- 1000 has no effect, and the command x_ <- vector(mode="numeric", length=1000) also has no effect. By that I mean, you make those objects, but never subsequently use them in any further computation. If those commands should do something meaningful I'll need your help in explaining their intended purpose.

For loop function is looping too many times

I am calculating a community weighted mean of functional trait values (studying forestry). I have to multiply the relative abundances of each species (tree) by the trait values. I have 2dataframes, 1 with the relative abundances of each species within each site and one with the average trait values for each species. I made a loop to automize the calculation, but the endresults return the multiplication 13 times instead of 1 time (I have 13plots, so maybe it has something to do with this) I'm already busy with this script for several days since i'm new to R, but i have to do this for my masterthesis. I think I reached my limit of logical thinking today and can't find my error :) can someone help me please? I'll paste the script below:
load data, apply some column names, fill NAs with 0
library(data.table)
traits <- read.csv("Trait value.csv", sep = ";")
plots_Maiz <- read.csv("CWM Maiz plot.csv", sep = ";")
plots_Maiz[is.na(plots_Maiz)] <- 0
colnames(plots_Maiz) <- c("site", "species","y0","y1", "y2", "y3", "y4", "y5")
traits[,1:17][is.na(traits[,1:17])] <- 0
#function for finding the corresponding species for a plot in the traitlist
traitsf <- function(df, traitlist){
plottraits <- subset(traitlist, species %in% df[,2])
return(plottraits)
}
traitcalc <- function(traits, plots_Maiz){
multlist <- list()
blist <- list()
vmult <- vector()
tickcount <- 0
plotsplit <- split.data.frame(plots_Maiz, plots_Maiz$site)
testlist <- lapply(plotsplit, traitsf, traitlist = traits)
for (q in 1:length(plotsplit)){
df1 <- testlist[[q]]
df2 <- plotsplit[[q]]
plot <- as.character(plotsplit[[q]][1,1])
for (i in 1:nrow(df1)){
v <- as.numeric(as.vector(t(df1[i,2:ncol(df1)])))
species <- as.character(df1[i,1])
for (j in 1:(ncol(df2)-2)){
tickcount <- tickcount + 1
vmult <-as.vector(v * (as.numeric(as.vector(df2[i,j+2]))))
vmult <- as.list(c(vmult, j-1, species, plot))
multlist[[tickcount]] <- vmult
}
}
b <- do.call(rbind, multlist)
b <- data.table::rbindlist(multlist)
blist[[q]] <- b
}
return(blist)
}
endresults <- traitcalc(traits,plots_Maiz)
endresultsdf2<- do.call("rbind", endresults)

Simpler method to insert dataframe variable and name when creating many dataframes from raster type

Is there a simpler way to designate new dataframe rows and rownames in the creation of a data frame from raster data?
rastA <- raster("rasterA.txt")
rastB <- raster("rasterB.txt")
rastC <- raster("rasterC.txt")
rastD <- raster("rasterD.txt")
rastE <- raster("rasterE.txt")
dfA <- as.data.frame(rastA)
dfB <- as.data.frame(rastB)
dfC <- as.data.frame(rastC)
dfD <- as.data.frame(rastD)
dfE <- as.data.frame(rastE)
# Renaming column in dataframe
names(dfA)[1] <- 'values'
names(dfB)[1] <- 'values'
names(dfC)[1] <- 'values'
names(dfD)[1] <- 'values'
names(dfE)[1] <- 'values'
# Adding new column with classifier 'X'
dfA$type <- 'X'
dfB$type <- 'X'
dfC$type <- 'X'
dfD$type <- 'X'
dfE$type <- 'X'
df_AB <- rbind.data.frame(dfA, dfB)
df_AC <- rbind.data.frame(dfA, dfC)
df_AD <- rbind.data.frame(dfA, dfD)
With the final combined data frames fed into ggplot to generate various histogram and density plots. This method (line by line) is easy enough, but I am wondering what efficiencies can be gained by using different methods.
Here is an approach that simplifies part of this
f <- system.file("external/test.grd", package="raster")
fls <- c(f, f, f, f, f)
s <- stack(fls) * 1:5
names(s) <- LETTERS[1:5]
df <- as.data.frame(s)
df <- na.omit(df)
I would expect that for most plots, df is what you want to use, and that would not not need to create all these separate objects that you do. However, if that is what you want, perhaps do
x <- reshape(df, varying=colnames(df), v.name='values', timevar='group', times=colnames(df), direction='long', new.row.names=NULL)
# see http://www.ats.ucla.edu/stat/r/faq/reshape.htm
rownames(x) <- NULL
x$id <- NULL
x$type <- 'X'
df_AB <- x[x$group %in% c('A', 'B'), ]
# etc

creating a function for processing my dataframe calculations

I am doing systematic calculations for my created dataframe. I have the code for the calculations but I would like to:
1) Wite it as a function and calling it for the dataframe I created.
2) reset the calculations for next ID in the dataframe.
I would appreciate your help and advice on this.
The dataframe is created in R using the following code:
#Create a dataframe
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
The other thing is that the previous code assumes the subject ID=1 for all time points. If subject ID=2 when the WT (weight) changes to 120. How can I reset the calculations and make it automated for all subject IDs in the dataframe? In this case the original dataframe would be like this:
#code:
rm(list=ls(all=TRUE))
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
df$ID[(df$WT>=120)==T] <- 2
df$TIME[df$ID==2] <- c(seq(0,20,1))
Thank you in advance!
In general, when doing calculations on different subject's data, I like to split the dataframe by ID, pass the vector of individual subject data into a for loop, do all the calculations, build a vector containing all the newly calculated data and then collapse the resultant and return the dataframe with all the numbers you want. This allows for a lot of control over what you do for each subject
subjects = split(df, df$ID)
forResults = vector("list", length=length(subjects))
# initialize these constants
C <- 2
V <- 10
k <- C/V
myFunc = function(data, resultsArray){
for(k in seq_along(subjects)){
df = subjects[[k]]
df$A1 = 100 # I assume this should be 100 for t=0 for each subject?
# you could vectorize this nested for loop..
for(i in 2:nrow(df)) {
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
# you can add all sorts of other calculations you want to do on each subject's data
# when you're done doing calculations, put the resultant into
# the resultsArray and we'll rebuild the dataframe with all the new variables
resultsArray[[k]] = df
# if you're not using RStudio, then you want to use dev.new() to instantiate a new plot canvas
# dev.new() # dont need this if you're using RStudio (which doesnt allow multiple plots open)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
# collapse the results vector into a dataframe
resultsDF = do.call(rbind, resultsArray)
return(resultsDF)
}
results = myFunc(subjects, forResults)
Do you want this:
ddf <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
myfn = function(df){
dosetimes <- c(0,6,12,18)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
myfn(ddf)
For multiple calls:
for(i in 1:N) {
myfn(ddf[ddf$ID==i,])
readline(prompt="Press <Enter> to continue...")
}

return identical DF or vector instead of NULL

users,
I have data.frames which are NULL in my results, but I don't want them to be NULL. I want them to be the same as the beginning (unchanged). I'm working on a list of files and the aim of my code is to fill all the NA with data from my other data.frames (according to the best correlation coefficient). Here's a small example:
Imagine these are my 3 input data frames (10 rows each):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,NA,NA,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
The aim here is for example, if there're NAs in ST1, ST1 must be filled with data from the best correlated file with ST1 (between ST2 and ST3 in this example)).
As ST3 has no data here, I cannot have any correlation coefficient. So NAs from ST3 cannot be filled, and ST3 cannot also be used to fill another file. So ST3 has no use if you want. Nevertheless I want to keep ST3 unchanged during all my code.
So the problem in my code comes from data.frames with no data and so with only NAs.
For the moment my code would give this for "refill" (end of my code) (filled NA in my data.frames):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- NULL
But actually, I want for results in "refill" this:
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
So for data.frames with only NAs, I don't want them to be NULL in "refill", but I want them to be identical as in input. I need this to have the same dimensions of data.frames between inputs and outputs.
If they are as NULL (like it is for the moment but I don't understand why and I want to change this), there will be 0 rows in this data.frame instead of 10 rows like the other data.frames.
So I think there's something wrong in my code in function "process.all" or "na.fill" or maybe "lst".
Here's my code and it is a reproductible example for you to understand my error (you'll see in head(refill) ST2 is set as NULL).
Sorry if it is a bit long but my error depends on other functions previously used. Hope you've understand my problem and what I'm trying to do. Thanks for your help!
(For information, in function "process.all" and "na.fill": x is the data.frame I want to fill, and y is the file which will be used to fill x (so the best correlated file with x)).
Geoffrey
# my data for example
DF1 <- data.frame(x1=c(NA,NA,rnorm(3:20)),x2=c(31:50))
write.table(DF1,"ST001_2008.csv",sep=";")
DF2 <- data.frame(x1=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,rnorm(1:10)),x2=c(1:20))
write.table(DF2,"ST002_2008.csv",sep=";")
DF3 <- data.frame(x1=rnorm(81:100),x2=NA)
write.table(DF3,"ST003_2008.csv",sep=";")
DF4 <- data.frame(x1=c(21:40),x2=rnorm(1:20))
write.table(DF4,"ST004_2008.csv",sep=";")
# Correlation table
corhiver2008capt1 <- read.table(text=" ST001 ST002 ST003 ST004
ST001 1.0000000 NA -0.4350665 0.3393549
ST002 NA NA NA NA
ST003 -0.4350665 NA 1.0000000 -0.4992513
ST004 0.3393549 NA -0.4992513 1.0000000",header=T)
lst <- lapply(list.files(pattern="\\_2008.csv$"), read.table,sep=";", header=TRUE, stringsAsFactors=FALSE)
Stations <-c("ST001","ST002","ST003","ST004")
names(lst) <- Stations
# searching the highest correlation for each data.Frame
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
m <- max(mat[station, ],na.rm=TRUE)
if (is.finite(m)) {return(which( mat[station, ] == m ))}
else {return(NA)}
}
# fill the data.frame with the data.frame which has the highest correlation coefficient
na.fill <- function(x, y){
if(all(!is.finite(y[1:10,1]))) return(y)
i <- is.na(x[1:10,1])
xx <- y[1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][i] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude),new)[i]
x
}
process.all <- function(df.list, mat){
f <- function(station)
na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
g <- function(station){
x <- df.list[[station]]
if(any(!is.finite(x[1:10,1]))){
mat[row(mat) == col(mat)] <- -Inf
nas <- which(is.na(x[1:10,1]))
ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))]
for(y in ord){
if(all(!is.na(df.list[[y]][1:10,1][nas]))){
xx <- df.list[[y]][1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][nas] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude), new)[nas]
break
}
}
}
x
}
n <- length(df.list)
nms <- names(df.list)
max.cor <- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
df.list <- lapply(seq.int(n), f)
df.list <- lapply(seq.int(n), g)
names(df.list) <- nms
df.list
}
refill <- process.all(lst, corhiver2008capt1)
refill <- as.data.frame(refill) ########## HERE IS THE PROBLEM ######
refill
How about
if(sum(!is.na(ST3)) == 0) {
skip whatever you normally would do and go to the next vector
}
This assumes, of course, that you don't have any problems with, say, a vector of 1999 NAs and one numerical value.

Resources