What is wrong with this looping? - r

I can't inderstand what is wrong with this R code, I have several rows and col with a measuament or NA and I basically want to get the min and max value in each line looking amongs the several cols:
require(plyr)
census <- read.csv("sps_census.csv")
info <- read.csv("sps_info.csv")
for (i in 1: nrow(census)) {
trans <- census[i,c("dbh1","dbh2","dbh3","dbh4","dbh5","dbh6","dbh7","dbh8", "dbh9")]
index.1 <- which (trans != "NA") #some NAs are in the data
census$min.dbh <- min(trans[1,index.1])
census$min.dbh.index <- min(index.1)
census$max.dbh <- max(trans[1,index.1])
census$max.dbh.index <- max(index.1)
}

In this line (and the other three similar lines):
census$min.dbh <- min(trans[1,index.1])
you are assigning an entire column, to all the same value. Clearly not what you intend.
Perhaps you want something like this:
census$min.dbh[i] <- min(trans[1,index.1])
Note that you can use apply to do this sort of thing. It would be a lot easier for someone to write a working apply example, if you supply example data (i.e., make your question a reproducible example).

You can use apply:
index <- c"dbh1","dbh2","dbh3","dbh4","dbh5","dbh6","dbh7","dbh8", "dbh9") #or paste("dbh",1:9,sep="")
census$min.dbh <- apply(census[index], 1, min, na.rm=T)
census$min.db.index <- apply(census[index], 1, function(x){ min(which(!is.na(x))) })
census$max.dbh <- apply(census[index], 1, max, na.rm=T)
census$max.db.index <- apply(census[index], 1, function(x){ max(which(!is.na(x))) })
Note that I'm using is.na(x) instead of x != "NA".

Related

short ifelse for specific use case, set vector elements to NA

I use elseif to sanitise data in a real world data base that is subjected to typing errors.
Lets say I want to sanitise a value of X which I know can't be above 100 in real world situations so I just want to turn everything above 100 to NA values not to be included in the analysis.
So I would do:
df$x <- ifelse(df$x > 100, NA, df$x)
this turns all values above 100 to NA and keeps the other ones
This feels quite cumbersome and makes the code unreadable when I use the real variable names which are quite long.
Is there any shorter way to do what I am trying to perform?
Thanks!
Is there any way in r to shorten this pea
The simplest way I am aware of is with function is.na<-.
is.na(df$x) <- df$x > 100
Explanation.
Function is.na<- is a generic function defined in file
src/library/base/R/is.R as
`is.na<-` <- function(x, value) UseMethod("is.na<-")
One method is defined in the file, the default method.
`is.na<-.default` <- function(x, value)
{
x[value] <- NA
x
}
This is what S3's method dispatch mechanism calls in the answer's code line. An alternative way of calling it is the functional form.
`is.na<-`(df$x, df$x > 100)
Use data.table
setDT(df)
df[x > 100, x := NA]
If the operation is to be applied for several columns,
column.names <- names(df)[names(df) %in% column.names]
for(i.col in column.names){
set(df, which(df[[i.col]] > 100), i.col, NA)
}
Try This answer will help.
df <- data.frame('X'=c(1,2,3,4,NA,100,101,102))
df$X <- as.numeric(df$X)
df$X <- ifelse((is.na(df$X) | df$X >100),NA,df$X)
You can use the column index instead of column names then.
col <- which(names(df) == 'x')
df[[col]] <- df[[col]] * c(1, NA)[(df[[col]] > 100) + 1]
Or
df[[col]] <- with(df, replace(df[[col]], df[[col]] > 100, NA))
So here you use column name only once.

How to substitute negative values with a calculated value in an entire dataframe

I've got a huge dataframe with many negative values in different columns that should be equal to their original value*0.5.
I've tried to apply many R functions but it seems I can't find a single function to work for the entire dataframe.
I would like something like the following (not working) piece of code:
mydf[] <- replace(mydf[], mydf[] < 0, mydf[]*0.5)
You can simply do,
mydf[mydf<0] <- mydf[mydf<0] * 0.5
If you have values that are non-numeric, then you may want to apply this to only the numeric ones,
ind <- sapply(mydf, is.numeric)
mydf1 <- mydf[ind]
mydf1[mydf1<0] <- mydf1[mydf1<0] * 0.5
mydf[ind] <- mydf1
You could try using lapply() on the entire data frame, making the replacements on each column in succession.
df <- lapply(df, function(x) {
x <- ifelse(x < 0, x*0.5, x)
})
The lapply(), or list apply, function is intended to be used on lists, but data frames are a special type of list so this works here.
Demo
In the replace the values argument should be of the same length as the number of TRUE values in the list ('index' vector)
replace(mydf, mydf <0, mydf[mydf <0]*0.5)
Or another option is set from data.table, which would be very efficient
library(data.table)
for(j in seq_along(mydf)){
i1 <- mydf[[j]] < 0
set(mydf, i = which(i1), j= j, value = mydf[[j]][i1]*0.5)
}
data
set.seed(24)
mydf <- as.data.frame(matrix(rnorm(25), 5, 5))

Inserting outliers to a dataframe

I try to create a function to inject outliers to an existing data frame.
I started creating a new dataframe outsusing the maxand minvalues of the original dataframe. This outs dataframe will containing a certain amountof outliered data.
Later I want to inject the outliered values of the outs dataframe to the original dataframe.
What I want to get is a function to inject a certain amount of outliers to an original dataframe.
I have different problems for example: I do know if I am using correctly runif to create a dataframe of outliers and second I do not know how to inject the outliers to temp
The code I've tried until now is:
addOutlier <- function (data, amount){
maxi <- apply(data, 2, function(x) (mean(x)+(3*(sd(x)))))
mini <- apply(data, 2, function(x) (mean(x)-(3*(sd(x)))))
temp <- data
amount2 <- ifelse(amount<1, (prod(dim(data))*amount), amount)
outs <- runif(amount2, 2, min = mini, max = maxi) # outliers
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:length(outs))
temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- outs
return (temp)
}
Please any help to make this work, will be deeply appreciated
My understanding is that what you're trying to achieve is adding a set amount of outliers to each column in your vector. Alternatively, you seem to also be looking into adding a % of outliers to each column. I wrote down a solution only for the former case, but the latter should pretty easy to implement if you really need it. Note how I broke things down into two functions, to (hopefully) help clarify what is going on. Hope this helps!
add.outlier.to.vector <- function(vector, amount) {
cells.to.modify <- sample(1:length(vector), amount, replace=F)
mean.val <- mean(vector)
sd.val <- sd(vector)
min.val <- mean.val - 3 * sd.val
max.val <- mean.val + 3 * sd.val
vector[cells.to.modify] <- runif(amount, min=min.val, max=max.val)
return(vector)
}
add.outlier.to.data.frame <- function (temp, amount){
for (i in 1:ncol(temp)) {
temp[,i] <- add.outlier.to.vector(temp[,i], amount)
}
return (temp)
}
data <- data.frame(
a=c(1,2,3,4),
b=c(7,8,9,10)
)
add.outlier.to.data.frame(data, 2)

How can I do a t.test on an entire data.frame and extract the p-values?

My dataset looks something like this:
a <- rnorm(2)
b <- rnorm(2)-3
x <- rnorm(13)
y <- rnorm(2)-1
z <- rnorm(2)-2
eg <- expand.grid(a,b,x,y,z)
treatment <- c(rep(1, 2), rep(0,3))
eg <- data.frame(t(eg))
row.names(eg) <- NULL
eg <- cbind(treatment, eg)
What I need to do is run t-tests on each column, comparing the treatment =1 group to the treatment=0 group. I'd like to then have a vector of p-values. I've tried (several versions of) doing this through a loop, but I continue to receive the same error message: "undefined columns selected." Here's my code currently:
p.values <- c(rep(NA, 208))
for (i in 2:209) {
x <- data.frame(eg[eg$treatment==1][,i][1:2])
y <- data.frame(eg[eg$treatment==0][,i][3:5])
value <- t.test(x=x, y=y)['p.value']
p.values[i] <- value
}
I added the data.frame() after reading someone mention that for loops only loop through dataframes, but it didn't change anything. I am sure there is an easier way to do this, perhaps by using something in the apply family? Does anyone have any suggestions? Thanks so much!
A couple of options, both using sapply:
sapply(
eg[-1], function(x) t.test(x[eg$treatment==1],x[eg$treatment==0])[["p.value"]]
)
Or looping over the names instead:
sapply(
names(eg[-1]),
function(x) t.test(as.formula(paste(x,"~ treatment")),data=eg)[["p.value"]]
)
Or even mapply:
mapply(function(x,y) t.test(x ~ y,data=cbind(x,y))[["p.value"]], eg[-1], eg[1])

How can I use a frame in apply?

I really like using the frame syntax in R. However, if I try to do this with apply, it gives me an error that the input is a vector, not a frame (which is correct). Is there a similar function to mapply which will let me keep using the frame syntax?
df = data.frame(x = 1:5, y = 1:5)
# This works, but is hard to read because you have to remember what's
# in column 1
apply(df, 1, function(row) row[1])
# I'd rather do this, but it gives me an error
apply(df, 1, function(row) row$x)
Youcab't use the $ on an atomic vector, But I guess you want use it for readability. But you can use [ subsetter.
Here an example. Please provide a reproducible example next time. Question in R specially have no sense without data.
set.seed(1234)
gidd <- data.frame(region=sample(letters[1:6],100,rep=T),
wbregion=sample(letters[1:6],100,rep=T),
foodshare=rnorm(100,0,1),
consincPPP05 = runif(100,0,5),
stringsAsFactors=F)
apply(gidd, ## I am applying it in all the grid here!
1,
function(row) {
similarRows = gidd[gidd$wbregion == row['region'] &
gidd$consincPPP05 > .8 * as.numeric(row['consincPPP05']),
]
return(mean(similarRows$foodshare))
})
Note that with apply I need to convert to a numeric.
You can also use plyr or data.table for a clean syntax , for example:
apply(df,1,function(row)row[1]*2)
is equivalent to
ddply(df, 1, summarise, z = x*2)

Resources