I really like using the frame syntax in R. However, if I try to do this with apply, it gives me an error that the input is a vector, not a frame (which is correct). Is there a similar function to mapply which will let me keep using the frame syntax?
df = data.frame(x = 1:5, y = 1:5)
# This works, but is hard to read because you have to remember what's
# in column 1
apply(df, 1, function(row) row[1])
# I'd rather do this, but it gives me an error
apply(df, 1, function(row) row$x)
Youcab't use the $ on an atomic vector, But I guess you want use it for readability. But you can use [ subsetter.
Here an example. Please provide a reproducible example next time. Question in R specially have no sense without data.
set.seed(1234)
gidd <- data.frame(region=sample(letters[1:6],100,rep=T),
wbregion=sample(letters[1:6],100,rep=T),
foodshare=rnorm(100,0,1),
consincPPP05 = runif(100,0,5),
stringsAsFactors=F)
apply(gidd, ## I am applying it in all the grid here!
1,
function(row) {
similarRows = gidd[gidd$wbregion == row['region'] &
gidd$consincPPP05 > .8 * as.numeric(row['consincPPP05']),
]
return(mean(similarRows$foodshare))
})
Note that with apply I need to convert to a numeric.
You can also use plyr or data.table for a clean syntax , for example:
apply(df,1,function(row)row[1]*2)
is equivalent to
ddply(df, 1, summarise, z = x*2)
Related
I use elseif to sanitise data in a real world data base that is subjected to typing errors.
Lets say I want to sanitise a value of X which I know can't be above 100 in real world situations so I just want to turn everything above 100 to NA values not to be included in the analysis.
So I would do:
df$x <- ifelse(df$x > 100, NA, df$x)
this turns all values above 100 to NA and keeps the other ones
This feels quite cumbersome and makes the code unreadable when I use the real variable names which are quite long.
Is there any shorter way to do what I am trying to perform?
Thanks!
Is there any way in r to shorten this pea
The simplest way I am aware of is with function is.na<-.
is.na(df$x) <- df$x > 100
Explanation.
Function is.na<- is a generic function defined in file
src/library/base/R/is.R as
`is.na<-` <- function(x, value) UseMethod("is.na<-")
One method is defined in the file, the default method.
`is.na<-.default` <- function(x, value)
{
x[value] <- NA
x
}
This is what S3's method dispatch mechanism calls in the answer's code line. An alternative way of calling it is the functional form.
`is.na<-`(df$x, df$x > 100)
Use data.table
setDT(df)
df[x > 100, x := NA]
If the operation is to be applied for several columns,
column.names <- names(df)[names(df) %in% column.names]
for(i.col in column.names){
set(df, which(df[[i.col]] > 100), i.col, NA)
}
Try This answer will help.
df <- data.frame('X'=c(1,2,3,4,NA,100,101,102))
df$X <- as.numeric(df$X)
df$X <- ifelse((is.na(df$X) | df$X >100),NA,df$X)
You can use the column index instead of column names then.
col <- which(names(df) == 'x')
df[[col]] <- df[[col]] * c(1, NA)[(df[[col]] > 100) + 1]
Or
df[[col]] <- with(df, replace(df[[col]], df[[col]] > 100, NA))
So here you use column name only once.
How can I use lapply() to "loop" over a multi-column dataset and apply a function? Normally, I would use rollapply(), but for reasons that aren't worth going into the analytics in this case only works with lapply(). I know how to run a function over an expanding window. But how can lapply() be used with a sliding window? For example, here's a toy example for manually changing the range works with a function I'll call my_fun for a multi-column dataset (dat1):
set.seed(78)
dat1 <- as.data.frame(matrix(rnorm(1000), ncol = 20, nrow = 50))
my_fun <-function(x) {
a <-apply(x,1,mean)
}
test.1 <-my_fun(dat1[1:10])
test.2 <-my_fun(dat1[2:11])
test.3 <-my_fun(dat1[3:12])
Using lapply() for an expanding window works too, i.e., for ranges 1:10, 1:11, 1:12:
test.a <-lapply(seq(10, 12), function(x) my_fun(dat1[1:x]))
My question: is there any way to use lapply to replicate the sliding window analysis via the 3 manual examples above? I've tried several possibilities, using rep() and replicate(), for example, but so far no success. Any insight would be greatly appreciated.
test.a <-lapply(seq(1, 3), function(x) my_fun(dat1[x:(x+9)]))
In fact, it can be done with rollapply like this:
library(zoo)
res <- t(rollapply(t(dat1), 10, function(x) my_fun(t(x)), by.column = FALSE))
# verify that res[, i] equals test.i for i = 1,2,3
all.equal(res[, 1], test.1)
## [1] TRUE
all.equal(res[, 2], test.2)
## [1] TRUE
all.equal(res[, 3], test.3)
## [1] TRUE
I've got a huge dataframe with many negative values in different columns that should be equal to their original value*0.5.
I've tried to apply many R functions but it seems I can't find a single function to work for the entire dataframe.
I would like something like the following (not working) piece of code:
mydf[] <- replace(mydf[], mydf[] < 0, mydf[]*0.5)
You can simply do,
mydf[mydf<0] <- mydf[mydf<0] * 0.5
If you have values that are non-numeric, then you may want to apply this to only the numeric ones,
ind <- sapply(mydf, is.numeric)
mydf1 <- mydf[ind]
mydf1[mydf1<0] <- mydf1[mydf1<0] * 0.5
mydf[ind] <- mydf1
You could try using lapply() on the entire data frame, making the replacements on each column in succession.
df <- lapply(df, function(x) {
x <- ifelse(x < 0, x*0.5, x)
})
The lapply(), or list apply, function is intended to be used on lists, but data frames are a special type of list so this works here.
Demo
In the replace the values argument should be of the same length as the number of TRUE values in the list ('index' vector)
replace(mydf, mydf <0, mydf[mydf <0]*0.5)
Or another option is set from data.table, which would be very efficient
library(data.table)
for(j in seq_along(mydf)){
i1 <- mydf[[j]] < 0
set(mydf, i = which(i1), j= j, value = mydf[[j]][i1]*0.5)
}
data
set.seed(24)
mydf <- as.data.frame(matrix(rnorm(25), 5, 5))
I'm trying to find the log return along a vector of prices but not sure how to call an index inside a function for use in an apply function.
Here's what I'm using now:
set.seed(456)
df1 <- data.frame(id = 1:20, col1 = round( runif(20) * 100 ,0))
df1[,'logDiff'] <- NA
for(i in 2:20){
df1[i,'logDiff'] <- log(df1[i,'col1'] / df1[i-1,'col1'])
}
Any suggestions?
EDIT:
I have a bunch of columns to do this for and would like to use something like this:
colsToUse <- c('co1l','col2','col3')
lagLogDf <- as.data.frame(lapply(df1[,colsToUse], lagLogFunction(x)))
As you want the difference between consecutive values of a vector, you can use the diff function:
df1$logDiff = c(NA, diff(log(df1$col)))
Alternatively (for instance, if your operation were more complicated than cumulative differences), you could use head and tail to get the vector missing the first element and missing the last element, and work with them in a vectorized way:
df1$logDiff = c(NA, log(tail(df1$col1, -1) / head(df1$col1, -1)))
I can't inderstand what is wrong with this R code, I have several rows and col with a measuament or NA and I basically want to get the min and max value in each line looking amongs the several cols:
require(plyr)
census <- read.csv("sps_census.csv")
info <- read.csv("sps_info.csv")
for (i in 1: nrow(census)) {
trans <- census[i,c("dbh1","dbh2","dbh3","dbh4","dbh5","dbh6","dbh7","dbh8", "dbh9")]
index.1 <- which (trans != "NA") #some NAs are in the data
census$min.dbh <- min(trans[1,index.1])
census$min.dbh.index <- min(index.1)
census$max.dbh <- max(trans[1,index.1])
census$max.dbh.index <- max(index.1)
}
In this line (and the other three similar lines):
census$min.dbh <- min(trans[1,index.1])
you are assigning an entire column, to all the same value. Clearly not what you intend.
Perhaps you want something like this:
census$min.dbh[i] <- min(trans[1,index.1])
Note that you can use apply to do this sort of thing. It would be a lot easier for someone to write a working apply example, if you supply example data (i.e., make your question a reproducible example).
You can use apply:
index <- c"dbh1","dbh2","dbh3","dbh4","dbh5","dbh6","dbh7","dbh8", "dbh9") #or paste("dbh",1:9,sep="")
census$min.dbh <- apply(census[index], 1, min, na.rm=T)
census$min.db.index <- apply(census[index], 1, function(x){ min(which(!is.na(x))) })
census$max.dbh <- apply(census[index], 1, max, na.rm=T)
census$max.db.index <- apply(census[index], 1, function(x){ max(which(!is.na(x))) })
Note that I'm using is.na(x) instead of x != "NA".