How do I sub sample data by group using ddply? - r

I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset.
I wrote a function to subset a data frame as follows:
samp <- function(dataf)
{
dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),]
}
Now I want to apply this function to each species in a larger data frame.
When I try something like
culled_data = ddply (larger_data, .(species), subset, samp)
I get this error:
Error in subset.data.frame(piece, ...) :
'subset' must evaluate to logical
Anyone got ideas on how to do this?

It looks like it should work once you remove , subset from your call.

Dirk answer is of course correct, but to add additional explanation I post my own.
Why your call don't work?
First of all your syntax is a shorthand. It's equivalent of
ddply(larger_data, .(species), function(dfrm) subset(dfrm, samp))
so you can clearly see that you provide function (see class(samp)) as second argument of subset. You could use samp(dfrm), but it won't work too cause samp return data.frame and subset need logical vector. So you could use samp(dfrm) when it returns logical indexing.
How to use subset in this case?
Make subset work by feed him with logical vector:
ddply (larger_data, .(species), subset, sample(seq_along(species)<=40))
I create logical vector with 40 TRUE (btw it works when for some spieces is less then 40 cases, then it return all) and random it.

Related

R: Check for finite values in DataFrame

I need to check whether data frame is "empty" or not ("empty" in a sense that dataframe contain zero finite value. If there is mix of finite and non-finite value, it should NOT be considered "empty")
Referring to How to check a data.frame for any non-finite, I came up with one line code to almost achieve this objective
nrow(tmp[rowSums(sapply(tmp, function(x) is.finite(x))) > 0,]) == 0
where tmp is some data frame.
This code works fine for most cases, but it fails if data frame contains a single row.
For example, the above code would work fine for,
tmp <- data.frame(a=c(NA,NA), b=c(NA,NA)) OR tmp <- data.frame(a=c(3,NA), b=c(4,NA))
But not for,
tmp <- data.frame(a=NA, b=NA)
because I think rowSums expects at least two rows
I looked at some other posts such as https://stats.stackexchange.com/questions/6142/how-to-calculate-the-rowmeans-with-some-single-rows-in-data, but I still couldn't come up a solution for my problem.
My question is, are there any clean ways (i.e. avoid using loops and ideally one liner) to check for being "empty" for any dataframes?
Thanks
If you are checking all columns, then you can just do
all(sapply(tmp, is.finite))
Here we are using all rather than the rowSums trick so we don't have to worry about preserving matrices.

R apply to change date in Posixct?

I'm trying replace the date in some Posixct data I have in a df. Essentially I thought of using something like this:
as.POSIXct(sub("\\S+", "2018-07-02", x))
x being the value in the row of my original data frame. I thought that the most efficient way would be to use something like apply to iterate through the rows, something like:
apply(df$original.date,1,function(x) as.POSIXct(sub("\\S+", "2018-07-02", x)))
However, it doesn't seem to like it, giving me an error regarding positive length. I was wondering first of all whether my approach was sound, and if so how to fix it. Alternatively I'm all ears if there is a better approach.
Thank you.
The functions sub and as.POSIXct are vectorized, hence a single call
df$original.date <- as.POSIXct(sub("\\S+", "2018-07-02", df$original.date))
will replace the original.date column in the data frame.
As for your apply code:
> apply(df$original.date,1,function(x) as.POSIXct(sub("\\S+", "2018-07-02", x)))
Error in apply(df$original.date, 1, function(x) as.POSIXct(sub("\\S+", :
dim(X) must have a positive length
The problem is that apply expects a matrix or array, while you give it a column from a data frame, i.e. a vector. You could use lapply or sapply. But that is unnecessary since as shown above the entire column can be changed in one go.

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

Perform t-test using aggregate function in R

I'm having difficulty using the unpaired t-test and the aggregate function.
Example
dd<-data.frame(names=c("1st","1st","1st","1st","2nd","2nd","2nd","2nd"),a=c(11,12,13,14,2.1,2.2,2.3,2.4),b=c(3.1,3.2,3.3,3.4,3.1,3.2,3.3,3.4))
dd
# Compare all the values in the "a" column that match with "1st" against the values in the "b" column that match "1st".
# Then, do the same thing with those matching "2nd"
t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value
# Also need to replace any errors from t.test that have too low variance with NA
# An example of the type of error I might run into would be if the "b" column was replaced with c(3,3,3,3,3,3,3,3).
For paired data, I found a work around.
# Create Paired data.
data_paired<-dd[,3]-dd[,2]
# Create new t-test so that it doesn't crash upon the first instance of an error.
my_t.test<-function(x){
A<-try(t.test(x), silent=TRUE)
if (is(A, "try-error")) return(NA) else return(A$p.value)
}
# Use aggregate with new t-test.
aggregate(data_paired, by=list(dd$name),FUN=my_t.test)
This aggregate works with a single column of input. However, I can't get it to function when I must have several columns go into the function.
Example:
my_t.test2<-function(x,y){
A<-try(t.test(x,y,paired=FALSE), silent=TRUE)
if (is(A, "try-error")) return(NA) else return(A$p.value)
}
aggregate(dd[,c(2,3)],by=list(dd$name),function(x,y) my_t.test2(dd[,3],dd[,2]))
I had thought that the aggregate function would only send the rows matching the value in the list to the function my_t.test2 and then move onto the next list element. However, the results produced indicate that it is performing a t-test on all values in the column like below. And then placing each of those values in the results.
t.test(dd[,3],dd[,2])$p.value
What am I missing? Is this an issues with the original my_test.2, an issue with how to structure the aggregate function, or something else. The way I applied it doesn't seem to aggregate.
These are the results I want.
t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value
To Note, this is a toy example and the actual data set will have well over 100,000 entries that need to be grouped by the value in the names column. Hence why I need the aggregate function.
Thanks for the help.
aggregate isn't the right function to use here because the summary function only works on one column at a time. It's not possible to get both the a and b values simultaneously with this method.
Another way you could approach the problem is to split the data, then apply the t-test to each of the subset. Here's one implementation
sapply(
split(dd[-1], dd$names),
function(x) t.test(x[["a"]], x[["b"]])$p.value
)
Here I split dd into a list of subset for each value of names. I use dd[-1] to drop the "names" column from the subsets to I just have a data.frame with two columns. One for a and one for b.
Then, to each subset in the list, I perform a t.test using the a and b columns. Then I extract the p-value. The sapply wrapper with calculate this p-value for each subset and rill returned a named vector of p-values where the names of the entries correspond to the levels of dd$names
1st 2nd
6.727462e-04 3.436403e-05
If you wanted to do paired t-test this way, you could do
sapply(
split(dd[-1], dd$names),
function(x) t.test(x[["a"]] - x[["b"]])$p.value
)
As #MrFlick said, agregate is not the right function to do this. Here are some alternatives to the sapply approach, using the dplyr or data.table packages.
require(dplyr)
summarize(group_by(dd, names), t.test(a,b)$p.value)
require(data.table)
data.table(dd)[, t.test(a,b)$p.value, by=names]

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Resources