Removing S4 class column in R (flowCore) - r

I work with the package flowCore found in Bioconductor, which reads my data files in an S4 class format. You can type library(flowCore) and then data(GvHD) which loads an example dataset. When you type GvHD you can see that this dataset is made out of 35 experiments, which can be accessed individually by typing for example GvHD[[1]].
Now I am trying to delete two columns FSC-H and SSC-H from all the experiments, but I have been unsuccessful.
I have tried myDataSet<- within(GvHD, rm("FSC-H","SSC-H")) but it doesn't work. I would greatly appreciate any help.

rm isn't meant for removing columns. The normal procedure is to assign NULL to that column:
for (i in 1:35){
GvHD[[i]][,c("FSC-H","SSC-H")] <- NULL
}
This is the same as you would do for a data frame.

I posted my question on the relevant GitHub page for flowCore and the answer was provided by Jacob Wagner.
GvHD[[1]] is a flowFrame, not a simple data frame, which is why the NULL assignment doesn't work. The underlying representation is also a matrix, which also doesn't support dropping a column by assigning it NULL.
If you want to drop columns, here are some ways you could do that. Note for all of these I'm subsetting columns for the whole flowSet rather than looping through each flowFrame. But you could perform these operations on each flowFrame as well.
As Greg mentioned, choose the columns you want to keep:
data(GvHD)
all_cols <- colnames(GvHD)
keep_cols <- all_cols[!(all_cols %in% c("FSC-H", "SSC-H"))]
GvHD[,keep_cols]
Or you could just filter in the subset:
GvHD[,!colnames(GvHD) %in% c("FSC-H", "SSC-H")]
You could also grab the numerical indices you want to drop and then use negative subsetting.
# drop_idx <- c(1,2)
drop_idx <- which(colnames(GvHD) %in% c("FSC-H", "SSC-H"))
GvHD[,-drop_idx]

Related

R - Unexpected behavior when typechecking columns of tidyverse data frame

I'm working with some data that has hundreds of covariates, so I decided to write some functions to make pre-processing much faster and cleaner (like scaling certain numeric variables). An important part of all of these functions is type-checking the columns before I apply a particular function to them.
Here is my function for scaling continuous columns:
# rm (vector): names of columns not to be scaled
scale.continuous <- function(df, rm=NULL) {
cols <- setdiff(colnames(df), rm)
for(col in cols) {
if(is.numeric(df[,col])){
df[,col] <- as.numeric(scale(df[,col]))
}
}
df
}
This works perfectly fine if I load the data frame using read.csv(), but the data I have is huge so the speed boost of using read_csv() from readr/tidyverse is significant. Unfortunately, if I load my data using read_csv() all of my functions break.
I narrowed down the issue to the type-checking, specifically when type-checking a column I am accessing by a string of its column name. Here's some code to demonstrate what I mean:
# When using read.csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] TRUE
# When using read_csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] FALSE
I realized the issue here was that indexing the dataframe with a string the way I do above returns a tibble instead of a regular list like other methods of indexing do. What I don't understand is why this behavior exists, why as.numeric() (or any type-check) does not work with a tibble and in general why there is this difference in the way the default and tidyverse dataframes are constructed. Also, it would be nice to know if there is a parameter I can change in read_csv() that will make the behavior of this type of indexing the same as with a default dataframe.
I should mention, I realize there are probably better ways of writing this code (for example, just using df$"col" to index fixes the issue), but I still don't understand what the root of the issue was with my first approach. I am now working with much larger data sets that require much more involved pre-processing than what I have been used to in the past so I want to have as complete an understanding of the data structures I am using as possible.
Tibbles have a slightly different default behaviour than regular data frames when using the [ extracting function which can be a bit of a gotcha. Specifically df[,"col"] on a tibble will return a one column tibble whereas on a regular data frame it will return a vector. So you need to use:
df[["col"]]
Or explicitly state that you want to coerce to the lowest dimension and do:
df[, "col", drop = TRUE]
From the documentation:
df[, j] returns a tibble; it does not automatically extract the column
inside. df[, j, drop = FALSE] is the default.

R: add column to dataframe, named based on formula

More 'feels like it should be' simple stuff which seems to be eluding me today. Thanks in advance for assistance.
Within a loop, that's within a function, I'm trying to add a column, and name it based on a formula.
I can bind a column & its name is taken from the bound object: data<-cbind(data,bothdata)
I can bind a column & manually name the bound object: data<-cbind(data,newname=bothdata)
I can bind a column which is the product of an equation & manually name the bound object: data<-cbind(data,newname2=bothdata-1)
Or another way: data <- transform(data, newColumn = bothdata-1)
What I can't do is have the name be the product of a formula. My actual formula-derived example name is paste("E_wgt",rev(which(rev(Esteps) == q))-1,"%") & equation for column: baddata - q.
A simpler one: data<-cbind(data,paste("magic",100,"beans")=bothdata-1). This fails because cbind isn't expecting the = even though it's fine in previous examples. Same fail for transform.
My first thought was assign but while I've used this successfully for creating forumla-named objects, I can't see how to get it to work for formula-named columns.
If I use an intermediary step to put the naming formula in an object container then use that, e.g.:
name <- paste("magic",100,"beans")
data<-cbind(data,name=bothdata-1)
the column name is "name" not "magic100beans". If I assign the equation result to an formula-named object:
assign(paste("magic",100,"beans"),bothdata-1)
Then try to cbind that via get:
data<-cbind(data,get(paste("magic",100,"beans")))
The column is called "get(paste("magic",100,"beans"))". Boo! Any thoughts anyone? It occurs to me that I can do cbind then separately colnames(data)[ncol(data)] <- paste("magic",100,"beans")) which I guess I'll settle for for now, but would still be interested to find if there was a direct way.
Thanks.
Chances are that cbind is overkill for your use case. In almost every instance, you can simply mutate the underlying data frame using data$newname2 <- data$bothdata - 1.
In the case where the name of the column is dynamic, you can just refer to it using the [[ operator -- data[["newcol"]] <- data$newname + 1. See ?'[' and ?'[.data.frame' for other tips and usages.
EDIT: Incorporated #Marek's suggestion for [["newcol"]] instead of [, "newcol"]
It may help you to know that data$col1 is the same than data[,"col1"] which is the same than data[,x] if x is "col1". This is how I usually access/set columns programmatically.
So this should work:
name <- paste("magic",100,"beans")
data[,name] <- obsdata-1
Note that you don't have to use the temporary variable name. This is equivalent to:
data$magic100beans <- obsdata-1
Itself equivalent, for a data.frame, to:
data<-cbind(data, magic100beans=bothdata-1)
Just so you know, you could also set the names afterwards:
old_names <- names(data)
name <- paste("magic",100,"beans")
data <- cbind(data, bothdata-1)
data <- setNames(data, c(old_names, name))
# or
names(data) <- c(old_names, name)

Recoding over multiple data frames in R

(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]

perform function on pairs of columns

I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().

How to attach a simple data.frame to a SpatialPolygonDataFrame in R?

I have (again) a problem with combining data frames in R. But this time, one is a SpatialPolygonDataFrame (SPDF) and the other one is usual data.frame (DF). The SPDF has around 1000 rows the DF only 400. Both have a common column, QDGC
Now, I tried
oo <- merge(SPDF,DF, by="QDGC", all=T)
but this only results in a normal data.frame, not a spatial polygon data frame any more.
I read somewhere else, that this does not work, but I did not understand what to do in such a case (has to do something with the ID columns, merge uses)
oooh such a hard question, I quess...
Thanks!
Jens
Let df = data frame, sp = spatial polygon object and by = name or column number of common column. You can then merge the data frame into the sp object using the following line of code
sp#data = data.frame(sp#data, df[match(sp#data[,by], df[,by]),])
Here is how the code works. The match function inside aligns the columns so that order is preserved. So when we merge it with sp#data, order is correctly preserved. A quick check to see if the code has worked is to inspect the two columns corresponding to the common column and see if they are identical (the common columns get duplicated and it is easy to remove the copy, but i keep it as it is a good check)
It is as easy as this:
require(sp) # the trick is that this package must be loaded!
oo <- merge(SPDF,DF, by="QDGC")
I've tested by myself. But it only works if you use merge from package sp. This is the default when sp package is loaded. merge function is then overloaded and sp::merge is used if the first argument is spatial structure.
merge can produce a dataframe with more rows than the originals if there's not a simple 1-1 mapping of the two dataframes. In which case, it would have to copy all the geometry and create multiple polygons, which is probably not a good thing.
If you have a dataframe which is the same number of rows as a SpatialPointsDataFrame, then you can just directly replace the #data slot.
library(sp)
example(overlay) # to get the srdf object
srdf#data
spplot(srdf)
srdf#data=data.frame(x=runif(3),xx=rep(0,3))
spplot(srdf)
if you get the number of rows wrong:
srdf#data=data.frame(x=runif(2),xx=rep(0,2))
spplot(srdf)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 3, 2
Maybe the function joinCountryData2Map in the rworldmap package can give inspiration. (But I may be wrong, as I was last time.)
One more solution is to use append_data function from the tmaptools package. It is called with these arguments:
append_data(shp, data, key.shp = NULL, key.data = NULL,
ignore.duplicates = FALSE, ignore.na = FALSE,
fixed.order = is.null(key.data) && is.null(key.shp))
It's a bit unfortunate that it's called append since I'd understand append more ina sense of rbind and we want to have something like join or merge here.
Ignoring that fact, function is really useful in making sure you got your joins correct and if some rows are present only on one side of join. From the docs:
Under coverage (shape items that do not correspond to data records),
over coverage (data records that do not correspond to shape items
respectively) as well as the existence of duplicated key values are
automatically checked and reported via console messages. With
under_coverage and over_coverage the under and over coverage key
values from the last append_data call can be retrieved,
If it is two shapefiles that are needed to be merged to a single object, just use rbind().
When using rbind(), just make sure that both the arguments you use are SpatialDataFrames. You can check this using class(sf). If it is not a dataframe, then use st_as_sf() to convert them to a SpatialDataFrame before you rbind them.
Note : You can also use this to append to NULLs, especially when you are using a result from a loop and you want to cumulate the results.

Resources