I work with data where most header name that are very long strings. These are cryptic but contain important details that cannot be forgotten. Long column names are difficult to work with for various display reasons as well as programmatic ones. To help with this, I typically retain the original column names as Hmisc labels & rename the columns with uninformative names like V1, V2, V3... etc or with some truncated (but still long & often not unique) version of the long name.
library(Hmisc)
myDF <- read.csv("someFile.csv")
myLabels <- colnames(myDF)
label(myDF, self=FALSE) <- myLabels
colnames(myDF) <- paste0("V", 1:ncol(myDF))
I can now work with the short names V & still look up the labels to get the original names. However, this is still less than satisfactory... myDF is now composed of class "labelled" and contains character vectors although my data is numeric in nature. Converting to numeric or even subsetting myDF will cause the labels to be dropped. Does anyone have some better suggestions? In particular I need to subset data, & I also find indexing by number to be clumsy & error prone.
Due to large data relative to RAM, I cannot keep copies of both numeric & "labelled" data.frames. I have also tried creating hash objects using the hash package:
library(hash)
myHash <- hash(colnames(myDF), label(myDF))
Or via lists:
nameList <- list()
for(name in colnames(myDF)) {
nameList[[name]] <- label(myDF)[name]
}
But... I also find these unsatisfactory mostly because they can fall out of synch with myDF after various manipulations & they are not accessible from the same object. Perhaps I just need to be more diligent.
Lastly, I thought that perhaps a solution would be a custom class that contains a data.frame & some other data structures to know the very meaningless terse name, the verbose & non-unique nickname, & the true variable name. But this would require overloading all the indexing operators & is likely way over my head skill wise.
So any other purposed solutions? Any help appreciated.
You could take a relational database kind of approach here. Create a separate data.frame that expresses the associations between the abbreviated and long names.
library(Hmisc)
myDF <- read.csv("someFile.csv")
LongNames <- colnames(myDF)
colnames(myDF) <- paste0("V", 1:ncol(myDF))
ShortNames <- colnames(myDF)
NameTable <- cbind(LongNames, ShortNames)
Even if your data is later manipulated, the associations between short and long names for variables should remain unchanged. Of course, each time you create a new variable that requires a long name, you'll need to add a new row to the NameTable, but you'd need to put that long name somewhere anyway.
To retrieve the long name easily using the short name, you could define a function for that purpose.
L = function(x){NameTable[which(ShortNames == x),1]}
L(V3) #gives long name of V3
Related
Consider the following simulation snippet:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),i) )
dsts <- do.call(rbind,dsts)
why does this code throws an error (dsts is matrix):
subset(dsts,i==1)
#Error in subset.matrix(dsts, i == 1) : object 'i' not found
Even this one:
colnames(dsts)[3] <- 'iii'
subset(dsts,iii==1)
But not this one (matrix coerced as dataframe):
subset(as.data.frame(dsts),i==1)
This one works either where x is already defined:
subset(dsts,x> 500)
The error occurs in subset.matrix() on this line:
else if (!is.logical(subset))
Is this a bug that should be reported to R Core?
The behavior you are describing is by design and is documented on the ?subset help page.
From the help page:
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
In R, data.frames and matrices are very different types of objects. If this is causing a problem, you are probably using the wrong data structure for your data. Matrices are really only necessary if you meed matrix arithmetic. If you are thinking of your columns as different attributes for a row observations, then you should be storing your data in a data.frame in the first place. You could store all your values in a simple vector where every three values represent one observation, but that would also be a poor choice of data structure for your data. I'm not sure if you were trying to be more efficient by choosing a matrix but it seems like just the wrong choice.
A data.frame is stored as a named list while a matrix is stored as a dimensioned vector. A list can be used as an environment which makes it easy to evaluate variable names in that context. The biggest difference between the two is that data.frames can hold columns of different classes (numerics, characters, dates) while matrices can only hold values of exactly one data.type. You cannot always easily convert between the two without a loss of information.
Thinks like $ only work with data.frames as well.
dd <- data.frame(x=1:10)
dd$x
mm <- matrix(1:10, ncol=1, dimnames=list(NULL, "x"))
mm$x # Error
If you want to subset a matrix, you are better off using standard [ subsetting rather than the sub setting function.
dsts[ dsts[,"i"]==1, ]
This behavior has been a part of R for a very long time. Any changes to this behavior is likely to introduce breaking changes to existing code that relies on variables being evaluated in a certain context. I think the problem lies with whomever told you to use a matrix in the first place. Rather than cbind(), you should have used data.frame()
I have a data.table containing some data that I want to merge into another data.table. However, the type of the columns don't match.
For example:
library(data.table)
heading <- c("A","B","C")
temp <- data.table(matrix(NA,nrow=1,ncol=3))
setnames(temp,names(temp),heading)
data <- data.table(A=2,B="hi")
merge(temp,data,by=c("A","B"))
The above gives me the error
Error in bmerge(i <- shallow(i), x, leftcols, rightcols, io <- haskey(i), :
x.'B' is a character column being joined to i.'B' which is type 'logical'.
Character columns must join to factor or character columns.
This is the case for any other data type mismatches.
In this case, I don't really care about what data type my temp data.table has originally, I want to match it to whatever type is in the data object. I also can't set the NAs to specific data types since I cannot determine that (or I'd rather not to keep it flexible) prior to the merge.
How can I make this work? Can I force a coercion?
For those who end up here using search engines, the answer to the question is:
rbindlist(list(temp,data), fill=TRUE, use.names=TRUE)
Full probs to #N8TRO (see the comments).
I want to do some statistic analysis in R on a column with the same name but different length originating from several data frames. I created a list:
my.list <- list(df1, df2, df3, df4)
Now, as some elements of the column of interest (say: my.col) contain the word "FAILED" instead of numbers, I replace it by 'NA':
for (i in 1:length(my.list)){
for (j in 1:length(my.list[[i]]$my.col)){
if (my.list[[i]]$my.col[j] %in% c("FAILED"))
{my.list[[i]]$my.col[j] <- 'NA'};
}
}
I am pretty sure that this is not the best solution for the problem, but at least it works. Although I have to say that it causes warnings that in another column (not my.col) there are unvalid factor levels that have been replaced by 'NA'. No idea why it actually considers other columns than my.col. Suggestions for improvement are highly appreciated.
Now, the remaining numbers contain decimal comma instead of point. Although I tried to eliminate this problems while importing the .csv-file with "dec=","", this does not work out for columns that contain anything else than numbers (e.g. "FAILED"). So I have to replace the comma by the point, and this is what doesn't work for me. I tried:
for (i in 1:length(my.list)){
as.numeric(gsub(",", ".", my.list[[i]]$my.col))
}
This doesn't give any errors, but it also doesn't change anything, although if I type in e.g.
as.numeric(gsub(",", ".", my.list[[4]]$my.col))
it does what I want to do for the 4th element of the list. From my point of view, both should be the same. What's the problem with this?
Btw, I prefer not to delete the other columns from the data frames because I might need them in future for other analysis.
You can do this efficiently using the plyr package.
Note that in the example, I use the built in iris data.
Instead of replacing "FAILED" with NA, I replaced values of "versicolor".
Instead of replacing a coma with a period, I replaced an s with a w.
my.list <- list(iris, iris)
library(plyr)
my.list<-llply(.data=my.list,
function(x) { x$Species<-as.character(x$Species)
x$Species[x$Species=="versicolor"]<-"NA"
x$Species<-gsub(pattern="s",
replacement="w",
x=x$Species)
x$Species<-as.factor(x$Species)
return(x)
})
The as.character was added as an example of a way to circumvent problems with adding a level to a factor. The as.factor insure the column is return as a factor with the new levels.
This will also give you the flexibility to convert from list to data.frame. Simply replace llply with ldply.
I want to rename some random columns of a large data frame and I want to use the current column names, not the indexes. Column indexes might change if I add or remove columns to the data, so I figure using the existing column names is a more stable solution.
This is what I have now:
mydf = merge(df.1, df.2)
colnames(mydf)[which(colnames(mydf) == "MyName.1")] = "MyNewName"
Can I simplify this code, either the original merge() call or just the second line? "MyName.1" is actually the result of an xts merge of two different xts objects.
The trouble with changing column names of a data.frame is that, almost unbelievably, the entire data.frame is copied. Even when it's in .GlobalEnv and no other variable points to it.
The data.table package has a setnames() function which changes column names by reference without copying the whole dataset. data.table is different in that it doesn't copy-on-write, which can be very important for large datasets. (You did say your data set was large.). Simply provide the old and the new names:
require(data.table)
setnames(DT,"MyName.1", "MyNewName")
# or more explicit:
setnames(DT, old = "MyName.1", new = "MyNewName")
?setnames
names(mydf)[names(mydf) == "MyName.1"] = "MyNewName" # 13 characters shorter.
Although, you may want to replace a vector eventually. In that case, use %in% instead of == and set MyName.1 as a vector of equal length to MyNewName
plyr has a rename function for just this purpose:
library(plyr)
mydf <- rename(mydf, c("MyName.1" = "MyNewName"))
names(mydf) <- sub("MyName\\.1", "MyNewName", names(mydf))
This would generalize better to a multiple-name-change strategy if you put a stem as a pattern to be replaced using gsub instead of sub.
You can use the str_replace function of the stringr package:
names(mydf) <- str_replace(names(mydf), "MyName.1", "MyNewName")
My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.