I want to do some statistic analysis in R on a column with the same name but different length originating from several data frames. I created a list:
my.list <- list(df1, df2, df3, df4)
Now, as some elements of the column of interest (say: my.col) contain the word "FAILED" instead of numbers, I replace it by 'NA':
for (i in 1:length(my.list)){
for (j in 1:length(my.list[[i]]$my.col)){
if (my.list[[i]]$my.col[j] %in% c("FAILED"))
{my.list[[i]]$my.col[j] <- 'NA'};
}
}
I am pretty sure that this is not the best solution for the problem, but at least it works. Although I have to say that it causes warnings that in another column (not my.col) there are unvalid factor levels that have been replaced by 'NA'. No idea why it actually considers other columns than my.col. Suggestions for improvement are highly appreciated.
Now, the remaining numbers contain decimal comma instead of point. Although I tried to eliminate this problems while importing the .csv-file with "dec=","", this does not work out for columns that contain anything else than numbers (e.g. "FAILED"). So I have to replace the comma by the point, and this is what doesn't work for me. I tried:
for (i in 1:length(my.list)){
as.numeric(gsub(",", ".", my.list[[i]]$my.col))
}
This doesn't give any errors, but it also doesn't change anything, although if I type in e.g.
as.numeric(gsub(",", ".", my.list[[4]]$my.col))
it does what I want to do for the 4th element of the list. From my point of view, both should be the same. What's the problem with this?
Btw, I prefer not to delete the other columns from the data frames because I might need them in future for other analysis.
You can do this efficiently using the plyr package.
Note that in the example, I use the built in iris data.
Instead of replacing "FAILED" with NA, I replaced values of "versicolor".
Instead of replacing a coma with a period, I replaced an s with a w.
my.list <- list(iris, iris)
library(plyr)
my.list<-llply(.data=my.list,
function(x) { x$Species<-as.character(x$Species)
x$Species[x$Species=="versicolor"]<-"NA"
x$Species<-gsub(pattern="s",
replacement="w",
x=x$Species)
x$Species<-as.factor(x$Species)
return(x)
})
The as.character was added as an example of a way to circumvent problems with adding a level to a factor. The as.factor insure the column is return as a factor with the new levels.
This will also give you the flexibility to convert from list to data.frame. Simply replace llply with ldply.
Related
I want to unclass several factor variables in R. I need this functionality for a lot of variables. At the moment I repeat the code for each variable which is not convenient:
unclass:
myd$ati_1 <-unclass(myd$ati_1)
myd$ati_2 <-unclass(myd$ati_2)
myd$ati_3 <-unclass(myd$ati_3)
myd$ati_4 <-unclass(myd$ati_4)
I've looked into the apply() function family but I do not even know if this is the correct approach. I also read about for loops but every example is only about simple integers, not when you need to loop over several variables.
Would be glad if someone could help me out.
You can use a loop:
block <- c("ati_1", "ati_2", "ati_3", "ati_4")
for (j in block) {myd[[j]] <- unclass(myd[[j]])}
# The double brackets allows you to specify actual names to extrapolate within the data frame
Here are a few ways. We use CO2 which comes with R and has several factor columns. This unclasses those columns.
If you need some other criterion then
set ix to the names or positions or a logical vector defining those columns to be transformed in the base R solution
replace is.factor in the collapse solution with a vector of names or positions or a logical vector denoting the columns to convert
in the dplyr solution replace where(...) with the same names, positions or logical.
Code follows. In all of these the input is not overridden so you still have the input available unchanged if you want to rerun it from scratch and, in general, overwriting objects is error prone.
# Base R
ix <- sapply(CO2, is.factor)
replace(CO2, ix, lapply(CO2[ix], unclass))
# collapse
library(collapse)
ftransformv(CO2, is.factor, unclass)
# dplyr
library(dplyr)
CO2 %>%
mutate(across(where(is.factor), unclass))
Depending on what you want this might be sufficient or omit the as.data.frame if a matrix result is ok.
as.data.frame(data.matrix(CO2))
I have a list containing nine dataframes (called data), each of varying lengths and contents. Consistent across most of them, though, are columns containing information that I want to store in a separate dataframe for later use.
These columns are the following:
identifiers <- c("Organism Name", "Protein names", "Gene names", "Pathway", "Biological Process")
I want to iterate through through each element of data to check if it contains the columns I'm interested in, then subset these columns as separate dataframes.
I first tried
lapply(data, '[', identifiers]
The problem with this is that not all of the dfs contain all of the identifiers listed above, so running this returns 'undefined columns selected'.
My next attempt was
lapply(data, function(x) if(identifiers %in% x) '[', identifiers)
which returned a list of 9 (corresponding to the 9 original dataframes) of class NULL. I think that this general method would work with proper execution, but I just can't figure it out.
Any help would be appreciated :)
Since identifiers is a vector of column names, some or all of which may be in each frame, we can do:
lapply(data, function(x) x[,intersect(names(x), identifiers),drop=FALSE])
with the understanding that some elements may have zero columns (if none are found).
Your use of if (identifiers %in% x) is not quite right for two reasons:
identifiers %in% x is looking for presence in the data, not in the names, it should be identifiers %in% names(x); and
if requires exactly one logical, but identifiers %in% names(x) is going to return a logical vector the same length as identifiers (i.e., not one). It needs to be summarized.
If it is true that if any of the columns are found, then you will always have all of them, then you can change my code above to be
lapply(data, function(x) if (all(identifiers %in% names(x))) data[,identifiers])
and frames without those columns will return NULL. My use above of intersect also works in this regard, the functional difference being in the case where a frame contains some but not all of them. Over to you which logic you prefer.
I came across a problem in my DataCamp exercise that basically asked "Remove the column names in this vector that are not factors." I know what they -wanted- me to do, and that was to simply do glimpse(df) and manually delete elements of the vector containing the column names, but that wasn't satisfying for me. I figured there was a simple way to store the column names of the dataframe that are factors into a vector. So, I tried two things that ended up working, but I worry they might be inefficient.
Example data Frame:
factorVar <- as.factor(LETTERS[1:10])
df1 <- data.frame(x = 1, y = 1:10, factorVar = sample(factorVar, 10))
My first solution was this:
vector1 <- names(select_if(df1, is.factor))
This worked, but select_if returns an entire tibble of a filtered dataframe and then gets the column names. Surely there's an easier way...
Next, I tried this:
vector2 <- colnames(df1)[sapply(df1,is.factor)]
This also worked, but I wanted to know if there's a quicker, more efficient way of filtering column names based on their type and then storing the results as a vector.
I work with data where most header name that are very long strings. These are cryptic but contain important details that cannot be forgotten. Long column names are difficult to work with for various display reasons as well as programmatic ones. To help with this, I typically retain the original column names as Hmisc labels & rename the columns with uninformative names like V1, V2, V3... etc or with some truncated (but still long & often not unique) version of the long name.
library(Hmisc)
myDF <- read.csv("someFile.csv")
myLabels <- colnames(myDF)
label(myDF, self=FALSE) <- myLabels
colnames(myDF) <- paste0("V", 1:ncol(myDF))
I can now work with the short names V & still look up the labels to get the original names. However, this is still less than satisfactory... myDF is now composed of class "labelled" and contains character vectors although my data is numeric in nature. Converting to numeric or even subsetting myDF will cause the labels to be dropped. Does anyone have some better suggestions? In particular I need to subset data, & I also find indexing by number to be clumsy & error prone.
Due to large data relative to RAM, I cannot keep copies of both numeric & "labelled" data.frames. I have also tried creating hash objects using the hash package:
library(hash)
myHash <- hash(colnames(myDF), label(myDF))
Or via lists:
nameList <- list()
for(name in colnames(myDF)) {
nameList[[name]] <- label(myDF)[name]
}
But... I also find these unsatisfactory mostly because they can fall out of synch with myDF after various manipulations & they are not accessible from the same object. Perhaps I just need to be more diligent.
Lastly, I thought that perhaps a solution would be a custom class that contains a data.frame & some other data structures to know the very meaningless terse name, the verbose & non-unique nickname, & the true variable name. But this would require overloading all the indexing operators & is likely way over my head skill wise.
So any other purposed solutions? Any help appreciated.
You could take a relational database kind of approach here. Create a separate data.frame that expresses the associations between the abbreviated and long names.
library(Hmisc)
myDF <- read.csv("someFile.csv")
LongNames <- colnames(myDF)
colnames(myDF) <- paste0("V", 1:ncol(myDF))
ShortNames <- colnames(myDF)
NameTable <- cbind(LongNames, ShortNames)
Even if your data is later manipulated, the associations between short and long names for variables should remain unchanged. Of course, each time you create a new variable that requires a long name, you'll need to add a new row to the NameTable, but you'd need to put that long name somewhere anyway.
To retrieve the long name easily using the short name, you could define a function for that purpose.
L = function(x){NameTable[which(ShortNames == x),1]}
L(V3) #gives long name of V3
Here is what I've done so far. So, that's basically grabbing some tables off the internet using XML, putting them into a list of dataframes and then some mess trying (and failing) to format them in an efficient and consistent way.
I can't work out how to apply the same changes to all of the dataframes. I think I need to use llply, but I can't get it right. Overall I am trying to achieve:
Column names all legitimate R names using make.names, then use the
str_replace_all towards the end of the file to strip all non-alpha
characters so the names are the same
Next I want to remove all but the first four columns from all of the dataframes
Then I want to add a column with the title for each book. I guess I'll have to do this manually.
Finally, I want to do an rbind to join all of the dataframes together
What's really got me stumped is how to apply the same transformations to each dataframe in the list such as modifying their column names and cutting off rows. Is llply the right tool for the job? How do I use it?
So far the most I've been able to achieve is turning my list of dataframes into a list of vectors with the right names. I believe this is because when I tried using names() it returned the vector of correct names, rather than a dataframe with the correct names. This was my attempt:
tlist <- llply(tabs, function(x) as.data.frame(str_replace_all(make.names(names(x)), "[^[:alpha:]]", "")))
I don't think I'm a million miles away here, but I can't think how to get it to return the full df.
Use this instead:
f <- function(x)
{
y <- x[,1:4]
names(y) <- str_replace_all(make.names(names(y)), "[^[:alpha:]]", "")
y
}
result <- rbind.fill(llply(tabs, f))
EDIT: following #baptiste, this may be better:
result <- ldply(tabs, f)