I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.
The first step I'm using is subsetrows <- which(is.na(mydata$reference)) but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor]) but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?
As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.
You can certainly accomplish this with base functions. But my personal preference is to use dplyr with chained operations such as this:
library(dplyr)
d %>%
filter(is.na(ref)) %>%
select(field) %>%
distinct()
data
d <- data.frame(
field = c("A", "B", "C", "A", "B", "C"),
ref = c(NA, "a", "b", NA, "c", NA)
)
I modified a suggestion in the comments by Marat to use the function unique that seems to return the correct levels.
Solution:
subsetrows <- which(is.na(mydata$reference))
unique(as.character(mydata$factor[subsetrows]))
While I like learning new packages and functions, this solution seems better at this point since it's more compact and easier for me to understand if I need to revisit this code at some distant point in the future.
Related
I have created a data frame using rbind() to append two data frames with the same row names together. I am then trying to use the order() function to order the factor levels alphabetically. However, it is still treating the data frames as two separate objects, and ordering the first alphabetically, and then the second alphabetically separately.
Example:
df1 <- data.frame(site=c("A", "F", "C"))
df2 <- data.frame(site=c("B", "G", "D"))
new.df <- rbind(df1, df2)
new.df <- new.df[order(new.df$site),]
outcome:
site
A
C
F
B
D
G
I have looked at other methods of reordering data, for example using the arrange function from package dplyr, but have not had any success. Any suggestions of how to fix this?
Any help much appreciated.
Thanks
Avoid creation of factors by
df1 <- data.frame(site=c("A", "F", "C"), stringsAsFactors = FALSE)
df2 <- data.frame(site=c("B", "G", "D"), stringsAsFactors = FALSE)
then the remaining stuff will work as expected.
I'm guessing you're not doing quite what you think you're doing there: the resulting new.df isn't a data frame any more, it's a factor. The result of order is to put it in the order of the levels of the factor (see levels(new.df$site). So, if you really want to do it this way (ie, keeping it as a factor rather than a character vector), you will need to reorder the levels first.
new.df$site <- factor(new.df$site, levels = sort(levels(new.df$site)))
new.df[order(new.df$site), ]
[1] A B C D F G
Levels: A B C D F G
But unless you really need it to be a factor from the start, I think you would be best advised to do what #Uwe Block suggests and, if necessary, turn it in to a factor after you've used rbind and done the sorting.
This question already has answers here:
Change the class from factor to numeric of many columns in a data frame
(16 answers)
Closed 7 years ago.
I want to change the class of multiple columns in an R data frame without either doing it one by one, or using a for loop (and noting this answer). I can do it with either of these methods, but they feel clumsy. Note that I don't necessarily want to change every column.
E.g. I have data frame mydf:
mydf <- data.frame("col1" = c(1, 2, 3),
"col2" = c("a", "b", "c"),
"col3" = c("a", "a", "b"), stringsAsFactors = FALSE)
I want to change columns two and three to class factor. (In reality I want to deal with many more than two columns...)
I can either do it column by column in my favourite way, e.g.:
mydf$col2 <- as.factor(mydf$col2)
mydf[, 3] <- as.factor(mydf[,3])
Or I could use a for loop:
for (i in 2:3{
mydf[,i] <- as.factor(mydf[,i])
}
These work, but feel clunky and suboptimal.
Better ideas?
OK I worked it out while writing the question, but figured it might as well go up in case it's use to anyone in future:
mydf[,2:3] <- lapply(mydf[,2:3], as.factor)
I have been grappling with the following problem for a while, as I need to load in, manipulate, and produce scores from new datasets as quickly as possible. I have defined a data dictionary containing a description of each variable class (e.g. numeric, factor, character, date) and, where applicable, a list of all possible factor levels:
DD <- data.frame(Var = c("a", "b", "c", "d"),
Class = c("Numeric", "Factor", "Factor", "Date"),
Levels = c(NA, "B1, B2, B3", "C1, C2", NA))
Data <- data.frame(a = 5, b = "B1", c = "C2", d = "2015-05-01")
Ultimately, I intend to use model.matrix to produce a design matrix with a common set of indicator variables/ columns regardless of the actual factor levels observed in the particular dataset, so I can score up the data from a particular model.
I need to do these tasks as quickly as possible and, hence, I am trying to find a solution that avoids using lapply/ loops. Here is (a slightly convoluted version of) my existing solution for setting the factor levels, which is currently too slow for my requirements:
lapply(1:ncol(Data[,DD$Class=="Factor"]), function(i) {
factor( as.character( unlist( Data[,DD$Class=="Factor"][i])) ,
levels = unlist(strsplit(as.character(DD$Levels[DD$Class=="Factor"][i]), ", ")) )
})
Any suggestions for avoiding use of a loop here, if it is even possible, or any alternative solutions would be much appreciated!
Thanks!
Sorry that I don't have enough reputationto add this as a comment.
Can I ask:
1. What's the dimension of your dataset?
2. What's the running time you may satisfy?
You can consider to use Microsoft Open R (Previsouly Revolution R),which optimises basic data manipulation.
Having data in a data.frame, I would like to aggregate some columns (using any general function) grouping by some others, keeping the remaining ones as they are (or even omitting them). The fashion is to recall the group by function in SQL. As an example let us assume we have
df <- data.frame(a=rnorm(4), b=rnorm(4), c=c("A", "B", "C", "A"))
and I want to sum (say) the values in column a and average (say) the values in column b, grouping by the symbols in column c. I am aware it is possible to achieve such using apply, cbind or similars, specifying the functions you want to use, but I was wondering if there were a smarter (one line) way (especially using the aggregate function) to do so.
Sorry but I don't follow how dealing with more than one column complicates things.
library(data.table)
dt <- data.table(df)
dt[,.(sum_a = sum(a),mean_b= mean(b)),by = c]
like this?
mapply(Vectorize(function(x, y) aggregate(
df[, x], by=list(df[, 3]), FUN=y), SIMPLIFY = F),
1:2, c('sum', 'mean'))
I am new to R. I've made a boxplot of my data but currently R is sorting the factors alphabetically. How do I maintain the original order of my data? This is my code:
boxplot(MS~Code,data=Input)
I have 40 variables that I wish to boxplot in the same order as the original data frame lists them. I've read that I may be able to set sort.names=FALSE to maintain the original order by I don't understand where that piece of code would go.
Is there a way to redefine my Input before it goes into boxplot?
Thank you.
factor the variable again as you wish in line 3
data(InsectSprays)
data <- InsectSprays
data$spray <- factor(data$spray, c("B", "C", "D", "E", "F", "G", "A"))
boxplot(count ~ spray, data = data, col = "lightgray")
The answer above is 98% of the way there.
set.seed(1)
# original order is E - A
Input <- data.frame(Code=rep(rev(LETTERS[1:5]),each=5),
MS=rnorm(25,sample(1:5,5)))
boxplot(MS~Code,data=Input) # plots alphabetically
Input$Code <- with(Input,factor(Code,levels=unique(Code)))
boxplot(MS~Code,data=Input) # plots in original order