I have a data frame with factor columns. Here is a tiny example:
dat <- data.frame(one = factor(c("a", "b")), two = factor(c("c", "d")))
I can calculate the means of the numeric values that underlie the factor labels for each column:
mean(as.integer(dat$one))
[1] 1.5
But since there are very many columns in my data frame, I would like to avoid having to calculate all the individual means and would rather do something like:
colMeans(dat)
which doesn't work, since the columns are factors, or
colMeans(as.integer(dat))
which doesn't work either.
So how can I easily calculate the means of all factor columns, without a loop or individually calculating them all?
Do I really have to change the class of all columns?
The data.matrix is pretty much designed for such a task. It also skips numeric and integer columns, if present, and hence reduces memory usage, though the conversion to matrix could be an overhead, sometimes. So as long you don't have character columns, this should be pretty straightforward
colMeans(data.matrix(dat))
# one two
# 1.5 1.5
We can use lapply
lapply(dat, function(x) mean(as.integer(x)))
Or with dplyr
library(dplyr)
dat %>%
summarise_each(funs(mean(as.integer(.))))
For big datasets, it may be better to calculate the mean by each column separately as converting to matrix may also create memory issues.
Write a simple function that uses a for loop to write all of the values into a vector.
dat <- data.frame(one = c(1:10), two = c(1:10))
colMeans <- function(tablename){
i <- 1
colmean <- c(1:ncol(tablename))
for(i in c(1:ncol(tablename))){
colmean[i] <- mean(tablename[,i])
}
return(colmean)
}
colMeans(dat)
Hope this works
You can also use data.table package, which is faster than data.frame. if your data is big e.g. millions of observations, than you need data.table to optimize run time.
Below is the code:
library(data.table)
dat <- data.table(one = factor(c("a", "b")), two = factor(c("c", "d")))
factorCols <- c("one", "two")
dat[, lapply(.SD, FUN=function(x) mean(as.integer(x))), .SDcols=factorCols]
Related
I have a data set in R called data, and in this data set I have more than 600 variables. Among these variables I have 94 variables called data$sleep1,data$sleep2...data$sleep94, and another 94 variables called data$wakeup1,data$wakeup2...data$wakeup94.
I want to create new variables, data$total1-data$total94, each of which is the sum of sleep and wakeup for the same day.
For example, data$total64 <-data$sleep64 + data$wakeup64,data$total94<-data$sleep94+data$wakeup94.
Without a loop, I need to write this code 94 times. I hope someone could give me some tips on this. It doesn't have to be a loop, but an easier way to do this.
FYI, every variables are numeric and have about 30% missing values. The missing are random, it could be anywhere. missing value is a blank but not 0.
I recommend storing your data in long form. To do this, use melt. I'll use data.table.
Sample data:
library(data.table)
set.seed(102943)
x <- setnames(as.data.table(matrix(runif(1880), nrow = 10)),
paste0(c("sleep", "wakeup"), rep(1:94, 2)))[ , id := 1:.N]
Melt:
long_data <-
melt(x, id.vars = "id",
measure.vars = list(paste0("sleep", 1:94),
paste0("wakeup", 1:94)))
#rename the output to look more familiar
#**note: this syntax only works in the development version;
# to install, follow instructions
# here: https://github.com/jtilly/install_github
# to install from https://github.com/Rdatatable/data.table
# (or, read ?setnames and figure out how to make the old version work)
setnames(long_data, -1L, c("day", "sleep", "wakeup"))
I hope you'll find it's much easier to work with the data in this form.
For example, your problem is now simple:
long_data[ , total := sleep + wakeup]
We could do this without a loop. Assuming that the columns are arranged in the sequence mentioned, we subset the 'sleep' columns and 'wakeup' columns separately using grep and then add the datasets together.
sleepDat <- data[grep('sleep', names(data))]
wakeDat <- data[grep('wakeup', names(data))]
nm1 <- paste0('total', 1:94)
data[nm1] <- sleepDat+wakeDat
If there are missing values and they are NA, we can replace the NA values with 0 and then add it together as before.
data[nm1] <- replace(sleepDat, is.na(sleepDat), 0) +
replace(wakeDat, is.na(wakeDat), 0)
If the missing value is '', then the columns would be either factor or character class (not clear from the OP's post). In that case, we may need to convert the dataset to numeric class so that the '' will be automatically converted to NA
sleepDat[] <- lapply(sleepDat, function(x)
as.numeric(as.character(x)))
wakeDat[] <- lapply(wakeDat, function(x)
as.numeric(as.character(x)))
and then proceed as before.
NOTE: If the columns are character, just omit the as.character step and use only as.numeric.
Having data in a data.frame, I would like to aggregate some columns (using any general function) grouping by some others, keeping the remaining ones as they are (or even omitting them). The fashion is to recall the group by function in SQL. As an example let us assume we have
df <- data.frame(a=rnorm(4), b=rnorm(4), c=c("A", "B", "C", "A"))
and I want to sum (say) the values in column a and average (say) the values in column b, grouping by the symbols in column c. I am aware it is possible to achieve such using apply, cbind or similars, specifying the functions you want to use, but I was wondering if there were a smarter (one line) way (especially using the aggregate function) to do so.
Sorry but I don't follow how dealing with more than one column complicates things.
library(data.table)
dt <- data.table(df)
dt[,.(sum_a = sum(a),mean_b= mean(b)),by = c]
like this?
mapply(Vectorize(function(x, y) aggregate(
df[, x], by=list(df[, 3]), FUN=y), SIMPLIFY = F),
1:2, c('sum', 'mean'))
new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))
I have two dataframes and I would like to do independent 2-group t-tests on the rows (i.e. t.test(y1, y2) where y1 is a row in dataframe1 and y2 is matching row in dataframe2)
whats best way of accomplishing this?
EDIT:
I just found the format: dataframe1[i,] dataframe2[i,]. This will work in a loop. Is that the best solution?
The approach you outlined is reasonable, just make sure to preallocate your storage vector. I'd double check that you really want to compare the rows instead of the columns. Most datasets I work with have each row as a unit of observation and the columns represent separate responses/columns of interest Regardless, it's your data - so if that's what you need to do, here's an approach:
#Fake data
df1 <- data.frame(matrix(runif(100),10))
df2 <- data.frame(matrix(runif(100),10))
#Preallocate results
testresults <- vector("list", nrow(df1))
#For loop
for (j in seq(nrow(df1))){
testresults[[j]] <- t.test(df1[j,], df2[j,])
}
You now have a list that is as long as you have rows in df1. I would then recommend using lapply and sapply to easily extract things out of the list object.
It would make more sense to have your data stored as columns.
You can transpose a data.frame by
df1_t <- as.data.frame(t(df1))
df2_t <- as.data.frame(t(df2))
Then you can use mapply to cycle through the two data.frames a column at a time
t.test_results <- mapply(t.test, x= df1_t, y = df2_t, SIMPLIFY = F)
Or you could use Map which is a simple wrapper for mapply with SIMPLIFY = F (Thus saving key strokes!)
t.test_results <- Map(t.test, x = df1_t, y = df2_t)
I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:
apply(t,2,max,na.rm=1)
It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".
I then tried this:
sapply(t,max,na.rm=1)
but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.
BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.
If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:
sapply(df, function(x) max(as.numeric(x)) ) # not generally a useful result
Or if you want to test for factors first and return as you expect then:
sapply( df, function(x) if("factor" %in% class(x) ) {
max(as.numeric(as.character(x)))
} else { max(x) } )
#Darrens comment does work better:
sapply(df, function(x) max(as.character(x)) )
max does succeed with character vectors.
The reason that max works with apply is that apply is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply is just a wrapper for lapply, so it is not surprising that both yield the same error.
The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max and min will be undefined, since R is assuming that you've created an unordered factor.
You can change this behavior by specifying options(stringsAsFactors = FALSE), which will change the default for the entire session, or you can pass stringsAsFactors = FALSE in the data.frame() construction call itself. Note that this just means that min and max will assume "alphabetical" ordering by default.
Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.
Regardless, sapply will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:
#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
d[4,] <- NA
#Similar function to DWin's answer
fun <- function(x){
if(is.numeric(x)){max(x,na.rm = 1)}
else{max(as.character(x),na.rm=1)}
}
#Use colwise from plyr package
colwise(fun)(d)
v1 v2 v3 v4
1 0.8478983 j 1.999435 J
If you want to learn your data summary (df) provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.
The best way to do this is avoid base *apply functions, which coerces the entire data frame to an array, possibly losing information.
If you wanted to apply a function as.numeric to every column, a simple way is using mutate_all from dplyr:
t %>% mutate_all(as.numeric)
Alternatively use colwise from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."
t %>% (colwise(as.numeric))
In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert or type_convert from readr.
Less interesting answer: we can apply on each column with a for-loop:
for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }
I don't know of a good way of doing assignment with *apply while preserving data frame structure.
building on #ltamar's answer:
Use summary and munge the output into something useful!
library(tidyr)
library(dplyr)
df %>%
summary %>%
data.frame %>%
select(-Var1) %>%
separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
rename(column_name=Var2) %>%
mutate(value=as.numeric(value),
metric = trimws(metric,'both')
) %>%
filter(!is.na(value)) -> metrics
It's not pretty and it is certainly not fast but it gets the job done!
these days loops are just as fast so this is more than sufficient:
for (I in 1L:length(c(1,2,3))) {
data.frame(c("1","2","3"),c("1","3","3"))[,I] <-
max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}
A solution using retype() from hablar to coerce factors to character or numeric type depending on feasability. I'd use dplyr for applying max to each column.
Code
library(dplyr)
library(hablar)
# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()
# Check max for each column
d %>% summarise_all(max)
Result
Not the new column types.
v1 v2 v3 v4
<dbl> <chr> <dbl> <chr>
1 0.974 j 1.09 J
Data
# Sample data borrowed from #joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df
my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))