I have read a lot of similar questions, but cannot get any of the code to work. I simply want to add a row to the bottom of my data frame (df) that has the min value for each column for columns 2:8. Below is my code (that works) to add a total line, but I want a min line.
df[(nrow(df)+ 1),(2:8)] <- colSums(df[,2:8], na.rm=TRUE)
I have tried to get colMins in package matrixStats to work, but can't for some reason. Any help would be appreciated!
In base R, you can use sapply() to create an atomic vector of column minimums and then rbind() to attach it to the original data. I added an NA for the first value, since we need something there to add it to the original data.
rbind(df, c(NA, sapply(df[2:8], min, na.rm = TRUE)))
Obviously this assumes only 8 columns, and so df[-1] can be used in place of df[2:8].
And for a speed increase we can use vapply() over sapply() because we know the result will be a single numeric value.
rbind(df, c(NA, vapply(df[-1], min, 1, na.rm = TRUE)))
Update: In response to your comment on the other answer - to get "MIN" in the first column and the minimum values in all the rest, we can adjust the call to a named list and do it all in one go. This way we don't mix the column classes (character and numeric) and end up with unexpected classes in the columns of the resulting data.
rbind(
df,
c(setNames(list("MIN"), names(df)[1]), lapply(df[-1], min, na.rm = TRUE))
)
We can try colMins from library(matrixStats)
library(matrixStats)
rbind(df, c(NA,colMins(as.matrix(df[2:8]))))
Update
To replace NA with 'MIN',
rbind(df, c('MIN',as.list(colMins(as.matrix(df[2:8])))))
Or another approach would be to convert to matrix and use addmargins
addmargins(`row.names<-`(as.matrix(df[-1]), df$ID), 1, FUN=min)
data
set.seed(24)
df <- cbind(ID= 1:10,as.data.frame(matrix(sample(1:9, 7*10,
replace=TRUE), ncol=7)))
Related
I'm having trouble trying to standardize my data.
So, first things first, I create the dataframe object with my data, with my desired row names (and I remove the 1st column, as it is not needed.
EXPGli <-read.delim("C:/Users/i5/Dropbox/Guilherme Vergara/Doutorado/Data/Datasets/MergedEXP3.txt", row.names=2)
EXPGli <- EXPGli[,-1]
EXPGli <- as.data.frame(EXPGli)
Then, I am supposed to convert all the columns to Z-score (each column = gene expression values; each row = sample) -> the idea here is to convert every gene expression data to a Z-score value of it for each cell
Z_score <- function(x) {(x-mean(x))/ sd(x)}
apply(EXPGli, 2, Z_score)
Which returns me [ reached 'max' / getOption("max.print") -- omitted 1143 rows ]
And now my whole df is NA's cells.
Indeed, there are several NAs in the dataset, some full rows and even some columns.
I tried several approaches to remove NAs
EXPGli <- na.omit(EXPGli)
EXPGli %>% drop_na()
print(EXPGli[rowSums(is.na(EXPGli)) == 0, ])
na.exclude(EXPGli)
Yet apparently, it does not work. Additionally, trying to is.na(EXPGli)
Returns me False to all fields.
I would like to understand what am I doing wrong here, it seems that the issue might be NA's not being recognized in R as NA but I couldnt find a solve for this. Any input is very appreciatted, thanks in advance!
You may want to set the argument na.rm = TRUE in your calls to mean(x) and sd(x) inside the Z_score function, otherwise these calls would return NAs for any vector with NAs in it.
Z_score <- function(x) {(x-mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)}
Need to know the meaning of each and every command . This is used to find the mean of entire row. But it also contains character type. So this code is used.
rowMeans(sapply(iris, function(x) as.numeric(as.character(x))), na.rm = T)
We can create an index to check the column type and then apply the rowMeans on the numeric columns
i1 <- sapply(iris, is.numeric)
rowMeans(iris[i1], na.rm = TRUE)
I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))
I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)
I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:
apply(t,2,max,na.rm=1)
It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".
I then tried this:
sapply(t,max,na.rm=1)
but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.
BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.
If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:
sapply(df, function(x) max(as.numeric(x)) ) # not generally a useful result
Or if you want to test for factors first and return as you expect then:
sapply( df, function(x) if("factor" %in% class(x) ) {
max(as.numeric(as.character(x)))
} else { max(x) } )
#Darrens comment does work better:
sapply(df, function(x) max(as.character(x)) )
max does succeed with character vectors.
The reason that max works with apply is that apply is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply is just a wrapper for lapply, so it is not surprising that both yield the same error.
The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max and min will be undefined, since R is assuming that you've created an unordered factor.
You can change this behavior by specifying options(stringsAsFactors = FALSE), which will change the default for the entire session, or you can pass stringsAsFactors = FALSE in the data.frame() construction call itself. Note that this just means that min and max will assume "alphabetical" ordering by default.
Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.
Regardless, sapply will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:
#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
d[4,] <- NA
#Similar function to DWin's answer
fun <- function(x){
if(is.numeric(x)){max(x,na.rm = 1)}
else{max(as.character(x),na.rm=1)}
}
#Use colwise from plyr package
colwise(fun)(d)
v1 v2 v3 v4
1 0.8478983 j 1.999435 J
If you want to learn your data summary (df) provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.
The best way to do this is avoid base *apply functions, which coerces the entire data frame to an array, possibly losing information.
If you wanted to apply a function as.numeric to every column, a simple way is using mutate_all from dplyr:
t %>% mutate_all(as.numeric)
Alternatively use colwise from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."
t %>% (colwise(as.numeric))
In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert or type_convert from readr.
Less interesting answer: we can apply on each column with a for-loop:
for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }
I don't know of a good way of doing assignment with *apply while preserving data frame structure.
building on #ltamar's answer:
Use summary and munge the output into something useful!
library(tidyr)
library(dplyr)
df %>%
summary %>%
data.frame %>%
select(-Var1) %>%
separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
rename(column_name=Var2) %>%
mutate(value=as.numeric(value),
metric = trimws(metric,'both')
) %>%
filter(!is.na(value)) -> metrics
It's not pretty and it is certainly not fast but it gets the job done!
these days loops are just as fast so this is more than sufficient:
for (I in 1L:length(c(1,2,3))) {
data.frame(c("1","2","3"),c("1","3","3"))[,I] <-
max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}
A solution using retype() from hablar to coerce factors to character or numeric type depending on feasability. I'd use dplyr for applying max to each column.
Code
library(dplyr)
library(hablar)
# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()
# Check max for each column
d %>% summarise_all(max)
Result
Not the new column types.
v1 v2 v3 v4
<dbl> <chr> <dbl> <chr>
1 0.974 j 1.09 J
Data
# Sample data borrowed from #joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10],
v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df
my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))