I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g
Related
Given the below dataframe
df <- data.frame(cbind(seq(1:4),rep(letters[seq(1:3)],4)))
X1 X2
1 a
2 b
3 c
4 a
1 b
2 c
3 a
4 b
1 c
2 a
3 b
4 c
I would like to summarize unique X2s by X1. For example,
1 a,b,c
2 b,c,a
3 c,a,b
4 a,b,c
I am very close. I use the following code:
'summary <- aggregate(df$X2, list(df$X1),FUN=unique)`
which produces
Group.1 X
1 1,2,3
2 2,3,1
3 3,1,2
4 1,2,3
(the index of the list). What is the most efficient way to get my desired result?
I am certain there is an easy solution and I've tried searching, but I must not be using the correct search terms. Thank you in advanced.
We can use toString to paste the elements
aggregate(X2~X1, unique(df), toString )
Or if we need to keep it as list
aggregate(X2~X1, transform(unique(df), X2 = as.character(X2)), list)
As the OP also mentioned the efficient approach
library(data.table)
unique(setDT(df))[, .(X2 = toString(X2)), by = X1]
Regarding the creation of data.frame, it is easier, compact and error-free way to do without using cbind with data.frame. The main reason is that cbind converts to a matrix and matrix can have only a single class. So, if there is a single character column or elements, all the elements are converted to character. With as.data.frame, by default the stringsAsFactors=TRUE, so the columns are converted to factor class.
df <- data.frame(X1= 1:4, X2= rep(letters[1:3],4), stringsAsFactors= FALSE)
The above code gets the intended output. Note that seq is not needed when we use :
I am trying to be lazier than ever with R and I was wondering to know if there is a chance to drop columns from a data.frame by using a condition.
For instance, let's say my data.frame has 50 columns.
I want to drop all the columns that share each other
mean(mydata$coli)... = mean(mydata$coln) = 0
How would you write this code in order to drop them all at once? Because I use to drop columns with
mydata2 <- subset(mydata, select = c(vari, ..., varn))
Obviously not interesting because of the need of manual data checking.
Thank you all!
Something similar as #akrun using lapply
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)
mydata[lapply(mydata, mean)!=0]
# col2
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
We can use colMeans to get the mean of all the columns as a vector, convert that to a logical index (!=0) and subset the dataset.
mydata[colMeans(mydata)!=0]
Or use Filter with f as mean. If the mean of a column is 0, it will be coerced to FALSE and all others as TRUE to filter out the columns.
Filter(mean, mydata)
data
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)
Part of a function I'm working on uses the following code to take a data frame and reorder its columns on the basis of the largest (absolute) value in each column.
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)])))
For the most part, this works fine, but with the dataset I'm working on, I occasionally get data that looks like this:
a <- rnorm(10,5,7); b <- rnorm(10,0,1); c <- rep(1,10)
dfm <- data.frame(A = a, B = b, C = c)
> dfm
A B C
1 0.6438373 -1.0487023 1
2 10.6882204 0.7665011 1
3 -16.9203506 -2.5047946 1
4 11.7160291 -0.1932127 1
5 13.0839793 0.2714989 1
6 11.4904625 0.5926858 1
7 -5.9559206 0.1195593 1
8 4.6305924 -0.2002087 1
9 -2.2235623 -0.2292297 1
10 8.4390810 1.1989515 1
When that happens, the above code returns a "non-numeric argument to mathematical function" error at the abs() step. (And if I get rid of the abs() step because I know, due to transformation, my data will be all positive, order() returns: "unimplemented type 'list' in 'orderVector1'".) This is because which() returns all the 1's in column C, which in turn makes apply() spit out a list, rather than a nice tidy vector.
My question is this: How can I make which() JUST return one value for column C in this case? Alternately, is there a better way to write this code to do what I want it to (reorder the columns of a matrix based on the largest value in each column, whether or not that largest value is duplicated) that won't have this problem?
If you want to select just the first element of the result, you can subset it with [1]:
ord <- order(abs(apply(dfm,2,function(x) x[which(abs(x) == max(abs(x)), arr.ind = TRUE)][1])))
To order the columns by their maximum element (in absolute value), you can do
dfm[order(apply(abs(dfm),2,max))]
Your code, with #CarlosCinelli's correction, should work fine, though.
I have a data.frame that can contains N columns (N defined at runtime), and I want to get the rows within the data frame that satisfy N-1 conditions, in other words I want to get only the rows with a specific value for the first N-1 columns.
For instance if I have a data frame with four columns (A,B,C,D) and five rows:
A B C D
1 2 3 4
9 9 9 9
1 2 9 5
4 3 2 1
1 2 3 8
I would get all the rows with A==1 & B==2 & C==3, i.e:
A B C D
1 2 3 4
1 2 3 8
But as said, the data frame can be composed of any amount of rows and columns (defined at runtime), and the values of the conditions may change.
I implemented this function (simplified):
getRows<-function(dataFrame, values) {
conditions=rep(TRUE, dim(dataFrame)[1])
for (k in 1:length(values)) {
conditions=conditions&(dataFrame[,k]==values[k])
}
return(dataFrame[conditions,])
}
Of course, this assumes the values in the values vector are sorted with respect to the columns order of the data frame, and that the length of the vector is N-1.
The function works but I've the feeling that it is not really efficient to create the vector of boolean, evaluate boolean expressions in this way and so on... especially if the data frame contains many data.
Another solution that I found is:
getRows<-function(dataFrame, values) {
tmp=dataFrame
for (k in 1:length(values)) {
tmp=tmp[tmp[,k]==values[k],]
}
return(tmp)
}
Basically this 'reduces' the data frame by filtering out all the rows that not satisfy each condition. But I think this is even worst, because it creates a new data frame object for each condition (ok always smaller, but anyway...).
So my question is: is there a method to do that more efficiently?
one possibility:
# if you are only checking for equalities
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), sep="==", collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(1:3, "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
# you can also use other conditions
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(paste0(c("==", "<=", "=="), 1:3), "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
Sometimes matrices are quicker than data.frames to operate on, so something along the lines of:
mat <- t(as.matrix(df[-ncol(df)))
boolMat <- (mat==values) # if necessary use match to reorder values to match columns of df
ind <- colSums(boolMat)==nrow(boolMat)
df[ind,]
The idea being that values will get recycled along the columns of the matrix (which are the rows of the dataframe). colSums is meant to be quicker than an apply, so the final line should be somewhat optimised compared to apply(boolMat, 2, all).
The optimal solutions will depend on the size and proportions of the data; whether the entries are all integers; and maybe what proportion of matches you get in the data. So as #droopy mentions, you'll need to benchmark. My approach involves creating a copy of the data, so if your data is already approaching memory limits, then it might struggle - but maybe then you could generate your data in matrix rather than data.frame format to save the duplication.
I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))