It's easy to grab one or more in ddply to process, but is there a way to grab the entire current row and pass that onto a function? Or to grab a set of columns determined at runtime?
Let me illustrate:
Given a dataframe like
df = data.frame(a=seq(1,20), b=seq(1,5), c= seq(5,1))
df
a b c
1 1 1 5
2 2 2 4
3 3 3 3
I could write a function to sum named columns along a row of a data frame like this:
selectiveSummer = function(row,colsToSum) {
return(sum(row[,colsToSum]))
}
It works when I call it for a row like this:
> selectiveSummer(df[1,],c('a','c'))
[1] 6
So I'd like to wrap that in an anonymous function and use it in ddply to apply it to every row in the table, something like the example below
f = function(x) { selectiveSummer(x,c('a','c')) }
#this doesn't work!
ddply(df,.(a,b,c), transform, foo=f(row))
I'd like to find a solution where the set of columns to manipulate can be determined at runtime, so if there's some way just to splat that from ddply's args and pass it into a function that takes any number of args, that works too.
Edit: To be clear, the real application driving this isn't sum, but this was an easier explanation
You can only select single rows with ddply if rows can be identified in a unique way with one or more variables. If there are identical rows ddply will cycle over data frames of multiple rows even if you use all columns (like ddply(df, names(df), f).
Why not use apply instead? Apply does iterate over individual rows.
apply(df, 1, function(x) f(as.data.frame(t(x)))))
result:
[1] 6 6 6 6 6 11 11 11 11 11 16 16 16 16 16 21 21 21 21 21
Simple...
df$id = 1:nrow(df)
ddply(df,c('id'),function(x){ ... })
OR
adply(df,1,function(x){ ... })
Related
Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)
Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)
Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)
I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g
I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50