mapping over the rows of a data frame

mapping over the rows of a data frame - r

Suppose I have a data frame with columns c1, ..., cn, and a function f that takes in the columns of this data frame as arguments.
How can I apply f to each row of the data frame to get a new data frame?
For example,
x = data.frame(letter=c('a','b','c'), number=c(1,2,3))
# x is
# letter | number
# a | 1
# b | 2
# c | 3
f = function(letter, number) { paste(letter, number, sep='') }
# desired output is
# a1
# b2
# c3
How do I do this? I'm guessing it's something along the lines of {s,l,t}apply(x, f), but I can't figure it out.

as #greg points out, paste() can do this. I suspect your example is a simplification of a more general problem. After struggling with this in the past, as illustrated in this previous question, I ended up using the plyr package for this type of thing. plyr does a LOT more, but for these things it's easy:
> require(plyr)
> adply(x, 1, function(x) f(x$letter, x$number))
X1 V1
1 1 a1
2 2 b2
3 3 c3
you'll want to rename the output columns, I'm sure
So while I was typing this, #joshua showed an alternative method using ddply. The difference in my example is that adply treats the input data frame as an array. adply does not use the "group by" variable row that #joshua created. How he did it is exactly how I was doing it until Hadley tipped me to the adply() approach. In the aforementioned question.

paste(x$letter, x$number, sep = "")

I think you were thinking of something like this, but note that the apply family of functions do not return data.frames. They will also attempt to coerce your data.frame to a matrix before applying the function.
apply(x,1,function(x) paste(x,collapse=""))
So you may be more interested in ddply from the plyr package.
> x$row <- 1:NROW(x)
> ddply(x, "row", function(df) paste(df[[1]],df[[2]],sep=""))
row V1
1 1 a1
2 2 b2
3 3 c3

Related

how to check values in one column are all identical by a second grouping variable?

I am using r to analyse some data that is in long format. I have one column that is a grouping variable which contains participant IDs and another variable that contains their sex.
e.g.
ID SEX
1 M
1 M
2 F
2 F
2 M
I would like to check whether there are any IDs which do not have sex coded consistently e.g. ID=2 above. Is there a way to do this? I have been playing around with dplyr and the group_by function, but I am at a loss. Any help would be greatly appreciated.
In terms of output, I would probably like a vector of all unique ID values that have non-identical values in the SEX column.

Here's a base R soultion using ave() -
df[ave(df$SEX, df$ID, FUN = function(x) length(unique(x))) > 1, ]
ID SEX
3 2 F
4 2 F
5 2 M

You can try this.
require(plyr)
df <- data.frame(c(1,1,2,2,2), c('M','M','F','F','M'))
names(df) <- c('ID','SEX')
df2 <- ddply(df,.(ID), mutate, count = length(unique(SEX)))
unique(df2[df2$count > 1,][1])
Result:
ID
2

How do you return the list of unique values in dataframe and not the index value of the list when aggregating in R?

Given the below dataframe
df <- data.frame(cbind(seq(1:4),rep(letters[seq(1:3)],4)))
X1 X2
1 a
2 b
3 c
4 a
1 b
2 c
3 a
4 b
1 c
2 a
3 b
4 c
I would like to summarize unique X2s by X1. For example,
1 a,b,c
2 b,c,a
3 c,a,b
4 a,b,c
I am very close. I use the following code:
'summary <- aggregate(df$X2, list(df$X1),FUN=unique)`
which produces
Group.1 X
1 1,2,3
2 2,3,1
3 3,1,2
4 1,2,3
(the index of the list). What is the most efficient way to get my desired result?
I am certain there is an easy solution and I've tried searching, but I must not be using the correct search terms. Thank you in advanced.

We can use toString to paste the elements
aggregate(X2~X1, unique(df), toString )
Or if we need to keep it as list
aggregate(X2~X1, transform(unique(df), X2 = as.character(X2)), list)
As the OP also mentioned the efficient approach
library(data.table)
unique(setDT(df))[, .(X2 = toString(X2)), by = X1]
Regarding the creation of data.frame, it is easier, compact and error-free way to do without using cbind with data.frame. The main reason is that cbind converts to a matrix and matrix can have only a single class. So, if there is a single character column or elements, all the elements are converted to character. With as.data.frame, by default the stringsAsFactors=TRUE, so the columns are converted to factor class.
df <- data.frame(X1= 1:4, X2= rep(letters[1:3],4), stringsAsFactors= FALSE)
The above code gets the intended output. Note that seq is not needed when we use :

R apply function - which one to use?

I am confused which apply family to use here.
I have a data frame mydf as
terms
A
B
C
I want to apply custom function to each of the values and get results in new columns like below
terms Value1 Value2 ResultChar
A 23 45 Good
B 12 34 Average
C 9 23 Poor
custom function is something like myfunc("A") returns a vector like (23, 45, Good)
Any help will be appreciated.

Looks like you want a data frame output, as you have different data type across columns. So you need define your myfunc to return a data frame.
Consider this toy example:
mydf <- data.frame(terms = letters[1:3], stringsAsFactors = FALSE)
myfunc <- function (u) data.frame(terms = u, one = u, two = paste0(u,u))
Here is one possibility using basic R features:
do.call(rbind, lapply(mydf$terms, myfunc))
# terms one two
#1 a a aa
#2 b b bb
#3 c c cc
Or you can use adply from plyr package:
library(plyr)
adply(mydf, 1, myfunc)
# terms terms.1 two
#1 a a aa
#2 b b bb
#3 c c cc
(>_<) it is my first time trying something other than R base for a data frame; not sure why adply returns undesired column names here...

We can use rbindlist with lapply. It would be more efficient
library(data.table)
rbindlist(lapply(mydf$terms, myfunc))
If needed, I can show the benchmarks. But, they are already shown here

R applying to a line

I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.

try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)

You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g

Assigning unique id to duplicated rows

If i have a data frame which looks like this:
x y
13 a
14 b
15 c
15 c
14 b
and I wanted each group of equal rows to have a unique id, like this:
x y id
13 a 1
14 b 2
15 c 3
15 c 3
14 b 2
Is there any easy way of doing this?
Thanks

I have a bit of a concern with the paste0 approach. If your columns contained more complex data, you could end up with surprising results, e.g. imagine:
x y
ab c
a bc
One solution is to replace paste0(...) with paste(..., sep = "#"). Even so, you cannot come up with a sep general enough that it will work with any type of data as there is always a non-zero probability that sep will be contained in some kind of data.
A more robust approach is to use a split/transform/combine approach. You can certainly do it with the base package but plyr makes it a bit easier:
library(plyr)
.idx <- 0L
ddply(df, colnames(df), transform, id = (.idx <<- .idx + 1L))
If this is too slow, I would recommend a data.table approach, as proposed here: data.table "key indices" or "group counter"

This is the first thing I thought:
Make a new variable which just combines the two columns by pasting their values to strings:
a<-paste0(z$x,z$y) #z is your data.frame
The make this as a factor and combine it to your dataframe:
cbind(z,id=factor(a,labels=1:length(unique(a))))
EDIT: #flodel was concerned about using paste0, it's better to use ordinary paste, or interaction:
a<-interaction(z,drop=TRUE)
cbind(z,id=factor(a,labels=1:length(unique(a))))
This is assuming that you want to separate x=ab, y=c, and x=a,y=bc. If not, then use paste0.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

mapping over the rows of a data frame - r

paste(x$letter, x$number, sep = "")

Related

how to check values in one column are all identical by a second grouping variable?

How do you return the list of unique values in dataframe and not the index value of the list when aggregating in R?

R apply function - which one to use?

R applying to a line

Assigning unique id to duplicated rows

Categories

Resources