R apply function - which one to use? - r

I am confused which apply family to use here.
I have a data frame mydf as
terms
A
B
C
I want to apply custom function to each of the values and get results in new columns like below
terms Value1 Value2 ResultChar
A 23 45 Good
B 12 34 Average
C 9 23 Poor
custom function is something like myfunc("A") returns a vector like (23, 45, Good)
Any help will be appreciated.

Looks like you want a data frame output, as you have different data type across columns. So you need define your myfunc to return a data frame.
Consider this toy example:
mydf <- data.frame(terms = letters[1:3], stringsAsFactors = FALSE)
myfunc <- function (u) data.frame(terms = u, one = u, two = paste0(u,u))
Here is one possibility using basic R features:
do.call(rbind, lapply(mydf$terms, myfunc))
# terms one two
#1 a a aa
#2 b b bb
#3 c c cc
Or you can use adply from plyr package:
library(plyr)
adply(mydf, 1, myfunc)
# terms terms.1 two
#1 a a aa
#2 b b bb
#3 c c cc
(>_<) it is my first time trying something other than R base for a data frame; not sure why adply returns undesired column names here...

We can use rbindlist with lapply. It would be more efficient
library(data.table)
rbindlist(lapply(mydf$terms, myfunc))
If needed, I can show the benchmarks. But, they are already shown here

Related

How do you return the list of unique values in dataframe and not the index value of the list when aggregating in R?

Given the below dataframe
df <- data.frame(cbind(seq(1:4),rep(letters[seq(1:3)],4)))
X1 X2
1 a
2 b
3 c
4 a
1 b
2 c
3 a
4 b
1 c
2 a
3 b
4 c
I would like to summarize unique X2s by X1. For example,
1 a,b,c
2 b,c,a
3 c,a,b
4 a,b,c
I am very close. I use the following code:
'summary <- aggregate(df$X2, list(df$X1),FUN=unique)`
which produces
Group.1 X
1 1,2,3
2 2,3,1
3 3,1,2
4 1,2,3
(the index of the list). What is the most efficient way to get my desired result?
I am certain there is an easy solution and I've tried searching, but I must not be using the correct search terms. Thank you in advanced.
We can use toString to paste the elements
aggregate(X2~X1, unique(df), toString )
Or if we need to keep it as list
aggregate(X2~X1, transform(unique(df), X2 = as.character(X2)), list)
As the OP also mentioned the efficient approach
library(data.table)
unique(setDT(df))[, .(X2 = toString(X2)), by = X1]
Regarding the creation of data.frame, it is easier, compact and error-free way to do without using cbind with data.frame. The main reason is that cbind converts to a matrix and matrix can have only a single class. So, if there is a single character column or elements, all the elements are converted to character. With as.data.frame, by default the stringsAsFactors=TRUE, so the columns are converted to factor class.
df <- data.frame(X1= 1:4, X2= rep(letters[1:3],4), stringsAsFactors= FALSE)
The above code gets the intended output. Note that seq is not needed when we use :

How to run a for-loop through a string vector of a data frame in R?

I'm trying to do something very simple: to run a loop through a vector of names and use those names in my code.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
My real dataset has 14 countries and 75 time periods. I would like to find a function which for example loops through the countries, then subsets them so I have the single datasets such as:
data_AT <- subset(Data, (Data$geo=="AT"))
data_BE <- subset(Data, (Data$geo=="BE"))
but with a loop and ideally with a solution I can apply to other functions as well :-)
In my mind, this should look something like this:
codes <- unique(Data$geo)
for (i in 1:length(codes))
{k <- codes[i]
data_(k) <- subset(Data, (Data$geo==k))}
however subset doesn't work like this, neither do other functions. I think my problem is that I don't know how to address the respective name which "k" has taken (e.g. "AT") as part of my code. If at all possible, I would very much appreciate an answer with a general solution of how I can run a function through a vector containing text and use each element of that vector in my code. Maybe in the direction of the apply functions? Though I'm not getting very far with that either...
Any help would be very much appreciated!
I'm using loops for simiral purposes too. Maybe it's not the fastest way, but at least I understand it -- for example, when saving plots for different subsets.
There is no need to loop through length of vector, you can loop through vector itself. For converting string to variable name, you can use assign.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
codes <- sort(unique(Data$geo))
for (k in codes) {
name<-paste("data", k, sep="_")
assign(name, subset(Data, (Data$geo==k)))
}
BTW, filter from package dplyr is much faster than subset!
In R, you would typically do this with a list of data.frames instead of several separate data.frames:
lst <- split(Data, Data$geo)
lst
#$AT
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
#
#$BE
# geo time value
#4 BE 1990Q1 4
#5 BE 1990Q2 5
#6 BE 1990Q3 6
Now you can access each element (which is a data.frame) by typing:
lst[["AT"]]
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
If you have a vector of country names for which you want to add +1 to the value column, you can do it like this:
cntrs <- c("BE", "AT")
lst[cntrs] <- lapply(lst[cntrs], function(x) {x$value <- x$value + 1; return(x)} )
#$BE
# geo time value
#4 BE 1990Q1 5
#5 BE 1990Q2 6
#6 BE 1990Q3 7
#
#$AT
# geo time value
#1 AT 1990Q1 2
#2 AT 1990Q2 3
#3 AT 1990Q3 4
Edit: if you really want to stick with a for loop, I recommend not to split the data into several separate data.frames but to run the loop on the whole data set like this for example:
cntrs <- "BE"
for(i in cntrs){
Data$value[Data$geo == i] <- Data$value[Data$geo == i] + 1
}

Assigning unique id to duplicated rows

If i have a data frame which looks like this:
x y
13 a
14 b
15 c
15 c
14 b
and I wanted each group of equal rows to have a unique id, like this:
x y id
13 a 1
14 b 2
15 c 3
15 c 3
14 b 2
Is there any easy way of doing this?
Thanks
I have a bit of a concern with the paste0 approach. If your columns contained more complex data, you could end up with surprising results, e.g. imagine:
x y
ab c
a bc
One solution is to replace paste0(...) with paste(..., sep = "#"). Even so, you cannot come up with a sep general enough that it will work with any type of data as there is always a non-zero probability that sep will be contained in some kind of data.
A more robust approach is to use a split/transform/combine approach. You can certainly do it with the base package but plyr makes it a bit easier:
library(plyr)
.idx <- 0L
ddply(df, colnames(df), transform, id = (.idx <<- .idx + 1L))
If this is too slow, I would recommend a data.table approach, as proposed here: data.table "key indices" or "group counter"
This is the first thing I thought:
Make a new variable which just combines the two columns by pasting their values to strings:
a<-paste0(z$x,z$y) #z is your data.frame
The make this as a factor and combine it to your dataframe:
cbind(z,id=factor(a,labels=1:length(unique(a))))
EDIT: #flodel was concerned about using paste0, it's better to use ordinary paste, or interaction:
a<-interaction(z,drop=TRUE)
cbind(z,id=factor(a,labels=1:length(unique(a))))
This is assuming that you want to separate x=ab, y=c, and x=a,y=bc. If not, then use paste0.

Apply multiple functions to column using tapply

Could someone please point to how we can apply multiple functions to the same column using tapply (or any other method, plyr, etc) so that the result can be obtained in distinct columns). For eg., if I have a dataframe with
User MoneySpent
Joe 20
Ron 10
Joe 30
...
I want to get the result as sum of MoneySpent + number of Occurences.
I used a function like --
f <- function(x) c(sum(x), length(x))
tapply(df$MoneySpent, df$Uer, f)
But this does not split it into columns, gives something like say,
Joe Joe 100, 5 # The sum=100, number of occurrences = 5, but it gets juxtaposed
Thanks in advance,
Raj
You can certainly do stuff like this using ddply from the plyr package:
dat <- data.frame(x = rep(letters[1:3],3),y = 1:9)
ddply(dat,.(x),summarise,total = NROW(piece), count = sum(y))
x total count
1 a 3 12
2 b 3 15
3 c 3 18
You can keep listing more summary functions, beyond just two, if you like. Note I'm being a little tricky here in calling NROW on an internal variable in ddply called piece. You could have just done something like length(y) instead. (And probably should; referencing the internal variable piece isn't guaranteed to work in future versions, I think. Do as I say, not as I do and just use length().)
ddply() is conceptually the clearest, but sometimes it is useful to use tapply instead for speed reasons, in which case the following works:
do.call( rbind, tapply(df$MoneySpent, df$User, f) )

mapping over the rows of a data frame

Suppose I have a data frame with columns c1, ..., cn, and a function f that takes in the columns of this data frame as arguments.
How can I apply f to each row of the data frame to get a new data frame?
For example,
x = data.frame(letter=c('a','b','c'), number=c(1,2,3))
# x is
# letter | number
# a | 1
# b | 2
# c | 3
f = function(letter, number) { paste(letter, number, sep='') }
# desired output is
# a1
# b2
# c3
How do I do this? I'm guessing it's something along the lines of {s,l,t}apply(x, f), but I can't figure it out.
as #greg points out, paste() can do this. I suspect your example is a simplification of a more general problem. After struggling with this in the past, as illustrated in this previous question, I ended up using the plyr package for this type of thing. plyr does a LOT more, but for these things it's easy:
> require(plyr)
> adply(x, 1, function(x) f(x$letter, x$number))
X1 V1
1 1 a1
2 2 b2
3 3 c3
you'll want to rename the output columns, I'm sure
So while I was typing this, #joshua showed an alternative method using ddply. The difference in my example is that adply treats the input data frame as an array. adply does not use the "group by" variable row that #joshua created. How he did it is exactly how I was doing it until Hadley tipped me to the adply() approach. In the aforementioned question.
paste(x$letter, x$number, sep = "")
I think you were thinking of something like this, but note that the apply family of functions do not return data.frames. They will also attempt to coerce your data.frame to a matrix before applying the function.
apply(x,1,function(x) paste(x,collapse=""))
So you may be more interested in ddply from the plyr package.
> x$row <- 1:NROW(x)
> ddply(x, "row", function(df) paste(df[[1]],df[[2]],sep=""))
row V1
1 1 a1
2 2 b2
3 3 c3

Resources