Apply multiple functions to column using tapply - r

Could someone please point to how we can apply multiple functions to the same column using tapply (or any other method, plyr, etc) so that the result can be obtained in distinct columns). For eg., if I have a dataframe with
User MoneySpent
Joe 20
Ron 10
Joe 30
...
I want to get the result as sum of MoneySpent + number of Occurences.
I used a function like --
f <- function(x) c(sum(x), length(x))
tapply(df$MoneySpent, df$Uer, f)
But this does not split it into columns, gives something like say,
Joe Joe 100, 5 # The sum=100, number of occurrences = 5, but it gets juxtaposed
Thanks in advance,
Raj

You can certainly do stuff like this using ddply from the plyr package:
dat <- data.frame(x = rep(letters[1:3],3),y = 1:9)
ddply(dat,.(x),summarise,total = NROW(piece), count = sum(y))
x total count
1 a 3 12
2 b 3 15
3 c 3 18
You can keep listing more summary functions, beyond just two, if you like. Note I'm being a little tricky here in calling NROW on an internal variable in ddply called piece. You could have just done something like length(y) instead. (And probably should; referencing the internal variable piece isn't guaranteed to work in future versions, I think. Do as I say, not as I do and just use length().)

ddply() is conceptually the clearest, but sometimes it is useful to use tapply instead for speed reasons, in which case the following works:
do.call( rbind, tapply(df$MoneySpent, df$User, f) )

Related

How do I use dataframe names as inputs for var in for loop (R language)?

In R, I defined the following function:
race_ethn_tab <- function(x) {
x %>%
group_by(RAC1P) %>%
tally(wt = PWGTP) %>%
print(n = 15) }
The function simply generates a weighted tally for a given dataset, for example, race_ethn_tab(ca_pop_2000) generates a simple 9 x 2 table:
1 Race 1 22322824
2 Race 2 2144044
3 Race 3 228817
4 Race 4 1827
5 Race 5 98823
6 Race 6 3722624
7 Race 7 116176
8 Race 8 3183821
9 Race 9 1268095
I have to do this for several (approx. 10 distinct datasets) where it's easier for me to keep the dfs distinct rather than bind them and create a year variable. So, I am trying to use either a for loop or purrr::map() to iterate through my list of dfs.
Here is what I tried:
dfs_test <- as.list(as_tibble(ca_pop_2000),
as_tibble(ca_pop_2001),
as_tibble(ca_pop_2002),
as_tibble(ca_pop_2003),
as_tibble(ca_pop_2004))
# Attempt 1: Using for loop
for (i in dfs_test) {
race_ethn_tab(i)
}
# Attempt 2: Using purrr::map
race_ethn_outs <- map(dfs_test, race_ethn_tab)
Both attempts are telling me that group_by can't be applied to a factor object, but I can't figure out why the elements in dfs_test are being registered as factors given that I am forcing them into the tibble class. Would appreciate any tips based on my approach or alternative approaches that could make sense here.
This, from #RonakShah, was exactly what was needed:
You code should work if you use list instead of as.list. See output of
as.list(as_tibble(mtcars), as_tibble(iris)) vs list(as_tibble(mtcars),
as_tibble(iris)) – Ronak Shah Oct 2 at 0:23
We can use mget to return a list of datasets, then loop over the list and apply the function
dfs_test <- mget(paste0("ca_pop_", 2000:2004))
It can be also made more general if we use ls
dfs_test <- mget(ls(pattern = '^ca_pop_\\d{4}$'))
map(dfs_test, race_ethn_tab)
This would make it easier if there are 100s of objects already created in the global environment instead of doing
list(ca_pop_2000, ca_pop_2001, .., ca_pop_2020)

Assigning unique id to duplicated rows

If i have a data frame which looks like this:
x y
13 a
14 b
15 c
15 c
14 b
and I wanted each group of equal rows to have a unique id, like this:
x y id
13 a 1
14 b 2
15 c 3
15 c 3
14 b 2
Is there any easy way of doing this?
Thanks
I have a bit of a concern with the paste0 approach. If your columns contained more complex data, you could end up with surprising results, e.g. imagine:
x y
ab c
a bc
One solution is to replace paste0(...) with paste(..., sep = "#"). Even so, you cannot come up with a sep general enough that it will work with any type of data as there is always a non-zero probability that sep will be contained in some kind of data.
A more robust approach is to use a split/transform/combine approach. You can certainly do it with the base package but plyr makes it a bit easier:
library(plyr)
.idx <- 0L
ddply(df, colnames(df), transform, id = (.idx <<- .idx + 1L))
If this is too slow, I would recommend a data.table approach, as proposed here: data.table "key indices" or "group counter"
This is the first thing I thought:
Make a new variable which just combines the two columns by pasting their values to strings:
a<-paste0(z$x,z$y) #z is your data.frame
The make this as a factor and combine it to your dataframe:
cbind(z,id=factor(a,labels=1:length(unique(a))))
EDIT: #flodel was concerned about using paste0, it's better to use ordinary paste, or interaction:
a<-interaction(z,drop=TRUE)
cbind(z,id=factor(a,labels=1:length(unique(a))))
This is assuming that you want to separate x=ab, y=c, and x=a,y=bc. If not, then use paste0.

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Create and process several columns with loop in R

I'm quite new to R and I would like to learn how to write a Loop to create and process several columns.
I imported a table into R that cointains data with 23 variables. For all of these variables I want to calculate the per capita valuem multiply this with 1000 and either write the data into a new table or in the same table as the old data.
So to this for only one column my operation looked like this:
<i>agriculture<-cbind(agriculture,"Total_value_per_capita"=agriculture$Total/agriculture$Total.Population*1000)</i>
Now I'm asking how to do this in a Loop for the 23 variables so that I won't have to write 23 similar lines of code.
I think the solution might look quite similar to the code pasted in this thread:
loop to create several matrix in R (maybe using paste)
but I dind't got it working on my code.
So any suggestion would be very helpful.
I would always favor an appropriate *ply function over loops in R. In this case sapply could be your friend:
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
df.per.capita <– as.data.frame(
sapply(
df[ colnames(df) != "c" ], function(x){ x/df$c *1000 }
)
)
For more complicated cases, you should definitely have a look at the plyr package.
This can be done using sweep function. Using Beasterfield's data generation but setting the seed you can obtain the same results
set.seed(001)
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
per.capita <- sweep(df[,colnames(df) != "c"], 1, STATS=df$c, FUN='/')*1000
per.capita
a b
1 300.0000 300.0000
2 2000.0000 1000.0000
3 833.3333 1000.0000
4 7000.0000 10000.0000
5 222.2222 555.5556
6 1000.0000 875.0000
7 1285.7143 1142.8571
8 1200.0000 800.0000
9 3333.3333 333.3333
10 250.0000 2250.0000
Comparing with Beasterfield's results:
all.equal(df.per.capita, per.capita)
[1] TRUE

Learning to understand plyr, ddply

I've been attempting to understand what and how plyr works through trying different variables and functions and seeing what results. So I'm more looking for an explanation of how plyr works than specific fix it answers. I've read the documentation but my newbie brain is still not getting it.
Some data and names:
mydf<- data.frame(c("a","a","b","b","c","c"),c("e","e","e","e","e","e")
,c(1,2,3,10,20,30),
c(5,10,20,20,15,10))
colnames(mydf)<-c("Model", "Class","Length", "Speed")
mydf
Question 1: Summarise versus Transform Syntax
So if I Enter: ddply(mydf, .(Model), summarise, sum = Length+Length)
I get:
`Model ..1
1 a 2
2 a 4
3 b 6
4 b 20
5 c 40
6 c 60
and if I enter: ddply(mydf, .(Model), summarise, Length+Length) I get the same result.
Now if use transform: ddply(mydf, .(Model), transform, sum = (Length+Length))
I get:
Model Class Length Speed sum
1 a e 1 5 2
2 a e 2 10 4
3 b e 3 20 6
4 b e 10 20 20
5 c e 20 15 40
6 c e 30 10 60
But if I state it like the first summarise :
ddply(mydf, .(Model), transform, (Length+Length))
Model Class Length Speed
1 a e 1 5
2 a e 2 10
3 b e 3 20
4 b e 10 20
5 c e 20 15
6 c e 30 10
So why does adding "sum =" make a difference?
Question 2: Why don't these work?
ddply(mydf, .(Model), sum, Length+Length) #Error in function (i) : object 'Length' not found
ddply(mydf, .(Model), length, mydf$Length) #Error in .fun(piece, ...) :
2 arguments passed to 'length' which requires 1
These examples are more to show that somewhere I'm fundamentally not understanding how to use plyr.
Any anwsers or explanations are appreciated.
I find that when I'm having trouble "visualizing" how any of the functional tools in R work, that the easiest thing to do is browser a single instance:
ddply(mydf, .(Model), function(x) browser() )
Then inspect x in real-time and it should all make sense. You can then test out your function on x, and if it works you're golden (barring other groupings being different than your first x).
The syntax is:
ddply(data.frame, variable(s), function, optional arguments)
where the function is expected to return a data.frame. In your situation,
summarise is a function that will transparently create a new data.frame, with the results of the expression that you provide as further arguments (...)
transform, a base R function, will transform the data.frames (first split by the variable(s)), adding new columns according to the expression(s) that you provide as further arguments. These need to be named, that's just the way transform works.
If you use other functions than subset, transform, mutate, with, within, or summarise, you'll need to make sure they return a data.frame (length and sum don't), or at the very least a vector of appropriate length for the output.
The way I understand the ddply(... , .(...) , summarise, ...) operations are are designed to reduce the number of rows to match the number of distinct combinations inside the .(...) grouping variables. So for your first example, this seemed natural:
ddply(mydf, .(Model), summarise, sL = sum(Length)
Model sL
1 a 3
2 b 13
3 c 50
OK. Seems to work for me (not a regular plyr user). The transform operations on the other hand I understand to be making new columns of the same length as the dataframe. That was what your first transform call accomplished. Your second one (a failure) was:
ddply(mydf, .(Model), transform, (Length+Length))
That one did not create a new name for the operation that was performed, so there was nothing new assigned in the result. When you added sum=(Length+Length), there suddenly was a name available, (and the sum function was not used). It's generally a bad idea to use the names of function for column names.
On question two, I think that the .fun argument needs to be a plyr-function or something that makes sense applied to a (split) dataframe as a whole rather any old function. There is no sum.data.frame function. But 'nrow' or 'ncol' do make sense. You can even get 'str' to work in that position. The length function applied to a dataframe gives the number of columns:
ddply(mydf, .(Model), length ) # all 4's

Resources