Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Given a time series entailing data about cinemas, the identifier "dates" are of interest. I would like to convert into the format "YYYY/MM/DD." However, when I run my code:
CINEMA.TICKET$DATE <- as.Date(CINEMA.TICKET$date , format = "%y/%m/%d")
Two issues occur:
First, the dates are shown on the far right of the table as, e.g. , "0005-05-20." And many entries disappear entirely. Can someone explain what I am doing wrong, and how can I do it properly?
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day newdate DATE
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 5/5/2018 5 2 5 0005-05-20 2005-05-20
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 5/5/2018 5 2 5 0005-05-20 2005-05-20
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 5/5/2018 5 2 5 0005-05-20 2005-05-20
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 5/5/2018 5 2 5 0005-05-20 2005-05-20
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 5/5/2018 5 2 5 0005-05-20 2005-05-20
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 5/5/2018 5 2 5 0005-05-20 2005-05-20
> str(CINEMA.TICKET)
As #Dave2e pointed out. You are looking for:
CINEMA.TICKET[, date := as.Date(date , format = "%d/%m/%Y")]
assuming our input format is "30/5/2018" since question is not clear with an example of "5/5/2018" where this could be "%d/%m/%Y" or "%m/%d/%Y"
As for ordering columns use:
setcolorder(CINEMA.TICKET, c("c", "b", "a"))
where c,b,a are column names in their desired order
lubridate probably does the trick
> lubridate::mdy("5/5/2018")
[1] "2018-05-05"
So you should use
library(lubridate)
library(tidyverse)
CINEMA.TICKET <- CINEMA.TICKET %>%
mutate(DATE=mdy(date))
Here is another option:
library(tidyverse)
output <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y"))
Output
film_code cinema_code total_sales tickets_sold tickets_out show_time occu_perc ticket_price ticket_use capacity date month quarter day
1 1492 304 3900000 26 0 4 4.26 150000 26 610.3286 2018-05-05 5 2 5
2 1492 352 3360000 42 0 5 8.08 80000 42 519.8020 2018-05-05 5 2 5
3 1492 489 2560000 32 0 4 20.00 80000 32 160.0000 2018-05-05 5 2 5
4 1492 429 1200000 12 0 1 11.01 100000 12 108.9918 2018-05-05 5 2 5
5 1492 524 1200000 15 0 3 16.67 80000 15 89.9820 2018-05-05 5 2 5
6 1492 71 1050000 7 0 3 0.98 150000 7 714.2857 2018-05-05 5 2 5
To have date classified as a date, you cannot have the forward slash. You can change the format, but it will no longer be classified as date, but will be classified as character again.
class(output$date)
# [1] "Date"
output2 <- df %>%
mutate(date = as.Date(date, format="%m/%d/%Y")) %>%
mutate(date = format(date, "%Y/%m/%d"))
class(output2$date)
# [1] "character"
Data
df <-
structure(
list(
film_code = c(1492L, 1492L, 1492L, 1492L, 1492L,
1492L),
cinema_code = c(304L, 352L, 489L, 429L, 524L, 71L),
total_sales = c(3900000L,
3360000L, 2560000L, 1200000L, 1200000L, 1050000L),
tickets_sold = c(26L,
42L, 32L, 12L, 15L, 7L),
tickets_out = c(0L, 0L, 0L, 0L, 0L,
0L),
show_time = c(4L, 5L, 4L, 1L, 3L, 3L),
occu_perc = c(4.26,
8.08, 20, 11.01, 16.67, 0.98),
ticket_price = c(150000L, 80000L,
80000L, 100000L, 80000L, 150000L),
ticket_use = c(26L, 42L, 32L,
12L, 15L, 7L),
capacity = c(610.3286, 519.802, 160, 108.9918,
89.982, 714.2857),
date = c("5/5/2018", "5/5/2018", "5/5/2018", "5/5/2018",
"5/5/2018", "5/5/2018"),
month = c(5L, 5L, 5L, 5L, 5L, 5L),
quarter = c(2L,
2L, 2L, 2L, 2L, 2L),
day = c(5L, 5L, 5L, 5L, 5L, 5L)
),
class = "data.frame",
row.names = c(NA,-6L)
)
Related
Data
I also have the total number of cancer patients (case_totals) and non-cancer patients(control_totals) which in this case is 100 and 1000 respectively.
Variant Cancer IBD AKI CKD CCF IHD
A1 0 5 4 0 0 4
A2 0 8 5 9 0 7
A3 20 9 6 7 0 3
B5 7 2 0 6 5 4
K7 9 1 8 4 2 5
L9 0 0 6 3 3 1
Desired outcome - two tables:
Table1:
Variant case_total not_seen_in_cases_total control_total not_seen_in_control_total
A1 0 100 13 987
A2 0 100 25 975
A3 20 80 25 975
B5 7 93 17 983
K7 9 91 20 980
L9 0 100 13 987
Table2:
case_total_in_gene not_seen_in_gene_cases control_total_in_gene control_total_not_in_gene
36 64 113 887
I will then run a fishers across both tables to get a per variant and per gene p.value which I can do.
My issue is that I have multiple such datasets and in each the order of the columns of the input is different. At present I have been using:
ncol(dt) #to get the total number of columns as in reality the table is very large
which(colnames(dt)=='Cancer') #get the index column
dt$control_total <- (rowSums(dt[,2:7])) - rowSums(dt[,2]) #get a control totals per row column
And then subsetting dt and just adding in the other columns using subtraction e.g. dt$not_seen_in_control_total <- 1000 - dt$control_total
This won't work with shifting column indices and I want to run this across hundreds of files ideally using a commandArgs.
Ultimately how do I reference a column which will always have the same name but will be in different places in a function like RowSums etc?
Many thanks
You can select column names by position or pattern in names or by specifying range of columns. It depends on how your data is structured.
library(dplyr)
table1 <- df %>%
mutate(control_total = rowSums(select(., setdiff(2:ncol(.),
match('Cancer', names(.)))))) %>%
transmute(Variant, Cancer,
not_seen_in_cases_total = 100 - Cancer,
control_total,
not_seen_in_control_total = 1000 - control_total)
table1
# Variant Cancer not_seen_in_cases_total control_total not_seen_in_control_total
#1 A1 0 100 13 987
#2 A2 0 100 29 971
#3 A3 20 80 25 975
#4 B5 7 93 17 983
#5 K7 9 91 20 980
#6 L9 0 100 13 987
table2 <- table1 %>%
summarise(case_total_in_gene = sum(Cancer),
not_seen_in_gene_cases = 100 - case_total_in_gene,
control_total_in_gene = sum(control_total),
control_total_not_in_gene = 1000 - control_total_in_gene)
table2
# case_total_in_gene not_seen_in_gene_cases control_total_in_gene control_total_not_in_gene
#1 36 64 117 883
data
df <- structure(list(Variant = c("A1", "A2", "A3", "B5", "K7", "L9"
), Cancer = c(0L, 0L, 20L, 7L, 9L, 0L), IBD = c(5L, 8L, 9L, 2L,
1L, 0L), AKI = c(4L, 5L, 6L, 0L, 8L, 6L), CKD = c(0L, 9L, 7L,
6L, 4L, 3L), CCF = c(0L, 0L, 0L, 5L, 2L, 3L), IHD = c(4L, 7L,
3L, 4L, 5L, 1L)), class = "data.frame", row.names = c(NA, -6L))
My dataframe is like this :
Device_id Group Nb_burst Date_time
24 1 3 2018-09-02 10:04:04
24 1 5 2018-09-02 10:08:00
55 2 3 2018-09-03 10:14:34
55 2 7 2018-09-03 10:02:29
16 3 2 2018-09-20 08:17:11
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I would like to know, only for the same ID, the same Group number, and the same Date, the timelag between two rows.
If timelag > 1 hour then all right keep them all.
If timelag < 1 hour then keep only the rows with the biggest Nb_burst.
Which mean a dataframe like :
Device_id Group Nb_burst Date_time
24 1 5 2018-09-02 10:08:00
55 2 7 2018-09-03 10:02:29
16 3 71 2018-09-20 06:03:40
22 4 10 2018-10-02 11:33:55
22 4 14 2018-10-02 16:22:18
I tried :
Data$timelag <- c(NA, difftime(Data$Min_start.time[-1], Data$Min_start.time[-nrow(Data)], units="hours"))
But I don't know how test only when Date, ID, and Group are the same, probably a loop.
My df has 1500 rows.
Hope someone could help me. Thank you !
I'm not sure why your group 3 is not duplicated, since time difference is greater than one hour.
But, you could create two indexing variables using ave. First, the order of the Nb_burst for each grouping. Second, the tine differences for each grouping.
dat <- within(dat, {
score <- ave(Nb_burst, Device_id, Group, as.Date(Date_time),
FUN=order)
thrsh <- abs(ave(as.numeric(Date_time), Device_id, Group, as.Date(Date_time),
FUN=diff)/3600) > 1
})
Finally subset by rowSums.
dat[rowSums(dat[c("score", "thrsh")]) > 1,1:4]
# Device_id Group Nb_burst Date_time
# 2 24 1 5 2018-09-02 10:08:00
# 3 55 2 7 2018-09-03 10:14:34
# 5 16 3 2 2018-09-20 08:17:11
# 6 16 3 71 2018-09-20 06:03:40
# 7 22 4 10 2018-10-02 11:33:55
# 8 22 4 14 2018-10-02 16:22:18
Data
dat <- structure(list(Device_id = c(24L, 24L, 55L, 55L, 16L, 16L, 22L,
22L), Group = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Nb_burst = c(3L,
5L, 7L, 3L, 2L, 71L, 10L, 14L), Date_time = structure(c(1535875444,
1535875680, 1535962474, 1535961749, 1537424231, 1537416220, 1538472835,
1538490138), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA,
-8L), class = "data.frame")
Im trying to produce a correlation plot for my data but i get 'x must be numeric error', other fixes have not worked for my case. Do i have to change the month to numeric as well? or is there a way of selecting only the numeric columns for my plot
Tried converting all to numeric but it just changes back to factor automatically
getwd()
myDF <- read.csv("qbase.csv")
head(myDF)
str(myDF)
cp <-cor(myDF)
head(round(cp,2))
'data.frame': 12 obs. of 8 variables:
$ Month : Factor w/ 12 levels "18-Apr","18-Aug",..: 5 4 8 1 9 7 6 2 12 11 ...
$ Monthly.Recurring.Revenue: Factor w/ 2 levels "$25,000 ","$40,000 ": 1 1 1 1 1 2 2 2 2 2 ...
$ Price.per.Seat : Factor w/ 2 levels "$40 ","$50 ": 2 2 2 2 2 1 1 1 1 1 ...
$ Paid.Seats : int 500 500 500 500 500 1000 1000 1000 1000 1000 ...
$ Active.Users : int 10 50 50 100 450 550 800 900 950 800 ...
$ Support.Cases : int 0 0 1 5 35 155 100 75 50 45 ...
$ Users.Trained : int 1 5 0 50 100 300 50 30 0 100 ...
$ Features.Used : int 5 5 5 5 8 9 9 10 15 15 ...
The results to dput(myDF) as are follows:
dput( myDF)
structure(list(Month = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L), .Label = c("18-Apr", "18-Aug", "18-Dec",
"18-Feb", "18-Jan", "18-Jul", "18-Jun", "18-Mar", "18-May", "18-Nov",
"18-Oct", "18-Sep"), class = "factor"), Monthly.Recurring.Revenue = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("$25,000 ",
"$40,000 "), class = "factor"), Price.per.Seat = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("$40 ",
"$50 "), class = "factor"), Paid.Seats = c(500L, 500L, 500L,
500L, 500L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
Active.Users = c(10L, 50L, 50L, 100L, 450L, 550L, 800L, 900L,
950L, 800L, 700L, 600L), Support.Cases = c(0L, 0L, 1L, 5L,
35L, 155L, 100L, 75L, 50L, 45L, 10L, 5L), Users.Trained = c(1L,
5L, 0L, 50L, 100L, 300L, 50L, 30L, 0L, 100L, 50L, 0L), Features.Used = c(5L,
5L, 5L, 5L, 8L, 9L, 9L, 10L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA,
-12L))
You can convert dates to POSIXct and also remove the dollar sign to convert the second and third columns to numeric:
myDF$Month <- as.numeric(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
cp <-cor(myDF)
library(ggcorrplot)
ggcorrplot(cp)
You are trying to get a correlation between factors and numeric columns, wich can't happen (cor handles only numeric, hence the error). You can do:
library(data.table)
ir <- data.table(iris) # since you didn't produce a reproducible example
ir[, cor(.SD), .SDcols = names(ir)[(lapply(ir, class) == "numeric")]]
what is in there:
cor(.SD) will calculate the correlation matrix for a new dataframe composed of a subset data.table (.SD, see ?data.table).
.SDcols establish wich columns will go into that subset data.table. They are only those which class is numeric.
You can remove the dollar sign and change the integer variables to numeric using sapply, then calculate the correlation.
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF[,2:8],as.numeric)
cor(newdf)
Edited:
If you want to use the month variable. Please install lubridate and use month function.
For example:
library(lubridate)
myDF$Month<- month(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF,as.numeric)
cor(as.data.frame(newdf))
The way to convert those months to Date class:
myDF$MonDt <- as.Date( paste0(myDF$Month, "-15"), format="%y-%b-%d")
Could also have used zoo::as.yearmon. Either method would allow you to apply as.numeric to get a valid time scaled value. The other answers are adequate when using single year data but because they incorrectly make the assumption the the leading two digits are day of the month rather than the year, they are going to fail to deliver valid answers in any multi-year dataset, but will not throw any warning about this.
with(myDF, cor(Active.Users, as.numeric(MonDt) ) )
[1] 0.8269705
As one of the other answers illustrated removing the $ and commas is needed before as.numeric will succeed on currency-formatted text. Again, this is also factor data so as.numeric could have yielded erroneous answers, although in this simple example it would not. A safe method would be:
myDF[2:3] <- lapply(myDF[2:3], function(x) as.numeric( gsub("[$,]", "", x)))
myDF
Month Monthly.Recurring.Revenue Price.per.Seat Paid.Seats Active.Users
1 18-Jan 25000 50 500 10
2 18-Feb 25000 50 500 50
3 18-Mar 25000 50 500 50
4 18-Apr 25000 50 500 100
5 18-May 25000 50 500 450
6 18-Jun 40000 40 1000 550
7 18-Jul 40000 40 1000 800
8 18-Aug 40000 40 1000 900
9 18-Sep 40000 40 1000 950
10 18-Oct 40000 40 1000 800
11 18-Nov 40000 40 1000 700
12 18-Dec 40000 40 1000 600
Support.Cases Users.Trained Features.Used MonDt
1 0 1 5 2018-01-15
2 0 5 5 2018-02-15
3 1 0 5 2018-03-15
4 5 50 5 2018-04-15
5 35 100 8 2018-05-15
6 155 300 9 2018-06-15
7 100 50 9 2018-07-15
8 75 30 10 2018-08-15
9 50 0 15 2018-09-15
10 45 100 15 2018-10-15
11 10 50 15 2018-11-15
12 5 0 15 2018-12-15
This question gets an answer that allows multiple correlation coefficients to be calculated and the two way data associations plotted on one page:
How to add p values for correlation coefficients plotted using splom in lattice?
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have some data that look like this:
head(t)
sub trialnum block.x lat.x block.y lat.y diff
1 1 10 3 1355 5 1337 18
2 1 11 3 1324 5 1470 -146
3 1 12 3 1861 5 1690 171
4 1 13 3 3501 5 1473 2028
5 1 14 3 1566 5 1402 164
6 1 15 3 1380 5 1539 -159
What I would like to do is reformat the data in R such that the values of "trialnum" (there are 20 of them) are the new columns, "sub" is the row values, and each cell has the "diff" value. For example
trialnum1 trialnum2 trialnum3...
sub
1
2
3
.
.
.
Any help would be much appreciated. Although the answer is probably simple, I've been struggling with this problem for some time.
Base package. We transpose column diff with the function t(x), then create the desired column names.
df <- data.frame(t(t[, 7]))
# Using the trialnum column
colnames(df) <- paste0(colnames(t[2]), t[, 2])
# or just the number of rows
colnames(df) <- paste0(colnames(t[2]), 1:nrow(t))
Output:
trialnum10 trialnum11 trialnum12 trialnum13 trialnum14 trialnum15
1 18 -146 171 2028 164 -159
trialnum1 trialnum2 trialnum3 trialnum4 trialnum5 trialnum6
1 18 -146 171 2028 164 -159
With dplyr and tidyr, first get rid of the columns you don't want, then spread trialnum and diff.
library(dplyr)
library(tidyr)
t %>% select(-block.x:-lat.y) %>% # get rid of extra columns so t will collapse
mutate(trialnum = paste0('trialnum', trialnum)) %>% # fix values for column names
spread(trialnum, diff) # spread columns
# sub trialnum10 trialnum11 trialnum12 trialnum13 trialnum14 trialnum15
# 1 1 18 -146 171 2028 164 -159
Data
t <- structure(list(sub = c(1L, 1L, 1L, 1L, 1L, 1L), trialnum = 10:15,
block.x = c(3L, 3L, 3L, 3L, 3L, 3L), lat.x = c(1355L, 1324L,
1861L, 3501L, 1566L, 1380L), block.y = c(5L, 5L, 5L, 5L,
5L, 5L), lat.y = c(1337L, 1470L, 1690L, 1473L, 1402L, 1539L
), diff = c(18L, -146L, 171L, 2028L, 164L, -159L)), .Names = c("sub",
"trialnum", "block.x", "lat.x", "block.y", "lat.y", "diff"), row.names = c(NA,
-6L), class = "data.frame")
I’m looking for a way to employ a lookup algorithm on a dataframe that, for a given element, examines corresponding variables within a range and returns the max such variable. The general gist is that I want the function to (1) look at a given element, (2) find all other elements of the same name, (3) among all elements of the same name, see if a corresponding variable is within +- X of any others, and (4) if so, return the max of those; if not, just return whatever that variable is.
A concrete example is with some time stamp data. Say I have orders for 2 businesses that are classified by date, hour, and minute. I want to look at daily orders, but the problem is that if orders come within 2 minutes of each other, they’re double-counted, so I only want to look at the max value in such cases.
*EDIT: I should say that if orders are logged consecutively within a couple minutes of each other, we assume they're duplicated and only want the max value. So if 4 orders came in, each a minute apart, but then there were no other orders +2 minutes from the last and -2 from the first, we can assume that group of 4 should only be counted once, and it should be the max value that's counted
Here's some data:
data <- structure(list(date = structure(c(16090, 16090, 16090, 16090,
16090, 16090, 16090, 16090, 16090, 16090, 16090, 16090, 16091,
16091, 16091, 16091, 16091, 16091, 16091), class = "Date"), company = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = c("ABCo", "Zyco"), class = "factor"), hour = c(5L,
5L, 5L, 7L, 7L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 6L, 6L, 6L, 7L, 7L,
7L, 8L), minute = c(21L, 22L, 50L, 13L, 20L, 34L, 47L, 34L, 35L,
20L, 44L, 19L, 14L, 16L, 37L, 24L, 26L, 49L, 50L), orders = c(59L,
46L, 31L, 15L, 86L, 23L, 8L, 71L, 86L, 44L, 23L, 47L, 6L, 53L,
21L, 54L, 73L, 63L, 4L)), .Names = c("date", "company", "hour",
"minute", "orders"), row.names = c(NA, -19L), class = "data.frame")
What I care about here is, for each company, on a given date, within a given hour, if there are any entries that fall +- 2 minutes of each other, I want to take the max value for "orders". If a given entry doesn't have anything within +- 2 minutes of it, then just keep the "orders" value given. (In this case, the first two rows of "data", ABCo on 2014-01-20 at hour=5, since minute 21 and 22 are within +-2 of eachother, we'd return the max value for orders, so 59. The third row, ABCo on 1-20 at hour=5 and minute = 50 has no other minute +-2 from it, so we'd just keep the value for orders, 31)
A starting point to look at the data for minutes and orders in terms of company+date+hour could be to concatenate these 3 terms together and reorganize the data frame:
data$biztime <- do.call(paste, c(data[c("company","date","hour")], sep = "_"))
data2 <- ddply(data, .(biztime, minute), summarise, orders = sum(orders))
But from here I'm lost. Is there any easy way to add another column to this dataframe using an ifelse statement or something else along these lines that does the sort of conditional operation above?
Add a column of datetime objects:
data <- transform(data,
datetime = strptime(sprintf("%s %s:%s", date, hour, minute),
format = "%Y-%m-%d %H:%M"))
Add a column of indices where two rows within two minutes of each other will share the same index:
data <- ddply(data, .(company), transform, timegroup =
cumsum(c(TRUE, diff(datetime, units = "mins") > 2)))
Finally, summarize:
ddply(data, .(company, timegroup), summarise,
orders = max(orders),
datetime = datetime[1])
# company timegroup orders datetime
# 1 ABCo 1 59 2014-01-20 05:21:00
# 2 ABCo 2 31 2014-01-20 05:50:00
# 3 ABCo 3 15 2014-01-20 07:13:00
# 4 ABCo 4 86 2014-01-20 07:20:00
# 5 ABCo 5 53 2014-01-21 06:14:00
# 6 ABCo 6 21 2014-01-21 06:37:00
# 7 ABCo 7 73 2014-01-21 07:24:00
# 8 ABCo 8 63 2014-01-21 07:49:00
# 9 ABCo 9 4 2014-01-21 08:50:00
# 10 Zyco 1 23 2014-01-20 05:34:00
# 11 Zyco 2 8 2014-01-20 05:47:00
# 12 Zyco 3 86 2014-01-20 06:34:00
# 13 Zyco 4 44 2014-01-20 07:20:00
# 14 Zyco 5 23 2014-01-20 07:44:00
# 15 Zyco 6 47 2014-01-20 08:19:00
Unless I misunderstood something, perhaps this is helpful; probably slow, I guess.
data$gr = as.numeric(interaction(data$company, data$date, data$hour))
ff = function(mins, ords) {
unlist(lapply(mins, function(x) max(ords[abs(x - mins) <= 2])))
}
do.call(rbind,
lapply(split(data, data$gr),
function(x) transform(x, new_val = ff(x$minute, x$orders))))
# date company hour minute orders gr new_val
#1.1 2014-01-20 ABCo 5 21 59 1 59
#1.2 2014-01-20 ABCo 5 22 46 1 59
#1.3 2014-01-20 ABCo 5 50 31 1 31
#2.6 2014-01-20 Zyco 5 34 23 2 23
#2.7 2014-01-20 Zyco 5 47 8 2 8
#6.8 2014-01-20 Zyco 6 34 71 6 86
#6.9 2014-01-20 Zyco 6 35 86 6 86
#7.13 2014-01-21 ABCo 6 14 6 7 53
#7.14 2014-01-21 ABCo 6 16 53 7 53
#7.15 2014-01-21 ABCo 6 37 21 7 21
#9.4 2014-01-20 ABCo 7 13 15 9 15
#9.5 2014-01-20 ABCo 7 20 86 9 86
#10.10 2014-01-20 Zyco 7 20 44 10 44
#10.11 2014-01-20 Zyco 7 44 23 10 23
#11.16 2014-01-21 ABCo 7 24 54 11 73
#11.17 2014-01-21 ABCo 7 26 73 11 73
#11.18 2014-01-21 ABCo 7 49 63 11 63
#14 2014-01-20 Zyco 8 19 47 14 47
#15 2014-01-21 ABCo 8 50 4 15 4