I have this dataframe :
> head(merged.tables)
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday StoreType
1 1 5 2015-07-31 5263 555 1 1 0 1 c
2 1 6 2013-01-12 4952 646 1 0 0 0 c
3 1 5 2014-01-03 4190 552 1 0 0 1 c
4 1 3 2014-12-03 6454 695 1 1 0 0 c
5 1 3 2013-11-13 3310 464 1 0 0 0 c
6 1 7 2013-10-27 0 0 0 0 0 0 c
Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2
1 a 1270 9 2008 0
2 a 1270 9 2008 0
3 a 1270 9 2008 0
4 a 1270 9 2008 0
5 a 1270 9 2008 0
6 a 1270 9 2008 0
Promo2SinceWeek Promo2SinceYear PromoInterval
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
Then I want to extract a dataframe showing the average of Sales vector when Open equal to 1 and by StoreType.
I used this command because it's the fatest I think:
merged.tables[StateHoliday==1,mean(na.omit(Sales)),by=StoreType]
But I got this error:
Error in [.data.frame(merged.tables, StateHoliday == 0,
mean(na.omit(Sales)), : unused argument (by = StoreType)
I search but I didn't get an answer to this error. Thanks for your help!
I had such an error.
Problem was solved when I realised: my data was not in data.table format.
example:
copy <- data.table(data)
Overview
There are lots of ways of applying a function to a group of values in your data frame. I present two:
Using the dplyr package to arrange your data in a way that answers your question.
Using tapply(), which performs a function over a group of values.
Reproducible Example
For each store type, I want the average sales for those stores whose Open value is equal to 1.
I present the dplyr method first, followed by tapply.
Note: The following data frame only takes a few columns from those posted in the OP.
# install necessary package
install.packages( pkgs = "dplyr" )
# load necessary package
library( dplyr )
# create data frame
merged.tables <-
data.frame(
Store = c( 1, 1, 1, 2, 2, 2 )
, StoreType = rep( x = c( "s", "m", "l" ) , times = 2)
, Sales = round( x = runif( n = 6, min = 3000, max = 6000 ) , digits = 0 )
, Open = c( 1, 1, 0, 0, 1, 1 )
, stringsAsFactors = FALSE
)
# view the data
merged.tables
# Store StoreType Sales Open
# 1 1 s 4608 1
# 2 1 m 4017 1
# 3 1 l 4210 0
# 4 2 s 4833 0
# 5 2 m 3818 1
# 6 2 l 3090 1
# dplyr method
merged.tables %>%
group_by( StoreType ) %>%
filter( Open == 1 ) %>%
summarise( AverageSales = mean( x = Sales , na.rm = TRUE ) )
# A tibble: 3 x 2
# StoreType AverageSales
# <chr> <dbl>
# 1 l 3090
# 2 m 3918
# 3 s 4608
# tapply method
# create the condition
# that 'Open' must be equal to one
Open.equals.one <- which( merged.tables$Open == 1 )
# apply the condition to
# both X and INDEX
tapply( X = merged.tables$Sales[ Open.equals.one ]
, INDEX = merged.tables$StoreType[ Open.equals.one ]
, FUN = mean
, na.rm = TRUE # just in case your data does have NA values in the `Sales` column, this removes them from the calculation
)
# l m s
# 3090.0 3917.5 4608.0
# end of script #
Resources
Should you need more conditions later on, I encourage you to check out other relevant SO posts, such as How to combine multiple conditions to subset a data-frame using “OR”? and Why is [ better than subset?.
Related
I would like to divide the value of a column according to a condition.
Something like the following:
Data$ColumnA [Data$ColumnA>50 && Data$ColumnB> 0] <- Data$ColumnA / 25
The problem is Data$ColumnA / 25 loses the "index", and makes the division of the first value in the list.
Thank you
I prefer using a data.table instead of a data.frame not only for performance reasons but also because the syntax is more compact:
library(data.table)
Data <- data.frame(ColumnA = seq(0, 175, by = 25),
ColumnB = c(0, 1))
Data
# ColumnA ColumnB
# 1 0 0
# 2 25 1
# 3 50 0
# 4 75 1
# 5 100 0
# 6 125 1
# 7 150 0
# 8 175 1
setDT(Data) # "convert" data.frame into a data.table
Data[ColumnA > 50 & ColumnB > 0, ColumnA := ColumnA / 25]
Data
# ColumnA ColumnB
# 1: 0 0
# 2: 25 1
# 3: 50 0
# 4: 3 1
# 5: 100 0
# 6: 5 1
# 7: 150 0
# 8: 7 1
I have a set of variables that contain data about whether or not a person has ever had certain health conditions. For example, "have you ever had a heart attack?"
If they say "yes" at observation 2, then the answer is still yes at observations 3 and 4. But, it is not necessarily yes at observation 1. The heart attack could have occurred between observation 1 and 2.
If they say "no" at observation 2, then the answer is no at observations 1. But, it is not necessarily no at observations 3 or 4.
Here is a reproducible example:
df <- tibble(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA)
)
df
id obs mi_ever
1 1 1 NA
2 1 2 0
3 1 3 1
4 1 4 NA
5 2 1 NA
6 2 2 0
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 1
11 3 3 NA
12 3 4 NA
It's trivial to carry my 0's (No's) backward or carry my 1's (Yes's) forward using zoo::na.locf. However, I'm not sure how to carry 0's backward and 1's forward. Ideally, I'd like the following result:
id obs mi_ever mi_ever_2
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
I've checked out the following posts, but none seem to cover exactly what I'm asking here.
Carry last Factor observation forward and backward in group of rows in R
Forward and backward fill data frame in R
making a "dropdown" function in R
Any help is appreciated.
Basically I'm marking the items in sequence after the first 1 to become 1 and the ones before the last 0 to become 0.
ever <- function (x) min( which( x == 1))
NA_1 <- function(x) seq_along(x) > ever(x) #could have done in one function
# check to see if working
ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
[1] NA 0 1 1 NA 0 NA NA NA 1 1 1
NA_0 <- function(x) seq_along(x) < not_yet(x)
not_yet <- function(x){ max( which( x==0)) }
# make temporary version of 1-modified column
temp1 <- ave(df$mi_ever, df$id, FUN= function(x){ x[NA_1(x) ] <- 1; x})
df$ever2 <- ave(temp1, df$id, FUN= function(x){ x[NA_0(x) ] <- 0; x})
# then make final version; could have done it "in place" I suppose.
df
# A tibble: 12 x 4
id obs mi_ever ever2
<int> <int> <dbl> <dbl>
1 1 1 NA 0
2 1 2 0 0
3 1 3 1 1
4 1 4 NA 1
5 2 1 NA 0
6 2 2 0 0
7 2 3 NA NA
8 2 4 NA NA
9 3 1 NA NA
10 3 2 1 1
11 3 3 NA 1
12 3 4 NA 1
If you need to suppress the warnings that should be possible.
I took the answer from #42- above (Thank you!), and tweaked it a little bit to further suit my needs. Specifically, I:
Took care of the warning "no non-missing arguments to min; returning Infno non-missing arguments to max; returning -Inf".
Combined the separate functions into a single function (although the separate functions were extremely useful for learning).
Added an optional check_logic argument. When TRUE, the function will return 9's if a 0 comes after a 1. This represents a data error or logic flaw that warrants further investigation.
Added an example of using the function with data.table, and on multiple variables at once. This more accurately represents how I'm using the function in real life, and I thought it may be useful to others.
The function:
distribute_ever <- function(x, check_logic = TRUE, ...) {
if (check_logic) {
if (length(which(x == 1)) > 0 & length(which(x == 0)) > 0) {
if (min(which(x == 1)) < max(which(x == 0))) {
x <- 9 # Set x to 9 if zero comes after 1
}
}
}
ones <- which(x == 1) # Get indices for 1's
if (length(ones) > 0) { # Prevents warning
first_1_by_group <- min(which(x == 1)) # Index first 1 by group
x[seq_along(x) > first_1_by_group] <- 1 # Set x at subsequent indices to 1
}
zeros <- which(x == 0) # Get indices for 0's
if (length(zeros) > 0) { # Prevents warning
last_0_by_group <- max(which(x == 0)) # Index last 0 by group
x[seq_along(x) < last_0_by_group] <- 0 # Set x at previous indices to 0
}
x
}
A new reproducible example with multiple "ever" variables and some cases with 0 after 1:
dt <- data.table(
id = rep(1:3, each = 4),
obs = rep(1:4, times = 3),
mi_ever = c(NA, 0, 1, NA, NA, 0, NA, NA, NA, 1, NA, NA),
diab_ever = c(0, NA, NA, 1, 1, NA, NA, 0, 0, NA, NA, NA)
)
Iterate over multiple variables quickly using data.table (with by group processing):
ever_vars <- c("mi_ever", "diab_ever")
dt[, paste0(ever_vars, "_2") := lapply(.SD, distribute_ever),
.SDcols = ever_vars,
by = id][]
Results:
id obs mi_ever diab_ever mi_ever_2 diab_ever_2
1: 1 1 NA 0 0 0
2: 1 2 0 NA 0 NA
3: 1 3 1 NA 1 NA
4: 1 4 NA 1 1 1
5: 2 1 NA 1 0 9
6: 2 2 0 NA 0 9
7: 2 3 NA NA NA 9
8: 2 4 NA 0 NA 9
9: 3 1 NA 0 NA 0
10: 3 2 1 NA 1 NA
11: 3 3 NA NA 1 NA
12: 3 4 NA NA 1 NA
For each input "ever" variable, we have:
Created a new variable with "_2" appended to the end of the input variable name. You could also edit "in place" as 42- pointed out, but I like being able to double check my data.
Zeroes are carried backward and ones are carried forward in time.
NA's after zeros and before ones (within id) are returned unchanged.
If there is a 0 (No, I've never had ...) after a 1 (Yes, I've had ...), as is the case with person 2's responses regarding diabetes, then the function returns 9's.
If we were to set check_logic to FALSE, then 1's would win out and replace 0's
I have a data.table with two parameters(date and status), now I want to insert new columns based on the original table.
data rules:
the Status column contains only "0" and "1"
the Date column is always increase by seconds :)
new variables:
group: to number each group or cycle for the status, the order of the status is (0,1). it means that the status starts with status '0', when the status becomes '0' again, one cycle is completed.
cycle_time: calculate the cycle time for each group
group_0: calculate the time for the status 0 within a specific group
group_1: calculate the time for the status 1 within a specific group
For example, a simple input:
the code to generate the data:
dd <- data.table(date = c("2015-07-01 00:00:12", "2015-07-01 00:00:13","2015-07-01 00:00:14","2015-07-01 00:00:15", "2015-07-01 00:00:16", "2015-07-01 00:00:17","2015-07-01 00:00:18", "2015-07-01 00:00:19", "2015-07-01 00:00:20","2015-07-01 00:00:21", "2015-07-01 00:00:22", "2015-07-01 00:00:23","2015-07-01 00:00:24", "2015-07-01 00:00:25"), status = c(0,0,0,0,1,1,1,0,0,1,1,1,1,0))
the output including new parameters is:
actually i have done with some basic methods,
the main idea is :if the current status is 0 and the next status is 1, then mark it as one group.
the idea could work, but the problem is the calculation time is too long, since so many loops.
I supposed that there could be an easier solution for this case
So a transition from 1 to 0 marks the boundary of a group. You can use cumsum and diff to get this working. For the x example in the answer of #zx8754:
data.frame(x, group_id = c(1, cumsum(diff(x) == -1) + 1))
x group_id
1 0 1
2 0 1
3 0 1
4 1 1
5 1 1
6 0 2
7 0 2
8 1 2
9 0 3
For a more realistically sized example:
res = data.frame(status = sample(c(0,1), 10e7, replace = TRUE))
system.time(res$group_id <- c(1, cumsum(diff(res$status) == -1) + 1))
user system elapsed
2.770 1.680 4.449
> head(res, 20)
status group_id
1 0 1
2 0 1
3 1 1
4 0 2
5 0 2
6 0 2
7 1 2
8 1 2
9 0 3
10 1 3
11 1 3
12 0 4
13 1 4
14 0 5
15 0 5
16 1 5
17 0 6
18 0 6
19 1 6
20 0 7
5 seconds for 10 million records is quite fast (although that depends on your definition of fast :)).
Benchmarking
set.seed(1)
res = data.frame(status = sample(c(0,1), 10e4, replace = TRUE))
microbenchmark::microbenchmark(
rleid = {
gr <- data.table::rleid(res$status)
x1 <- as.numeric(as.factor(ifelse(gr %% 2 == 0, gr - 1, gr)))
# removing "as.numeric(as.factor" helps, but still not as fast as cumsum
#x1 <- ifelse(gr %% 2 == 0, gr - 1, gr)
},
cumsum = { x2 <- c(1, cumsum(diff(res$status) == -1) + 1) }
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# rleid 118.161287 120.149619 122.673747 121.736122 123.271881 168.88777 100 b
# cumsum 1.511811 1.559563 2.221273 1.826404 2.475402 6.88169 100 a
identical(x1, x2)
# [1] TRUE
Try this:
#dummy data
x <- c(0,0,0,1,1,0,0,1,0)
#get group id using rleid from data.table
gr <- data.table::rleid(x)
#merge separated 0,1 groups
gr <- ifelse(gr %% 2 == 0, gr - 1, gr)
#result
cbind(x, gr)
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 3
# [7,] 0 3
# [8,] 1 3
# [9,] 0 5
#if we need to have group names sequential then
cbind(x, gr = as.numeric(as.factor(gr)))
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 2
# [7,] 0 2
# [8,] 1 2
# [9,] 0 3
Hi I want to identify and label the largest number for each group, can someone tell me how to get this done in r (or maybe excel would be easier)?
The following is an example data, the original data contains only the left 2 columns and I want to generate the third one. In the 3rd column, I want to label the largest value in the group as 1, e.g., in group 1, the largest is .02874 so it's marked as 1, otherwise 0. Thank you!
x <- read.table(header=T, text="group value largest
1 0.02827 0
1 0.02703 0
1 0.02874 1
2 0.03255 0
2 0.10394 1
2 0.03417 0
3 0.13858 0
3 0.16084 0
3 0.99830 1
3 0.24563 0")
UPDATE: Thank you all for your help! They all are great solutions!
Finally, the base (no package required) approach:
is.largest <- function(x) as.integer(seq_along(x) == which.max(x))
x <- transform(x, largest = ave(value, group, FUN = is.largest))
Note that if I were you, I would remove the as.integer and just store a logical (TRUE/FALSE) vector.
library(data.table)
x <- data.table(x)
y <- x[,list(value = max(value), maxindicator = TRUE), by = c('group')]
z <- merge(x,y, by = c('group','value'), all = TRUE)
Output
> z
group value largest maxindicator
1: 1 0.02703 0 NA
2: 1 0.02827 0 NA
3: 1 0.02874 1 TRUE
4: 2 0.03255 0 NA
5: 2 0.03417 0 NA
6: 2 0.10394 1 TRUE
7: 3 0.13858 0 NA
8: 3 0.16084 0 NA
9: 3 0.24563 0 NA
10: 3 0.99830 1 TRUE
Here is a solution with plyr :
x$largest <- 0
x <- ddply(x, .(group), function(df) {
df$largest[which.max(df$value)] <- 1
df
})
And one with base R :
x$largest <- 0
l <- split(x, x$group)
l <- lapply(l, function(df) {
df$largest[which.max(df$value)] <- 1
df
})
x <- do.call(rbind, l)
Here's a less cool base approach:
FUN <- function(x) {y <- rep(0, length(x)); y[which.max(x)] <- 1; y}
x$largest <- unlist(tapply(x$value, x$group, FUN))
## group value largest
## 1 1 0.02827 0
## 2 1 0.02703 0
## 3 1 0.02874 1
## 4 2 0.03255 0
## 5 2 0.10394 1
## 6 2 0.03417 0
## 7 3 0.13858 0
## 8 3 0.16084 0
## 9 3 0.99830 1
## 10 3 0.24563 0
It was more difficult to do in base than I had anticipated.
I am new to R.
I would like to transform a binary matrix like this:
example:
" 1874 1875 1876 1877 1878 .... 2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
Since, columns names are years I would like to aggregate them in decades and obtain something like:
"1840-1849 1850-1859 1860-1869 .... 2000-2009
F 1 0 0 0 0 ... 0
E 1 1 0 0 0 ... 0
D 1 1 0 0 0 ... 0
C 1 1 0 0 0 ... 0
B 1 1 0 0 0 ... 0
A 1 1 0 0 0 ... 0"
I am used to python and do not know how to do this transformation without making loops!
Thanks, isabel
It is unclear what aggregation you want, but using the following dummy data
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
The following counts events in each 10-year period.
Get the years as a numeric variable
years <- as.numeric(names(df))
Next we need an indicator for the start of each decade
ind <- seq(from = signif(years[1], 3), to = signif(tail(years, 1), 3), by = 10)
We then apply over the indices of ind (1:(length(ind)-1)), select columns from df that are the current decade and count the 1s using rowSums.
tmp <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)])
}, inds = ind, data = df)
Next we cbind the resulting vectors into a data frame and fix-up the column names:
out <- do.call(cbind.data.frame, tmp)
names(out) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out
This gives:
> out
1870-1879 1880-1889 1890-1899
1 4 5 6
2 4 6 6
3 2 5 5
4 5 5 7
5 3 3 7
6 5 5 4
If you want simply a binary matrix with a 1 indicating at least 1 event happened in that decade, then you can use:
tmp2 <- lapply(seq_along(ind[-1]),
function(i, inds, data) {
as.numeric(rowSums(data[, names(data) %in% inds[i]:(inds[i+1]-1)]) > 0)
}, inds = ind, data = df)
out2 <- do.call(cbind.data.frame, tmp2)
names(out2) <- paste(head(ind, -1), tail(ind, -1) - 1, sep = "-")
out2
which gives:
> out2
1870-1879 1880-1889 1890-1899
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
If you want a different aggregation, then modify the function applied in the lapply call to use something other than rowSums.
This is another option, using modular arithmetic to aggregate the columns.
# setup, borrowed from #GavinSimpson
set.seed(42)
df <- data.frame(matrix(sample(0:1, 6*25, replace = TRUE), ncol = 25))
names(df) <- 1874 + 0:24
result <- do.call(cbind,
by(t(df), as.numeric(names(df)) %/% 10 * 10, colSums))
# add -xxx9 to column names, for each decade
dimnames(result)[[2]] <- paste(colnames(result), as.numeric(colnames(result)) + 9, sep='-')
# 1870-1879 1880-1889 1890-1899
# V1 4 5 6
# V2 4 6 6
# V3 2 5 5
# V4 5 5 7
# V5 3 3 7
# V6 5 5 4
If you wanted to aggregate with something other than sum, replace the call to
colSums with something like function(cols) lapply(cols, f), where f is the aggregating
function, e.g., max.