Divide value depending on the condition of the column in R - r

I would like to divide the value of a column according to a condition.
Something like the following:
Data$ColumnA [Data$ColumnA>50 && Data$ColumnB> 0] <- Data$ColumnA / 25
The problem is Data$ColumnA / 25 loses the "index", and makes the division of the first value in the list.
Thank you

I prefer using a data.table instead of a data.frame not only for performance reasons but also because the syntax is more compact:
library(data.table)
Data <- data.frame(ColumnA = seq(0, 175, by = 25),
ColumnB = c(0, 1))
Data
# ColumnA ColumnB
# 1 0 0
# 2 25 1
# 3 50 0
# 4 75 1
# 5 100 0
# 6 125 1
# 7 150 0
# 8 175 1
setDT(Data) # "convert" data.frame into a data.table
Data[ColumnA > 50 & ColumnB > 0, ColumnA := ColumnA / 25]
Data
# ColumnA ColumnB
# 1: 0 0
# 2: 25 1
# 3: 50 0
# 4: 3 1
# 5: 100 0
# 6: 5 1
# 7: 150 0
# 8: 7 1

Related

Error in data.frame , unused argument

I have this dataframe :
> head(merged.tables)
Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday StoreType
1 1 5 2015-07-31 5263 555 1 1 0 1 c
2 1 6 2013-01-12 4952 646 1 0 0 0 c
3 1 5 2014-01-03 4190 552 1 0 0 1 c
4 1 3 2014-12-03 6454 695 1 1 0 0 c
5 1 3 2013-11-13 3310 464 1 0 0 0 c
6 1 7 2013-10-27 0 0 0 0 0 0 c
Assortment CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2
1 a 1270 9 2008 0
2 a 1270 9 2008 0
3 a 1270 9 2008 0
4 a 1270 9 2008 0
5 a 1270 9 2008 0
6 a 1270 9 2008 0
Promo2SinceWeek Promo2SinceYear PromoInterval
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
Then I want to extract a dataframe showing the average of Sales vector when Open equal to 1 and by StoreType.
I used this command because it's the fatest I think:
merged.tables[StateHoliday==1,mean(na.omit(Sales)),by=StoreType]
But I got this error:
Error in [.data.frame(merged.tables, StateHoliday == 0,
mean(na.omit(Sales)), : unused argument (by = StoreType)
I search but I didn't get an answer to this error. Thanks for your help!
I had such an error.
Problem was solved when I realised: my data was not in data.table format.
example:
copy <- data.table(data)
Overview
There are lots of ways of applying a function to a group of values in your data frame. I present two:
Using the dplyr package to arrange your data in a way that answers your question.
Using tapply(), which performs a function over a group of values.
Reproducible Example
For each store type, I want the average sales for those stores whose Open value is equal to 1.
I present the dplyr method first, followed by tapply.
Note: The following data frame only takes a few columns from those posted in the OP.
# install necessary package
install.packages( pkgs = "dplyr" )
# load necessary package
library( dplyr )
# create data frame
merged.tables <-
data.frame(
Store = c( 1, 1, 1, 2, 2, 2 )
, StoreType = rep( x = c( "s", "m", "l" ) , times = 2)
, Sales = round( x = runif( n = 6, min = 3000, max = 6000 ) , digits = 0 )
, Open = c( 1, 1, 0, 0, 1, 1 )
, stringsAsFactors = FALSE
)
# view the data
merged.tables
# Store StoreType Sales Open
# 1 1 s 4608 1
# 2 1 m 4017 1
# 3 1 l 4210 0
# 4 2 s 4833 0
# 5 2 m 3818 1
# 6 2 l 3090 1
# dplyr method
merged.tables %>%
group_by( StoreType ) %>%
filter( Open == 1 ) %>%
summarise( AverageSales = mean( x = Sales , na.rm = TRUE ) )
# A tibble: 3 x 2
# StoreType AverageSales
# <chr> <dbl>
# 1 l 3090
# 2 m 3918
# 3 s 4608
# tapply method
# create the condition
# that 'Open' must be equal to one
Open.equals.one <- which( merged.tables$Open == 1 )
# apply the condition to
# both X and INDEX
tapply( X = merged.tables$Sales[ Open.equals.one ]
, INDEX = merged.tables$StoreType[ Open.equals.one ]
, FUN = mean
, na.rm = TRUE # just in case your data does have NA values in the `Sales` column, this removes them from the calculation
)
# l m s
# 3090.0 3917.5 4608.0
# end of script #
Resources
Should you need more conditions later on, I encourage you to check out other relevant SO posts, such as How to combine multiple conditions to subset a data-frame using “OR”? and Why is [ better than subset?.

R subsetting with dplyr

Troubles with R subsetting and arranging datasets.
I have a dataset that looks like this:
Student Skill Correct
64525 10 1
64525 10 1
70363 10 0
70363 10 1
70363 10 1
64525 15 0
70363 15 0
70363 15 1
I would need to create a new dataset for each skill, with a row for each student and a column for each observation (Correct). Like this:
Skill: 10
Student Obs1 Obs2 Obs3
64525 1 1 NA
70363 0 1 1
Skill: 15
Student Obs1 Obs2
64525 0 NA
70363 0 1
Notice that the number of columns of each skill dataset can vary, depending on the numebr of observations for each student. Notice also that the value can be a NA if there is not such an observation in the dataset (a student can try the skill a different number of times than other students).
I think this might e a job for the dplyr package but I am not sure.
I really appreciate the help of the community!!
Here's a possible data.table implementation
library(data.table) # V 1.10.0
res <- setDT(df)[, .(.(dcast(.SD, Student ~ rowid(Student)))), by = Skill]
Which will result in a data.table of data.tables
res
# Skill V1
# 1: 10 <data.table>
# 2: 15 <data.table>
Which could be segmented by the Skill column
res[Skill == 10, V1]
# [[1]]
# Student 1 2 3
# 1: 64525 1 1 NA
# 2: 70363 0 1 1
Or in order to see the whole column
res[, V1]
# [[1]]
# Student 1 2 3
# 1: 64525 1 1 NA
# 2: 70363 0 1 1
#
# [[2]]
# Student 1 2
# 1: 64525 0 NA
# 2: 70363 0 1
This will get the job done.
xy <- read.table(text = "Student Skill Correct
64525 10 1
64525 10 1
70363 10 0
70363 10 1
70363 10 1
64525 15 0
70363 15 0
70363 15 1", header = TRUE)
# first split by skill and work on each element
sapply(split(xy, xy$Skill), FUN = function(x) {
# extract column correct
out <- sapply(split(x, x$Student), FUN = "[[", "Correct")
# pad shortest vectors with NAs at the end
out <- mapply(out, max(lengths(out)), FUN = function(m, a) {
c(m, rep(NA, times = (a - length(m))))
}, SIMPLIFY = FALSE)
do.call(rbind, out)
})
$`10`
[,1] [,2] [,3]
64525 1 1 NA
70363 0 1 1
$`15`
[,1] [,2]
64525 0 NA
70363 0 1

Group function with basic calculation

I have a data.table with two parameters(date and status), now I want to insert new columns based on the original table.
data rules:
the Status column contains only "0" and "1"
the Date column is always increase by seconds :)
new variables:
group: to number each group or cycle for the status, the order of the status is (0,1). it means that the status starts with status '0', when the status becomes '0' again, one cycle is completed.
cycle_time: calculate the cycle time for each group
group_0: calculate the time for the status 0 within a specific group
group_1: calculate the time for the status 1 within a specific group
For example, a simple input:
the code to generate the data:
dd <- data.table(date = c("2015-07-01 00:00:12", "2015-07-01 00:00:13","2015-07-01 00:00:14","2015-07-01 00:00:15", "2015-07-01 00:00:16", "2015-07-01 00:00:17","2015-07-01 00:00:18", "2015-07-01 00:00:19", "2015-07-01 00:00:20","2015-07-01 00:00:21", "2015-07-01 00:00:22", "2015-07-01 00:00:23","2015-07-01 00:00:24", "2015-07-01 00:00:25"), status = c(0,0,0,0,1,1,1,0,0,1,1,1,1,0))
the output including new parameters is:
actually i have done with some basic methods,
the main idea is :if the current status is 0 and the next status is 1, then mark it as one group.
the idea could work, but the problem is the calculation time is too long, since so many loops.
I supposed that there could be an easier solution for this case
So a transition from 1 to 0 marks the boundary of a group. You can use cumsum and diff to get this working. For the x example in the answer of #zx8754:
data.frame(x, group_id = c(1, cumsum(diff(x) == -1) + 1))
x group_id
1 0 1
2 0 1
3 0 1
4 1 1
5 1 1
6 0 2
7 0 2
8 1 2
9 0 3
For a more realistically sized example:
res = data.frame(status = sample(c(0,1), 10e7, replace = TRUE))
system.time(res$group_id <- c(1, cumsum(diff(res$status) == -1) + 1))
user system elapsed
2.770 1.680 4.449
> head(res, 20)
status group_id
1 0 1
2 0 1
3 1 1
4 0 2
5 0 2
6 0 2
7 1 2
8 1 2
9 0 3
10 1 3
11 1 3
12 0 4
13 1 4
14 0 5
15 0 5
16 1 5
17 0 6
18 0 6
19 1 6
20 0 7
5 seconds for 10 million records is quite fast (although that depends on your definition of fast :)).
Benchmarking
set.seed(1)
res = data.frame(status = sample(c(0,1), 10e4, replace = TRUE))
microbenchmark::microbenchmark(
rleid = {
gr <- data.table::rleid(res$status)
x1 <- as.numeric(as.factor(ifelse(gr %% 2 == 0, gr - 1, gr)))
# removing "as.numeric(as.factor" helps, but still not as fast as cumsum
#x1 <- ifelse(gr %% 2 == 0, gr - 1, gr)
},
cumsum = { x2 <- c(1, cumsum(diff(res$status) == -1) + 1) }
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# rleid 118.161287 120.149619 122.673747 121.736122 123.271881 168.88777 100 b
# cumsum 1.511811 1.559563 2.221273 1.826404 2.475402 6.88169 100 a
identical(x1, x2)
# [1] TRUE
Try this:
#dummy data
x <- c(0,0,0,1,1,0,0,1,0)
#get group id using rleid from data.table
gr <- data.table::rleid(x)
#merge separated 0,1 groups
gr <- ifelse(gr %% 2 == 0, gr - 1, gr)
#result
cbind(x, gr)
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 3
# [7,] 0 3
# [8,] 1 3
# [9,] 0 5
#if we need to have group names sequential then
cbind(x, gr = as.numeric(as.factor(gr)))
# x gr
# [1,] 0 1
# [2,] 0 1
# [3,] 0 1
# [4,] 1 1
# [5,] 1 1
# [6,] 0 2
# [7,] 0 2
# [8,] 1 2
# [9,] 0 3

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

Reshaping wide dataset in interval format

I am working on a "wide" dataset, and now I would like to use a specific package (-msSurv-, for non-parametric multistate models) which requires data in interval form.
My current dataset is characterized by one row for each individual:
dat <- read.table(text = "
id cohort t0 s1 t1 s2 t2 s3 t3
1 2 0 1 50 2 70 4 100
2 1 0 2 15 3 100 0 0
", header=TRUE)
where cohort is a time-fixed covariate, and s1-s3 correspond to the values that a time-varying covariate s = 1,2,3,4 takes over time (they are the distinct states visited by the individual over time). Calendar time is defined by t1-t3, and ranges from 0 to 100 for each individual.
So, for instance, individual 1 stays in state = 1 up to calendar time = 50, then he stays in state = 2 up to time = 70, and finally he stays in state = 4 up to time 100.
What I would like to obtain is a dataset in "interval" form, that is:
id cohort t.start t.stop start.s end.s
1 2 0 50 1 2
1 2 50 70 2 4
1 2 70 100 4 4
2 1 0 15 2 3
2 1 15 100 3 3
I hope the example is sufficiently clear, otherwise please let me know and I will try to further clarify.
How would you automatize this reshaping? Consider that I have a relatively large number of (simulated) individuals, around 1 million.
Thank you very much for any help.
I think I understand. Does this work?
require(data.table)
dt <- data.table(dat, key=c("id", "cohort"))
dt.out <- dt[, list(t.start=c(t0,t1,t2), t.stop=c(t1,t2,t3),
start.s=c(s1,s2,s3), end.s=c(s2,s3,s3)),
by = c("id", "cohort")]
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 0
# 6: 2 1 100 0 0 0
If the output you show is indeed right and is what you require, then you can obtain with two more lines (not the best way probably, but it should nevertheless be fast)
# remove rows where start.s and end.s are both 0
dt.out <- dt.out[, .SD[start.s > 0 | end.s > 0], by=1:nrow(dt.out)]
# replace end.s values with corresponding start.s values where end.s == 0
# it can be easily done with max(start.s, end.s) because end.s >= start.s ALWAYS
dt.out <- dt.out[, end.s := max(start.s, end.s), by=1:nrow(dt.out)]
dt.out[, nrow:=NULL]
> dt.out
# id cohort t.start t.stop start.s end.s
# 1: 1 2 0 50 1 2
# 2: 1 2 50 70 2 4
# 3: 1 2 70 100 4 4
# 4: 2 1 0 15 2 3
# 5: 2 1 15 100 3 3

Resources