Conditional sum grouped by date in R - r

The data I am working with is the number of people in a group. The columns in the dataset I'm concerned with are the date (column 1) and the number of people in a group (column 3 where there is a separate row for each group on a given day). I am looking for an output spreadsheet that gives me a column for a date, one for the sum of all the groups with one person in it on a day, and a column for the sum of all the people who are in groups larger than one on a day.
For example if this was my dataset:
Date People
10/18 1
10/18 3
10/18 1
10/18 8
10/20 1
10/20 4
10/20 2
My desired output would be:
Date p=1 p>1
10/18 2 11
10/20 1 6
My data frame is "DF" and a csv with the different dates is "times". I tried to use a for loop but the output was just zeros.
Here is what I tried:
ntimes = length(times$UniTimes)
for(i in 1:ntimes)
{
s<- sum(DF[which (DF[,3] > 1 & DF[,1]==i),3])
t<- sum(DF[which (DF[,3] < 2 & DF[,1]== i),3])
}
ndf<-data.frame(times,s,t)
write.csv(ndf,'groups_c.csv')
Thank you for your time and help!

You can use aggregate:
aggregate(People ~ Date, x, function(x) c("p=1" = sum(x[x==1]),
"p>1" = sum(x[x>1])))
# Date People.p=1 People.p>1
#1 10/18 2 11
#2 10/20 1 6

This should work, but without data to reproduce it's difficult to say:
library(dplyr)
DF %>%
group_by(Date) %>%
summarise(peq1 = sum(People == 1),
pgeq1 = sum(People[People > 1]))

An option with data.table
library(data.table)
setDT(DF)[, .(peq1 = sum(People == 1), pgeq1 = sum(People[People >1])), .(Date)]

Related

Conditionally mutate dataframe based on multiple conditions R

I have seen some similar questions, but none of them was exactly the same as the thing I want to do - which is why I am asking.
I have a dataframe (dummy_data) which contains indices of some observations (obs) regarding given subjects (ID). The dataframe consists only the meaningful data (in other words: the desired conditions are met). The last column in this example data contains the total number of observations (total_obs).
ID <-c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
obs <- c(1,2,3,5,6,3,4,5,7,8,9,12,16,1,2,4,5,6,7,8,2,4,6,7,8,10,13,14,15,3,4,6,7,11)
total_obs <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
dummy_data <- data.frame(ID, obs, total_obs)
I would like to create a new column (interval) with 3 possible values: "start", "center", "end" based on following condition(s):
it should split total number of observations (total_obs) into 3 groups (based on indices - from 1st to the last - which is the value stored in the total_obs column) and assign the interval value according to the indices stored in obs column.
Here is the expected output:
ID <- c(rep("item_001",5),rep("item_452",8),rep("item_0001",7),rep("item_31",9),rep("item_007",5))
segment <- c(1,2,3,5,6, 3,4,5,7,8,9,12,16, 1,2,4,5,6,7,8, 2,4,6,7,8,10,13,14,15, 3,4,6,7,11)
total_segments <- c(rep(6,5),rep(16,8),rep(9,7),rep(18,9),rep(11,5))
interval <- c("start","start","center","end","end","start","start","start","center","center","center","end","end","start","start","center","center","center","end","end","start","start","start","center","center","center","end","end","end", "start","start","center","center","end")
wanted_data <- data.frame(ID, segment, total_segments, interval)
I would like to use use dplyr::ntile() with dplyr::mutate() and dplyr::case_when() but I could not make my code function properly. Any solutions?
You just need dplyr::mutate() and dplyr::case_when().
The following should give you something to work off of.
dummy_data %>%
mutate(interval = case_when(obs < (total_obs/3) ~ "start",
obs < 2*(total_obs/3) ~ "center",
TRUE ~ "end"))
# TRUE ~ "end" is the 'else' case when everything else is false
Which gives slightly different results.
I think more careful deliberation should be made regarding where the endpoints are for each interval, but if you know what you are doing, using a combination of <=, %/%, and ceil() should give you the result you desire.
First, because dummy_data$obs is identical withwanted_data$segment, and dummy_data$total_obs is identical with wanted_data$total_segments, you just need to rename these columns.
For the interval column, here is one approach of creating it:
group the data based on segment column
create a column, say tile, and fill it with ntile(segment) results.
create interval column, and use case_when to fill it with the category labels created from tile. It means, fill interval with "start" when tile = 1, "center" when 2, and "end" when 3.
drop the tile column.
wanted_data <- dummy_data %>%
rename(segment = obs, total_segments = total_obs) %>%
group_by(total_segments) %>%
mutate(tile = ntile(segment, 3)) %>%
mutate(interval = case_when(tile == 1~"start",
tile == 2~"center",
tile == 3~"end")) %>%
select(-tile)
wanted_data
# A tibble: 34 × 4
# Groups: total_segments [5]
ID segment total_segments interval
<chr> <dbl> <dbl> <chr>
1 item_001 1 6 start
2 item_001 2 6 start
3 item_001 3 6 center
4 item_001 5 6 center
5 item_001 6 6 end
6 item_452 3 16 start
7 item_452 4 16 start
8 item_452 5 16 start
9 item_452 7 16 center
10 item_452 8 16 center
# … with 24 more rows
It's slightly different from wanted_data$interval that you showed because based on your comment, you said that the division into categories is just as dplyr::ntile() does.

r: using `for` and `if` to run run function on numeric vars only

I have a four column dataframe with date, var1_share, var2_share, and total. I want to multiply each of the share metrics against the total only to create new variables containing the raw values for both var1 & var2. See below code (a bit verbose) to construct the dataframe that contains the share variables:
df<- data.frame(dt= seq.Date(from = as.Date('2019-01-01'),
to= as.Date('2019-01-10'), by= 'day'),
var1= round(runif(10, 3, 12), digits = 1),
var2= round(runif(10, 3, 12), digits = 1))
df$total<- apply(df[2:3], 1, sum)
ratio<- lapply(df[-1], function(x) x/df$total)
ratio<- data.frame(ratio)
df<- cbind.data.frame(df[1],ratio)
colnames(df)<- c('date', 'var1_share', 'var2_share', 'total')
df
The final dataframe should look like this:
> df
date var1_share var2_share total
1 2019-01-01 0.5862069 0.4137931 1
2 2019-01-02 0.6461538 0.3538462 1
3 2019-01-03 0.3591549 0.6408451 1
4 2019-01-04 0.7581699 0.2418301 1
5 2019-01-05 0.3989071 0.6010929 1
6 2019-01-06 0.5132743 0.4867257 1
7 2019-01-07 0.5230769 0.4769231 1
8 2019-01-08 0.4969325 0.5030675 1
9 2019-01-09 0.5034965 0.4965035 1
10 2019-01-10 0.3254438 0.6745562 1
I have nested an if statement within a for loop, hoping to return a new dataframe called share. I want it to skip date when using the share variables for I've incorporated is.numeric so that it ignores date, however, when I run it, it only returns the date and not the desired result of date, the share of each variable (as separate columns), and the total column. See below code:
for (i in df){
share<- if(is.numeric(i)){
i * df$total
} else i
share<- data.frame(share)
return(share)
}
share
> share
share
1 2019-01-01
2 2019-01-02
3 2019-01-03
...
How do I adjust this function so that share returns a dataframe containing date, variable 1 and 2 raw variables, and total?
One could note that multiplying a vector (*) with a data.frame, will cast the multiplication column wise over the data frame (multiply the vector on column 1, 2, 3 etc.). As such you can do this without any 'apply' by simply using * of the total column and the columns you want to multiply.
Or you could make a simple function to achieve the result. Below is such an example.
Multi_share <- function(x, total_col = "total"){
if(is.character(total_col))
return(x[,sapply(x, is.numeric)[names(x) != total_col]] * x[, total_col])
if(is.numeric(total_col) && NROW(total_col) == NROW(x))
return(x[,sapply(x, is.numeric)] * total_col)
stop("Total unrecognized. Must either be a 1 dimensional vector, a column matrix or a character specifying the total column in R.")
}
cbind(df, Multi_share(df))
One could change the names of the columns as well.
Maybe you want something like this?
share <-df[, sapply(df,is.numeric)]
share <-mapply(function(x) x*share$total, share[,names(share)!="total"])
The first line will give you back only numeric columns (so date is filtered).
The second one will multiply each column (except total) and total.

Efficiently iterate over rows to dynamically/sequentially populate variable going down rows

I am trying to dynamically populate a variable, which requires me to reference rows.
Given are 3 columns: time, group, and val.
I want to populate rows 3, 4, 7, and 8's val which are initially NA.
Here is my toy data:
df <- expand.grid(time = rep(c(1,2,3,4)), group = rep(c("A", "B")))
df$val <- c(50,40,NA,NA)
df
> df
time group val
1 1 A 50
2 2 A 40
3 3 A NA
4 4 A NA
5 1 B 50
6 2 B 40
7 3 B NA
8 4 B NA
I have two grouping variables (time and group) and, as example, I need to populate row 3 above by this set of rules:
1. Order by group and time (in ascending order)
2. For time = 3, the value of **val** is the arithmetic average of two previous rows;
(2a). i.e. the average of time 2 and time 1 values, so it will be 1/2 * (40+50) = 45.
3. For time = 4, the value of **val** is the arithmetic average of two previous rows;
(3a). i.e. the average of time 3 and time 2 values, so it will be 1/2 * (45+40) = 42.5.
And so on, going down to the last row of each group as defined by time and group variables.
I want to avoid using loops and referencing row index to achieve this, and prefer to stay within dplyr, as the rest of my scripts are in the dplyr ecosystem. Is there an efficient way to achieve this?
This isn't the cleanest solution, but it gets the job done:
df2 = df %>%
arrange(group, time) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val)) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val))
Again, it's not pretty, but it seems to work. Hope that helps give you something to start from.

Retrieving unique combinations [duplicate]

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?
The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.
I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.

count frequency of rows based on a column value in R

I understand that this is quite a simple question, but I haven't been able to find an answer to this.
I have a data frame which gives you the id of a person and his hobby. Since a person may have many hobbies, the id field may be repeated in multiple rows, each with a different hobby. I have been trying to print out only those rows which have more than one hobbies. I was able to get the frequencies using table.
But how do I apply the condition to print only when the frequency is greater than one.
Secondly, is there a better way to find frequencies without using table.
This is my attempt with table without the filter for frequency greater than one
> id=c(1,2,2,3,2,4,3,1)
> hobby = c('play','swim','play','movies','golf','basketball','playstation','gameboy')
> df = data.frame(id, hobby)
> table(df$id)
1 2 3 4
2 3 2 1
Try using data table, I find it more readable than using table() functions:
library(data.table)
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies',
'golf','basketball','playstation','gameboy')
df = data.frame(id=id, hobby=hobby)
dt = as.data.table(df)
dt[,hobbies:=.N, by=id]
You will get, for your condition:
> dt[hobbies >1,]
id hobby hobbies
1: 1 play 2
2: 2 swim 3
3: 2 play 3
4: 3 movies 2
5: 2 golf 3
6: 3 playstation 2
7: 1 gameboy 2
This example assumes you are trying to filter df
id=c(1,2,2,3,2,4,3,1)
hobby = c('play','swim','play','movies','golf','basketball',
'playstation','gameboy')
df = data.frame(id, hobby)
table(df$id)
Get all those ids that have more than one hobby
tmp <- as.data.frame(table(df$id))
tmp <- tmp[tmp$Freq > 1,]
Using that information - select their IDs in df
df1 <- df[df$id %in% tmp$Var1,]
df1

Resources