Say I have a dataset like this:
df <- data.frame(id = c(1, 1, 1, 2, 2),
classname = c("Welding", "Welding", "Auto", "HVAC", "Plumbing"),
hours = c(3, 2, 4, 1, 2))
I.e.,
id classname hours
1 1 Welding 3
2 1 Welding 2
3 1 Auto 4
4 2 HVAC 1
5 2 Plumbing 2
I'm trying to figure out how to summarize the data in a way that gives me, for each id, a list of the classes they took as well as how many hours of each class. I would want these to be in a list so I can keep it one row per id. So, I would want it to return:
id class.list class.hours
1 1 Welding, Auto 5,4
2 2 HVAC, Plumbing 1,2
I was able to figure out how to get it to return the class.list.
library(dplyr)
classes <- df %>%
group_by(id) %>%
summarise(class.list = list(unique(as.character(classname))))
This gives me:
id class.list
1 1 Welding, Auto
2 2 HVAC, Plumbing
But I'm not sure how I could get it to sum the number of hours for each of those classes (class.hours).
Thanks for your help!
In base R, this can be accomplished with two calls to aggregate. The inner call sums the hours and the outer call "concatenates" the hours and the class names. In the outer call of aggregate, cbind is used to include both the hours and the class names in the output, and also to provide the desired variable names.
# convert class name to character variable
df$classname <- as.character(df$classname)
# aggregate
aggregate(cbind("class.hours"=hours, "class.list"=classname)~id,
data=aggregate(hours~id+classname, data=df, FUN=sum), toString)
id class.hours class.list
1 1 4, 5 Auto, Welding
2 2 1, 2 HVAC, Plumbing
In data.table, roughly the same output is produced with a chained statement.
setDT(df)[, .(hours=sum(hours)), by=.(id, classname)][, lapply(.SD, toString), by=id]
id classname hours
1: 1 Welding, Auto 5, 4
2: 2 HVAC, Plumbing 1, 2
The variable names could then be set using the data.table setnames function.
This is how you could do it using dplyr:
classes <- df %>%
group_by(id, classname) %>%
summarise(hours = sum(hours)) %>%
summarise(class.list = list(unique(as.character(classname))),
class.hours = list(hours))
The first summarise peels of the latest group by (classname). It is not necessary to use unique() anymore, but I kept it in there to match the part you already had.
Related
Assuming the following data set:
df <- data.frame(...1 = c(1, 2, 3),
...2 = c(1, 2, 3),
n_column = c(1, 1, 2))
I now want to rename all vars that start with "...". My real data sets could have different numbers of "..." vars. The information about how many such vars I have is in the n_column column, more precisely, it is the maximum of that column.
So I tried:
df %>%
rename_with(.cols = starts_with("..."),
.fn = paste0("new_name", 1:max(n_column)))
which gives an error:
# Error in paste0("new_name", 1:max(n_column)) :
# object 'n_column' not found
So I guess the problem is that the paste0 function does look for the column I provide within the current data set. Not sure, though, how I could do so. Any ideas?
I know I could bypass the whole thing by just creating an external scalar that contains the max. of n_column, but ideally I'd like to do everything in one pipeline.
You don't need information from n_column, .cols will pass only those columns that satisfy the condition (starts_with("...")).
library(dplyr)
df %>% rename_with(~paste0("new_name", seq_along(.)), starts_with("..."))
# new_name1 new_name2 n_column
#1 1 1 1
#2 2 2 1
#3 3 3 2
This is safer than using max(n_column) as well, for example if the data from n_column gets corrupted or the number of columns with ... change this will still work.
A way to refer to column values in rename_with would be to use anonymous function so that you can use .$n_column.
df %>%
rename_with(function(x) paste0("new_name", 1:max(.$n_column)),
starts_with("..."))
I am assuming this is part of longer chain so you don't want to use max(df$n_column).
We can use str_c
library(dplyr)
library(stringr)
df %>%
rename_with(~str_c("new_name", seq_along(.)), starts_with("..."))
Or using base R
i1 <- startsWith(names(df), "...")
names(df)[i1] <- sub("...", "new_name", names(df)[i1], fixed = TRUE)
df
new_name1 new_name2 n_column
1 1 1 1
2 2 2 1
3 3 3 2
A completly other approach would be
df %>% janitor::clean_names()
x1 x2 n_column
1 1 1 1
2 2 2 1
3 3 3 2
I have a four column dataframe with date, var1_share, var2_share, and total. I want to multiply each of the share metrics against the total only to create new variables containing the raw values for both var1 & var2. See below code (a bit verbose) to construct the dataframe that contains the share variables:
df<- data.frame(dt= seq.Date(from = as.Date('2019-01-01'),
to= as.Date('2019-01-10'), by= 'day'),
var1= round(runif(10, 3, 12), digits = 1),
var2= round(runif(10, 3, 12), digits = 1))
df$total<- apply(df[2:3], 1, sum)
ratio<- lapply(df[-1], function(x) x/df$total)
ratio<- data.frame(ratio)
df<- cbind.data.frame(df[1],ratio)
colnames(df)<- c('date', 'var1_share', 'var2_share', 'total')
df
The final dataframe should look like this:
> df
date var1_share var2_share total
1 2019-01-01 0.5862069 0.4137931 1
2 2019-01-02 0.6461538 0.3538462 1
3 2019-01-03 0.3591549 0.6408451 1
4 2019-01-04 0.7581699 0.2418301 1
5 2019-01-05 0.3989071 0.6010929 1
6 2019-01-06 0.5132743 0.4867257 1
7 2019-01-07 0.5230769 0.4769231 1
8 2019-01-08 0.4969325 0.5030675 1
9 2019-01-09 0.5034965 0.4965035 1
10 2019-01-10 0.3254438 0.6745562 1
I have nested an if statement within a for loop, hoping to return a new dataframe called share. I want it to skip date when using the share variables for I've incorporated is.numeric so that it ignores date, however, when I run it, it only returns the date and not the desired result of date, the share of each variable (as separate columns), and the total column. See below code:
for (i in df){
share<- if(is.numeric(i)){
i * df$total
} else i
share<- data.frame(share)
return(share)
}
share
> share
share
1 2019-01-01
2 2019-01-02
3 2019-01-03
...
How do I adjust this function so that share returns a dataframe containing date, variable 1 and 2 raw variables, and total?
One could note that multiplying a vector (*) with a data.frame, will cast the multiplication column wise over the data frame (multiply the vector on column 1, 2, 3 etc.). As such you can do this without any 'apply' by simply using * of the total column and the columns you want to multiply.
Or you could make a simple function to achieve the result. Below is such an example.
Multi_share <- function(x, total_col = "total"){
if(is.character(total_col))
return(x[,sapply(x, is.numeric)[names(x) != total_col]] * x[, total_col])
if(is.numeric(total_col) && NROW(total_col) == NROW(x))
return(x[,sapply(x, is.numeric)] * total_col)
stop("Total unrecognized. Must either be a 1 dimensional vector, a column matrix or a character specifying the total column in R.")
}
cbind(df, Multi_share(df))
One could change the names of the columns as well.
Maybe you want something like this?
share <-df[, sapply(df,is.numeric)]
share <-mapply(function(x) x*share$total, share[,names(share)!="total"])
The first line will give you back only numeric columns (so date is filtered).
The second one will multiply each column (except total) and total.
I am trying to dynamically populate a variable, which requires me to reference rows.
Given are 3 columns: time, group, and val.
I want to populate rows 3, 4, 7, and 8's val which are initially NA.
Here is my toy data:
df <- expand.grid(time = rep(c(1,2,3,4)), group = rep(c("A", "B")))
df$val <- c(50,40,NA,NA)
df
> df
time group val
1 1 A 50
2 2 A 40
3 3 A NA
4 4 A NA
5 1 B 50
6 2 B 40
7 3 B NA
8 4 B NA
I have two grouping variables (time and group) and, as example, I need to populate row 3 above by this set of rules:
1. Order by group and time (in ascending order)
2. For time = 3, the value of **val** is the arithmetic average of two previous rows;
(2a). i.e. the average of time 2 and time 1 values, so it will be 1/2 * (40+50) = 45.
3. For time = 4, the value of **val** is the arithmetic average of two previous rows;
(3a). i.e. the average of time 3 and time 2 values, so it will be 1/2 * (45+40) = 42.5.
And so on, going down to the last row of each group as defined by time and group variables.
I want to avoid using loops and referencing row index to achieve this, and prefer to stay within dplyr, as the rest of my scripts are in the dplyr ecosystem. Is there an efficient way to achieve this?
This isn't the cleanest solution, but it gets the job done:
df2 = df %>%
arrange(group, time) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val)) %>%
mutate(val = if_else(is.na(val), (lag(val, n=1) + lag(val, n=2))/2.0, val))
Again, it's not pretty, but it seems to work. Hope that helps give you something to start from.
So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?
The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.
I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.
I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical.
The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns.
A specific form of the question could be how to use data.table to accomplish the equivalent of what dcast does in the following:
input <- data.table(
id=c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
variable=c('x', 'y', 'y', 'x', 'y', 'y', 'x', 'x', 'y', 'other'),
value=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
dcast(input,
id ~ variable, sum,
subset=.(variable %in% c('x', 'y')))
the output of which is
id x y
1 1 1 5
2 2 4 11
3 3 15 9
Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :
setkey(input,variable)
input[c("x","y"),sum(value)]
This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :
input[,sum(value),keyby=variable][c("x","y")]
The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.
The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.
The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :
DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]
In future when secondary keys are implemented that would be easier :
set2key(input,variable) # add a secondary key
input[c("x","y"),sum(value),key=2] # syntax speculative
To group by id as well :
setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']
and including id in the key might be worth setkey's cost depending on your data :
setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']
If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.
Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :
input[c("x","y"),sum(value),by=id]
> setkey(input, "id")
> input[ , list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 34
> input[ variable %in% c("x", "y"), list(sum(value)), by=id]
id V1
1: 1 6
2: 2 15
3: 3 24
The last one:
> input[ variable %in% c("x", "y"), list(sum(value)), by=list(id, variable)]
id variable V1
1: 1 x 1
2: 1 y 5
3: 2 x 4
4: 2 y 11
5: 3 x 15
6: 3 y 9
I'm not sure if this is the best way, but you can try:
input[, list(x = sum(value[variable == "x"]),
y = sum(value[variable == "y"])), by = "id"]
# id x y
# 1: 1 1 5
# 2: 2 4 11
# 3: 3 15 9