Repeat a vector on a dataframe - r

I want to add a column to my datatable based on a vector. However my datatable is having 20 rows and my vector is having 7 values. I want the datatable to be repeated 7 times such that each 20 rows has one value from the vector. This might be simple but I am not able to get how to do this.
Some sample data -
library(data.table)
set.seed(9901)
#### create the sample variables for creating the data
group <- c(1:7)
brn <- sample(1:10,20,replace = T)
period <- c(101:120)
df1 <- data.table(cbind(brn,period))
So in this case I want to add a column group. datatable would now have 140 rows. 20 rows for group 1 then 20 rows for group 2 and so on.....

Apparently you want this:
df1[CJ(group, period), on = .(period)]
# brn period group
# 1: 3 101 1
# 2: 9 102 1
# 3: 9 103 1
# 4: 5 104 1
# 5: 5 105 1
# ---
#136: 9 116 7
#137: 7 117 7
#138: 10 118 7
#139: 2 119 7
#140: 7 120 7
CJ creates a data.table resulting from the cartesian join of the vectors passed to it. This data.table is then joined with df1 based on the column specified by on.

I would (1) repeat each number in group 20 times to create the datasets in a list and
(2) join them:
AllLists<-apply(as.data.frame(group),1,function(x) cbind(x,df1))
do.call( "rbind",AllLists)

A solution with data.table. Is that what you are looking for?
library(data.table)
df2 <- df1[rep(1:nrow(df1), times = 7),
][,group := rep(group, each = 20)]
But Rolands solution in the comments is definitly more elegant.

Related

Assign a sequence of numbers to data frame rows

I am trying to assign a sequence of numbers to rows in a data frame based off of the row position. I have 330 rows and I want each set of nine rows to be named one through nine in a new ID column. For example, I want rows 1-9 labeled 1-9, rows 10-18 labeled 1-9, rows 19-27 labeled 1-9 and so on.
I have tried to use this code:
test <- temp.df %>% mutate(id = seq(from = 1, to = 330, along.width= 9))
test
but it just ends up just creating a new column that labels the rows 1-330 as shown below.
Time Temperature ID
09:36:52 25.4 1
09:36:59 25.4 2
09:37:07 25.4 3
09:37:14 25.4 4
09:37:21 25.4 5
09:37:29 25.4 6
09:37:36 25.4 7
09:37:43 25.5 8
09:37:51 25.5 9
09:37:58 25.5 10
What is the best way to accomplish my goal?
I think if you could provide a snippet of the data.frame temp.df, it would be easier to help you out. Maybe the following line could help you by adding it to your data.frame, however, it is not a very flexible solution, but it is based on the information you provided.
n_repeated <- 9 #block of ID
N_rows <- 330 #number of observations
df <- data.frame(id = rep(seq(1,n_repeated ),N_rows ))
head(df,n = 15)
#> head(df,n = 15)
# id
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 6
# 7 7
# 8 8
# 9 9
# 10 1
# 11 2
# 12 3
# 13 4
# 14 5
# 15 6
[Edited]
using mutate from dplyr this line should do it:
test <- temp.df %>% mutate(id = rep(seq(1,9), nrow(temp.df)))

Extract the n highest value by group with data.table in R [duplicate]

This question already has answers here:
How to order data within subgroups in data.table R
(1 answer)
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 3 years ago.
Consider the following:
DT <- data.table(a = sample(1:2), b = sample(1:100, 10), d = rnorm(10))
How to display the whole DT containing only the 3 highest values b for each level of a?
I found an approximate solution here:
> DT[order(-b), head(b, 3), a]
a V1
1: 2 99
2: 2 83
3: 2 75
4: 1 96
5: 1 71
6: 1 67
However, it will only display the columns a and b (named as V1). I would like to obtain the same DT but with all the columns and the original columns names. Please consider that in practice my DT has many columns. I'm aiming for an elegant solution that doesn't requires to list all the columns names.
We can separate the calls and filter top 3 rows by group.
library(data.table)
DT[order(-b),head(.SD, 3),a]
# a b d
#1: 1 100 1.4647474
#2: 1 61 -1.1250266
#3: 1 51 0.9435628
#4: 2 82 0.3302404
#5: 2 72 -0.0219803
#6: 2 55 1.6865777

Return subset of rows based on aggregate of keyed rows [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 6 years ago.
I would like to subset a data table in R within each subset, based on an aggregate function over the subset of rows. For example, for each key, return all values greater than the mean of a field calculated only for rows in subset. Example:
library(data.table)
t=data.table(Group=rep(c(1:5),each=5),Detail=c(1:25))
setkey(t,'Group')
library(foreach)
library(dplyr)
ret=foreach(grp=t[,unique(Group)],.combine=bind_rows,.multicombine=T) %do%
t[Group==grp&Detail>t[Group==grp,mean(Detail)],]
# Group Detail
# 1: 1 4
# 2: 1 5
# 3: 2 9
# 4: 2 10
# 5: 3 14
# 6: 3 15
# 7: 4 19
# 8: 4 20
# 9: 5 24
#10: 5 25
The question is, is it possible to succinctly code the last two lines using data.table features? Sorry if this is a repeat, I am also struggling explaining the exact goal to have google/stackoverflow find it.
Using the .SD function works. Was not aware of it, thanks:
dt[, .SD[Detail > mean(Detail)], by = Group]
Also works, with some performance gains:
indx <- dt[, .I[Detail > mean(Detail)], by = Group]$V1 ; dt[indx]

Aggregate/collapse data frame

Is there an "all-in-one" convenience function in R that can collapse/aggregate a data frame to resolve the many-to-many problem? The motivation is to reduce many-to-many relationships so that two or more tables can be joined on some primary key (a column with unique identifier values). To elucidate, consider a data frame like:
set.seed(1) # for reproducibility
df <- data.frame(id = sort(rep(seq(1,3),4)), # primary key
geo_loc = state.abb[sample(seq(1,length(state.name)), # state abbreviations
size=length(sort(rep(seq(1,3),4))),
replace = TRUE)],
revenue = c(sample(seq(0,50),size=3), sample(c(seq(101,200)),size=3),
sample(seq(201,300),size=4), sample(seq(301,1000),size=2)),
prod_id = sample(LETTERS[c(seq(1,4))],size=12, replace=TRUE),
quant = c(sample(seq(0,5),size=4), sample(c(seq(3,8)),size=4),
sample(seq(6,11),size=2), sample(seq(9,14),size=2))) ; df
id geo_loc revenue prod_id quant
1 1 MN 47 D 0
2 1 MA 29 B 3
3 1 SD 50 B 4
4 1 NM 174 A 1
5 2 NC 136 D 6
6 2 LA 143 B 5
7 2 IN 215 C 8
8 2 WY 202 A 4
9 3 NY 271 A 10
10 3 HI 211 C 9
11 3 CT 613 C 10
12 3 MS 748 A 14
Does a function already exist that will collapse this table such that there is only one row per unique id? It would have to convert the geo_loc and prod_id columns to k levels - 1 dummy columns. It would also be nice if such a function could allow automatic clustering of the revenue into a number of blocks based on perhaps quantiles.
Only aggregate when you have a proper grouping variable. It would be more logical to aggregate by prod_id, for example.
To perform these data tidying and aggregating operations I would personally recommend spread() and gather() from the tidyr package and summarise() and group_by() from the dplyr package.

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Resources