I have a data frame (actually I prefer data.table) with columns of multiple pairs of (x,y) coordinates and corresponding values alpha, something like follows:
> data.frame(x_1 = 1:5, y_1 = 6:10,
x_2 = 11:15, y_2 = 16:20,
x_3 = 21:25, y_3=26:30,
alpha = seq(0.2,1,0.2))
x_1 y_1 x_2 y_2 x_3 y_3 alpha
1 1 6 11 16 21 26 0.2
2 2 7 12 17 22 27 0.4
3 3 8 13 18 23 28 0.6
4 4 9 14 19 24 29 0.8
5 5 10 15 20 25 30 1.0
I need to organise it into a long format such that there is an x and a y column, where a row of coordinates from df is stacked to be three pairs on top of one another; a column for alpha which is duplicated for each pairing and; a column for the corresponding pair index, as follows:
x y alpha index
1 1 6 0.2 1
2 11 16 0.2 2
3 21 26 0.2 3
4 2 7 0.4 1
5 12 17 0.4 2
6 22 27 0.4 3
7 3 8 0.6 1
8 13 18 0.6 2
9 23 28 0.6 3
10 4 9 0.8 1
11 14 19 0.8 2
12 24 29 0.8 3
13 5 10 1.0 1
14 15 20 1.0 2
15 25 30 1.0 3
I have tried to use gather without much success - trying to melt by pairs of columns and then duplicating the alpha values caused me grief. I then resorted to a for loop through the rows of df, compiling a (pre-allocated) vector of values x, y and alpha with each iteration, but even with the pre-allocation this was horrendously slow compared to a similar operation in python.
In practice I have about 20,000-40,000 rows, many more "constant" columns like alpha and something like 3-5 pair indices.
Apologies if there has been a similar question - I couldn't find one and really struggle wording questions about quite specific data manipulations. Any help is greatly appreciated!
gather has been superseded by pivot_longer. I think this gives you what you want.
df %>%
pivot_longer(
c(starts_with("x"), starts_with("y")),
names_pattern="(.)_(.)",
names_to=c(".value", "index")
)
# A tibble: 15 x 4
alpha index x y
<dbl> <chr> <int> <int>
1 0.2 1 1 6
2 0.2 2 11 16
3 0.2 3 21 26
4 0.4 1 2 7
5 0.4 2 12 17
6 0.4 3 22 27
7 0.6 1 3 8
8 0.6 2 13 18
9 0.6 3 23 28
10 0.8 1 4 9
11 0.8 2 14 19
12 0.8 3 24 29
13 1 1 5 10
14 1 2 15 20
15 1 3 25 30
Does this work as expected?
df %>%
pivot_longer(cols = -alpha, names_to = c("col", "index"), names_sep = "_") %>%
pivot_wider(names_from = col, values_from = value)
Output
# A tibble: 15 x 4
alpha index x y
<dbl> <chr> <int> <int>
1 0.2 1 1 6
2 0.2 2 11 16
3 0.2 3 21 26
4 0.4 1 2 7
5 0.4 2 12 17
6 0.4 3 22 27
7 0.6 1 3 8
8 0.6 2 13 18
9 0.6 3 23 28
10 0.8 1 4 9
11 0.8 2 14 19
12 0.8 3 24 29
13 1 1 5 10
14 1 2 15 20
15 1 3 25 30
Here is another pivot_longer approach:
pivot_longer without alpha only columns start with x
use window function lead
remove every second row with filter
Create index
library(dplyr)
library(tidyr)
df %>%
pivot_longer(c(-alpha, starts_with("x")),
names_to = "names.x",
values_to = "x"
) %>%
mutate(y = lead(x)) %>%
filter(row_number() %% 2 != 0) %>% ## Delete even-rows
select(-names.x) %>%
mutate(index = rep(1:3, length.out = n()))
alpha x y index
<dbl> <int> <int> <int>
1 0.2 1 6 1
2 0.2 11 16 2
3 0.2 21 26 3
4 0.4 2 7 1
5 0.4 12 17 2
6 0.4 22 27 3
7 0.6 3 8 1
8 0.6 13 18 2
9 0.6 23 28 3
10 0.8 4 9 1
11 0.8 14 19 2
12 0.8 24 29 3
13 1 5 10 1
14 1 15 20 2
15 1 25 30 3
Related
I know how to do basic stuff in R, but I am still a newbie. I am also probably asking a pretty redundant question (but I don't know how to enter it into google so that I find the right hits).
I have been getting hits like the below:
Assign value to group based on condition in column
R - Group by variable and then assign a unique ID
I want to assign subgroups into groups, and create a new column out of them.
I have data like the following:
dataframe:
ID SubID Values
1 15 0.5
1 15 0.2
2 13 0.1
2 13 0
1 14 0.3
1 14 0.3
2 10 0.2
2 10 1.6
6 31 0.7
6 31 1.0
new dataframe:
ID SubID Values groups
1 15 0.5 2
1 15 0.2 2
2 13 0.1 2
2 13 0 2
1 14 0.3 1
1 14 0.3 1
2 10 0.2 1
2 10 1.6 1
6 31 0.7 1
6 31 1.0 1
I have tried the following in R, but I am not getting the desired results:
newdataframe$groups <- dataframe %>% group_indices(,dataframe$ID, dataframe$SubID)
newdataframe<- dataframe %>% group_by(ID, SubID) %>% mutate(groups=group_indices(,dataframe$ID, dataframe$SubID))
I am not sure how to frame the question in R. I want to group by ID, and SubID, and then assign those subgroups in that are grouped by IDs and reset the the grouping count on each ID.
Any help would be really appreciated.
Here is an alternative approach which uses the rleid() function from the data.table package. rleid() generates a run-length type id column.
According to the expected result, the OP expects SubId to be numbered by order of value and not by order of appearance. Therefore, we need to call arrange().
library(dplyr)
df %>%
group_by(ID) %>%
arrange(SubID) %>%
mutate(groups = data.table::rleid(SubID))
ID SubID Values groups
<int> <int> <dbl> <int>
1 2 10 0.2 1
2 2 10 1.6 1
3 2 13 0.1 2
4 2 13 0 2
5 1 14 0.3 1
6 1 14 0.3 1
7 1 15 0.5 2
8 1 15 0.2 2
9 6 31 0.7 1
10 6 31 1 1
Note that the row order has changed.
BTW: With data.table, the code is less verbose and the original row order is maintained:
library(data.table)
setDT(df)[order(ID, SubID), groups := rleid(SubID), by = ID][]
ID SubID Values groups
1: 1 15 0.5 2
2: 1 15 0.2 2
3: 2 13 0.1 2
4: 2 13 0.0 2
5: 1 14 0.3 1
6: 1 14 0.3 1
7: 2 10 0.2 1
8: 2 10 1.6 1
9: 6 31 0.7 1
10: 6 31 1.0 1
There are multiple ways to do this one way would be to group_by ID and create a unique number for each SubID by converting it to factor and then to integer.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(groups = as.integer(factor(SubID)))
# ID SubID Values groups
# <int> <int> <dbl> <int>
# 1 1 15 0.5 2
# 2 1 15 0.2 2
# 3 2 13 0.1 2
# 4 2 13 0 2
# 5 1 14 0.3 1
# 6 1 14 0.3 1
# 7 2 10 0.2 1
# 8 2 10 1.6 1
# 9 6 31 0.7 1
#10 6 31 1 1
In base R, we can use ave with similar logic
df$groups <- with(df, ave(SubID, ID, FUN = factor))
I have a dataframe that looks like this:
group1<-c(rep(1,12))
group2<-c(rep('Low',6), rep('High',6))
var <-c(1:6,1:6)
var1 <-c(2:13)
var2 <-c(20:31)
df1<-data.frame(group1,group2,var,var1,var2)
group1<-c(rep(2,12))
group2<-c(rep('Low',6), rep('High',6))
var <-c(1:6,1:6)
var1 <-c(2:13)
var2 <-c(20:31)
df2<-data.frame(group1,group2,var,var1,var2)
df<-rbind(df1,df2)
group1 group2 var var1 var2
1 1 Low 1 2 20
2 1 Low 2 3 21
3 1 Low 3 4 22
4 1 Low 4 5 23
5 1 Low 5 6 24
6 1 Low 6 7 25
7 1 High 1 8 26
8 1 High 2 9 27
9 1 High 3 10 28
10 1 High 4 11 29
11 1 High 5 12 30
12 1 High 6 13 31
13 2 Low 1 2 20
14 2 Low 2 3 21
15 2 Low 3 4 22
16 2 Low 4 5 23
17 2 Low 5 6 24
18 2 Low 6 7 25
19 2 High 1 8 26
20 2 High 2 9 27
21 2 High 3 10 28
22 2 High 4 11 29
23 2 High 5 12 30
24 2 High 6 13 31
I want to do normalize my columns in the following way. For each combination of group1 and group2, I want to divide var1 and var1 columns with their first element. This allows me to construct a common scale/index across the columns of interest. For example, looking at the combination of group1=1 and group2=low, the relevant elements of var1 should be transformed into 2/2,3/2,4/2,5/2,6/2,7/2 respectively for the combination group1=1 and group2=High should be 8/8,9/8,10/8,11/8,12/8,13/8 and so on.
I want to do the above transformations for both var1 and var2. The expected output should look like this:
group1 group2 var var1 var2 var1_tra var2_tra
1 1 Low 1 2 20 1.000 1.000000
2 1 Low 2 3 21 1.500 1.050000
3 1 Low 3 4 22 2.000 1.100000
4 1 Low 4 5 23 2.500 1.150000
5 1 Low 5 6 24 3.000 1.200000
6 1 Low 6 7 25 3.500 1.250000
7 1 High 1 8 26 1.000 1.000000
8 1 High 2 9 27 1.125 1.038462
9 1 High 3 10 28 1.250 1.076923
10 1 High 4 11 29 1.375 1.115385
11 1 High 5 12 30 1.500 1.153846
12 1 High 6 13 31 1.625 1.192308
13 2 Low 1 2 20 1.000 1.000000
14 2 Low 2 3 21 1.500 1.050000
15 2 Low 3 4 22 2.000 1.100000
16 2 Low 4 5 23 2.500 1.150000
17 2 Low 5 6 24 3.000 1.200000
18 2 Low 6 7 25 3.500 1.250000
19 2 High 1 8 26 1.000 1.000000
20 2 High 2 9 27 1.125 1.038462
21 2 High 3 10 28 1.250 1.076923
22 2 High 4 11 29 1.375 1.115385
23 2 High 5 12 30 1.500 1.153846
24 2 High 6 13 31 1.625 1.192308
NOTE: Numbers could be anything, usually positive real numbers and because my dataframe is really big, cannot know in advance what could be the element that I want to divide with in order to perform such transformations.
After grouping by 'group1', 'group2', use mutate_at to do the division of the columns selected by the first value of that column
library(dplyr)
df %>%
group_by(group1, group2) %>%
mutate_at(vars(var1, var2), list(tra = ~ ./first(.)))
# A tibble: 24 x 7
# Groups: group1, group2 [4]
# group1 group2 var var1 var2 var1_tra var2_tra
# <dbl> <fct> <int> <int> <int> <dbl> <dbl>
# 1 1 Low 1 2 20 1 1
# 2 1 Low 2 3 21 1.5 1.05
# 3 1 Low 3 4 22 2 1.1
# 4 1 Low 4 5 23 2.5 1.15
# 5 1 Low 5 6 24 3 1.2
# 6 1 Low 6 7 25 3.5 1.25
# 7 1 High 1 8 26 1 1
# 8 1 High 2 9 27 1.12 1.04
# 9 1 High 3 10 28 1.25 1.08
#10 1 High 4 11 29 1.38 1.12
# … with 14 more rows
Or using data.table
nm1 <- c("var1", "var2")
nm2 <- paste0(nm1, "_tra")
library(data.table)
setDT(df)[, (nm2) := lapply(.SD, function(x) x/first(x)),
by = .(group1, group2), .SDcols = nm1]
Also you can use from sqldf likes the following:
result <- sqldf('select df.*, (df.var1 + 0.0) / scale.s_var1 as var1_tra, (df.var2 + 0.0) / scale.s_var2 as var2_tra
from df join
(select group1, group2, min(var1) as s_var1, min(var2) as s_var2
from df
group by group1, group2) as scale
on df.group1 = scale.group1 AND df.group2 = scale.group2
')
In the above code first we find the minimum value for var1 and var2 by each group using the following query:
select group1, group2, min(var1) as s_var1, min(var2) as s_var2
from df
group by group1, group2
And use that as a nested query and joining with the original data frame df on equality over the value of group1 and group2.
set.seed(123)
df <- data.frame(x = sample(1:10, 20, replace = T), id = rep(1:2, each = 10))
For each id, I want to create a column which has the sum of previous 5 x values.
df %>% group_by(id) %>% mutate(roll.sum = c(x[1:4], zoo::rollapply(x, 5, sum)))
# Groups: id [2]
x id roll.sum
<int> <int> <int>
3 1 3
8 1 8
5 1 5
9 1 9
10 1 10
1 1 36
6 1 39
9 1 40
6 1 41
5 1 37
10 2 10
5 2 5
7 2 7
6 2 6
2 2 2
9 2 39
3 2 32
1 2 28
4 2 25
10 2 29
The 6th row should be 35 (3 + 8 + 5 + 9 + 10), the 7th row should be 33 (8 + 5 + 9 + 10 + 1) and so on.
However, the above function is also including the row itself for calculation. How can I fix it?
library(zoo)
df %>% group_by(id) %>%
mutate(Sum_prev = rollapply(x, list(-(1:5)), sum, fill=NA, align = "right", partial=F))
#you can use rollapply(x, list((1:5)), sum, fill=NA, align = "left", partial=F)
#to sum the next 5 elements scaping the current one
x id Sum_prev
1 3 1 NA
2 8 1 NA
3 5 1 NA
4 9 1 NA
5 10 1 NA
6 1 1 35
7 6 1 33
8 9 1 31
9 6 1 35
10 5 1 32
11 10 2 NA
12 5 2 NA
13 7 2 NA
14 6 2 NA
15 2 2 NA
16 9 2 30
17 3 2 29
18 1 2 27
19 4 2 21
20 10 2 19
There is the rollify function in the tibbletime package that you could use. You can read about it in this vignette: Rolling calculations in tibbletime.
library(tibbletime)
library(dplyr)
rollig_sum <- rollify(.f = sum, window = 5)
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) #added lag() here
# A tibble: 20 x 3
# Groups: id [2]
# x id roll.sum
# <int> <int> <int>
# 1 3 1 NA
# 2 8 1 NA
# 3 5 1 NA
# 4 9 1 NA
# 5 10 1 NA
# 6 1 1 35
# 7 6 1 33
# 8 9 1 31
# 9 6 1 35
#10 5 1 32
#11 10 2 NA
#12 5 2 NA
#13 7 2 NA
#14 6 2 NA
#15 2 2 NA
#16 9 2 30
#17 3 2 29
#18 1 2 27
#19 4 2 21
#20 10 2 19
If you want the NAs to be some other value, you can use, for example, if_else
df %>%
group_by(id) %>%
mutate(roll.sum = lag(rollig_sum(x))) %>%
mutate(roll.sum = if_else(is.na(roll.sum), x, roll.sum))
I have a data frame with 3 columns, each containing a small number of values:
> df
# A tibble: 364 x 3
A B C
<dbl> <dbl> <dbl>
0. 1. 0.100
0. 1. 0.200
0. 1. 0.300
0. 1. 0.500
0. 2. 0.100
0. 2. 0.200
0. 2. 0.300
0. 2. 0.600
0. 3. 0.100
0. 3. 0.200
# ... with 354 more rows
> apply(df, 2, table)
$`A`
0 1 2 3 4 5 6 7 8 9 10
34 37 31 32 27 39 29 28 37 39 31
$B
1 2 3 4 5 6 7 8 9 10 11
38 28 38 37 32 34 29 33 30 35 30
$C
0.1 0.2 0.3 0.4 0.5 0.6
62 65 65 56 60 56
I would like to create a fourth column, which will contain for each row the product of the frequencies of each value withing each group. So for example the first value of the column "Freq" would be the product of the frequency of zero within column A, the frequency of 1 within column B and the frequency of 0.1 within column C.
How can I do this efficiently with dplyr/baseR?
To emphasize, this is not the combined frequency of each total row, but the product of the 1-column frequencies
An efficient approach using a combination of lapply, Map & Reduce from base R:
l <- lapply(df, table)
m <- Map(function(x,y) unname(y[match(x, names(y))]), df, l)
df$D <- Reduce(`*`, m)
which gives:
> head(df, 15)
A B C D
1 3 5 0.4 57344
2 5 6 0.5 79560
3 0 4 0.1 77996
4 2 6 0.1 65348
5 5 11 0.6 65520
6 3 8 0.5 63360
7 6 6 0.2 64090
8 1 9 0.4 62160
9 10 2 0.2 56420
10 5 2 0.2 70980
11 4 11 0.3 52650
12 7 6 0.5 57120
13 10 1 0.2 76570
14 7 10 0.5 58800
15 8 10 0.3 84175
What this does:
lapply(df, table) creates a list of frequency for each column
With Map a list is created with match where each list-item has the same length as the number of rows of df. Each list-item is a vector of frequencies corresponding to the values in df.
With Reduce the product of the vectors in the list m is calculated element wise: the first value of each vector in the list m are mulplied with each other, then the 2nd value, etc.
The same approach in tidyverse:
library(dplyr)
library(purrr)
df %>%
mutate(D = map(df, table) %>%
map2(df, ., function(x,y) unname(y[match(x, names(y))])) %>%
reduce(`*`))
Used data:
set.seed(2018)
df <- data.frame(A = sample(rep(0:10, c(34,37,31,32,27,39,29,28,37,39,31)), 364),
B = sample(rep(1:11, c(38,28,38,37,32,34,29,33,30,35,30)), 364),
C = sample(rep(seq(0.1,0.6,0.1), c(62,65,65,56,60,56)), 364))
will use the following small example
df
A B C
1 3 5 0.4
2 5 6 0.5
3 0 4 0.1
4 2 6 0.1
5 5 11 0.6
6 3 8 0.5
7 6 6 0.2
8 1 9 0.4
9 10 2 0.2
10 5 2 0.2
sapply(g,table)
$A
0 1 2 3 5 6 10
1 1 1 2 3 1 1
$B
2 4 5 6 8 9 11
2 1 1 3 1 1 1
$C
0.1 0.2 0.4 0.5 0.6
2 3 2 2 1
library(tidyverse)
df%>%
group_by(A)%>%
mutate(An=n())%>%
group_by(B)%>%
mutate(Bn=n())%>%
group_by(C)%>%
mutate(Cn=n(),prod=An*Bn*Cn)
A B C An Bn Cn prod
<int> <int> <dbl> <int> <int> <int> <int>
1 3 5 0.400 2 1 2 4
2 5 6 0.500 3 3 2 18
3 0 4 0.100 1 1 2 2
4 2 6 0.100 1 3 2 6
5 5 11 0.600 3 1 1 3
6 3 8 0.500 2 1 2 4
7 6 6 0.200 1 3 3 9
8 1 9 0.400 1 1 2 2
9 10 2 0.200 1 2 3 6
10 5 2 0.200 3 2 3 18
I have data frame, I want to create a new variable by sum of each ID and group, if I sum normal,dimension of data reduce, my case I need to keep and repeat each row.
ID <- c(rep(1,3), rep(3, 5), rep(4,4))
Group <-c(1,1,2,1,1,1,2,2,1,1,1,2)
x <- c(1:12)
y<- c(12:23)
df <- data.frame(ID,Group,x,y)
ID Group x y
1 1 1 1 12
2 1 1 2 13
3 1 2 3 14
4 3 1 4 15
5 3 1 5 16
6 3 1 6 17
7 3 2 7 18
8 3 2 8 19
9 4 1 9 20
10 4 1 10 21
11 4 1 11 22
12 4 2 12 23
The output with 2 more variables "sumx" and "sumy". Group by (ID, Group)
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
Any Idea?
As short as:
df$sumx <- with(df,ave(x,ID,Group,FUN = sum))
df$sumy <- with(df,ave(y,ID,Group,FUN = sum))
We can use dplyr
library(dplyr)
df %>%
group_by(ID, Group) %>%
mutate_each(funs(sum)) %>%
rename(sumx=x, sumy=y) %>%
bind_cols(., df[c("x", "y")])
If there are only two columns to sum, then
df %>%
group_by(ID, Group) %>%
mutate(sumx = sum(x), sumy = sum(y))
You can use below code to get what you want if it is a single column and in case you have more than 1 column then add accordingly:
library(dplyr)
data13 <- data12 %>%
group_by(Category) %>%
mutate(cum_Cat_GMR = cumsum(GrossMarginRs))