Repeat data.frame N times with adding column - r

I have the following data frame and I want to repeat it N times
dc <- read.table(text = "from 1 2 3 4 5
1 0.01 0.02 0.03 0.04 0.05
2 0.06 0.07 0.08 0.09 0.10
3 0.11 0.12 0.13 0.14 0.15
4 0.16 0.17 0.18 0.19 0.20
5 0.21 0.22 0.23 0.24 0.25", header = TRUE)
n<-20
ddr <- NA
for(i in 1:n) {
ddr <- rbind(ddr, cbind(dc,i))
}
As a result, I would like to receive:
from X1 X2 X3 X4 X5 i
1 0.01 0.02 0.03 0.04 0.05 1
2 0.06 0.07 0.08 0.09 0.10 1
3 0.11 0.12 0.13 0.14 0.15 1
4 0.16 0.17 0.18 0.19 0.20 1
5 0.21 0.22 0.23 0.24 0.25 1
1 0.01 0.02 0.03 0.04 0.05 2
2 0.06 0.07 0.08 0.09 0.10 2
3 0.11 0.12 0.13 0.14 0.15 2
4 0.16 0.17 0.18 0.19 0.20 2
5 0.21 0.22 0.23 0.24 0.25 2
.............................
1 0.01 0.02 0.03 0.04 0.05 20
2 0.06 0.07 0.08 0.09 0.10 20
3 0.11 0.12 0.13 0.14 0.15 20
4 0.16 0.17 0.18 0.19 0.20 20
5 0.21 0.22 0.23 0.24 0.25 20
The matrix must be repeated N times, and repeat number is added.
Is there a correct solution (easy function to do this in R) to this issue? In my case if the ddr is not declared (ddr<-NA), the script does not work. Thanks!

You can use rep() to replicate the row indexes, and also to create the repeat number column.
cbind(dc[rep(1:nrow(dc), n), ], i = rep(1:n, each = nrow(dc)))
Let's break it down:
dc[rep(1:nrow(dc), n), ] uses replicated row indexes in the i value of row indexing of [ for data frames
rep(1:n, each = nrow(dc)) replicates a sequence the length of the n value nrow(dc) times each
cbind(...) combines the two into a single data frame
As #HubertL points out in the comments, this can be further simplified to
cbind(dc, i = rep(1:n, each = nrow(dc)))
thanks to the magic of recycling. Please go give him a vote.

Here is also a more intuitive way, about identical in speed to the other top answer:
n <- 3
data.frame(df,i=rep(1:n,ea=NROW(df)))
Output (repeated 3x):
from X1 X2 X3 X4 X5 i
1 1 0.01 0.02 0.03 0.04 0.05 1
2 2 0.06 0.07 0.08 0.09 0.10 1
3 3 0.11 0.12 0.13 0.14 0.15 1
4 4 0.16 0.17 0.18 0.19 0.20 1
5 5 0.21 0.22 0.23 0.24 0.25 1
6 1 0.01 0.02 0.03 0.04 0.05 2
7 2 0.06 0.07 0.08 0.09 0.10 2
8 3 0.11 0.12 0.13 0.14 0.15 2
9 4 0.16 0.17 0.18 0.19 0.20 2
10 5 0.21 0.22 0.23 0.24 0.25 2
11 1 0.01 0.02 0.03 0.04 0.05 3
12 2 0.06 0.07 0.08 0.09 0.10 3
13 3 0.11 0.12 0.13 0.14 0.15 3
14 4 0.16 0.17 0.18 0.19 0.20 3
15 5 0.21 0.22 0.23 0.24 0.25 3
EDIT: Top Answer Speed Test
This test was scaled up to n=1e+05, iterations=100:
func1 <- function(){
data.frame(df,i=rep(1:n,ea=NROW(df)))
}
func2 <- function(){
cbind(dc, i = rep(1:n, each = nrow(dc)))
}
func3 <- function(){
cbind(dc[rep(1:nrow(dc), n), ], i = rep(1:n, each = nrow(dc)))
}
microbenchmark::microbenchmark(
func1(),func2(),func3())
Unit: milliseconds
expr min lq mean median uq max neval cld
func1() 15.58709 21.69143 28.62695 22.01692 23.85648 117.9012 100 a
func2() 15.99023 21.59375 28.37328 22.18298 23.99953 136.1209 100 a
func3() 414.18741 436.51732 473.14571 453.26099 498.21576 666.8515 100 b

Related

Sort columns of a data.frame based on a list of prefixes

I have a data frame that looks like this:
Names S1_ATTCG S1_GTTA S9_TGCC S5_TGGA S21_GGCA
A 0.34 0.12 0.32 0.98 0.65
B 0.14 0.02 0.45 0.09 0.006
C 0.04 0.34 0.98 0.12 0.06
Is there a way to sort the columns so that the columns beginning with ^S1 and ^S5 will appear before all the others?
The data frame is composed by 53.000 columns and 12.000 rows.
A quick and dirty solution:
cbind(
d[, grepl( "S1|S5", names(d))],
d[,!grepl( "S1|S5", names(d))]
)
S1_ATTCG S1_GTTA S5_TGGA Names S9_TGCC S21_GGCA
1 0.34 0.12 0.98 A 0.32 0.650
2 0.14 0.02 0.09 B 0.45 0.006
3 0.04 0.34 0.12 C 0.98 0.060
With data:
d <- read.table(text = 'Names S1_ATTCG S1_GTTA S9_TGCC S5_TGGA S21_GGCA
A 0.34 0.12 0.32 0.98 0.65
B 0.14 0.02 0.45 0.09 0.006
C 0.04 0.34 0.98 0.12 0.06 ', header = T)
Good ol' dplyr can help too.
d %>%
relocate(
starts_with(c('S1','S5')), .after = Names)
)
Names S1_ATTCG S1_GTTA S5_TGGA S9_TGCC S21_GGCA
1 A 0.34 0.12 0.98 0.32 0.650
2 B 0.14 0.02 0.09 0.45 0.006
3 C 0.04 0.34 0.12 0.98 0.060
d <- read.table(text = 'Names S1_ATTCG S1_GTTA S9_TGCC S5_TGGA S21_GGCA
A 0.34 0.12 0.32 0.98 0.65
B 0.14 0.02 0.45 0.09 0.006
C 0.04 0.34 0.98 0.12 0.06 ', header = T)
d[c("Names", gtools::mixedsort(names(d)[-1]))]
#> Names S1_ATTCG S1_GTTA S5_TGGA S9_TGCC S21_GGCA
#> 1 A 0.34 0.12 0.98 0.32 0.650
#> 2 B 0.14 0.02 0.09 0.45 0.006
#> 3 C 0.04 0.34 0.12 0.98 0.060
Created on 2021-09-16 by the reprex package (v2.0.1)
or
d %>%
relocate(gtools::mixedsort(names(d)), .after = Names))

igraph constraint components (c-size, c-density, c-hierarchy)

I would like to compute the components of Burt's constraint discussed here by Burt.
Igraph's constraint command computes Burt's constraint score:
rm(list=ls())
library(igraph)
g <- graph.formula( "A"--------"B":"E":"F":"EGO",
"B"--------"A":"D":"EGO",
"C"--------"EGO",
"D"--------"B":"EGO",
"E"--------"A":"EGO",
"F"--------"A":"EGO",
"EGO"-"A":"B":"C":"D":"E":"F")
coords <- layout_nicely(g)
V(g)$label <- V(g)$name
g$layout <- coords
plot(g)
constraint(g)
Constraint only returns the overall constraint score.
with
and
is the strength (= weight) of connection between two vertices i and j.
is the direct connection between vertices i and j (share of connections from i to j of all of i's connections).
is the sum of indirect connections between vertices i and j (connections to other vertices q that are both connected to i and j).
I want to work with the individual components c-size, c-density, and c-hierarchy.
Burt reshapes the constraint equation like this:
The first term is c-size.
The second term is c-density.
The third term is c-hierarchy.
I want to compute the three components of constraint for each vertex of the network.
I could think of two solutions, both of which are beyond my capabilities.
Maybe there is a way to harness these values directly from igraph's constraint command.
Alternatively, one would have to compute these values manually.
For the example above, I have computed these values by hand using Excel:
node
degrees
constraint
c-size
c-density
c-hierarchy
A
4
0.60
0.25
0.23
0.12
B
3
0.64
0.33
0.24
0.07
C
1
1.00
1.00
0.00
0.00
D
2
0.78
0.50
0.25
0.03
E
2
0.73
0.50
0.21
0.02
F
2
0.73
0.50
0.21
0.02
EGO
6
0.40
0.17
0.16
0.07
strength
direct influence of j
indirect influence of j
combinded influence of j
FROM
TO
A
B
1
0.25
0.25
0.17
0.04
0.09
A
E
1
0.25
0.25
0.17
0.04
0.09
A
F
1
0.25
0.25
0.17
0.04
0.09
A
EGO
1
0.25
0.25
1.33
0.33
0.34
B
A
1
0.33
0.33
0.17
0.06
0.15
B
D
1
0.33
0.33
0.17
0.06
0.15
B
EGO
1
0.33
0.33
0.75
0.25
0.34
C
EGO
1
1.00
1.00
0.00
0.00
1.00
D
B
1
0.50
0.50
0.17
0.08
0.34
D
EGO
1
0.50
0.50
0.33
0.17
0.44
E
A
1
0.50
0.50
0.17
0.08
0.34
E
EGO
1
0.50
0.50
0.25
0.13
0.39
F
A
1
0.50
0.50
0.17
0.08
0.34
F
EGO
1
0.50
0.50
0.25
0.13
0.39
EGO
A
1
0.17
0.17
1.33
0.22
0.15
EGO
B
1
0.17
0.17
0.75
0.13
0.09
EGO
C
1
0.17
0.17
0.00
0.00
0.03
EGO
D
1
0.17
0.17
0.33
0.06
0.05
EGO
E
1
0.17
0.17
0.25
0.04
0.04
EGO
F
1
0.17
0.17
0.25
0.04
0.04

Rolling calculation of beta (linear regression slope)

I have a dataframe
> df
date comp ret mret
1 1/1/75 A 0.07 0.06
2 1/2/75 A 0.04 0.05
3 1/3/75 A 0.01 0.01
4 1/4/75 A -0.05 -0.04
5 1/5/75 A 0.05 0.05
6 1/6/75 A 0.04 0.04
7 1/7/75 A 0.07 0.08
8 1/8/75 A 0.01 0.00
9 1/9/75 A -0.02 -0.01
10 1/10/75 A -0.03 -0.01
11 1/11/75 A 0.01 0.02
12 1/12/75 A 0.03 0.04
13 1/1/75 B 0.09 0.06
14 1/2/75 B 0.07 0.05
15 1/3/75 B 0.04 0.01
16 1/4/75 B -0.02 -0.04
17 1/5/75 B 0.06 0.05
18 1/6/75 B 0.08 0.04
19 1/7/75 B 0.10 0.08
20 1/8/75 B 0.02 0.00
21 1/9/75 B -0.01 -0.01
22 1/10/75 B 0.01 -0.01
23 1/11/75 B -0.01 0.02
24 1/12/75 B 0.07 0.04
I want to calculate beta based on CAPM which is the slope between ret and mret (y-variable = ret, x-variable = mret). This means that I need to do a linear regression to calculate this beta.
The twist is then that I want to calculate the rolling beta over the past 5 months and at least 3 months for each company. To break it down:
I need to make the first beta calculation at line number 3 since this has 3 months of data. At line 4 I want to use the past 4 months of data when calculating beta, at line 5 I want the past 5 months of data, at line 6 I want the past 5 months of data again etc.
I want to group the calculation by the variable 'comp', meaning that at line 13 everything resets and the first calculation starts at line 15 and then follows the method mentioned above.
The results should end up looking like this:
date comp ret mret beta
1 1/1/75 A 0.07 0.06 NA
2 1/2/75 A 0.04 0.05 NA
3 1/3/75 A 0.01 0.01 1.0714
4 1/4/75 A -0.05 -0.04 1.1129
5 1/5/75 A 0.05 0.05 1.1098
6 1/6/75 A 0.04 0.04 1.0578
7 1/7/75 A 0.07 0.08 1.0193
8 1/8/75 A 0.01 0.00 0.9839
9 1/9/75 A -0.02 -0.01 0.9307
10 1/10/75 A -0.03 -0.01 1.0161
11 1/11/75 A 0.01 0.02 0.9895
12 1/12/75 A 0.03 0.04 1.0106
13 1/1/75 B 0.09 0.06 NA
14 1/2/75 B 0.07 0.05 NA
15 1/3/75 B 0.04 0.01 0.9286
16 1/4/75 B -0.02 -0.04 1.0484
17 1/5/75 B 0.06 0.05 0.9913
18 1/6/75 B 0.08 0.04 0.9932
19 1/7/75 B 0.10 0.08 0.9807
20 1/8/75 B 0.02 0.00 1.0046
21 1/9/75 B -0.01 -0.01 1.1496
22 1/10/75 B 0.01 -0.01 1.1613
23 1/11/75 B -0.01 0.02 1.0559
24 1/12/75 B 0.07 0.04 1.0426
Is there a way to do this in R?
Using df from the Note at the end, create a slope function and use rollapplyr to run it on a moving window. partial = 3 tells it to use partial windows at the beginning of at least 3 rows.
library(dplyr)
library(zoo)
slope <- function(m) {
ret <- m[, 1]
mret <- m[, 2]
cov(ret, mret) / var(mret)
}
df %>%
group_by(comp) %>%
mutate(beta = rollapplyr(cbind(ret, mret), 5, slope, partial = 3, fill = NA,
by.column = FALSE)) %>%
ungroup
giving:
# A tibble: 24 x 5
date comp ret mret beta
<chr> <chr> <dbl> <dbl> <dbl>
1 1/1/75 A 0.07 0.06 NA
2 1/2/75 A 0.04 0.05 NA
3 1/3/75 A 0.01 0.01 1.07
4 1/4/75 A -0.05 -0.04 1.11
5 1/5/75 A 0.05 0.05 1.11
6 1/6/75 A 0.04 0.04 1.06
7 1/7/75 A 0.07 0.08 1.02
8 1/8/75 A 0.01 0 0.984
9 1/9/75 A -0.02 -0.01 0.931
10 1/10/75 A -0.03 -0.01 1.02
# ... with 14 more rows
Note
Input in reproducible form:
Lines <- "date comp ret mret
1 1/1/75 A 0.07 0.06
2 1/2/75 A 0.04 0.05
3 1/3/75 A 0.01 0.01
4 1/4/75 A -0.05 -0.04
5 1/5/75 A 0.05 0.05
6 1/6/75 A 0.04 0.04
7 1/7/75 A 0.07 0.08
8 1/8/75 A 0.01 0.00
9 1/9/75 A -0.02 -0.01
10 1/10/75 A -0.03 -0.01
11 1/11/75 A 0.01 0.02
12 1/12/75 A 0.03 0.04
13 1/1/75 B 0.09 0.06
14 1/2/75 B 0.07 0.05
15 1/3/75 B 0.04 0.01
16 1/4/75 B -0.02 -0.04
17 1/5/75 B 0.06 0.05
18 1/6/75 B 0.08 0.04
19 1/7/75 B 0.10 0.08
20 1/8/75 B 0.02 0.00
21 1/9/75 B -0.01 -0.01
22 1/10/75 B 0.01 -0.01
23 1/11/75 B -0.01 0.02
24 1/12/75 B 0.07 0.04"
df <- read.table(text = Lines)

How to create a new column in a data frame depending on multiple criteria from multiple columns from the same data frame

I have a data frame df1 with four variables. One refers to sunlight, the second one refers to the moon-phase light (light due to the moon's phase), the third one to the moon-position light (light from the moon depending on if it is in the sky or not) and the fourth refers to the clarity of the sky (opposite to cloudiness).
I call them SL, MPhL, MPL and SC respectively. I want to create a new column referred to "global light" that during the day depends only on SL and during the night depends on the other three columns ("MPhL", "MPL" and "SC"). What I want is that at night (when SL == 0), the light in a specific area is equal to the product of the columns "MPhL", "MPL" and "SC". If any of them is 0, then, the light at night would be 0 also.
Since I work with a matrix of hundreds of thousands of rows, what would be the best way to do it? As an example of what I have:
SL<- c(0.82,0.00,0.24,0.00,0.98,0.24,0.00,0.00)
MPhL<- c(0.95,0.85,0.65,0.35,0.15,0.00,0.87,0.74)
MPL<- c(0.00,0.50,0.10,0.89,0.33,0.58,0.00,0.46)
SC<- c(0.00,0.50,0.10,0.89,0.33,0.58,0.00,0.46)
df<-data.frame(SL,MPhL,MPL,SC)
df
SL MPhL MPL SC
1 0.82 0.95 0.00 0.00
2 0.00 0.85 0.50 0.50
3 0.24 0.65 0.10 0.10
4 0.00 0.35 0.89 0.89
5 0.98 0.15 0.33 0.33
6 0.24 0.00 0.58 0.58
7 0.00 0.87 0.00 0.00
8 0.00 0.74 0.46 0.46
What I would like to get is this:
df
SL MPhL MPL SC GL
1 0.82 0.95 0.00 0.00 0.82 # When "SL">0, GL= SL
2 0.00 0.85 0.50 0.50 0.21 # When "SL" is 0, GL = MPhL*MPL*SC
3 0.24 0.65 0.10 0.10 0.24
4 0.00 0.35 0.89 0.89 0.28
5 0.98 0.15 0.33 0.33 0.98
6 0.24 0.00 0.58 0.58 0.24
7 0.00 0.87 0.00 0.00 0.00
8 0.00 0.74 0.46 0.46 0.16
the most simple way would be to use the ifelse function:
GL <- ifelse(SL == 0, MPhL * MPL * SC, SL)
If you want to work in a more structured environment, I can recommend the dplyr package:
library(dplyr)
tibble(SL = SL, MPhL = MPhL, MPL = MPL, SC = SC) %>%
mutate(GL = if_else(SL == 0, MPhL * MPL * SC, SL))
# A tibble: 8 x 5
SL MPhL MPL SC GL
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.82 0.95 0.00 0.00 0.820000
2 0.00 0.85 0.50 0.50 0.212500
3 0.24 0.65 0.10 0.10 0.240000
4 0.00 0.35 0.89 0.89 0.277235
5 0.98 0.15 0.33 0.33 0.980000
6 0.24 0.00 0.58 0.58 0.240000
7 0.00 0.87 0.00 0.00 0.000000
8 0.00 0.74 0.46 0.46 0.156584

R :How to execute FOR loop for Kmeans

I have an input file with Format as below :
RN KEY MET1 MET2 MET3 MET4
1 1 0.11 0.41 0.91 0.17
2 1 0.94 0.02 0.17 0.84
3 1 0.56 0.64 0.46 0.7
4 1 0.57 0.23 0.81 0.09
5 2 0.82 0.67 0.39 0.63
6 2 0.99 0.90 0.34 0.84
7 2 0.83 0.01 0.70 0.29
I have to execute Kmeans in R -separately for DF with Key=1 and Key=2 and so on...
Afterwards the final output CSV should look like
RN KEY MET1 MET2 MET3 MET4 CLST
1 1 0.11 0.41 0.91 0.17 1
2 1 0.94 0.02 0.17 0.84 1
3 1 0.56 0.64 0.46 0.77 2
4 1 0.57 0.23 0.81 0.09 2
5 2 0.82 0.67 0.39 0.63 1
6 2 0.99 0.90 0.34 0.84 2
7 2 0.83 0.01 0.70 0.29 2
Ie Key=1 is to be treated as separate DF and Key=2 is be treated as separate DF and so on...
Finally the output of clustering (of each DF)is to be combined with Key column first (since Key cannot participate in clustering) and then combined with each different DF for final output
In the above example :
DF1 is
KEY MET1 MET2 MET3 MET4
1 0.11 0.41 0.91 0.17
1 0.94 0.02 0.17 0.84
1 0.56 0.64 0.46 0.77
1 0.57 0.23 0.81 0.09
DF2 is
KEY MET1 MET2 MET3 MET4
2 0.82 0.67 0.39 0.63
2 0.99 0.90 0.34 0.84
2 0.83 0.01 0.70 0.29
Please suggest how to achieve in R
Psuedo code :
n<-Length(unique(Mydf$key))
for i=1 to n
{
#fetch partial df for each value of Key and run K means
dummydf<-subset(mydf,mydf$key=i
KmeansIns<-Kmeans(dummydf,2)
# combine with cluster result
dummydf<-data.frame(dummydf,KmeansIns$cluster)
# combine each smalldf into final Global DF
finaldf<-data.frame(finaldf,dummydf)
}Next i
#Now we have finaldf then it can be written to file
I think the easiest way would be to use by. Something along the lines of
by(data = DF, INDICES = DF$KEY, FUN = function(x) {
# your clustering code here
})
where x is a subset of your DF for each KEY.
A solution using data.tables.
library(data.table)
setDT(DF)[,CLST:=kmeans(.SD, centers=2)$clust, by=KEY, .SDcols=3:6]
DF
# RN KEY MET1 MET2 MET3 MET4 CLST
# 1: 1 1 0.11 0.41 0.91 0.17 2
# 2: 2 1 0.94 0.02 0.17 0.84 1
# 3: 3 1 0.56 0.64 0.46 0.70 1
# 4: 4 1 0.57 0.23 0.81 0.09 2
# 5: 5 2 0.82 0.67 0.39 0.63 2
# 6: 6 2 0.99 0.90 0.34 0.84 2
# 7: 7 2 0.83 0.01 0.70 0.29 1
#Read data
mdf <- read.table("mydat.txt", header=T)
#Convert to list based on KEY column
mls <- split(mdf, f=mdf$KEY)
#Define columns to use in clustering
myv <- paste0("MET", 1:4)
#Cluster each df item in list : modify kmeans() args as appropriate
kls <- lapply(X=mls, FUN=function(x){x$clust <- kmeans(x[, myv],
centers=2)$cluster ; return(x)})
#Make final "global" df
finaldf <- do.call(rbind, kls)

Resources