Create multiple variables in a loop - r

Is there a way to create multiple variables in a loop. For example, if I have a variable, called 'test' among others, in my data frame, how can I create a series of new variables called say 'test1', 'test2', ... 'testn' that are defined as test^1, test^2... test^n
As an example
mynum <- 1:10
myletters <- letters[1:10]
mydf <- data.frame(mynum, myletters)
mydf
mynum myletters
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
for (i in 1:5)
{paste0(var, i) <- mynum^i
}
But it errors out.
I am trying to create variables like var1, var2, var3 etc which are mynum^1, mynum^2, mynum^3 etc.
Best regards
Deepak

You can use lapply to create new columns and combine them using do.call + cbind.
n <- 1:5
mydf[paste0('var', n)] <- do.call(cbind, lapply(n, function(x) mydf$mynum^x))
mydf
# mynum myletters var1 var2 var3 var4 var5
#1 1 a 1 1 1 1 1
#2 2 b 2 4 8 16 32
#3 3 c 3 9 27 81 243
#4 4 d 4 16 64 256 1024
#5 5 e 5 25 125 625 3125
#6 6 f 6 36 216 1296 7776
#7 7 g 7 49 343 2401 16807
#8 8 h 8 64 512 4096 32768
#9 9 i 9 81 729 6561 59049
#10 10 j 10 100 1000 10000 100000
Or with purrr's map_dfc
mydf[paste0('var', n)] <- purrr::map_dfc(n, ~mydf$mynum^.x)

Try this, you have to take into account that you have to move the position of the new variables. That is why I use i+2 in the loop. Here the code:
#Data
mynum <- 1:10
myletters <- letters[1:10]
mydf <- data.frame(mynum, myletters,stringsAsFactors = F)
The loop:
#Loop
for (i in 1:5)
{
mydf[,i+2] <- mydf[,'mynum']^i
names(mydf)[i+2] <- paste0('var',i)
}
Output:
mynum myletters var1 var2 var3 var4 var5
1 1 a 1 1 1 1 1
2 2 b 2 4 8 16 32
3 3 c 3 9 27 81 243
4 4 d 4 16 64 256 1024
5 5 e 5 25 125 625 3125
6 6 f 6 36 216 1296 7776
7 7 g 7 49 343 2401 16807
8 8 h 8 64 512 4096 32768
9 9 i 9 81 729 6561 59049
10 10 j 10 100 1000 10000 100000

An option with map
library(dplyr)
library(purrr)
map_dfc(1:5, ~ mydf$mynum^.x) %>%
rename_all(~ str_replace(., '\\.+', 'var')) %>%
bind_cols(mydf, .)

Related

Creating Kronecker product in R

Let the following be the dataset:
What I need to do is to create new columns wherein I need to multiply all a columns with b columns and name the newly created column as
a1_b1, a1_b2........ a1_b4, a2_b1, a2_b2 as shown in the figure.
I am using R for data analysis. Even though I have stated only two columns by two columns, in reality, it is 1600 by 25. Hence the question.
This might be fast enough:
set.seed(42)
DF <- data.frame(a1 = sample(1:10),
a2 = sample(1:10),
b1 = sample(1:10),
b2 = sample(1:10))
a <- grep("a", names(DF))
b <- grep("b", names(DF))
combs <- expand.grid(a, b)
res <- do.call(mapply, c(list(FUN = \(...) do.call(`*`, DF[, c(...)])), combs))
colnames(res) <- paste(names(DF)[combs[[1]]], names(DF)[combs[[2]]], sep = "_")
cbind(DF, res)
# a1 a2 b1 b2 a1_b1 a2_b1 a1_b2 a2_b2
#1 1 8 9 3 9 72 3 24
#2 5 7 10 1 50 70 5 7
#3 10 4 3 2 30 12 20 8
#4 8 1 4 6 32 4 48 6
#5 2 5 5 10 10 25 20 50
#6 4 10 6 8 24 60 32 80
#7 6 2 1 4 6 2 24 8
#8 9 6 2 5 18 12 45 30
#9 7 9 8 7 56 72 49 63
#10 3 3 7 9 21 21 27 27
The operation in the question is the transpose of the KhatriRao product. We use the Matrix package which comes with R so it does not have to be installed. Using the input in the Note at the end,
pick out the two portions, transpose them, use KhatriRao and transpose back giving a sparse matrix (class "dgCMatrix"). We can use as.matrix to convert to a dense matrix as shown or as.data.frame(as.matrix(...)) to convert to a data.frame.
library(Matrix)
rownames(dat) <- 1:nrow(dat)
ix <- grep("a", colnames(dat))
as.matrix(t(KhatriRao(t(dat[, -ix]), t(dat[, ix]), make.dimnames = TRUE)))
giving:
a1:b1 a2:b1 a1:b2 a2:b2
1 101 838.3 108.3 898.89
2 204 1050.6 220.6 1136.09
3 309 1957.0 357.0 2261.00
4 416 1664.0 464.0 1856.00
5 525 1638.0 578.0 1803.36
6 749 2118.6 838.6 2372.04
Note
dat <- setNames(cbind(BOD, BOD + 100), c("a1", "a2", "b1", "b2"))
dat
giving
a1 a2 b1 b2
1 1 8.3 101 108.3
2 2 10.3 102 110.3
3 3 19.0 103 119.0
4 4 16.0 104 116.0
5 5 15.6 105 115.6
6 7 19.8 107 119.8

logical operator TRUE/FALSE in R

I wrote a simple function that produces all combinations of the input (a vector). Here the input vector is basically a sequence of 4 coordinates (x, y) as mentioned inside the function as a, b,c, and d.
intervals<-function(x1,y1,x2,y2,x3,y3,x4,y4){
a<-c(x1,y1)
b<-c(x2,y2)
c<-c(x3,y3)
d<-c(x4,y4)
union<-expand.grid(a,b,c,d)
union
}
intervals(2,10,3,90,6,50,82,7)
> intervals(2,10,3,90,6,50,82,7)
Var1 Var2 Var3 Var4
1 2 3 6 82
2 10 3 6 82
3 2 90 6 82
4 10 90 6 82
5 2 3 50 82
6 10 3 50 82
7 2 90 50 82
8 10 90 50 82
9 2 3 6 7
10 10 3 6 7
11 2 90 6 7
12 10 90 6 7
13 2 3 50 7
14 10 3 50 7
15 2 90 50 7
16 10 90 50 7
>
Now I want to find (max of x) and (min of y) for each row of the given output. E.g. row 2: we have 4 values (10, 3, 6, 82). Here (3,6,82) are from x (x2,x3,x4) and 10 is basically from y (y1). Thus max of x is 82, and the min of y is 10.
So what I want is two values from each row.
I do not actually know how to approach this kind of logical command. Any idea or suggestions?
You can pass x and y vector separately to the function. Use expand.grid to create all combinations of the vector and get max of x and min of y from each row.
intervals<-function(x, y){
tmp <- do.call(expand.grid, rbind.data.frame(x, y))
names(tmp) <- paste0('col', seq_along(tmp))
result <- t(apply(tmp, 1, function(p) {
suppressWarnings(c(max(p[p %in% x]), min(p[p %in% y])))
}))
result[is.infinite(result)] <- NA
result <- as.data.frame(result)
names(result) <- c('max_x', 'min_x')
result
}
intervals(c(2,3,6,82), c(10, 90, 50, 7))
# max_x min_x
#1 82 NA
#2 82 10
#3 82 90
#4 82 10
#5 82 50
#6 82 10
#7 82 50
#8 82 10
#9 6 7
#10 6 7
#11 6 7
#12 6 7
#13 3 7
#14 3 7
#15 2 7
#16 NA 7

dplyr:: create new column with order number of another column

first apologise if this question was asked somewhere else but I couldn't find an answer.
In R, I have a 2 columns data.frame with ID and Score values.
library(dplyr)
library(magrittr)
set.seed(1235) # for reproducible example
data.frame(ID = LETTERS[1:16],
Score = round(rnorm(n=16,mean = 1200, sd = 5 ), 0),
stringsAsFactors = F) -> tmp
head(tmp)
# ID Score
# 1 A 1203
# 2 B 1198
# 3 C 1197
# 4 D 1202
# 5 E 1200
# 6 F 1190
I want to create a new column called Position with numbers from 1 to nrow(tmp) corresponding to the decreasing order of the Score column.
I can do that in base R with:
tmp[order(tmp$Score, decreasing = T), "Position"] <- 1:nrow(tmp)
head(tmp[order(tmp$Position), ])
# ID Score Position
# 1 A 1211 1
# 8 H 1210 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 16 P 1202 6
But I was wondering if there's a more elegant way to do it abiding the tidyverse principles?
Like I tried this but it doesn't work and I can't understand why...
tmp %>%
mutate(Position = order(Score, decreasing = T)) %>%
arrange(Position) %>%
head()
# ID Score Position
# 1 A 1211 1
# 2 L 1200 2
# 3 C 1209 3
# 4 D 1205 4
# 5 E 1202 5
# 6 G 1188 6
Here the ordering clearly didn't work.
Thanks!
We can use row_number
library(dplyr)
tmp %>%
mutate(Position2 = row_number(-Score))
-output
# ID Score Position Position2
#1 A 1197 12 12
#2 B 1194 16 16
#3 C 1205 3 3
#4 D 1201 8 8
#5 E 1201 9 9
#6 F 1208 1 1
#7 G 1200 10 10
#8 H 1203 5 5
#9 I 1207 2 2
#10 J 1202 6 6
#11 K 1195 15 15
#12 L 1205 4 4
#13 M 1196 13 13
#14 N 1198 11 11
#15 O 1196 14 14
#16 P 1202 7 7
where 'Position' is the one created with order based on base R OP's code
Similar to your order logic we can arrange the data in decreasing order and create position column which goes from 1 to number of rows in the data.
library(dplyr)
tmp %>%
arrange(desc(Score)) %>%
mutate(position = 1:n())
# ID Score position
#1 F 1208 1
#2 I 1207 2
#3 C 1205 3
#4 L 1205 4
#5 H 1203 5
#6 J 1202 6
#7 P 1202 7
#8 D 1201 8
#9 E 1201 9
#10 G 1200 10
#11 N 1198 11
#12 A 1197 12
#13 M 1196 13
#14 O 1196 14
#15 K 1195 15
#16 B 1194 16

Rearranging order for a pair in R

I have a column with 10 random numbers, from that I want to create a new column that have switched places for every pair, see example for how I mean. How would you do that?
column newcolumn
1 5
5 1
7 6
6 7
25 67
67 25
-10 2
2 -10
-50 36
36 -50
Taking advantage of the fact that R will replicate smaller vectors when adding them to larger vectors, you can:
a <- data.frame(column=c(1,5,7,6,25,67,-10,2,50,36))
a$newColumn <- a$column[seq(nrow(a)) + c(1, -1)]
Something like this.
a <- data.frame(column=c(1,5,7,6,25,67,-10,2,50,36))
a$newColumn <- 0
a[seq(1,nrow(a),by=2),"newColumn"]<-a[seq(2,nrow(a),by=2),"column"]
a[seq(2,nrow(a),by=2),"newColumn"]<-a[seq(1,nrow(a),by=2),"column"]
# results
column newColumn
1 1 5
2 5 1
3 7 6
4 6 7
5 25 67
6 67 25
7 -10 2
8 2 -10
9 50 36
10 36 50
Here is a base R one-liner: We can cast column as 2 x nrow(df)/2 matrix, swap rows, and recast as vector.
df$newcolumn <- c(matrix(df$column, ncol = nrow(df) / 2)[c(2,1), ]);
# column newcolumn
#1 1 5
#2 5 1
#3 7 6
#4 6 7
#5 25 67
#6 67 25
#7 -10 2
#8 2 -10
#9 -50 36
#10 36 -50
Sample data
df <- read.table(text =
"column
1
5
7
6
25
67
-10
2
-50
36", header = T)
Another option would be to use ave and rev
transform(df, newCol = ave(x = df$column, rep(1:5, each = 2), FUN = rev))
# column newCol
#1 1 5
#2 5 1
#3 7 6
#4 6 7
#5 25 67
#6 67 25
#7 -10 2
#8 2 -10
#9 -50 36
#10 36 -50
The part rep(1:5, each = 2) creates a grouping variable ("pairs") for each of which we reverse the elements.
Here's a compact way:
a$new_col <- c(matrix(a$column,2)[2:1,])
# column new_col
# 1 1 5
# 2 5 1
# 3 7 6
# 4 6 7
# 5 25 67
# 6 67 25
# 7 -10 2
# 8 2 -10
# 9 50 36
# 10 36 50
The idea is to write in a 2 row matrix, switch the rows, and unfold back in a vector.

R: subset a data frame based on conditions from another data frame

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

Resources