Say I have a data frame with m variables, how can I get their generated variables up to power of n? For example, df is a data frame with 2 variables a and b:
df <- data.frame(a=c(1,2), b=c(3,4))
I want to add variables up to power of 3, which means adding to df these generated columns:
a^2, a*b, b^2, a^3, a^2*b, b^2*a, b^3
How can I do this?
Use polym:
df <- data.frame(a=c(1,2), b=c(3,4))
# a b
#1 1 3
#2 2 4
res <- do.call(polym, c(df, degree=3, raw=TRUE))
# 1.0 2.0 3.0 0.1 1.1 2.1 0.2 1.2 0.3
#[1,] 1 1 1 3 3 3 9 9 27
#[2,] 2 4 8 4 8 16 16 32 64
#attr(,"degree")
#[1] 1 2 3 1 2 3 2 3 3
Edit:
Here is a possibility to create the desired column names:
colnames(res) <- apply(
do.call(rbind,
strsplit(colnames(res), ".", fixed=TRUE)),
1,
function(x) paste(rep(names(df), as.integer(x)), collapse="")
)
# a aa aaa b ab aab bb abb bbb
#[1,] 1 1 1 3 3 3 9 9 27
#[2,] 2 4 8 4 8 16 16 32 64
#attr(,"degree")
#[1] 1 2 3 1 2 3 2 3 3
Related
This question already has answers here:
Create a ranking variable with dplyr?
(3 answers)
Closed 3 years ago.
I have a data set where 5 varieties (var) and 3 variables (x,y,z) are available. I need to rank these varieties for 3 variables. When there is tie in rank it shows gap before starting the following rank. I cannot get the consecutive rank. Here is my data
x<-c(3,3,4,5,5)
y<-c(5,6,4,4,5)
z<-c(2,3,4,3,5)
df<-cbind(x,y,z)
rownames(df) <- paste0("G", 1:nrow(df))
df <- data.frame(var = row.names(df), df)
I tried the following code for my result
res <- sapply(df, rank,ties.method='min')
res
var x y z
[1,] 1 1 3 1
[2,] 2 1 5 2
[3,] 3 3 1 4
[4,] 4 4 1 2
[5,] 5 4 3 5
I got x variable with rank 1 1 3 4 4 instead of 1 1 2 3 3. For y and z the same thing was found.
My desired result is
>res
var x y z
[1,] 1 1 2 1
[2,] 2 1 3 2
[3,] 3 2 1 3
[4,] 4 3 1 2
[5,] 5 3 2 4
I will be grateful if anyone helps me.
Well, an easy way would be to convert to factor and then integer
df[] <- lapply(df, function(x) as.integer(factor(x)))
df
# var x y z
#G1 1 1 2 1
#G2 2 1 3 2
#G3 3 2 1 3
#G4 4 3 1 2
#G5 5 3 2 4
One dplyr possibility could be:
df %>%
mutate_at(2:4, list(~ dense_rank(.)))
var x y z
1 G1 1 2 1
2 G2 1 3 2
3 G3 2 1 3
4 G4 3 1 2
5 G5 3 2 4
Or a base R possibility:
df[2:4] <- lapply(df[2:4], function(x) match(x, sort(unique(x))))
We can use data.table
library(data.table)
setDT(df)[, (2:4) := lapply(.SD, dense_rank), .SDcols = 2:4]
df
# var x y z
#1: G1 1 2 1
#2: G2 1 3 2
#3: G3 2 1 3
#4: G4 3 1 2
#5: G5 3 2 4
I have a data frame of many companies (let's say 7 companies) and many periods (let's say 2 periods). I need to create a new column by dividing each period's company into few parts (let's say 3 parts). Now since 7 can not exactly be divided by 3, I want assign two rows to each of the first groups, and one extra row to the last group. In the following table, the 'res' column is the expected result:
Company Period res
1 1 11
2 1 11
3 1 12
4 1 12
5 1 13
6 1 13
7 1 13
1 2 21
2 2 21
3 2 22
4 2 22
5 2 23
6 2 23
7 2 23
As I understood you want to divide into equal parts and put the remaining (in case there is a remainder) in the last group. The following function does that, i.e.
f1 <- function(x, parts){
len1 <- length(x)
i1 <- len1 %% parts
v1 <- rep((len1 - i1)/parts, parts)
v1[length(v1)] <- v1[length(v1)] + i1
v2 <- rep(seq_along(v1), v1)
return(v2)
}
#Here are some trials,
f1(seq(7), 3)
#[1] 1 1 2 2 3 3 3
f1(seq(8), 3)
#[1] 1 1 2 2 3 3 3 3
f1(seq(9), 3)
#[1] 1 1 1 2 2 2 3 3 3
f1(seq(10), 3)
#[1] 1 1 1 2 2 2 3 3 3 3
Now you need to apply it in each group using the split-apply method (using data.table or dplyr will definitely speed up this process), i.e.
do.call(rbind,
lapply(split(df, df$Period), function(i) {
i$New_column <- paste0(i$Period, f1(i$Company, 3)); i}))
which gives,
Company Period New_column
1.1 1 1 11
1.2 2 1 11
1.3 3 1 12
1.4 4 1 12
1.5 5 1 13
1.6 6 1 13
1.7 7 1 13
2.8 1 2 21
2.9 2 2 21
2.10 3 2 22
2.11 4 2 22
2.12 5 2 23
2.13 6 2 23
2.14 7 2 23
NOTE: You can easily add a separator in paste0 to distinguish between 1_11 and 11_1
Create a function of number of companies (nc) and number of groups (nc). For all but the last group (ng - 1), the length of each group is the quotient (nc %/% ng). For the last group, the length is the quotient plus the remainder (nc %% ng).
f <- function(nc, ng){
qu <- nc %/% ng
rep(1:ng, c(rep(qu, ng - 1), qu + nc %% ng))
}
Do this for each period:
d$res2 <- ave(d$Period, d$Period, FUN = function(x) paste0(x, "_", f(7, 3)))
d
# Company Period res res2
# 1 1 1 11 1_1
# 2 2 1 11 1_1
# 3 3 1 12 1_2
# 4 4 1 12 1_2
# 5 5 1 13 1_3
# 6 6 1 13 1_3
# 7 7 1 13 1_3
# 8 1 2 21 2_1
# 9 2 2 21 2_1
# 10 3 2 22 2_2
# 11 4 2 22 2_2
# 12 5 2 23 2_3
# 13 6 2 23 2_3
# 14 7 2 23 2_3
Here the number of companies is hard coded (7), but this could of course be calculated from your data.
If the remainder doesn't have to be allocated to the last group, you may just use cut:
ave(d$Company, d$Period, FUN = function(x) cut(seq_along(x), 3))
This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 6 years ago.
Here's an example of what I'm trying to do:
df <- data.frame(
id = letters[1:5],
enum_start = c(1, 1, 1, 1, 1),
enum_end = c(1, 5, 3, 7, 2)
)
df2 <- df %>%
split(.$id) %>%
lapply(function(x) cbind(x, hello = seq(x$enum_start, x$enum_end, by = 1L))) %>%
bind_rows
df2
# id enum_start enum_end hello
# 1 a 1 1 1
# 2 b 1 5 1
# 3 b 1 5 2
# 4 b 1 5 3
# 5 b 1 5 4
# 6 b 1 5 5
# 7 c 1 3 1
# 8 c 1 3 2
# 9 c 1 3 3
# 10 d 1 7 1
# 11 d 1 7 2
# 12 d 1 7 3
# 13 d 1 7 4
# 14 d 1 7 5
# 15 d 1 7 6
# 16 d 1 7 7
# 17 e 1 2 1
# 18 e 1 2 2
Note that the starting and ending values for hello depend on the data and hence the number of rows for each id is dynamic. I'm looking for a solution that involves maybe expand from tidyr but am struggling.
Here's a dplyr/tidyr approach
group_by(df, id) %>%
expand(enum_start, enum_end, hello = full_seq(enum_end:enum_start, 1))
Not sure if there's a tidyr-way without grouping the data (would be interesting to know)
Here is a base R method that produces the desired output.
dfNew <- within(df[rep(seq_len(nrow(df)), df$enum_end), ],
hello <- sequence(df$enum_end))
sequence will return the natural numbers and takes a vector that allows for repeated recounting. It is used to produce the "hello" variable. within reduces typing and returns a modified data.frame. I fed it an expanded version of df where rows are repeated using rep and [.
dfNew
id enum_start enum_end hello
1 a 1 1 1
2 b 1 5 1
2.1 b 1 5 2
2.2 b 1 5 3
2.3 b 1 5 4
2.4 b 1 5 5
3 c 1 3 1
3.1 c 1 3 2
3.2 c 1 3 3
4 d 1 7 1
4.1 d 1 7 2
4.2 d 1 7 3
4.3 d 1 7 4
4.4 d 1 7 5
4.5 d 1 7 6
4.6 d 1 7 7
5 e 1 2 1
5.1 e 1 2 2
Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3
I have a dataset that looks like this:
a <- data.frame(rep(1,5),1:5,1:5)
b <- data.frame(rep(2,5),1:5,1:5)
colnames(a) <- c(1,2,3)
colnames(b) <- c(1,2,3)
c <- rbind(a,b)
1 2 3
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 2 1 1
7 2 2 2
8 2 3 3
9 2 4 4
10 2 5 5
but I want it to be restructured to this:
2_1 2_2 3_1 3_2
1 1 1 1 1
2 2 2 2 2
3 3 3 3 4
4 4 4 4 4
5 5 5 5 5
a <- data.frame(rep(1,5),1:5,1:5)
b <- data.frame(rep(2,5),1:5,1:5)
colnames(b) <- colnames(a) <- paste("a", c(1,2,3), sep='')
d <- rbind(a,b)
library(reshape)
recast(d, a2 ~ a1, measure.var="a3")
I changed your example slightly, since it had numbers as variable names. This is not recommended because it permits the following nonsense:
"1" <- 3
print(1)
[1] 1
print("1")
[1] "1"
print(`1`)
[1] 3
Need I say more?