How to create a function which loops through column index numbers in R? - r

Consider the following data frame (df):
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1"
a 2 6 0 x 2 19 12
a 4 2 7 x 3.5 7 11
b 1 9 4 x 7 1.5 4
b 7 5 11 x 8 12 5
I would like to create a new column which shows the sum of the product between two corresponding columns. To write less code I address the columns by their index number. Unfortunately I have no experience in writing functions, so I ended up doing this manually, which is extremely tedious and not very elegant.
Here a reproducible example of the data frame and what I have tried so far:
id <- c("a","a","b","b")
df <- data.frame(id)
df$a1 <- as.numeric((c(2,4,1,7)))
df$b1 <- as.numeric((c(6,2,9,5)))
df$c1 <- as.numeric((c(0,7,4,11)))
df$not_relevant <- c("x","x","x","x")
df$p_a1 <- as.numeric((c(2,3.5,7,8)))
df$p_b1 <- as.numeric((c(19,7,1.5,12)))
df$p_c1 <- as.numeric((c(12,11,4,5)))
require(dplyr)
df %>% mutate(total = .[[2]]*.[[6]] + .[[3]] *.[[7]]+ .[[4]] *.[[8]])
This leads to the desired result, but as I mentioned is not very efficient:
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1" "total"
a 2 6 0 x 2 19 12 118.0
a 4 2 7 x 3.5 7 11 105.0
b 1 9 4 x 7 1.5 4 36.5
b 7 5 11 x 8 12 5 171.0
The real data I am working with has much more columns, so I would be glad if someone could show me a way to pack this operation into a function which loops through the column index numbers and matches the correct columns to each other.

Column indices are not a good way to do this. (Not a good way in general...)
Here's a simple dplyr method that assumes the columns are in the correct corresponding order (that is, it will give the wrong result if the "x1", "x2", "x3" is in a different order than "p_x3", "p_x2", "p_x1"). You may also need to refine the selection criteria for your real data:
df$total = rowSums(select(df, starts_with("x")) * select(df, starts_with("p_")))
df
# id x1 x2 x3 not_relevant p_x1 p_x2 p_x3 total
# 1 a 2 6 0 x 2.0 19.0 12 118.0
# 2 a 4 2 7 x 3.5 7.0 11 105.0
# 3 b 1 9 4 x 7.0 1.5 4 36.5
# 4 b 7 5 11 x 8.0 12.0 5 171.0
The other good option would be to convert your data to a long format, where you have a single x column and a single p column, with an "index" column indicating the 1, 2, 3. Then the operation could be done by group, finally moving back to a wide format.

Related

Assign a sequence of numbers to data frame rows

I am trying to assign a sequence of numbers to rows in a data frame based off of the row position. I have 330 rows and I want each set of nine rows to be named one through nine in a new ID column. For example, I want rows 1-9 labeled 1-9, rows 10-18 labeled 1-9, rows 19-27 labeled 1-9 and so on.
I have tried to use this code:
test <- temp.df %>% mutate(id = seq(from = 1, to = 330, along.width= 9))
test
but it just ends up just creating a new column that labels the rows 1-330 as shown below.
Time Temperature ID
09:36:52 25.4 1
09:36:59 25.4 2
09:37:07 25.4 3
09:37:14 25.4 4
09:37:21 25.4 5
09:37:29 25.4 6
09:37:36 25.4 7
09:37:43 25.5 8
09:37:51 25.5 9
09:37:58 25.5 10
What is the best way to accomplish my goal?
I think if you could provide a snippet of the data.frame temp.df, it would be easier to help you out. Maybe the following line could help you by adding it to your data.frame, however, it is not a very flexible solution, but it is based on the information you provided.
n_repeated <- 9 #block of ID
N_rows <- 330 #number of observations
df <- data.frame(id = rep(seq(1,n_repeated ),N_rows ))
head(df,n = 15)
#> head(df,n = 15)
# id
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# 6 6
# 7 7
# 8 8
# 9 9
# 10 1
# 11 2
# 12 3
# 13 4
# 14 5
# 15 6
[Edited]
using mutate from dplyr this line should do it:
test <- temp.df %>% mutate(id = rep(seq(1,9), nrow(temp.df)))

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

multiple condition then creating new column

I have a dataset with two columns, I need to create a third column carries conditions on first one and second one.
set.seed(1)
x1=(sample(1:10, 100,replace=T))
y1=sample(seq(1,10,0.1),100,replace=T)
z=cbind(x1,y1)
unique(as.data.frame(z)$x1)
z%>%as.data.frame()%>%dplyr::filter(x1==3)
table(x1)
1 2 3 4 5 6 7 8 9 10
7 6 11 14 14 5 11 15 11 6
> z%>%as.data.frame()%>%dplyr::filter(x1==3)
x1 y1
1 3 6.9
2 3 9.5
3 3 10.0
4 3 5.6
5 3 4.1
6 3 2.5
7 3 5.3
8 3 9.5
9 3 5.5
10 3 8.9
11 3 1.2
for example when I filter x==3 then y1 values can be seen, I need to write 1 on 11th row, rest will be 0. I need to find a minimum in that column. My original dataset has 43545 rows but only 638 unique numbers like x1. table x1 shows that 1 repeated 7 times but in my dataset some have a frequency of 1 some have frequency of 100. I should use case_when but how can I check every y1 to find the smallest to put 1?
If I understand correctly, you are looking for the row with minimal y1 value for each value of x1
library(tidyverse)
z %>% as.data.frame() %>%
group_by(x1) %>%
arrange(y1) %>% # sort values by increasing order within each group
mutate(flag = ifelse(row_number()==1,1,0)) %>% # create flag for first row in group
ungroup()

Build a data frame with overlapping observations

Lets say I have a data frame with the following structure:
> DF <- data.frame(x=1:5, y=6:10)
> DF
x y
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I need to build a new data frame with overlapping observations from the first data frame to be used as an input for building the A matrix for the Rglpk optimization library. I would use n-length observation windows, so that if n=2 the resulting data frame would join rows 1&2, 2&3, 3&4, and so on. The length of the resulting data frame would be
(numberOfObservations-windowSize+1)*windowSize
The result for this example with windowSize=2 would be a structure like
x y
1 1 6
2 2 7
3 2 7
4 3 8
5 3 8
6 4 9
7 4 9
8 5 10
I could do a loop like
DFResult <- NULL
numBlocks <- nrow(DF)-windowSize+1
for (i in 1:numBlocks) {
DFResult <- rbind(DFResult, DF[i:(i+horizon-1), ])
}
But this seems vey inefficient, especially for very large data frames.
I also tried
rollapply(data=DF, width=windowSize, FUN=function(x) x, by.column=FALSE, by=1)
x y
[1,] 1 6
[2,] 2 7
[3,] 2 7
[4,] 3 8
where I was trying to repeat a block of rows without applying any aggregate function. This does not work since I am missing some rows
I am a bit stumped by this and have looked around for similar problems but could not find any. Does anyone have any better ideas?
We could do a vectorized approach
i1 <- seq_len(nrow(DF))
res <- DF[c(rbind(i1[-length(i1)], i1[-1])),]
row.names(res) <- NULL
res
# x y
#1 1 6
#2 2 7
#3 2 7
#4 3 8
#5 3 8
#6 4 9
#7 4 9
#8 5 10

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources