Finding different max between specified rows of different fields - r

I'm looking to find the max values of different columns based on specified rows of each column.
My actual data frame is 50K columns and 1K+ rows so I can't use a loop without greatly increasing run time.
Data Frame:
row
V1
V2
V3
V4
1
5
2
4
5
2
3
5
1
6
3
7
3
2
6
4
2
5
3
10
5
6
9
1
2
beg_row <- c(2, 1, 2, 3)
end_row <- c(4, 3, 3, 5)
output:
c(7, 5, 2, 10)

You can try mapply (but I suspect that it won't speed up the runtime if you have massive columns)
> mapply(function(x, y, z) max(x[y:z]), df[-1], beg_row, end_row)
V1 V2 V3 V4
7 5 2 10
Data
df <- structure(list(row = 1:5, V1 = c(5L, 3L, 7L, 2L, 6L), V2 = c(
2L,
5L, 3L, 5L, 9L
), V3 = c(4L, 1L, 2L, 3L, 1L), V4 = c(
5L, 6L, 6L,
10L, 2L
)), class = "data.frame", row.names = c(NA, -5L))
beg_row <- c(2, 1, 2, 3)
end_row <- c(4, 3, 3, 5)

An option with dplyr
library(dplyr)
df1 %>%
summarise(across(-row, ~ {
i1 <- match(cur_column(), names(df1)[-1])
max(.x[beg_row[i1]:end_row[i1]])}))
V1 V2 V3 V4
1 7 5 2 10
Or another option is to create NA outside the range and then use colMaxs
library(matrixStats)
colMaxs(as.matrix((NA^!(row(df1[-1]) >= beg_row[col(df1[-1])] &
row(df1[-1]) <= end_row[col(df1[-1])])) * df1[-1]), na.rm = TRUE)
[1] 7 5 2 10

The fastest approach that I have found is to use data.table and a for loop. I have tested it with a dataframe of 2K rows and 50K columns.
library(data.table)
beg_row <- sample(1:50, 49999, replace = T)
end_row <- sample(100:150, 49999, replace = T)
df <- matrix(sample(1:50, 50000*2000, replace = T), 2000, 50000)
df <- as.data.frame(df)
dt <- setDT(df)
vmax <- rep(0, ncol(dt)-1)
for (i in 2:ncol(dt)) {
vmax[i-1] <- max(dt[[i]][beg_row[i-1]:end_row[i-1]])
}
Another possible solution, based on purrr::pmap_dbl:
library(purrr)
pmap_dbl(list(beg_row, end_row, 2:ncol(df)), ~ max(df[..1:..2, ..3]))
#> [1] 7 5 2 10

Related

Fill in missing rows in data in R

Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

How to create a dataframe in R with a column calculation that references its own value in the prior row?

I am try to use R to calculate sales as a function of inventory as a function of sales. See below data snapshot. Is there anyway to calculate this?
Group, Day and Build are independent variables
Sales = lag(Sales,1) * Build
I am given this data frame:
Group <- c("A","A","A","A","A","B","B","B","B","B")
Day <- c(1,2,3,4,5,1,2,3,4,5)
Build <- c(1.5,2,.3,.5,.6,1.2,.9,1.2,1.2,.4)
Sales <- c(50000,NA,NA,NA,NA,20000,NA,NA,NA,NA)
Trying to populate this data frame:
Group <- c("A","A","A","A","A","B","B","B","B","B")
Day <- c(1,2,3,4,5,1,2,3,4,5)
Build <- c(1.5,2,.3,.5,.6,1.2,.9,1.2,1.2,.4)
Sales <- c(50000,100000,30000,15000,9000,20000,18000,21600,25920,10368)
We can also do this with accumulate from purrr
library(dplyr)
library(purrr)
df1 %>%
group_by(Group) %>%
mutate(Sales = accumulate(Build[-1], ~ .y * .x, .init = first(Sales)))
# A tibble: 10 x 4
# Groups: Group [2]
# Group Day Build Sales
# <fct> <dbl> <dbl> <dbl>
# 1 A 1 1.5 50000
# 2 A 2 2 100000
# 3 A 3 0.3 30000
# 4 A 4 0.5 15000
# 5 A 5 0.6 9000
# 6 B 1 1.2 20000
# 7 B 2 0.9 18000
# 8 B 3 1.2 21600
# 9 B 4 1.2 25920
#10 B 5 0.4 10368
Or using base R with by and Reduce
df1$Sales <- do.call(c, by(df1[3:4], df1$Group, FUN =
function(dat) Reduce(function(x, y) x * y,
dat$Build[-1], init = dat$Sales[1], accumulate = TRUE)))
data
df1 <- structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), Day = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5), Build = c(1.5, 2, 0.3, 0.5, 0.6,
1.2, 0.9, 1.2, 1.2, 0.4), Sales = c(50000, NA, NA, NA, NA, 20000,
NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, -10L
))

Multiple uses of setdiff() on consecutive groups without for looping

I would like to setdiff between consecutive groups without for looping, if possible with a datatable way or a function of apply family.
Dataframe df :
id group
1 L1 1
2 L2 1
3 L1 2
4 L3 2
5 L4 2
6 L3 3
7 L5 3
8 L6 3
9 L1 4
10 L4 4
11 L2 5
I want to know how much new ids there are between consecutive groups. So, for example, if we compare group 1 and 2, there are two new ids : L3 and L4 so it returns 2 (not with setdiff directly but with length()), if we compare group 2 and 3, L5 and L6 are the news ids so it returns 2 and so on.
Expected results :
new_id
2
2
2
1
Data :
structure(list(id = structure(c(1L, 2L, 1L, 3L, 4L, 3L, 5L, 6L,
1L, 4L, 2L), .Label = c("L1", "L2", "L3", "L4", "L5", "L6"), class = "factor"),
group = c(1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5)), class = "data.frame", row.names = c(NA,
-11L), .Names = c("id", "group"))
Here is an option with mapply:
lst <- with(df, split(id, group))
mapply(function(x, y) length(setdiff(y, x)), head(lst, -1), tail(lst, -1))
#1 2 3 4
#2 2 2 1
Here is a data.table way with merge. Suppose the original data.frame is named dt:
library(data.table)
setDT(dt)
dt2 <- copy(dt)[, group := group + 1]
merge(
dt, dt2, by = 'group', allow.cartesian = T
)[, .(n = length(setdiff(id.x, id.y))), by = group]
# group n
# 1: 2 2
# 2: 3 2
# 3: 4 2
# 4: 5 1
You could use Reduce to run a comparison function on pairwise elements in a list. For example
xx<-Reduce(function(a, b) {
x <- setdiff(b$id, a$id);
list(id=b$id, new=x, newcount=length(x))
}, split(df, df$group),
acc=TRUE)[-1]
Then you can get the counts of new elements out with
sapply(xx, '[[', "newcount")
and you can get the new values with
sapply(xx, '[[', "new")
L = split(d, d$group) #Split data ('d') by group and create a list
#use lapply to access 'id' for each sub group in the list and obtain setdiff
sapply(2:length(L), function(i)
setNames(length(setdiff(L[[i]][,1], L[[i-1]][,1])),
nm = paste(names(L)[i], names(L)[i-1], sep = "-")))
#2-1 3-2 4-3 5-4
# 2 2 2 1

Sort data based on conditions

I have a (x) data frame in R with 5 numeric columns and apart from this one information is sorting order to be followed in form of a vector i.e.
1, 0, 2, 4, 3
dataset
v1 v2 v3 v4 v5
1 2 3 4 5
3 13 12 1 4
6 4 6 5 3
Expected result
v1 v2 v3 v4 v5
3 13 12 1 4
1 2 2 4 5
6 4 6 5 3
this vector define the sorting order that first column needs to be sorted first then 3rd column then 5th column and then 4th column. manually it can be done as
x = x[order(x[1],)]
x = x[order(x[3],)]
x = x[order(x[5],)]
x = x[order(x[4],)]
rownames(x) = NULL
Problem is for 5 columns, it is easy but it is complicated for 100s of columns.
any lead to this will be appreciated.
Thanks
We can do a match on the original vector and then use a for loop to get the output
i1 <- match(seq_along(x), vec, nomatch = 0)
i1 <- i1[i1!=0]
for(i in i1){
x <- x[order(x[i]),]
}
x
# v1 v2 v3 v4 v5
# 2 3 13 12 1 4
# 1 1 2 3 4 5
# 3 6 4 6 5 3
data
x <- structure(list(v1 = c(1L, 3L, 6L), v2 = c(2L, 13L, 4L), v3 = c(3L,
12L, 6L), v4 = c(4L, 1L, 5L), v5 = c(5L, 4L, 3L)), .Names = c("v1",
"v2", "v3", "v4", "v5"), class = "data.frame", row.names = c(NA,
-3L))
vec <- c(1, 0, 2, 4, 3)

Manipulating a Data frame in R

I am new in R. I have data frame
A 5 8 9 6
B 8 2 3 6
C 1 8 9 5
I want to make
A 5
A 8
A 9
A 6
B 8
B 2
B 3
B 6
C 1
C 8
C 9
C 5
I have a big data file
Assuming you're starting with something like this:
mydf <- structure(list(V1 = c("A", "B", "C"), V2 = c(5L, 8L, 1L),
V3 = c(8L, 2L, 8L), V4 = c(9L, 3L, 9L),
V5 = c(6L, 6L, 5L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -3L))
mydf
# V1 V2 V3 V4 V5
# 1 A 5 8 9 6
# 2 B 8 2 3 6
# 3 C 1 8 9 5
Try one of the following:
library(reshape2)
melt(mydf, 1)
Or
cbind(mydf[1], stack(mydf[-1]))
Or
library(splitstackshape)
merged.stack(mydf, var.stubs = "V[2-5]", sep = "var.stubs")
The name pattern in the last example is unlikely to be applicable to your actual data though.
Someone could probably do this in a better way but here I go...
I put your data into a data frame called data
#repeat the value in the first column (c - 1) times were c is the number of columns (data[1,])
rep(data[,1], each=length(data[1,])-1)
#turning your data frame into a matrix allows you then turn it into a vector...
#transpose the matrix because the vector concatenates columns rather than rows
as.vector(t(as.matrix(data[,2:5])))
#combining these ideas you get...
data.frame(col1=rep(data[,1], each=length(data[1,])-1),
col2=as.vector(t(as.matrix(data[,2:5]))))
If you could use a matrix you can just 'cast' it to a vector and add the row names. I have assumed that you really want 'a', 'b', 'c' as row names.
n <- 3;
data <- matrix(1:9, ncol = n);
data <- t(t(as.vector(data)));
rownames(data) <- rep(letters[1:3], each = n);
If you want to keep the rownames from your first data frame this is ofcourse also possible without libraries.
n <- 3;
data <- matrix(1:9, ncol=n);
names <- rownames(data);
data <- t(t(as.vector(data)))
rownames(data) <- rep(names, each = n)

Resources