R keep only last submission? - r

My data looks something like this:
DF<- data.frame( id=c("A1","A2","A3","A1"), submission=c(1,1,1,2))
What is the best way of keeping only the last submission for each id? That is:
DF<- data.frame( id=c("A2","A3","A1"), submission=c(1,1,2))
Thanks!

Here are a few options in base R:
DF[!duplicated(DF$id, fromLast=TRUE),]
# id submission
# 2 A2 1
# 3 A3 1
# 4 A1 2
do.call(rbind, by(DF, DF$id, FUN=tail, 1))
# id submission
# A1 A1 2
# A2 A2 1
# A3 A3 1
aggregate(submission ~ id, DF, tail, 1)
# id submission
# 1 A1 2
# 2 A2 1
# 3 A3 1

Related

replace array values according to values in the first index in the first dimension

I have an array with a few dimensions. I want to replace values according to values in the first index in the first dimension. In the example below, I want to change all values that the corresponding a1 dimension == 2. If I change only one index:
set.seed(2)
arr <- array(data=sample(1:2, 18, replace = TRUE), dim=c(3,3,2), dimnames=list(paste0("a",1:3),paste0("b",1:3),paste0("c",1:2)))
# replace second index according to first index of dimension 1
arr[2,,][arr[1,,]==2] <- NA
The result is as expected:
> arr
, , c1
b1 b2 b3
a1 1 2 1
a2 1 NA 1
a3 2 2 1
, , c2
b1 b2 b3
a1 2 2 1
a2 NA NA 2
a3 1 1 2
But if I try to change all other indexes like this:
set.seed(2)
arr <- array(data=sample(1:2, 18, replace = TRUE), dim=c(3,3,2), dimnames=list(paste0("a",1:3),paste0("b",1:3),paste0("c",1:2)))
# replace 2nd & 3rd index according to first index of dimension 1
arr[2:3,,][arr[1,,]==2] <- NA
It doesn't work as I expect. The indexes in an array is difficult to understand. How do I do it correctly? (naturally, without changing each index separately). Thanks.
I expect the result to be:
> arr
, , c1
b1 b2 b3
a1 1 2 1
a2 1 NA 1
a3 2 NA 1
, , c2
b1 b2 b3
a1 2 2 1
a2 NA NA 2
a3 NA NA 2
You can use rep to get the right indices for subsetting.
arr[2:3,,][rep(arr[1,,]==2, each=2)] <- NA
arr
#, , c1
#
# b1 b2 b3
#a1 1 2 1
#a2 1 NA 1
#a3 2 NA 1
#
#, , c2
#
# b1 b2 b3
#a1 2 2 1
#a2 NA NA 2
#a3 NA NA 2
Or more generally.
i <- 2:dim(arr)[1]
arr[i,,][rep(arr[1,,]==2, each=length(i))] <- NA
Or (Thanks to #jblood94 for this variant)
arr[-1,,][rep(arr[1,,]==2, each = nrow(arr) - 1)] <- NA
Or using a loop.
for(i in 2:nrow(arr)) arr[i,,][arr[1,,]==2] <- NA
It would be
arr[2:3,,][rep(arr[1,,]==2, each = 2)] <- NA
Or, more generally, to replace all rows based on the first row:
arr[-1,,][rep(arr[1,,]==2, each = nrow(arr) - 1)] <- NA

Use the levels of a dataframe column to add a new column with an incrementing number unique to each level

I'm trying to create a new column in a dataframe that contains an incrementing number based on the levels of a different column. That is, I want to rename the levels of a column so that each level has a unique, incrementing number.
df <- data.frame(y1 = c(100, 100, 100, 200, 200, 500, 500, 500),
y2 = c(6, 5, 4, 2, 5, 4, 3, 2))
df$y1 <- as.factor(df$y1)
levels(df$y1) ## [1] "100" "200" "500"
Expected output: a new y3 column with new level names based on the levels of y1. The "b" isn't necessary, I can add that on later.
y1 y2 y3
100 6 b1
100 5 b1
100 4 b1
200 2 b2
200 5 b2
500 4 b3
500 3 b3
500 2 b3
I've messed around with lapply and various for loops, but I don't really know what I'm doing here... stuff like this:
for (i in levels(df$y1)){
batchnum <- 1
if (i == df$y1){
df$y3 <- paste0("b", batchnum)
batchnum <- batchnum + 1
}
}
This just labels y3 with "b1" for each row, I guess because if is not vectorized or something?
## Warning messages:
1: In if (i == df$y1) { :
the condition has length > 1 and only the first element will be used
Using data.table:
library(data.table)
setDT(df)
df[, y3 := .GRP, by = y1]
df[, y3 := paste0("b", y3)] # you can change "b" with whatever you want
y1 y2 y3
1: 100 6 b1
2: 100 5 b1
3: 100 4 b1
4: 200 2 b2
5: 200 5 b2
6: 500 4 b3
7: 500 3 b3
8: 500 2 b3
The most direct and simple approach (taking advantage of the fact that as.numeric will generate numbers corresponding to the factor levels):
df$y3 <- paste0('b', as.numeric(df$y1))
If it's not clear why this works, look at the following code on its own:
as.numeric(df$y1)
A dplyr approach:
require(dplyr);
df %>% mutate(y3 = paste0("b", as.numeric(y1)));
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3
Or you also do:
df %>% mutate(y3 = paste0("b", cumsum(!duplicated(y1))));
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3
Here's one way:
x <- c(100,100,100,200,200,500,500,500)
paste0("b",rep(seq_along(table(x)),table(x)))
[1] "b1" "b1" "b1" "b2" "b2" "b3" "b3" "b3"
One can use group_indices function from dplyr to create new column as:
library(dplyr)
df %>% mutate(y3 = paste0("b", group_indices(.,y1)))
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3

Dependents and Precedents in R

Need help in flagging number of Dependents and Precedents in R. My data frame contains some formulas (strings) and I want to add "col3" which should contain: 0 for A1, 1 for A2 (Because A2 is dependent on A1 - One dependency) and 2 for A3 (Because A3 is dependent on A2/A1).
col1 <- c('A1','A2','A3', 'A6','A4','A7')
col2 <- c('X1+Y1','A1+Y2', 'A4+Y3+A2', 'Y5+A1','A2+A1+A3','A2+A1')
df <- data.frame(col1, col2, stringsAsFactors=F)
My Output should look like:
col1 col2 col3
1 A1 Y1 0
2 A2 A1+Y2 1
3 A3 A4+Y3+A2 5
4 A6 Y5+A1 1
5 A4 A2+A1+Y3 3
6 A7 A2+A1 3
I have a data frame with 100+ rows of this format. Appreciate if you could help with this.
Below code produces the correct output.
col0 <- c('A1','A2','A3', 'A6','A4','A7')
col2 <- c('X1+Y1','A1+Y2', 'A1+Y3+A2', 'Y5+A2','A2+A1+A3','A2+A3')
df <- data.frame(col0, col2, stringsAsFactors=F)
library(tidyr)
library(dplyr)
df1 <- df %>%
separate(col2, into = as.character(c(1:4)),sep = "\\+") %>%
replace(is.na(.),"")
df1$OOE <- 0
for (i in 1:nrow(df1)) {
for (j in 2:ncol(df1)) {
for (k in 1:nrow(df1)) {
if (df1[i,j] == df1$col0[k]) df1$OOE[i]=df1$OOE[k]+df1$OOE[i]+1
}
}
}
col0 1 2 3 4 OOE
1 A1 X1 Y1 0
2 A2 A1 Y2 1
3 A3 A1 Y3 A2 3
4 A6 Y5 A2 2
5 A4 A2 A1 A3 7
6 A7 A2 A3 6
If AX can have a dependency on AY where Y>X, we need a tree like structure to find the dependencies. I knew about the igraph package but it seems to complex for the task. We just need some reference semantics and after some research, data.tree package seems appropriate. Here is the code:
col1 <- c('A1','A2','A3', 'A6','A4','A7')
col2 <- c('X1+Y1','A1+Y2', 'A1+Y3+A2', 'Y5+A2','A2+A1+A3','A2+A3')
df <- data.frame(col1, col2, stringsAsFactors=F)
require(data.tree)
# Create the graph/forest based on the data
getForest <- function(data) {
res <- new.env()
for( i in 1:nrow(data)){
nname <- data$col1[i]
if(!exists(nname,where=res))
assign(nname,Node$new(nname), pos=res)
par <- get(nname, envir=res)
print(par)
#Add the childs
deps <- unlist(regmatches(data$col2[i],gregexpr("A\\d+",data$col2[i])))
for( ch in deps) {
print("Ammm")
if(!exists(ch, where=res))
assign(ch,Node$new(ch), pos=res)
child <- get(ch, envir=res)
par$AddChildNode(child)
}
}
#Return the nodes
res
}
f <- getForest(df)
# Function to get the dependency level
getLevel<- function(node) {
if (node$count == 0)
return (0)
else {
return (length(node$children)+sum(sapply(node$children,getlevel)))
}
}
#Add dependency level to data frame
df$col3 <- sapply(df$col1, function(x) {getLevel(get(x,f))})
df
# col1 col2 col3
#1 A1 X1+Y1 0
#2 A2 A1+Y2 1
#3 A3 A1+Y3+A2 3
#4 A6 Y5+A2 2
#5 A4 A2+A1+A3 7
#6 A7 A2+A3 6

cbind with partially nested list

I'm trying to cbind or unnest or as.data.table a partially nested list.
id <- c(1,2)
A <- c("A1","A2","A3")
B <- c("B1")
AB <- list(A=A,B=B)
ABAB <- list(AB,AB)
nested_list <- list(id=id,ABAB=ABAB)
The length of id is the same as ABAB (2 in this case). I don't know how to unlist a part of this list (ABAB) and cbind another part (id). Here's my desired result as a data.table:
data.table(id=c(1,1,1,2,2,2),A=c("A1","A2","A3","A1","A2","A3"),B=rep("B1",6))
id A B
1: 1 A1 B1
2: 1 A2 B1
3: 1 A3 B1
4: 2 A1 B1
5: 2 A2 B1
6: 2 A3 B1
I haven't tested for more general cases, but this works for the OP example:
library(data.table)
as.data.table(nested_list)[, lapply(ABAB, as.data.table)[[1]], id]
# id A B
#1: 1 A1 B1
#2: 1 A2 B1
#3: 1 A3 B1
#4: 2 A1 B1
#5: 2 A2 B1
#6: 2 A3 B1
Or another option (which is probably faster, but is more verbose):
rbindlist(lapply(nested_list$ABAB, as.data.table),
idcol = 'id')[, id := nested_list$id[id]]
This is some super ugly base R, but produces the desired output.
Reduce(rbind, Map(function(x, y) setNames(data.frame(x, y), c("id", "A", "B")),
as.list(nested_list[[1]]),
lapply(unlist(nested_list[-1], recursive=FALSE),
function(x) Reduce(cbind, x))))
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
lapply takes the a list of two elements (each containing the A and B variables) extracted with unlist and recursive=FALSE. It returns a list of character matrices with the B variable filled in by recycling. A list of the individual id variables from as.list(nested_list[[1]]) and the lit of matrices are fed to Map which converts corresponding pairs to a data.frame and gives the columns the desired names and returns a list of data.frames. Finally, this list of data.frames is fed to Reduce, which rbinds the results to a single data.frame.
The final Reduce(rbind, could be replaced by data.tables rbindlist if desired.
Here's another hideous solution
max_length = max(unlist(lapply(nested_list, function(x) lapply(x, lengths))))
data.frame(id = do.call(c, lapply(nested_list$id, rep, max_length)),
do.call(rbind, lapply(nested_list$ABAB, function(x)
do.call(cbind, lapply(x, function(y) {
if(length(y) < max_length) {
rep(y, max_length)
} else {
y
}
})))))
# id A B
#1 1 A1 B1
#2 1 A2 B1
#3 1 A3 B1
#4 2 A1 B1
#5 2 A2 B1
#6 2 A3 B1
And one more, also inelegant- but I`d gone too far by the time I saw the other answers.
restructure <- function(nested_l) {
ids <- as.numeric(max(unlist(lapply(unlist(nested_l, recursive = FALSE), function(x){
lapply(x, length)
}))))
temp = data.frame(rep(nested_l$id, each = ids),
sapply(1:length(nested_l$id), function(x){
out <-unlist(lapply(nested_l[[2]], function(y){
return(y[x])
}))
}))
names(temp) <- c("id", unique(substring(unlist(nested_l[2]), first = 1, last = 1)))
return(temp)
}
> restructure(nested_list)
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
Joining the party:
library(tidyverse)
temp <- map(nested_list,~map(.x,~expand.grid(.x)))
df <- map_df(1:2,~cbind(temp$id[[.x]],temp$ABAB[[.x]]))
Var1 A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1

expand.grid with separate variable for each column

I would like to achieve the following data.frame in R:
i1 i2 i3
1 A1 A2 A3
2 No A2 A3
3 A1 No A3
4 No No A3
5 A1 A2 No
6 No A2 No
7 A1 No No
8 No No No
In each column the variable can either be the concatenated string "A" and the column number or "No". The data.frame should contain all possible combinations.
My idea was to use expand.grid, but I don't know how to create the list dynamically. Or is there a better approach?
expand.grid(list(c("A1", "No"), c("A2", "No"), c("A3", "No")))
I guess you could create your own helper function, something like that
MyList <- function(n) expand.grid(lapply(paste0("A", seq_len(n)), c, "No"))
Then simply pass it the number of elements (e.g., 3)
MyList(3)
# Var1 Var2 Var3
# 1 A1 A2 A3
# 2 No A2 A3
# 3 A1 No A3
# 4 No No A3
# 5 A1 A2 No
# 6 No A2 No
# 7 A1 No No
# 8 No No No
Alternatively, you could also try data.tables CJ equivalent which should much more efficient than expand.grid for a big n
library(data.table)
DTCJ <- function(n) do.call(CJ, lapply(paste0("A", seq_len(n)), c, "No"))
DTCJ(3) # will return a sorted cross join
# V1 V2 V3
# 1: A1 A2 A3
# 2: A1 A2 No
# 3: A1 No A3
# 4: A1 No No
# 5: No A2 A3
# 6: No A2 No
# 7: No No A3
# 8: No No No
Another option is using Map with expand.grid
n <- 3
expand.grid(Map(c, paste0('A', seq_len(n)), 'NO'))
Or
expand.grid(as.data.frame(rbind(paste0('A', seq_len(n)),'NO')))
Another option, only using the most fundamental functions in R, is to use the indices:
df <- data.frame(V1 = c('A','A','A', 'A',rep('No',4)), V2 = c('A','A','No','No','A','A','No','No'), V3 = c('A','No','A','No','A','No','A','No'), stringsAsFactors = FALSE)
to get the row and col indices of the elements we need to change:
rindex <- which(df != 'No') %% nrow(df)
cindex <- ceiling(which(df != 'No')/nrow(df))
the solution is basically a one-liner:
df[matrix(c(rindex,cindex),ncol=2)] <- paste0(df[matrix(c(rindex,cindex),ncol=2)],cindex)
> df
V1 V2 V3
1 A1 A2 A3
2 A1 A2 No
3 A1 No A3
4 A1 No No
5 No A2 A3
6 No A2 No
7 No No A3
8 No No No

Resources