Can you please suggest how to implement the following in R.
I have a table as given below.
ID object value
1 a 3
2 a 2
3 b 3
4 a 1
5 a 2
6 b 2
7 a 1
8 b 1
I would like to get the following table
ID object values
1 a 3, 2, 1
2 a 2, 1
4 a 1
5 a 2, 1
7 a 1
3 b 3, 2, 1
6 b 2,1
8 b 1
In other words, for each object each row value is appended with the next observed values till the value reaches 1.
Thanks a lot for helping.
Bikas
It is not altogether clear whether
the data will always be ordered decreasing by value
and whether the values should be output in decreasing order
IN any event, I would use the data.table library. Assuming your table is a data.frame, df, I would do the following:
library(data.table)
setDT(df)
df[ values >= 1 ][ by=list( ID, value ), order(value, decreasing=TRUE), values = paste0( value, sep=", " ) ]
What this is doing is:
initializing your data.frame as a data.table
using only rows with values >= 1
ordering the data
grouping by ID and value
pasting value together
Using a modified dataset with 2nd row value as 4
res <- unsplit(lapply(split(df, df$object), function(x) {
x$value <- sapply(seq_len(nrow(x)), function(i) {
i1 <- i:nrow(x)
indx <- which(x$value[i1]==1)[1]
paste(x$value[i1[seq(indx)]], collapse=",")
})
x}),
df$object)
res[order(res$object),]
# ID object value
#1 1 a 3, 4, 1
#2 2 a 4, 1
#4 4 a 1
#5 5 a 2, 1
#7 7 a 1
#3 3 b 3, 2, 1
#6 6 b 2, 1
#8 8 b 1
Also, using data.table
library(data.table)
setDT(df)[, N:=1:.N, by=object][,
values:=unlist(lapply(N, function(i) {
val <- value[i:.N]
paste(val[1:which(val==1)[1]], collapse=",")
})), keyby=object][,-(3:4), with=FALSE]
# ID object values
#1: 1 a 3,4,1
#2: 2 a 4,1
#3: 4 a 1
#4: 5 a 2,1
#5: 7 a 1
#6: 3 b 3,2,1
#7: 6 b 2,1
#8: 8 b 1
Update
If you need the sequence up till the minimum value, you could replace which(x$value[i1]==1 to which(x$value[i1]==min(x$value))[1]. For example, using the first code as a function.
f1 <- function(dat){
lst <- split(dat, dat$object)
lst2 <- lapply(lst, function(x) {
x$value <- sapply(seq_len(nrow(x)), function(i) {
i1 <- i:nrow(x)
indx <- which(x$value[i1]== min(x$value))[1]
paste(x$value[i1[seq(indx)]], collapse=",")
})
x})
res <- unsplit(lst2, dat$object)
res[order(res$object),]
}
f1(df)
# ID object value
#1 1 a 3,4,1
#2 2 a 4,1
#4 4 a 1
#5 5 a 2,1
#7 7 a 1
#3 3 b 3,2,1
#6 6 b 2,1
#8 8 b 1
If I change all the 1 values to 2
df$value[df$value==1] <- 2
f1(df)
# ID object value
#1 1 a 3,4,2
#2 2 a 4,2
#4 4 a 2
#5 5 a 2
#7 7 a 2
#3 3 b 3,2
#6 6 b 2
#8 8 b 2
data
df <- structure(list(ID = 1:8, object = c("a", "a", "b", "a", "a",
"b", "a", "b"), value = c(3L, 4L, 3L, 1L, 2L, 2L, 1L, 1L)), .Names = c("ID",
"object", "value"), class = "data.frame", row.names = c(NA, -8L
))
Related
This question is a follow-up on my previous question. In this question, after my split.default() call below, I get a named list of data.frames called L.
Qs: I was wondering how I could condense each data.frame in L whose each column consists of a constant number? (How about if I know the names of the data.frames whose columns are constant numbers?)
My desired output is shown further below.
r <- list(
data.frame(study.name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
A = rep(1, 6),
B = rep(4, 6)),
data.frame(study.name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
G = c(1,1,1,2,NA,NA),
A = rep(3, 6),
B = rep(7, 6)))
DATA <- do.call(cbind, r)
nm1 <- Reduce(intersect, lapply(r, colnames))[-1]
L <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
Desired output:
# $A
# A A.1
# 1 1 3
# $B
# B B.1
# 1 4 7
# $X
# X X.1
# 1 2 1
# 2 2 NA
# 3 1 3
# 4 1 1
# 5 NA NA
# 6 NA NA
Assuming that the NA rows should be preserved, apply duplicated by looping over the list as well as if all the elements of a particular are NA, then keep that row
lapply(L, function(x) x[(rowSums(is.na(x)) == ncol(x))|!duplicated(x),])
#$A
# A A.1
#1 1 3
#$B
# B B.1
#1 4 7
#$X
# X X.1
#1 2 1
#2 2 NA
#3 1 3
#4 1 1
#5 NA NA
#6 NA NA
If we also need a check for constant value
is_constant <- function(x) length(unique(x)) == 1L
lapply(L, function(x) if(all(sapply(x, is_constant))) x[1,, drop = FALSE] else x)
#$A
# A A.1
#1 1 3
#$B
# B B.1
#1 4 7
#$X
# X X.1
#1 2 1
#2 2 NA
#3 1 3
#4 1 1
#5 NA NA
#6 NA NA
I've got a dataset
>view(interval)
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 2 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
>dput(interval)
structure(list(V1 = c(NA, 2, 3, 4, NA),
V2 = c(1, 2, NA, 2, 5),
V3 = c(2, 3, 1, 2, 1), ID = 1:5), row.names = c(NA, -5L), class = "data.frame")
I would like to extract the previous not NA value (or the next, if NA is in the first row) for every row, and store it as a local variable in a custom function, because I have to perform other operations on every row based on this value(which should change for every row i'm applying the function).
I've written this function to print the local variables, but when I apply it the output is not what I want
myFunction<- function(x){
position <- as.data.frame(which(is.na(interval), arr.ind=TRUE))
tempVar <- ifelse(interval$ID == 1, interval[position$row+1,
position$col], interval[position$row-1, position$col])
return(tempVar)
}
I was expecting to get something like this
# [1] 2
# [2] 2
# [3] 4
But I get something pretty messed up instead.
Here's attempt number 1:
dat <- read.table(header=TRUE, text='
V1 V2 V3 ID
NA 1 2 1
2 2 3 2
3 NA 1 3
4 2 2 4
NA 5 1 5')
myfunc1 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# catch first-row NA
ind[,1] <- ifelse(ind[,1] == 1L, 2L, ind[,1] - 1L)
x[ind]
}
myfunc1(dat)
# [1] 2 2 4
The problem with this is when there is a second "stacked" NA:
dat2 <- dat
dat2[2,1] <- NA
dat2
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 NA 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
myfunc1(dat2)
# [1] NA NA 2 4
One fix/safeguard against this is to use zoo::na.locf, which takes the "last observation carried forward". Since the top-row is a special case, we do it twice, second time in reverse. This gives us the "next non-NA value in the column (up or down, depending).
library(zoo)
myfunc2 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# this is to guard against stacked NA
x <- apply(x, 2, zoo::na.locf, na.rm = FALSE)
# this special-case is when there are one or more NAs at the top of a column
x <- apply(x, 2, zoo::na.locf, fromLast = TRUE, na.rm = FALSE)
x[ind]
}
myfunc2(dat2)
# [1] 3 3 2 4
I need some help on creating a special kind of subtraction.
I have a data table x and I must subtract two columns, say a and b.
However, either column may not exist.
If a column does not exist, its value in the subtraction should be set to zero.
So far, I have approached this problem by trying to define a new subtraction operator, %-%
Thus, for example, if x = data.table(a = 5, b = 2), then a %-% b should be 3, whereas a %-% d should be 5.
I have tried to define this subtraction operator as shown below. However, for some reason, my subtraction always yields zero! Can anyone help me understand what am I doing wrong and how may I correct my code?
library(data.table)
x = data.table(a = floor(10 * runif(5)), b = floor(10 * runif(5)), c =floor(10 * runif(5)))
`%-%` <- function(e1,e2, DT = x){
ifelse(is.numeric(substitute(e1, DT)), e1 <- substitute(e1, DT), e1 <- 0)
ifelse(is.numeric(substitute(e2, DT)), e2 <- substitute(e2, DT), e2 <- 0)
return(e1 - e2)
}
x[, d := a %-% b]
x
x[, d := a %-% d]
x
Many thanks!
We can create a function with intersect for passing the column names into .SDcols, then Reduce by subtracting the corresponding rows of each column in .SD (Subset of Data.table)
f1 <- function(dat, .x, .y) intersect(names(dat), c(.x, .y))
x[, d := Reduce('-', .SD), .SDcols = f1(x, 'a', 'b')]
x[, e := Reduce(`-`, .SD), .SDcols = f1(x, 'a', 'f')]
x
# a b c d e
#1: 7 0 8 7 7
#2: 3 6 4 -3 3
#3: 9 9 8 0 9
#4: 3 6 2 -3 3
#5: 0 2 3 -2 0
Or if we want to change the OP's function by passing unquoted arguments, then use enquo to convert it to from quosure and then reconvert it back to string with quo_name. Create an intersection vector from the column names of the dataset, and use - in the Reduce
library(dplyr)
`%-%` <- function(e1,e2, DT){
e1 <- quo_name(enquo(e1))
e2 <- quo_name(enquo(e2))
nm1 <- intersect(names(DT), c(e1, e2))
DT[, Reduce(`-`, .SD), .SDcols = nm1]
}
x[, d := `%-%`(a, b, .SD)]
x[, e := `%-%`(a, f, .SD)]
data
x <- structure(list(a = c(7L, 3L, 9L, 3L, 0L), b = c(0L, 6L, 9L, 6L,
2L), c = c(8L, 4L, 8L, 2L, 3L)), .Names = c("a", "b", "c"), row.names = c("1:",
"2:", "3:", "4:", "5:"), class = "data.frame")
setDT(x)
`%-%`=function(a,b){
DT=eval(sys.status()$sys.calls[[2]][[2]])
a=substitute(a)
b=substitute(b)
stopifnot(is.name(a),is.name(b),is.data.table(DT))
a=deparse(a)
b=deparse(b)
d=numeric(nrow(DT))
a=if(!exists(a,DT)) d else get(a,DT)
b=if(!exists(b,DT)) d else get(b,DT)
a-b
}
set.seed(5)
x = data.table(a = floor(10 * runif(5)), b = floor(10 * runif(5)), c =floor(10 * runif(5)))
x
a b c
1: 2 7 2
2: 6 5 4
3: 9 8 3
4: 2 9 5
5: 1 1 2
x[,a%-%b]
[1] -5 1 1 -7 0
x[,a%-%f]# F is just a column of zeros since it does not exist:
[1] 2 6 9 2 1
Or you can just do:
x[,c("d","e","f"):=.(a%-%b,a%-%h,g%-%h)]
x
a b c d e f
1: 2 7 2 -5 2 0
2: 6 5 4 1 6 0
3: 9 8 3 1 9 0
4: 2 9 5 -7 2 0
5: 1 1 2 0 1 0
This function is written to work on a datatable only. For example:
setDF(x)[,a%-%b]
Error: is.data.table(DT) is not TRUE
setDT(x)[,a%-%b]
[1] -5 1 1 -7 0
EDIT: This answer gives the correct value with regard to the order. (Most of the answers given below do not pass this test)
setDT(x)[,a%-%b]#Column subtract another
[1] -5 1 1 -7 0
setDT(x)[,b%-%a]#Reversing the order
[1] 5 -1 -1 7 0
setDT(x)[,b%-%b]#Column Subtract itself
[1] 0 0 0 0 0
setDT(x)[,a%-%f]#Column subtract a non-existing column
[1] 2 6 9 2 1
setDT(x)[,f%-%a]#a non-existing column subtract an existing column
[1] -2 -6 -9 -2 -1
x[,g%-%f] #subtract two non-existing columns
[1] 0 0 0 0 0
IIUC, you can try this way. We use exist function to ensure if the column is available in the data.
# helper function
do_sub <- function(df, col1 = 'a', col2='b')
{
ans <- integer()
if (exists(col1, df) & exists(col2, df)){
ans <- append(ans, df[[col1]] - df[[col2]])
} else if (exists(col1, df)){
ans <- append(ans, df[[col1]] - 0)
} else {
ans <- append(ans, 0 - df[[col2]])
}
return (ans)
}
# compute new columns
df[, d := do_sub(.SD, col1 = 'a', col2 = 'b')]
df[, e := do_sub(.SD, col1 = 'a', col2 = 'f')]
print(df)
a b c d e
1: 7 0 8 7 7
2: 3 6 4 -3 3
3: 9 9 8 0 9
4: 3 6 2 -3 3
5: 0 2 3 -2 0
DF = structure(list(a = c(1L, 2L, 5L), b = c(2L, 3L, 3L), c = c(3L, 1L, 2L)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
a b c
1 2 3
2 3 1
5 3 2
How do I create additional columns, each including the names or indices of the columns of the row minimum, middle and maximum as follows?
a b c min middle max
1 2 3 a b c
2 3 1 c a b
5 3 2 c b a
One approach would be to loop through the rows with apply, returning the column names in the indicated order:
cbind(DF, t(apply(DF, 1, function(x) setNames(names(DF)[order(x)],
c("min", "middle", "max")))))
# a b c min middle max
# 1 1 2 3 a b c
# 2 2 3 1 c a b
# 3 5 3 2 c b a
This solution assumes you have exactly three columns (so the middle is the second largest). If that is not the case, you could generalize to any number of columns with the following modification:
cbind(DF, t(apply(DF, 1, function(x) {
ord <- order(x)
setNames(names(DF)[c(ord[1], ord[(length(x)+1)/2], tail(ord, 1))],
c("min", "middle", "max"))
})))
# a b c min middle max
# 1 1 2 3 a b c
# 2 2 3 1 c a b
# 3 5 3 2 c b a
As the OP mentioned about data.table, here is one way with data.table. Convert the 'data.frame' to 'data.table' (setDT(DF)), grouped by the sequence of rows, we unlist the dataset, order the values, use it as index to order column names, create three columns by assigning (after converting to list).
library(data.table)
setDT(DF)[, c('min', 'middle', 'max') :=
as.list(names(DF)[order(unlist(.SD))]) ,1:nrow(DF)][]
# a b c min middle max
#1: 1 2 3 a b c
#2: 2 3 1 c a b
#3: 5 3 2 c b a
I have a data.frame
orig.DF<-data.frame(V1=c("A", "B", "C"), V2=c(3,2,4))
and I have to expand it so that it takes the following form
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I tried taaply and ave but I can't get it to count to 1:x and repeat the V1 accordingly
df <- data.frame(V1 = c("A", "B", "C"), V2 = c(3, 2, 4))
data.frame(x = rep(df$V1, df$V2), y = sequence(df$V2))
x y
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 1
7 C 2
8 C 3
9 C 4
Here is one approach:
do.call(
rbind,
apply(orig.DF, 1, function(row) expand.grid(row["V1"], 1:row["V2"]))
)