Getting index of column of minimum value in a data.frame row - r

DF = structure(list(a = c(1L, 2L, 5L), b = c(2L, 3L, 3L), c = c(3L, 1L, 2L)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
a b c
1 2 3
2 3 1
5 3 2
How do I create additional columns, each including the names or indices of the columns of the row minimum, middle and maximum as follows?
a b c min middle max
1 2 3 a b c
2 3 1 c a b
5 3 2 c b a

One approach would be to loop through the rows with apply, returning the column names in the indicated order:
cbind(DF, t(apply(DF, 1, function(x) setNames(names(DF)[order(x)],
c("min", "middle", "max")))))
# a b c min middle max
# 1 1 2 3 a b c
# 2 2 3 1 c a b
# 3 5 3 2 c b a
This solution assumes you have exactly three columns (so the middle is the second largest). If that is not the case, you could generalize to any number of columns with the following modification:
cbind(DF, t(apply(DF, 1, function(x) {
ord <- order(x)
setNames(names(DF)[c(ord[1], ord[(length(x)+1)/2], tail(ord, 1))],
c("min", "middle", "max"))
})))
# a b c min middle max
# 1 1 2 3 a b c
# 2 2 3 1 c a b
# 3 5 3 2 c b a

As the OP mentioned about data.table, here is one way with data.table. Convert the 'data.frame' to 'data.table' (setDT(DF)), grouped by the sequence of rows, we unlist the dataset, order the values, use it as index to order column names, create three columns by assigning (after converting to list).
library(data.table)
setDT(DF)[, c('min', 'middle', 'max') :=
as.list(names(DF)[order(unlist(.SD))]) ,1:nrow(DF)][]
# a b c min middle max
#1: 1 2 3 a b c
#2: 2 3 1 c a b
#3: 5 3 2 c b a

Related

apply function removing 0 counts from table() output of ordered factors [duplicate]

This question already has an answer here:
R: Why am I not getting type or class "factor" after converting columns to factor?
(1 answer)
Closed 7 months ago.
Assume the following data.frame with columns of ordered factors:
dat0 <- data.frame(X1 = 1:5, X2 = 1:5, X3 = c(1,1:4), X4 = c(2,2:5))
dat <- data.frame(lapply(dat0, factor, ordered=TRUE, levels=1:5, labels=letters[1:5]))
I want to create a nice looking table that compiles how many a:e are in each column of dat (including any 0 counts). The function table() is an obvious choice.
My "clean" attempt at making this table does not work. See below:
The table() function works as expected (i.e., includes all 5 factor choices -- even if one or more has a 0 count) when applied to individual columns:
table(dat[,1])
a b c d e
1 1 1 1 1
table(dat[,3])
a b c d e
2 1 1 1 0
# note: that a 0 is provided for any factor missing
However, when I try to use an apply() function on the data.frame to include all column counts into one table, I get wonky resulting formatting:
apply(dat, 2, table)
$X1
a b c d e
1 1 1 1 1
$X2
a b c d e
1 1 1 1 1
$X3
a b c d
2 1 1 1
$X4
b c d e
2 1 1 1
I can demonstrate the cause of the issue by only including columns of my data.frame that have at least 1 count for each factor that is similar between the columns. (i.e., I can get my desired formatting outcome by removing any column with a 0 count for any factor):
apply(dat[1:2], 2, table) # only including columns of dat with all 5 letters (i.e., no 0 counts)
X1 X2
a 1 1
b 1 1
c 1 1
d 1 1
e 1 1
Question: Is there a simple workaround/solution here when using table() or am I going to have to find a different approach?
Note: I know I could simply cbind() the individual table results, but that's very tedious in my actual more complex data set.
We may use table in sapply.
sapply(dat, table)
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
Or vapply which is faster, but we need to know the .
vapply(dat, table, nlevels(unlist(dat)))
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
If we don't urgently need the row names, we may use tabulate.
sapply(dat, tabulate, nlevels(unlist(dat)))
# X1 X2 X3 X4
# [1,] 1 1 2 0
# [2,] 1 1 1 2
# [3,] 1 1 1 1
# [4,] 1 1 1 1
# [5,] 1 1 0 1
In case we know the nlevels before, we may simplify it to vapply(dat, table, numeric(5L)) and sapply(dat, tabulate, numeric(5L)) which also gives a gain in speed.
Here comes the benchmark
set.seed(42)
DAT <- dat[sample(nrow(dat),1e5, replace=TRUE), ]
r <- matrix(, 5L, dim(DAT)[2])
microbenchmark::microbenchmark(
t(data.frame(do.call(rbind,lapply(DAT, table)))),
sapply(DAT, table),
vapply(DAT, table, numeric(5L)),
vapply(DAT, table, numeric(nlevels(unlist(dat)))),
sapply(DAT, tabulate, 5L),
sapply(DAT, tabulate, nlevels(unlist(dat))),
`for`={for (j in seq_along(DAT)) r[, j] <- tabulate(DAT[, j], 5L)}
)
Unit: microseconds
expr min lq mean median uq max neval cld
t(data.frame(do.call(rbind, lapply(DAT, table)))) 9960.629 10101.4820 11662.6014 10221.6970 14459.0215 17422.732 100 c
sapply(DAT, table) 9690.340 9822.2150 11721.6487 9934.2045 14128.6330 19107.070 100 c
vapply(DAT, table, numeric(5L)) 9630.185 9729.9155 11313.4803 9816.3260 14017.8180 22655.129 100 c
vapply(DAT, table, numeric(nlevels(unlist(dat)))) 9753.252 9890.5700 11309.0461 9976.4840 14110.4775 17906.082 100 c
sapply(DAT, tabulate, 5L) 725.613 742.7820 778.6458 785.3595 807.1935 916.700 100 a
sapply(DAT, tabulate, nlevels(unlist(dat))) 848.600 891.1135 936.7825 939.8245 967.2390 1114.601 100 a
for 3580.538 3846.5700 4059.3048 3922.1300 3981.4300 19752.024 100 b
Data:
dat <- structure(list(X1 = structure(1:5, levels = c("a", "b", "c",
"d", "e"), class = c("ordered", "factor")), X2 = structure(1:5, levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor")), X3 = structure(c(1L,
1L, 2L, 3L, 4L), levels = c("a", "b", "c", "d", "e"), class = c("ordered",
"factor")), X4 = structure(c(2L, 2L, 3L, 4L, 5L), levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor"))), class = "data.frame", row.names = c(NA,
-5L))
Solution:
Use lapply and not apply as explained in the ZheyuanLi's linked answer and his comment.
Summary: The problem of apply is that it converts everything to characters, then table re-factors those characters so that unused levels are not preserved. But lapply gives a list.
Use a combination of data.frame, do.call, rbind, and t (transpose) to get the data into the desired data.frame format:
t(data.frame(do.call(rbind,lapply(dat, table))))
X1 X2 X3 X4
a 1 1 2 0
b 1 1 1 2
c 1 1 1 1
d 1 1 1 1
e 1 1 0 1
Or:
As ZheyuanLi pointed out, one can simply use sapply(dat, table).
Also thanks jay.sf for showing how vapply works.

Difference across columns in a data.table

Sorry if it might be a quite basic point, but I fail to find a convenient tool.
I have a (quite large) data table and want to difference across columns, that is
A B C D
9 N.A. 3 2
15 4 N.A. N.A.
N.A. N.A 2 3
I want to create a new column E that is the what is left of A after differencing B, C, and D. For N.A.s in columns B,C and D, I can assume zeros, but when there is a N.A. in A I have to ignore this observation. So the final result should be
A B C D E
9 N.A. 3 2 4
15 4 N.A. N.A. 11
I was removing all the columns in DT that are N.A. in A by
DT <- DT[!(DT$A=="N.A.")]
and then I tried
DT[, E:= lapply(.SD, diff), .SDcols = c("A", "B", "C", "D")].
but that fails because of the N.A.'s.
I don't want to manually change the N.A.s into 0 (because later on I might want to distinguish what was a real zero and what was what I imputed)- I'd like to do it inside a function. Does anybody have a good idea?
Here you go:
df$E[!is.na(df$A)] = 2*df$A[!is.na(df$A)] - rowSums(df[!is.na(df$A),], na.rm = T)
Example:
df = data.frame(A = c(19,25,NA,17),B = c(1,2,3,4), C = c(5,NA,NA,9), D = c(3,1,2,NA))
>df
A B C D
1 19 1 5 3
2 25 2 NA 1
3 NA 3 NA 2
4 17 4 9 NA
df$E[!is.na(df$A)] = 2*df$A[!is.na(df$A)] - rowSums(df[!is.na(df$A),], na.rm = T)
> df
A B C D E
1 19 1 5 3 10
2 25 2 NA 1 22
3 NA 3 NA 2 NA
4 17 4 9 NA 4
I assume all columns ar with type character.
require(data.table)
DT <- data.table(A = c("9", "15", "N.A."),
B = c("N.A.", "4", "N.A."),
C = c("3", "N.A.", "2"),
D = c("2", "N.A.", "3"))
DT <- DT[A != "N.A."]
Compute row number.
DT[, rownum := .I]
You will get warnings because N.A. can not be converted to type numeric.
DT[, E := as.numeric(A) - sum(as.numeric(B),
as.numeric(C),
as.numeric(D), na.rm = T), by = rownum]
DT

converting multiple rows into a single row based on specific conditions

Can you please suggest how to implement the following in R.
I have a table as given below.
ID object value
1 a 3
2 a 2
3 b 3
4 a 1
5 a 2
6 b 2
7 a 1
8 b 1
I would like to get the following table
ID object values
1 a 3, 2, 1
2 a 2, 1
4 a 1
5 a 2, 1
7 a 1
3 b 3, 2, 1
6 b 2,1
8 b 1
In other words, for each object each row value is appended with the next observed values till the value reaches 1.
Thanks a lot for helping.
Bikas
It is not altogether clear whether
the data will always be ordered decreasing by value
and whether the values should be output in decreasing order
IN any event, I would use the data.table library. Assuming your table is a data.frame, df, I would do the following:
library(data.table)
setDT(df)
df[ values >= 1 ][ by=list( ID, value ), order(value, decreasing=TRUE), values = paste0( value, sep=", " ) ]
What this is doing is:
initializing your data.frame as a data.table
using only rows with values >= 1
ordering the data
grouping by ID and value
pasting value together
Using a modified dataset with 2nd row value as 4
res <- unsplit(lapply(split(df, df$object), function(x) {
x$value <- sapply(seq_len(nrow(x)), function(i) {
i1 <- i:nrow(x)
indx <- which(x$value[i1]==1)[1]
paste(x$value[i1[seq(indx)]], collapse=",")
})
x}),
df$object)
res[order(res$object),]
# ID object value
#1 1 a 3, 4, 1
#2 2 a 4, 1
#4 4 a 1
#5 5 a 2, 1
#7 7 a 1
#3 3 b 3, 2, 1
#6 6 b 2, 1
#8 8 b 1
Also, using data.table
library(data.table)
setDT(df)[, N:=1:.N, by=object][,
values:=unlist(lapply(N, function(i) {
val <- value[i:.N]
paste(val[1:which(val==1)[1]], collapse=",")
})), keyby=object][,-(3:4), with=FALSE]
# ID object values
#1: 1 a 3,4,1
#2: 2 a 4,1
#3: 4 a 1
#4: 5 a 2,1
#5: 7 a 1
#6: 3 b 3,2,1
#7: 6 b 2,1
#8: 8 b 1
Update
If you need the sequence up till the minimum value, you could replace which(x$value[i1]==1 to which(x$value[i1]==min(x$value))[1]. For example, using the first code as a function.
f1 <- function(dat){
lst <- split(dat, dat$object)
lst2 <- lapply(lst, function(x) {
x$value <- sapply(seq_len(nrow(x)), function(i) {
i1 <- i:nrow(x)
indx <- which(x$value[i1]== min(x$value))[1]
paste(x$value[i1[seq(indx)]], collapse=",")
})
x})
res <- unsplit(lst2, dat$object)
res[order(res$object),]
}
f1(df)
# ID object value
#1 1 a 3,4,1
#2 2 a 4,1
#4 4 a 1
#5 5 a 2,1
#7 7 a 1
#3 3 b 3,2,1
#6 6 b 2,1
#8 8 b 1
If I change all the 1 values to 2
df$value[df$value==1] <- 2
f1(df)
# ID object value
#1 1 a 3,4,2
#2 2 a 4,2
#4 4 a 2
#5 5 a 2
#7 7 a 2
#3 3 b 3,2
#6 6 b 2
#8 8 b 2
data
df <- structure(list(ID = 1:8, object = c("a", "a", "b", "a", "a",
"b", "a", "b"), value = c(3L, 4L, 3L, 1L, 2L, 2L, 1L, 1L)), .Names = c("ID",
"object", "value"), class = "data.frame", row.names = c(NA, -8L
))

cbind warnings : row names were found from a short variable and have been discarded

I have below line of code for cbind, but I am getting a warning message everytime.
Though the code still functions as it should be, is there any way to resolve the warning?
dateset = subset(all_data[,c("VAR1","VAR2","VAR3","VAR4","VAR5","RATE1","RATE2","RATE3")])
dateset = cbind(dateset[c(1,2,3,4,5)],stack(dateset[,-c(1,2,3,4,5)]))
Warnings :
Warning message:
In data.frame(..., check.names = FALSE) :
row names were found from a short variable and have been discarded
Thanks in advance!
I'm guessing your data.frame has row.names:
A <- data.frame(a = c("A", "B", "C"),
b = c(1, 2, 3),
c = c(4, 5, 6),
row.names=c("A", "B", "C"))
cbind(A[1], stack(A[-1]))
# a values ind
# 1 A 1 b
# 2 B 2 b
# 3 C 3 b
# 4 A 4 c
# 5 B 5 c
# 6 C 6 c
# Warning message:
# In data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
What's happening here is that since you can't by default have duplicated row.names in a data.frame and since you don't tell R at any point to duplicate the row.names when recycling the first column to the same number of rows of the stacked column, R just discards the row.names.
Compare with a similar data.frame, but one without row.names:
B <- data.frame(a = c("A", "B", "C"),
b = c(1, 2, 3),
c = c(4, 5, 6))
cbind(B[1], stack(B[-1]))
# a values ind
# 1 A 1 b
# 2 B 2 b
# 3 C 3 b
# 4 A 4 c
# 5 B 5 c
# 6 C 6 c
Alternatively, you can set row.names = NULL in your cbind statement:
cbind(A[1], stack(A[-1]), row.names = NULL)
# a values ind
# 1 A 1 b
# 2 B 2 b
# 3 C 3 b
# 4 A 4 c
# 5 B 5 c
# 6 C 6 c
If your original row.names are important, you can also add them back in with:
cbind(rn = rownames(A), A[1], stack(A[-1]), row.names = NULL)
# rn a values ind
# 1 A A 1 b
# 2 B B 2 b
# 3 C C 3 b
# 4 A A 4 c
# 5 B B 5 c
# 6 C C 6 c

I have multiple dataframes under one name and I need to create a new column in each one by combining two of the other columns? [duplicate]

I have several csv files all named with dates and for all of them I want to create a new column in each file that contains data from two other columns placed together. Then, I want to combine them into one big dataframe and choose only two of those columns to keep. Here's an example:
Say I have two dataframes:
a b c a b c
x 1 2 3 x 3 2 1
y 2 3 1 y 2 1 3
Then I want to create a new column d in each of them:
a b c d a b c d
x 1 2 3 13 x 3 2 1 31
y 2 3 1 21 y 2 1 3 23
Then I want to combine them like this:
a b c d
x 1 2 3 13
y 2 3 1 21
x 3 2 1 31
y 2 1 3 23
Then keep two of the columns a and d and delete the other two columns b and c:
a d
x 1 13
y 2 21
x 3 31
y 2 23
Here is my current code (It doesn't work when I try to combine two of the columns or when I try to only keep two of the columns):
f <- list.files(pattern="201\\d{5}\\.csv") # reading in all the files
mydata <- sapply(f, read.csv, simplify=FALSE) # assigning them to a dataframe
do.call(rbind,mydata) # combining all of those dataframes into one
mydata$Data <- paste(mydata$LAST_UPDATE_DT,mydata$px_last) # combining two of the columns into a new column named "Data"
c('X','Data') %in% names(mydata) # keeping two of the columns while deleting the rest
The object mydata is a list of data frames. You can change the data frames in the list with lapply:
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
file1 <- "a b
x 2 3"
file2 <- "a b
x 3 1"
mydata <- lapply(c(file1, file2), function(x) read.table(text = x, header =TRUE))
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
# [[1]]
# a b c
# x 2 3 23
#
# [[2]]
# a b c
# x 3 1 31
You can use rbind (data1,data2)[,c(1,3)] for that. I assume that you can create col d in each dataframe which is a basic thing.
data1<-structure(list(a = 1:2, b = 2:3, c = c(3L, 1L), d = c(13L, 21L
)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"), class = "data.frame")
> data1
a b c d
x 1 2 3 13
y 2 3 1 21
data2<-structure(list(a = c(3L, 2L), b = c(2L, 1L), c = c(1L, 3L), d = c(31L,
23L)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"
), class = "data.frame")
> data2
a b c d
x 3 2 1 31
y 2 1 3 23
data3<-rbind(data1,data2)
> data3
a b c d
x 1 2 3 13
y 2 3 1 21
x1 3 2 1 31
y1 2 1 3 23
finaldata<-data3[,c("a","d")]
> finaldata
a d
x 1 13
y 2 21
x1 3 31
y1 2 23

Resources