Add new column describing unique values per ID - r

The data I have contain four fields: ID, x1 (numeric), x2 (numeric), and x3 (factor). Some IDs have multiple records, and also some values of x3 are missing (NA). Here is a sample
ID <- c(1,1,1,1,2,2,3,3,3,3,4,4,4,5,6,6)
x1 <- rnorm(16,0,1)
x2 <- rnorm(16,2,2)
x3 <- c("a", "a", "a", NA, "b", "b", "c", "c", "a", "c", "w", "w", "w", "y", NA, NA)
df <- data.frame(ID, x1, x2, x3)
I want to to create a new field (let's call it unqind) to check whether each ID has unique values of x3.
For example, ID=1 has four observations of x3 ("a", "a", "a", NA) ... three "a"'s and one NA. Therefore unqind=0.
ID=2 has two observations of x3 (2 "b"s)... therefore, unqind=1.
In case all values of x3 are NAs per ID, then unqind=1.
After creating unqind, df looks like:
ID x1 x2 x3 unqind
1 0.9087691 4.4353865 a 0
1 0.3686852 2.5851186 a 0
1 -1.335171 1.18109 a 0
1 -0.1596629 0.593775 NA 0
2 0.4841148 0.1684549 b 1
2 0.1256352 4.2785666 b 1
3 -0.954508 3.1284599 c 0
3 0.3502183 2.4766285 c 0
3 -1.2365438 1.041901 a 0
3 0.9786498 -0.6517521 c 0
4 1.3426399 1.5733424 w 1
4 -0.3117586 -0.4648479 w 1
4 0.136769 -2.6124866 w 1
5 -1.3295984 6.2783164 y 1
6 -1.1989125 -1.7025381 NA 1
6 -0.8936165 2.3131387 NA 1

You could do this quite easily with the data.table package. uniqueN() is equivalent to length(unique(x)) but much faster. Group by ID and compare the result to 1.
library(data.table)
setDT(df)[, unqind := as.integer(uniqueN(x3) == 1L), by = ID]
Another option, using base R, could be with ave().
df$unqind <- with(df, {
as.integer(ave(as.character(x3), ID, FUN=function(x) length(unique(x))) == 1L)
})

Related

apply function removing 0 counts from table() output of ordered factors [duplicate]

This question already has an answer here:
R: Why am I not getting type or class "factor" after converting columns to factor?
(1 answer)
Closed 7 months ago.
Assume the following data.frame with columns of ordered factors:
dat0 <- data.frame(X1 = 1:5, X2 = 1:5, X3 = c(1,1:4), X4 = c(2,2:5))
dat <- data.frame(lapply(dat0, factor, ordered=TRUE, levels=1:5, labels=letters[1:5]))
I want to create a nice looking table that compiles how many a:e are in each column of dat (including any 0 counts). The function table() is an obvious choice.
My "clean" attempt at making this table does not work. See below:
The table() function works as expected (i.e., includes all 5 factor choices -- even if one or more has a 0 count) when applied to individual columns:
table(dat[,1])
a b c d e
1 1 1 1 1
table(dat[,3])
a b c d e
2 1 1 1 0
# note: that a 0 is provided for any factor missing
However, when I try to use an apply() function on the data.frame to include all column counts into one table, I get wonky resulting formatting:
apply(dat, 2, table)
$X1
a b c d e
1 1 1 1 1
$X2
a b c d e
1 1 1 1 1
$X3
a b c d
2 1 1 1
$X4
b c d e
2 1 1 1
I can demonstrate the cause of the issue by only including columns of my data.frame that have at least 1 count for each factor that is similar between the columns. (i.e., I can get my desired formatting outcome by removing any column with a 0 count for any factor):
apply(dat[1:2], 2, table) # only including columns of dat with all 5 letters (i.e., no 0 counts)
X1 X2
a 1 1
b 1 1
c 1 1
d 1 1
e 1 1
Question: Is there a simple workaround/solution here when using table() or am I going to have to find a different approach?
Note: I know I could simply cbind() the individual table results, but that's very tedious in my actual more complex data set.
We may use table in sapply.
sapply(dat, table)
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
Or vapply which is faster, but we need to know the .
vapply(dat, table, nlevels(unlist(dat)))
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
If we don't urgently need the row names, we may use tabulate.
sapply(dat, tabulate, nlevels(unlist(dat)))
# X1 X2 X3 X4
# [1,] 1 1 2 0
# [2,] 1 1 1 2
# [3,] 1 1 1 1
# [4,] 1 1 1 1
# [5,] 1 1 0 1
In case we know the nlevels before, we may simplify it to vapply(dat, table, numeric(5L)) and sapply(dat, tabulate, numeric(5L)) which also gives a gain in speed.
Here comes the benchmark
set.seed(42)
DAT <- dat[sample(nrow(dat),1e5, replace=TRUE), ]
r <- matrix(, 5L, dim(DAT)[2])
microbenchmark::microbenchmark(
t(data.frame(do.call(rbind,lapply(DAT, table)))),
sapply(DAT, table),
vapply(DAT, table, numeric(5L)),
vapply(DAT, table, numeric(nlevels(unlist(dat)))),
sapply(DAT, tabulate, 5L),
sapply(DAT, tabulate, nlevels(unlist(dat))),
`for`={for (j in seq_along(DAT)) r[, j] <- tabulate(DAT[, j], 5L)}
)
Unit: microseconds
expr min lq mean median uq max neval cld
t(data.frame(do.call(rbind, lapply(DAT, table)))) 9960.629 10101.4820 11662.6014 10221.6970 14459.0215 17422.732 100 c
sapply(DAT, table) 9690.340 9822.2150 11721.6487 9934.2045 14128.6330 19107.070 100 c
vapply(DAT, table, numeric(5L)) 9630.185 9729.9155 11313.4803 9816.3260 14017.8180 22655.129 100 c
vapply(DAT, table, numeric(nlevels(unlist(dat)))) 9753.252 9890.5700 11309.0461 9976.4840 14110.4775 17906.082 100 c
sapply(DAT, tabulate, 5L) 725.613 742.7820 778.6458 785.3595 807.1935 916.700 100 a
sapply(DAT, tabulate, nlevels(unlist(dat))) 848.600 891.1135 936.7825 939.8245 967.2390 1114.601 100 a
for 3580.538 3846.5700 4059.3048 3922.1300 3981.4300 19752.024 100 b
Data:
dat <- structure(list(X1 = structure(1:5, levels = c("a", "b", "c",
"d", "e"), class = c("ordered", "factor")), X2 = structure(1:5, levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor")), X3 = structure(c(1L,
1L, 2L, 3L, 4L), levels = c("a", "b", "c", "d", "e"), class = c("ordered",
"factor")), X4 = structure(c(2L, 2L, 3L, 4L, 5L), levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor"))), class = "data.frame", row.names = c(NA,
-5L))
Solution:
Use lapply and not apply as explained in the ZheyuanLi's linked answer and his comment.
Summary: The problem of apply is that it converts everything to characters, then table re-factors those characters so that unused levels are not preserved. But lapply gives a list.
Use a combination of data.frame, do.call, rbind, and t (transpose) to get the data into the desired data.frame format:
t(data.frame(do.call(rbind,lapply(dat, table))))
X1 X2 X3 X4
a 1 1 2 0
b 1 1 1 2
c 1 1 1 1
d 1 1 1 1
e 1 1 0 1
Or:
As ZheyuanLi pointed out, one can simply use sapply(dat, table).
Also thanks jay.sf for showing how vapply works.

Is it possible to return the values of reference columns to multiple columns in r

I am organizing a large dataset adapted to my research. Suppose that I have 9 observations (records) and 4 columns as follows:
z <- data.frame("fa" = c(1, NA, NA, 2, 1, 1, 2, 1, 1),
"fb" = c(2, 2, NA, 1, NA, NA, NA, 1, 2),
"initial_1" = c("A", "B", "B", "B", "A", "C", "D", "B", "A"),
"initial_2" = c("D", "C", "C", "A", "B", "A", "A", "D", "D"))
I want to create two new columns, fa_new and fb_new according to the values of the first two columns, fa and fb, which are linked to the reference columns, initial_1 and initial_2, such that fa == # is matching to intial_#.
For example, as can be seen above, the first record of the column fa is 1 which is linked to "A" of intial_1. Thus, the first record of the new column fa_new will be "A". Likewise, the first record of fb is 2 which is linked to "D" of intial_2; thus, the first record of fb_new will be "D".
Accordingly, my expectation is:
fa_new fb_new
1 A D
2 NA C
3 NA NA
4 A B
5 A NA
6 C NA
7 A NA
8 B B
9 A D
Is this possible using r?
You can use lapply to do this for multiple columns :
cols <- 1:2
init_cols <- paste0('initial_', cols)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z[new_cols] <- lapply(z[cols], function(x) z[init_cols][cbind(inds, x)])
z
# fa fb initial_1 initial_2 fa_new fb_new
#1 1 2 A D A D
#2 NA 2 B C <NA> C
#3 NA NA B C <NA> <NA>
#4 2 1 B A A B
#5 1 NA A B A <NA>
#6 1 NA C A C <NA>
#7 2 NA D A A <NA>
#8 1 1 B D B B
#9 1 2 A D A D
The logic here is we create a matrix with cbind which has row/column number. The row number is inds (1:nrow(z)) whereas column number comes from fa/fb columns which is used to subset z dataframe.
The actual dataframe is labelled dataset, the following answer should work on the real data.
cols <- 1:2
init_cols <- paste0('fuinitials_', 1:94)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z1 <- data.frame(z)
z1[cols][z1[cols] < 1] <- NA
z1[new_cols] <- lapply(z1[cols], function(x) z1[init_cols][cbind(inds, x)])

Find the number of items row wise with an exception

I have a set of the following form:-
a <- data.frame(X1=c("A", "B", "C", "D", "0"),
X2=c("B", "A", "D", "E", "A"),
X3=c("0", "0", "B", "A", "0"),
X4=c("A", "0", "A", "0", "0")
)
# a
# X1 X2 X3 X4
# A B 0 A
# B A 0 0
# C D B A
# D E A 0
# 0 A 0 0
What I want to know if in each row how many items are there except "0" and save them in a new column. The expected output should be :-
# b
# 3
# 2
# 4
# 3
# 1
Duplicates should be counted as different, ie, if a row consists of 2 "A", 1 "B" and a "0", it should return 3. Thanks in advance.
We could compare the dataframe with 0 and use rowSums to calculate number of entries except 0 in each row.
rowSums(a != 0)
#[1] 3 2 4 3 1
Although, it is not needed here (since applying rowSums is straight-forward) we can also use apply row-wise :
apply(a!= 0 , 1, sum)
If you have single character in each cell of data frame a, then here is a base R option. Otherwise (if you have have any multiple characters in some cells), please turn to the approach by #Ronak Shah
a$b <- nchar(gsub("0","",do.call(paste0,a)))
such that
> a
X1 X2 X3 X4 b
1 A B 0 A 3
2 B A 0 0 2
3 C D B A 4
4 D E A 0 3
5 0 A 0 0 1
We can use lengths with split
lengths(split(a[a!=0], row(a)[a != 0]))

Order data frame by columns with different calling schemes

Say I have the following data frame:
df <- data.frame(x1 = c(2, 2, 2, 1),
x2 = c(3, 3, 2, 1),
let = c("B", "A", "A", "A"))
df
x1 x2 let
1 2 3 B
2 2 3 A
3 2 2 A
4 1 1 A
If I want to order df by x1, then x2 then let, I do this:
df2 <- df[with(df, order(x1, x2, let)), ]
df2
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
However, x1 and x2 have actually been saved as an id <- c("x1", "x2") vector earlier in the code, which I use for other purposes.
So my problem is that I want to reference id instead of x1 and x2 in my order function, but unfortunately anything like df[order(df[id], df$let), ] will result in a argument lengths differ error.
From what I can tell (and this has been addressed at another SO thread), the problem is that length(df[id]) == 2 and length(df$let) == 4.
I have been able to make it through this workaround:
df3 <- df[order(df[, id[1]], df[, id[2]], df[, "let"]), ]
df3
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
But it looks ugly and depends on knowing the size of id.
Is there a more elegant solution to sorting my data frame by id then let?
I would suggest using do.call(order, ...) and combining id and "let" with c():
id <- c("x1", "x2")
df[do.call(order, df[c(id, "let")]), ]
# x1 x2 let
# 4 1 1 A
# 3 2 2 A
# 2 2 3 A
# 1 2 3 B

Finding the Column Index for a Specific Value

I am having a brain cramp. Below is a toy dataset:
df <- data.frame(
id = 1:6,
v1 = c("a", "a", "c", NA, "g", "h"),
v2 = c("z", "y", "a", NA, "a", "g"),
stringsAsFactors=F)
I have a specific value that I want to find across a set of defined columns and I want to identify the position it is located in. The fields I am searching are characters and the trick is that the value I am looking for might not exist. In addition, null strings are also present in the dataset.
Assuming I knew how to do this, the variable position indicates the values I would like returned.
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
The general rule is that I want to find the position of value "a", and if it is not located or if v1 is missing, then I want 99 returned.
In this instance, I am searching across v1 and v2, but in reality, I have 10 different variables. It is also worth noting that the value I am searching for can only exist once across the 10 variables.
What is the best way to generate this recode?
Many thanks in advance.
Use match:
> df$position <- apply(df,1,function(x) match('a',x[-1], nomatch=99 ))
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
Firstly, drop the first column:
df <- df[, -1]
Then, do something like this (disclaimer: I'm feeling terribly sleepy*):
( df$result <- unlist(lapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))) )
v1 v2 result
1 a z 1
2 a y 1
3 c a 2
4 <NA> <NA> 99
5 g a 2
6 h g 99
* sleepy = code is not vectorised
EDIT (slightly different solution, I still feel sleepy):
df$result <- rapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))

Resources