The data I have contain four fields: ID, x1 (numeric), x2 (numeric), and x3 (factor). Some IDs have multiple records, and also some values of x3 are missing (NA). Here is a sample
ID <- c(1,1,1,1,2,2,3,3,3,3,4,4,4,5,6,6)
x1 <- rnorm(16,0,1)
x2 <- rnorm(16,2,2)
x3 <- c("a", "a", "a", NA, "b", "b", "c", "c", "a", "c", "w", "w", "w", "y", NA, NA)
df <- data.frame(ID, x1, x2, x3)
I want to to create a new field (let's call it unqind) to check whether each ID has unique values of x3.
For example, ID=1 has four observations of x3 ("a", "a", "a", NA) ... three "a"'s and one NA. Therefore unqind=0.
ID=2 has two observations of x3 (2 "b"s)... therefore, unqind=1.
In case all values of x3 are NAs per ID, then unqind=1.
After creating unqind, df looks like:
ID x1 x2 x3 unqind
1 0.9087691 4.4353865 a 0
1 0.3686852 2.5851186 a 0
1 -1.335171 1.18109 a 0
1 -0.1596629 0.593775 NA 0
2 0.4841148 0.1684549 b 1
2 0.1256352 4.2785666 b 1
3 -0.954508 3.1284599 c 0
3 0.3502183 2.4766285 c 0
3 -1.2365438 1.041901 a 0
3 0.9786498 -0.6517521 c 0
4 1.3426399 1.5733424 w 1
4 -0.3117586 -0.4648479 w 1
4 0.136769 -2.6124866 w 1
5 -1.3295984 6.2783164 y 1
6 -1.1989125 -1.7025381 NA 1
6 -0.8936165 2.3131387 NA 1
You could do this quite easily with the data.table package. uniqueN() is equivalent to length(unique(x)) but much faster. Group by ID and compare the result to 1.
library(data.table)
setDT(df)[, unqind := as.integer(uniqueN(x3) == 1L), by = ID]
Another option, using base R, could be with ave().
df$unqind <- with(df, {
as.integer(ave(as.character(x3), ID, FUN=function(x) length(unique(x))) == 1L)
})
Related
This question already has an answer here:
R: Why am I not getting type or class "factor" after converting columns to factor?
(1 answer)
Closed 7 months ago.
Assume the following data.frame with columns of ordered factors:
dat0 <- data.frame(X1 = 1:5, X2 = 1:5, X3 = c(1,1:4), X4 = c(2,2:5))
dat <- data.frame(lapply(dat0, factor, ordered=TRUE, levels=1:5, labels=letters[1:5]))
I want to create a nice looking table that compiles how many a:e are in each column of dat (including any 0 counts). The function table() is an obvious choice.
My "clean" attempt at making this table does not work. See below:
The table() function works as expected (i.e., includes all 5 factor choices -- even if one or more has a 0 count) when applied to individual columns:
table(dat[,1])
a b c d e
1 1 1 1 1
table(dat[,3])
a b c d e
2 1 1 1 0
# note: that a 0 is provided for any factor missing
However, when I try to use an apply() function on the data.frame to include all column counts into one table, I get wonky resulting formatting:
apply(dat, 2, table)
$X1
a b c d e
1 1 1 1 1
$X2
a b c d e
1 1 1 1 1
$X3
a b c d
2 1 1 1
$X4
b c d e
2 1 1 1
I can demonstrate the cause of the issue by only including columns of my data.frame that have at least 1 count for each factor that is similar between the columns. (i.e., I can get my desired formatting outcome by removing any column with a 0 count for any factor):
apply(dat[1:2], 2, table) # only including columns of dat with all 5 letters (i.e., no 0 counts)
X1 X2
a 1 1
b 1 1
c 1 1
d 1 1
e 1 1
Question: Is there a simple workaround/solution here when using table() or am I going to have to find a different approach?
Note: I know I could simply cbind() the individual table results, but that's very tedious in my actual more complex data set.
We may use table in sapply.
sapply(dat, table)
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
Or vapply which is faster, but we need to know the .
vapply(dat, table, nlevels(unlist(dat)))
# X1 X2 X3 X4
# a 1 1 2 0
# b 1 1 1 2
# c 1 1 1 1
# d 1 1 1 1
# e 1 1 0 1
If we don't urgently need the row names, we may use tabulate.
sapply(dat, tabulate, nlevels(unlist(dat)))
# X1 X2 X3 X4
# [1,] 1 1 2 0
# [2,] 1 1 1 2
# [3,] 1 1 1 1
# [4,] 1 1 1 1
# [5,] 1 1 0 1
In case we know the nlevels before, we may simplify it to vapply(dat, table, numeric(5L)) and sapply(dat, tabulate, numeric(5L)) which also gives a gain in speed.
Here comes the benchmark
set.seed(42)
DAT <- dat[sample(nrow(dat),1e5, replace=TRUE), ]
r <- matrix(, 5L, dim(DAT)[2])
microbenchmark::microbenchmark(
t(data.frame(do.call(rbind,lapply(DAT, table)))),
sapply(DAT, table),
vapply(DAT, table, numeric(5L)),
vapply(DAT, table, numeric(nlevels(unlist(dat)))),
sapply(DAT, tabulate, 5L),
sapply(DAT, tabulate, nlevels(unlist(dat))),
`for`={for (j in seq_along(DAT)) r[, j] <- tabulate(DAT[, j], 5L)}
)
Unit: microseconds
expr min lq mean median uq max neval cld
t(data.frame(do.call(rbind, lapply(DAT, table)))) 9960.629 10101.4820 11662.6014 10221.6970 14459.0215 17422.732 100 c
sapply(DAT, table) 9690.340 9822.2150 11721.6487 9934.2045 14128.6330 19107.070 100 c
vapply(DAT, table, numeric(5L)) 9630.185 9729.9155 11313.4803 9816.3260 14017.8180 22655.129 100 c
vapply(DAT, table, numeric(nlevels(unlist(dat)))) 9753.252 9890.5700 11309.0461 9976.4840 14110.4775 17906.082 100 c
sapply(DAT, tabulate, 5L) 725.613 742.7820 778.6458 785.3595 807.1935 916.700 100 a
sapply(DAT, tabulate, nlevels(unlist(dat))) 848.600 891.1135 936.7825 939.8245 967.2390 1114.601 100 a
for 3580.538 3846.5700 4059.3048 3922.1300 3981.4300 19752.024 100 b
Data:
dat <- structure(list(X1 = structure(1:5, levels = c("a", "b", "c",
"d", "e"), class = c("ordered", "factor")), X2 = structure(1:5, levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor")), X3 = structure(c(1L,
1L, 2L, 3L, 4L), levels = c("a", "b", "c", "d", "e"), class = c("ordered",
"factor")), X4 = structure(c(2L, 2L, 3L, 4L, 5L), levels = c("a",
"b", "c", "d", "e"), class = c("ordered", "factor"))), class = "data.frame", row.names = c(NA,
-5L))
Solution:
Use lapply and not apply as explained in the ZheyuanLi's linked answer and his comment.
Summary: The problem of apply is that it converts everything to characters, then table re-factors those characters so that unused levels are not preserved. But lapply gives a list.
Use a combination of data.frame, do.call, rbind, and t (transpose) to get the data into the desired data.frame format:
t(data.frame(do.call(rbind,lapply(dat, table))))
X1 X2 X3 X4
a 1 1 2 0
b 1 1 1 2
c 1 1 1 1
d 1 1 1 1
e 1 1 0 1
Or:
As ZheyuanLi pointed out, one can simply use sapply(dat, table).
Also thanks jay.sf for showing how vapply works.
I am organizing a large dataset adapted to my research. Suppose that I have 9 observations (records) and 4 columns as follows:
z <- data.frame("fa" = c(1, NA, NA, 2, 1, 1, 2, 1, 1),
"fb" = c(2, 2, NA, 1, NA, NA, NA, 1, 2),
"initial_1" = c("A", "B", "B", "B", "A", "C", "D", "B", "A"),
"initial_2" = c("D", "C", "C", "A", "B", "A", "A", "D", "D"))
I want to create two new columns, fa_new and fb_new according to the values of the first two columns, fa and fb, which are linked to the reference columns, initial_1 and initial_2, such that fa == # is matching to intial_#.
For example, as can be seen above, the first record of the column fa is 1 which is linked to "A" of intial_1. Thus, the first record of the new column fa_new will be "A". Likewise, the first record of fb is 2 which is linked to "D" of intial_2; thus, the first record of fb_new will be "D".
Accordingly, my expectation is:
fa_new fb_new
1 A D
2 NA C
3 NA NA
4 A B
5 A NA
6 C NA
7 A NA
8 B B
9 A D
Is this possible using r?
You can use lapply to do this for multiple columns :
cols <- 1:2
init_cols <- paste0('initial_', cols)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z[new_cols] <- lapply(z[cols], function(x) z[init_cols][cbind(inds, x)])
z
# fa fb initial_1 initial_2 fa_new fb_new
#1 1 2 A D A D
#2 NA 2 B C <NA> C
#3 NA NA B C <NA> <NA>
#4 2 1 B A A B
#5 1 NA A B A <NA>
#6 1 NA C A C <NA>
#7 2 NA D A A <NA>
#8 1 1 B D B B
#9 1 2 A D A D
The logic here is we create a matrix with cbind which has row/column number. The row number is inds (1:nrow(z)) whereas column number comes from fa/fb columns which is used to subset z dataframe.
The actual dataframe is labelled dataset, the following answer should work on the real data.
cols <- 1:2
init_cols <- paste0('fuinitials_', 1:94)
new_cols <- paste0(names(z)[cols], '_new')
inds <- 1:nrow(z)
z1 <- data.frame(z)
z1[cols][z1[cols] < 1] <- NA
z1[new_cols] <- lapply(z1[cols], function(x) z1[init_cols][cbind(inds, x)])
I have a set of the following form:-
a <- data.frame(X1=c("A", "B", "C", "D", "0"),
X2=c("B", "A", "D", "E", "A"),
X3=c("0", "0", "B", "A", "0"),
X4=c("A", "0", "A", "0", "0")
)
# a
# X1 X2 X3 X4
# A B 0 A
# B A 0 0
# C D B A
# D E A 0
# 0 A 0 0
What I want to know if in each row how many items are there except "0" and save them in a new column. The expected output should be :-
# b
# 3
# 2
# 4
# 3
# 1
Duplicates should be counted as different, ie, if a row consists of 2 "A", 1 "B" and a "0", it should return 3. Thanks in advance.
We could compare the dataframe with 0 and use rowSums to calculate number of entries except 0 in each row.
rowSums(a != 0)
#[1] 3 2 4 3 1
Although, it is not needed here (since applying rowSums is straight-forward) we can also use apply row-wise :
apply(a!= 0 , 1, sum)
If you have single character in each cell of data frame a, then here is a base R option. Otherwise (if you have have any multiple characters in some cells), please turn to the approach by #Ronak Shah
a$b <- nchar(gsub("0","",do.call(paste0,a)))
such that
> a
X1 X2 X3 X4 b
1 A B 0 A 3
2 B A 0 0 2
3 C D B A 4
4 D E A 0 3
5 0 A 0 0 1
We can use lengths with split
lengths(split(a[a!=0], row(a)[a != 0]))
Say I have the following data frame:
df <- data.frame(x1 = c(2, 2, 2, 1),
x2 = c(3, 3, 2, 1),
let = c("B", "A", "A", "A"))
df
x1 x2 let
1 2 3 B
2 2 3 A
3 2 2 A
4 1 1 A
If I want to order df by x1, then x2 then let, I do this:
df2 <- df[with(df, order(x1, x2, let)), ]
df2
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
However, x1 and x2 have actually been saved as an id <- c("x1", "x2") vector earlier in the code, which I use for other purposes.
So my problem is that I want to reference id instead of x1 and x2 in my order function, but unfortunately anything like df[order(df[id], df$let), ] will result in a argument lengths differ error.
From what I can tell (and this has been addressed at another SO thread), the problem is that length(df[id]) == 2 and length(df$let) == 4.
I have been able to make it through this workaround:
df3 <- df[order(df[, id[1]], df[, id[2]], df[, "let"]), ]
df3
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
But it looks ugly and depends on knowing the size of id.
Is there a more elegant solution to sorting my data frame by id then let?
I would suggest using do.call(order, ...) and combining id and "let" with c():
id <- c("x1", "x2")
df[do.call(order, df[c(id, "let")]), ]
# x1 x2 let
# 4 1 1 A
# 3 2 2 A
# 2 2 3 A
# 1 2 3 B
I am having a brain cramp. Below is a toy dataset:
df <- data.frame(
id = 1:6,
v1 = c("a", "a", "c", NA, "g", "h"),
v2 = c("z", "y", "a", NA, "a", "g"),
stringsAsFactors=F)
I have a specific value that I want to find across a set of defined columns and I want to identify the position it is located in. The fields I am searching are characters and the trick is that the value I am looking for might not exist. In addition, null strings are also present in the dataset.
Assuming I knew how to do this, the variable position indicates the values I would like returned.
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
The general rule is that I want to find the position of value "a", and if it is not located or if v1 is missing, then I want 99 returned.
In this instance, I am searching across v1 and v2, but in reality, I have 10 different variables. It is also worth noting that the value I am searching for can only exist once across the 10 variables.
What is the best way to generate this recode?
Many thanks in advance.
Use match:
> df$position <- apply(df,1,function(x) match('a',x[-1], nomatch=99 ))
> df
id v1 v2 position
1 1 a z 1
2 2 a y 1
3 3 c a 2
4 4 <NA> <NA> 99
5 5 g a 2
6 6 h g 99
Firstly, drop the first column:
df <- df[, -1]
Then, do something like this (disclaimer: I'm feeling terribly sleepy*):
( df$result <- unlist(lapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))) )
v1 v2 result
1 a z 1
2 a y 1
3 c a 2
4 <NA> <NA> 99
5 g a 2
6 h g 99
* sleepy = code is not vectorised
EDIT (slightly different solution, I still feel sleepy):
df$result <- rapply(apply(df, 1, grep, pattern = "a"), function(x) ifelse(length(x) == 0, 99, x))