I would like to compute an id variable based on the unique combination of two (or more) variables. Consider the simple example below:
# Example dataframe
mydf <- data.frame(var1 = LETTERS[c(1, 2, 1)], var2 = LETTERS[c(2, 1, 3)])
mydf
# var1 var2
# A B
# B A
# A C
Here, rows 1 and 2 should have the same id because AB and BA represent a combination of the same elements. Row 3 however, has a different id since the AC combination appear only once.
# Desired output
cbind(mydf, cid = c(1, 1, 2))
# var1 var2 cid
# A B 1
# B A 1
# A C 2
Any suggestion?
We can sort by row, create a logical vector with duplicated and get the cumsum
cbind(mydf, cid = cumsum(!duplicated(t(apply(mydf, 1, sort)))))
You could benefit from factor type in base R for that:
mydf$cid <- as.numeric(factor(apply(mydf,1,function(x) paste0(sort(x), collapse = ""))))
It disregards the order by which the equivalent rows are appeared in data frame. cumsum does not work once, for example, the rows 2 and 3 are switched in your data frame.
Related
I would like to expand a dataframe based on all pairwise combinations of one variable while keeping the associate value of a second variable. For example:
V1 <- letters[1:2]
V2 <- 1:2
df <- data.frame(V1, V2)
I would like to return:
Var1 Var2 Var3 Var4
a a 1 1
b a 2 1
a b 1 2
b b 2 2
I can use expand.grid(df$V1, df$V1) to get all of the pairs, but I'm not sure how to include the second variable without having its values expanded also.
If we need to expand each column separately, then we can do this with Map where the arguments are two 'df' objects
do.call(cbind, Map(expand.grid, df, df))
This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 5 years ago.
Consider the following replicable data frame:
col1 <- c(rep("a", times = 5), rep("b", times = 5), rep("c", times = 5))
col2 <- c(0,0,1,1,0,0,1,1,1,0,0,0,0,0,1)
data <- as.data.frame(cbind(col1, col2))
Now the data is a matrix of 15x2. Now I want to count how many zeros there are with the condition that only for the rows of a's. I use table():
table <- table(data$col2[data$col1=="a"])
table[names(table)==0]
This works just fine and result is 3.
But my real data has 100,000 observations with 12 different values of such col1 so I want to make a function so I don't have to type the above lines of code 12 times.
countzero <- function(row){
table <- table(data$col2[data$col1=="row"])
result <- table[names(table)==0]
return(result)
}
I expected that when I run countzero(row = a) it will return 3 as well but instead it returns 0, and also 0 for b and c.
For my real data, it returns
numeric(0)
which I have no idea why.
Anyone could help me out please?
EDIT: To all the answers showing me how to count in total how many zeros for each value of col1, it works all fine, but my purpose is to build a function that returns only the count of one specific col1 value, e.g. just the a's, because that count will be used later to compute other stuff (the percent of 0's in all a's, e.g.)
1) aggregate Try aggregate:
aggregate(col2 == 0 ~ col1, data, sum)
giving:
col1 col2 == 0
1 a 3
2 b 2
3 c 4
2) table or try table (omit the [,1] if you want the counts of 1's too):
table(data)[, 1]
giving:
a b c
3 2 4
We can use data.table which would be efficient
library(data.table)
setDT(data)[col2==0, .N, col1]
# col1 N
#1: a 3
#2: b 2
#3: c 4
Or with dplyr
library(dplyr)
data %>%
filter(col2==0) %>%
count(col1)
I want to build a matrix or data frame by choosing names of columns where the element in the data frame contains does not contain an NA. For example, suppose I have:
zz <- data.frame(a = c(1, NA, 3, 5),
b = c(NA, 5, 4, NA),
c = c(5, 6, NA, 8))
which gives:
a b c
1 1 NA 5
2 NA 5 6
3 3 4 NA
4 5 NA 8
I want to recognize each NA and build a new matrix or df that looks like:
a c
b c
a b
a c
There will be the same number of NAs in each row of the input matrix/df. I can't seem to get the right code to do this. Suggestions appreciated!
library(dplyr)
library(tidyr)
zz %>%
mutate(k = row_number()) %>%
gather(column, value, a, b, c) %>%
filter(!is.na(value)) %>%
group_by(k) %>%
summarise(temp_var = paste(column, collapse = " ")) %>%
separate(temp_var, into = c("var1", "var2"))
# A tibble: 4 × 3
k var1 var2
* <int> <chr> <chr>
1 1 a c
2 2 b c
3 3 a b
4 4 a c
Here's a possible vectorized base R approach
indx <- which(!is.na(zz), arr.ind = TRUE)
matrix(names(zz)[indx[order(indx[, "row"]), "col"]], ncol = 2, byrow = TRUE)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "c"
#[3,] "a" "b"
#[4,] "a" "c"
This finds non-NA indices, sorts by rows order and then subsets the names of your zz data set according to the sorted index. You can wrap it into as.data.frame if you prefer it over a matrix.
EDIT: transpose the data frame one time before process, so don't need to transpose twice in loop in first version.
cols <- names(zz)
for (column in cols) {
zz[[column]] <- ifelse(is.na(zz[[column]]), NA, column)
}
t_zz <- t(zz)
cols <- vector("list", length = ncol(t_zz))
for (i in 1:ncol(t_zz)) {
cols[[i]] <- na.omit(t_zz[, i])
}
new_dt <- as.data.frame(t(do.call("cbind", cols)))
The tricky part here is your goal actually change data frame structure, so the task of "remove NA in each row" have to build row by row as new data frame, since every column in each row could came from different column of original data frame.
zz[1, ] is a one row data frame, use t to convert it into vector so we can use na.omit, then transpose back to row.
I used 2 for loops, but for loops are not necessarily bad in R. The first one is vectorized for each column. The second one need to be done row by row anyway.
EDIT: growing objects is very bad in performance in R. I knew I can use rbindlist from data.table which can take a list of data frames, but OP don't want new packages. My first attempt just use rbind which could not take list as input. Later I found an alternative is to use do.call. It's still slower than rbindlist though.
I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.
I have a data.frame with 2 columns. I want the script to return the value of observations if I provide the value ID. The values in ID are unique.
ID = c("A","B","C","D")
observations = c(3,4,3,2)
d = data.frame(ID, observations)
ID observations
1 A 3
2 B 4
3 C 3
4 D 2
I'd like to access the data frame in a way that it returns me the value of the column observations if I provide the respective ID for the row. (Keep in mind that every ID occurs only in one row).
So for example if I provide the ID = A, it returns 3.
Likewise, if ID == B, it returns 4.
Another option using dplyr
require(dplyr)
ID = c("A","B","C","D")
observations = c("3","4","3","2")
d = data.frame(ID, observations)
d %>%
filter(ID == "D") %>%
select(observations)