I have a data frame in the following format
1 2 a b c
1 a b 0 0 0
2 b 0 0 0
3 c 0 0 0
I want to fill columns a through c with a TRUE/FALSE that says whether the column name is in columns 1 or 2
1 2 a b c
1 a b 1 1 0
2 b 0 1 0
3 c 0 0 1
I have a dataset of about 530,000 records, 4 description columns, and 95 output columns so a for loop does not work. I have tried code in the following format, but it was too time consuming:
> for(i in 3:5) {
> for(j in 1:3) {
> for(k in 1:2){
> if(df[j,k]==colnames(df)[i]) df[j, i]=1
> }
> }
> }
Is there an easier, more efficient way to achieve the same output?
Thanks in advance!
One option is mtabulate from qdapTools
library(qdapTools)
df1[-(1:2)] <- mtabulate(as.data.frame(t(df1[1:2])))[-3]
df1
# 1 2 a b c
#1 a b 1 1 0
#2 b 0 1 0
#3 c 0 0 1
Or we melt the dataset after converting to matrix, use table to get the frequencies, and assign the output to the columns that are numeric.
library(reshape2)
df1[-(1:2)] <- table(melt(as.matrix(df1[1:2]))[-2])[,-1]
Or we can 'paste' the first two columns and use cSplit_e to get the binary format.
library(splitstackshape)
cbind(df1[1:2], cSplit_e(as.data.table(do.call(paste, df1[1:2])),
'V1', ' ', type='character', fill=0, drop=TRUE))
data
df1 <- structure(list(`1` = c("a", "b", "c"), `2` = c("b", "", ""),
a = c(0L, 0L, 0L), b = c(0L, 0L, 0L), c = c(0L, 0L, 0L)), .Names = c("1",
"2", "a", "b", "c"), class = "data.frame", row.names = c("1",
"2", "3"))
Related
I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0
I want to loop through a large dataframe counting in the first column how many values >0, removing those rows that were counted.... then moving on to column 2 counting the number of values>0 and removing those rows etc...
the data frame
taxonomy A B C
1 cat 0 2 0
2 dog 5 1 0
3 horse 3 0 0
4 mouse 0 0 4
5 frog 0 2 4
6 lion 0 0 2
can be generated with
DF1 = structure(list(taxonomy = c("cat", "dog","horse","mouse","frog", "lion"),
A = c(0L, 5L, 3L, 0L, 0L, 0L), D = c(2L, 1L, 0L, 0L, 2L, 0L), C = c(0L, 0L, 0L, 4L, 4L, 2L)),
.Names = c("taxonomy", "A", "B", "C"),
row.names = c(NA, -6L), class = "data.frame")
and i expect the outcome to be
A B C
count 2 2 2
i wrote this loop but it does not remove the rows as it goes
res <- data.frame(DF1[1,], row.names = c('count'))
for(n in 1:ncol(DF1)) {
res[colnames(DF1)[n]] <- sum(DF1[n])
DF1[!DF1[n]==1]
}
it gives this incorrect result
A B C
count 2 3 3
You could do ...
DF = DF1[, -1]
cond = DF != 0
p = max.col(cond, ties="first")
fp = factor(p, levels = seq_along(DF), labels = names(DF))
table(fp)
# A B C
# 2 2 2
To account for rows that are all zeros, I think this works:
fp[rowSums(cond) == 0] <- NA
We can update the dataset in each run. Create a temporary dataset without the 'taxonomy' column ('tmp'). Initiate a named vector ('n'), loop through the columns of 'tmp', get a logical index based on whether the column is greater than 0 ('i1'), get the sum of TRUE values, update the 'n' for the corresponding column, then update the 'tmp' by removing those rows using 'i1' as row index
tmp <- DF1[-1]
n <- setNames(numeric(ncol(tmp)), names(tmp))
for(i in seq_len(ncol(tmp))) {
i1 <- tmp[[i]] > 0
n[i] <- sum(i1)
tmp <- tmp[!i1, ]}
n
# A B C
# 2 2 2
It can also be done with Reduce
sapply(Reduce(function(x, y) y[!x] > 0, DF1[3:4],
init = DF1[,2] > 0, accumulate = TRUE ), sum)
#[1] 2 2 2
Or using accumulate from purrr
library(purrr)
accumulate(DF1[3:4], ~ .y[!.x] > 0, .init = DF1[[2]] > 0) %>%
map_int(sum)
#[1] 2 2 2
This is easy with Reduce and sapply:
> first <- Reduce(function(a,b) b[a==0], df[-1], accumulate=TRUE)
> first
[[1]]
[1] 0 5 3 0 0 0
[[2]]
[1] 2 0 2 0
[[3]]
[1] 0 4 2
> then <- sapply(setNames(first, names(df[-1])), function(x) length(x[x>0]))
> then
A B C
2 2 2
I have a dataset in R in long format. Each ID does not appear the same number of times (i.e. one ID might be one row, another might appear 79 rows).
e.g.
ID V1 V2
1 B 0
1 A 1
1 C 0
2 C 0
3 A 0
3 C 0
I want to create a variable which, if any of the rows for a given ID have Var2 == 1, then 1 repeats for every row of that ID
e.g.
ID V1 V2 V3
1 B 0 1
1 A 1 1
1 C 0 1
2 C 0 0
3 A 0 0
3 C 0 0
In base R we can use any - and ave for the grouping.
DF$V3 <- with(DF, ave(V2, ID, FUN = function(x) any(x == 1)))
DF
# ID V1 V2 V3
#1 1 B 0 1
#2 1 A 1 1
#3 1 C 0 1
#4 2 C 0 0
#5 3 A 0 0
#6 3 C 0 0
data
DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), V1 = c("B", "A",
"C", "C", "A", "C"), V2 = c(0L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID",
"V1", "V2"), class = "data.frame", row.names = c(NA, -6L))
Here's a tidyverse solution.
If V2 can only be 0 or 1:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(V3 = max(V2))
If you want to check that V2 is exactly 1.
df %>%
group_by(ID) %>%
mutate(V3 = as.numeric(any(V2 == 1)))
Another base R option is
df$V3 <- with(df, +(ID %in% which(rowsum(V2, ID) > 0)))
I have data set as follows:
A B C
R1 1 0 1
R2 0 1 0
R3 0 0 0
I want to add another column in data set named index such that it gives column names for each row where the column value is greater than zero. The result I want is as follows:
A B C Index
R1 1 0 1 A,C
R2 0 1 0 B
R3 0 0 0 NA
Here is one approach using base:
use apply to go over rows, find elements that are equal to one and paste together the corresponding column names:
df$Index <- apply(df, 1, function(x) paste(colnames(df)[which(x == 1)], collapse = ", "))
df$Index <- crate a new column called Index where the result of the operation will be held
apply - applies a function over rows and/or columns of a matrix/data frame
1 - specify that the function should be applied to rows (2 - means over columns)
function(x) an unnamed function which is further defined - x corresponds to each row
which(x == 1) which elements of a row are equal to 1 output is TRUE/FALSE
colnames(df) - names of the columns of the data frame
colnames(df)[which(x == 1] - subsets the column names which are TRUE for the expression which(x == 1)
paste with collapse = ", " - collapse a character vector (in this case a vector of column names that we acquired before) into a string where each element will be separated by ,.
now replace empty entries with NA
df$Index[df$Index == ""] <- NA_character_
here is how the output looks like
#output
sample A B C Index
1 R1 1 0 1 A, C
2 R2 0 1 0 B
3 R3 0 0 0 <NA>
data:
structure(list(sample = structure(1:3, .Label = c("R1", "R2",
"R3"), class = "factor"), A = c(1L, 0L, 0L), B = c(0L, 1L, 0L
), C = c(1L, 0L, 0L)), .Names = c("sample", "A", "B", "C"), class = "data.frame", row.names = c(NA,
-3L))
Slightly different flavored apply()solution:
df$index <- apply(df, 1, function(x) ifelse(any(x), toString(names(df)[x == 1]), NA))
A B C index
R1 1 0 1 A, C
R2 0 1 0 B
R3 0 0 0 <NA>
data:
df <- structure(
list(
A = c(1L, 0L, 0L),
B = c(0L, 1L, 0L),
C = c(1L, 0L, 0L)
),
row.names = paste0('R', 1:3),
class = "data.frame"
)
I am very new to R, and I sincerely appreciate your help.
The following is part of my data:
subjectID A B C D E F G H I J
S001 1 1 1 1 1 0 0
S002 1 1 1 0 0 0 0
I want to sum the rows from A to J, and so the data will look like this:
subjectID A B C D E F G H I J TOTAL
S001 1 1 1 1 1 0 0 5
S002 1 1 1 0 0 0 0 3
Thank you so much! I would like sum if variable A to J == 1.
As suggested, I post here my answers.
This is is with apply. the df[-1] is to exclude the first column (which is not numeric), the x[x == 1] is to subset the elements of x (a single row due to the 1 of the apply) with only values of 1.
df$TOTAL <- apply(df[-1], 1, function(x) sum(x[x == 1], na.rm = T))
Another (I bet much faster and) easier to code way in base R is:
df$TOTAL <- rowSums(df[-1] == 1, na.rm = T)
both have as a result this
df
subjectID A B C D E F G H I J TOTAL
1 S001 1 1 1 1 1 0 0 NA NA NA 5
2 S002 1 1 1 0 0 0 0 NA NA NA 3
Data
df <- structure(list(subjectID = structure(1:2, .Label = c("S001",
"S002"), class = "factor"), A = c(1L, 1L), B = c(1L, 1L), C = c(1L,
1L), D = c(1L, 0L), E = c(1L, 0L), F = c(0L, 0L), G = c(0L, 0L
), H = c(NA, NA), I = c(NA, NA), J = c(NA, NA)), .Names = c("subjectID",
"A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), class = "data.frame", row.names = c(NA,
-2L))
Another similar option to the one posted by SabDeM but using sapply to sum only numeric columns
df$Total <- rowSums(df[ ,sapply(df, is.numeric)])
Output:
subjectID A B C D E F G H I J Total
1 S001 1 1 1 1 1 0 0 NA NA NA 5
2 S002 1 1 1 0 0 0 0 NA NA NA 3