Sum row on duplicated element in a column [R] [duplicate] - r

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Sum multiple variables by group [duplicate]
(2 answers)
Closed 3 months ago.
I have a dafarame such as :
Names COLA COLB COLC
sp1_A 1 0 1
sp1_A 1 0 0
sp1_B 0 1 1
sp2_A 0 0 1
sp2_A 0 1 1
sp2_A 0 0 1
And I would like for each Names to sum the row content and get
I shoudl then get:
Names COLA COLB COLC
sp1_A 2 0 1
sp1_B 0 1 1
sp2_A 0 1 3
Here is the dput format of the dataframe :
structure(list(Names = c("sp1_A", "sp1_A", "sp1_B", "sp2_A",
"sp2_A", "sp2_A"), COLA = c(1L, 1L, 0L, 0L, 0L, 0L), COLB = c(0L,
0L, 1L, 0L, 1L, 0L), COLC = c(1, 0, 1, 1, 1, 1)), class = "data.frame", row.names = c(NA,
-6L))

Related

How to group a numeric variable in r? [duplicate]

This question already has answers here:
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Cut by Defined Interval
(2 answers)
Closed 2 years ago.
I have the data that has numeric variable A. I want to make groups for A to have something like B.
data <- structure(list(A = c(0, 0, 0, 0, 1, 2, 9, 15, 30, 100, 0.2, 0.003,
95, 18), B = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 4L, 10L, 1L, 1L,
10L, 2L)), class = "data.frame", row.names = c(NA, -14L))
Are you trying to create B from A? it looks like you want something like
data$A %/% 10
[1] 0 0 0 0 0 0 0 1 3 10 0 0 9 1
or
(data$A %/% 10)+1
[1] 1 1 1 1 1 1 1 2 4 11 1 1 10 2

Subset a matrix in R

I have a matrix of bundles and I would like to subset it based on the column sum (a budget) and the first value. If the first value is 0 and I could add the value in and still be under the budget I would like to drop the column.
For example, if my budget is 10 (column sum) and my matrix looks like this:
col1 col2 col3 col4
1 2 2 0 0
2 3 3 3 3
3 0 0 2 0
4 4 0 4 0
I would like the end matrix to look like this because the 0 in col4 row 1 could be included and the column sum would be under 10:
col1 col2 col3
1 2 2 0
2 3 3 3
3 0 0 2
4 4 0 4
My code is currently:
for (i in 1:ncol(df)) {
if (df[1,i]==0) {
df<-df[,which(colSums(df)+2>10)]
}
}
The code is not working because it also removes column 2. I don't think it is considering the if statement when subsetting the matrix.
Thanks.
Similar to the solution by #akrun, but I think the following subset approach already can make it
> df[,head(df,1)!=0 | colSums(df)+2>10]
col1 col2 col3
1 2 2 0
2 3 3 3
3 0 0 2
4 4 0 4
DATA
df <- structure(list(col1 = c(2L, 3L, 0L, 4L), col2 = c(2L, 3L, 0L,
0L), col3 = c(0L, 3L, 2L, 4L), col4 = c(0L, 3L, 0L, 0L)),
class = "data.frame", row.names = c("1",
"2", "3", "4"))
One option is to create the condition with colSums and the value in first row to subset the columns. colSums would be more efficient
bids <- 2
df1[which(!(df1[1,] == 0 & (colSums(df1) + bids) < 10))]
# col1 col2 col3
#1 2 2 0
#2 3 3 3
#3 0 0 2
#4 4 0 4
Or using the for loop
for(i in seq_along(df1)) if(df1[1, i] == 0 & sum(df1[[i]]) + bids < 10) df1[[i]] <- NULL
data
df1 <- structure(list(col1 = c(2L, 3L, 0L, 4L), col2 = c(2L, 3L, 0L,
0L), col3 = c(0L, 3L, 2L, 4L), col4 = c(0L, 3L, 0L, 0L)),
class = "data.frame", row.names = c("1",
"2", "3", "4"))

How I can make a column with some indicator?

I have 4 indicator and an id row. How I can make a column with respect of this indicator like this:
ID. col1. col2. col3. col4.
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0
I n each row just one of the columns is 1 and others 0. new column is 1 if col1 is 1, is 2 if col2 is 1 , is 3 if col3 is 1 and is 4 if col4 is 1.
so the output is
ID. col.
1 1
2 4
3 3
An option is max.col
cbind(df1[1], `col.` = max.col(df1[-1], "first"))
# ID. col.
#1 1 1
#2 2 4
#3 3 3
If there are no 1s in the rows, create a logical condition to return that row as NA
df1[2, 5] <- 0
cbind(df1[1], `col.` = max.col(df1[-1], "first") * NA^ !rowSums(df1[-1] == 1))
data
df1 <- structure(list(ID. = 1:3, col1. = c(1L, 0L, 0L), col2. = c(0L,
0L, 0L), col3. = c(0L, 0L, 1L), col4. = c(0L, 1L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))

Find overlapping ranges in a dataframe and assign them values

A simpler version of the original question which I asked but nobody answered it yet.
I have a huge input file (a representative sample of which is shown below as input):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
I want to process it by following a rule (described below) so that I get an output like:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 1
chr2:300-700 1 1 0
chr1:250-450 1 1 1
chr2:200-500 1 1 0
chr2:600-1000 1 1 1
chr1:400-800 0 1 1
chr1:700-870 1 0 1
chr2:700-1400 0 1 1
Rule:
Take every index (the first in this case is chr1:200-400) of the dataframe, see if it overlaps with any other value in the dataframe. If yes, write 1 below that column in which it exists, if not write 0.
For example, if we take 1st index of the input input[1,1] which is chr1:200-400. As it exists in column 1 we will write 1 below it. Now we will check if this range overlap with any other range which exists in any of the other columns in the input. This value overlaps only with the first value (chr1:250-450) of the second column (CT2), therefore, we write 1 below that as well. As there is no overlap with any of the values in CT3, we write 0 below CT3 in the output dataframe.
Here are the dput of input and output:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
A possible solution using the data.table-package:
# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)
# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
, by = variable][]
# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)
# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
][, dcast(.SD, value ~ i.variable, fun = f)]
which gives:
value CT1 CT2 CT3
1: chr1:200-400 1 1 0
2: chr1:250-450 1 1 1
3: chr1:400-800 0 1 1
4: chr1:700-870 1 0 1
5: chr1:800-970 1 0 1
6: chr2:200-500 1 1 0
7: chr2:300-700 1 1 0
8: chr2:600-1000 1 1 1
9: chr2:700-1400 0 1 1

How to count the number of combinations of boolean data in R

What is the best way to determine a factor or create a new category field based on a number of boolean fields? In this example, I need to count the number of unique combinations of medications.
> MultPsychMeds
ID OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE
1 A 1 1 0 0
2 B 1 0 1 0
3 C 1 0 1 0
4 D 1 0 1 0
5 E 1 0 0 1
6 F 1 0 0 1
7 G 1 0 0 1
8 H 1 0 0 1
9 I 0 1 1 0
10 J 0 1 1 0
Perhaps another way to state it is that I need to pivot or cross tabulate the pairs. The final results need to look something like:
Combination Count
OLANZAPINE/HALOPERIDOL 1
OLANZAPINE/QUETIAPINE 3
OLANZAPINE/RISPERIDONE 4
HALOPERIDOL/QUETIAPINE 2
This data frame can be replicated in R with:
MultPsychMeds <- structure(list(ID = structure(1:10, .Label = c("A", "B", "C",
"D", "E", "F", "G", "H", "I", "J"), class = "factor"), OLANZAPINE = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), HALOPERIDOL = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), QUETIAPINE = c(0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 1L, 1L), RISPERIDONE = c(0L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 0L)), .Names = c("ID", "OLANZAPINE", "HALOPERIDOL",
"QUETIAPINE", "RISPERIDONE"), class = "data.frame", row.names = c(NA,
-10L))
Here's one approach using the reshape and plyr packages:
library(reshape)
library(plyr)
#Melt into long format
dat.m <- melt(MultPsychMeds, id.vars = "ID")
#Group at the ID level and paste the drugs together with "/"
out <- ddply(dat.m, "ID", summarize, combos = paste(variable[value == 1], collapse = "/"))
#Calculate a table
with(out, count(combos))
x freq
1 HALOPERIDOL/QUETIAPINE 2
2 OLANZAPINE/HALOPERIDOL 1
3 OLANZAPINE/QUETIAPINE 3
4 OLANZAPINE/RISPERIDONE 4
Just for fun, a base R solution (that can be turned into a oneliner :-) ):
data.frame(table(apply(MultPsychMeds[,-1], 1, function(currow){
wc<-which(currow==1)
paste(colnames(MultPsychMeds)[wc+1], collapse="/")
})))
Another way could be:
subset(
as.data.frame(
with(MultPsychMeds, table(OLANZAPINE, HALOPERIDOL, QUETIAPINE, RISPERIDONE)),
responseName="count"
),
count>0
)
which gives
OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE count
4 1 1 0 0 1
6 1 0 1 0 3
7 0 1 1 0 2
10 1 0 0 1 4
It's not an exact way you want it, but is fast and simple.
There is shorthand in plyr package:
require(plyr)
count(MultPsychMeds, c("OLANZAPINE", "HALOPERIDOL", "QUETIAPINE", "RISPERIDONE"))
# OLANZAPINE HALOPERIDOL QUETIAPINE RISPERIDONE freq
# 1 0 1 1 0 2
# 2 1 0 0 1 4
# 3 1 0 1 0 3
# 4 1 1 0 0 1

Resources