I am trying to create a summary table and having a mental hang up. Essentially, what I think I want is a summaryBy statement getting colSums for the subsets for ALL columns except the factor to summarize on.
My data frame looks like this:
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524
comp103680_c0 10 0 0 0 0 0 1
comp103947_c0 3 0 0 0 0 0 0
comp104660_c0 1 1 1 0 0 0 0
comp105255_c0 10 0 0 0 0 0 0
What I would like to do is get colSums for all columns after Cluster using Cluster as the grouping factor.
I have tried a bunch of things. The last was the ply ddply
> groupColumns = "Cluster"
> dataColumns = colnames(GO_matrix_MF[,2:ncol(GO_matrix_MF)])
> res = ddply(GO_matrix_MF, groupColumns, function(x) colSums(GO_matrix_MF[dataColumns]))
> head(res)
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524 GO:0004674 GO:0045735
1 1 121 138 196 94 43 213 97 20
2 2 121 138 196 94 43 213 97 20
I am not sure what the return values represent, but they do not represent the colSums
Try:
> aggregate(.~Cluster, data=ddf, sum)
Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
1 1 1 1 0 0 0 0
2 3 0 0 0 0 0 0
3 10 0 0 0 0 0 1
I think you are looking for something like this. I modified your data a bit. There are other options too.
# Modified data
foo <- structure(list(Cluster = c(10L, 3L, 1L, 10L), GO.0003677 = c(11L,
0L, 1L, 5L), GO.0003700 = c(0L, 0L, 1L, 0L), GO.0046872 = c(0L,
9L, 0L, 0L), GO.0008270 = c(0L, 0L, 0L, 0L), GO.0043565 = c(0L,
0L, 0L, 0L), GO.0005524 = c(1L, 0L, 0L, 0L)), .Names = c("Cluster",
"GO.0003677", "GO.0003700", "GO.0046872", "GO.0008270", "GO.0043565",
"GO.0005524"), class = "data.frame", row.names = c("comp103680_c0",
"comp103947_c0", "comp104660_c0", "comp105255_c0"))
library(dplyr)
foo %>%
group_by(Cluster) %>%
summarise_each(funs(sum))
# Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
#1 1 1 1 0 0 0 0
#2 3 0 0 9 0 0 0
#3 10 16 0 0 0 0 1
Related
I have a dataset with over several diseases, 0 indicating not having the disease and 1 having the disease.
To illustrate it with an example: I am interested in Diseases A and whether the people in the dataset have this diseases on its own or as the cause of another disease. Therefore I want to create a new variable "Type" with the values "NotDiseasedWithA", "Primary" and "Secondary". The diseases that can cause A are contained in a vector "SecondaryCauses":
SecondaryCauses = c("DiseaseB", "DiseaseD")
"NotDiseasedWithA" means that they do not have disease A.
"Primary" means that they have disease A but not any of the known diseases that can cause it.
"Secondary" means that they have disease A and a diseases that probably caused it.
Sample data
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE
1 0 1 0 0 0
2 1 0 0 0 1
3 1 0 1 1 0
4 1 0 1 1 1
5 0 0 0 0 0
My question is:
How do I select the columns I am interested in? I have more than 20 columns that are not ordered. Therefore I created the vector.
How do I create the condition based on the content of the diseases I am interested in?
I tried something like the following, but this did not work:
DF %>% mutate(Type = ifelse(DiseaseA == 0, "NotDiseasedWithA", ifelse(sum(names(DF) %in% SecondaryCauses) > 0, "Secondary", "Primary")))
So in the end I want to have this results:
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
1 0 1 0 0 0 NotDiseasedWithA
2 1 0 0 0 1 Primary
3 1 0 1 1 0 Secondary
4 1 0 1 1 1 Secondary
5 0 0 0 0 0 NotDiseasedWithA
using data.table
df <- structure(list(ID = 1:5, DiseaseA = c(0L, 1L, 1L, 1L, 0L), DiseaseB = c(1L,
0L, 0L, 0L, 0L), DiseaseC = c(0L, 0L, 1L, 1L, 0L), DiseaseD = c(0L,
0L, 1L, 1L, 0L), DiseaseE = c(0L, 1L, 0L, 1L, 0L)), row.names = c(NA,
-5L), class = c("data.frame"))
library(data.table)
setDT(df) # make it a data.table
SecondaryCauses = c("DiseaseB", "DiseaseD")
df[DiseaseA == 0, Type := "NotDiseasedWithA"][DiseaseA == 1, Type := ifelse(rowSums(.SD) > 0, "Secondary", "Primary"), .SDcols = SecondaryCauses]
df
# ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
# 1: 1 0 1 0 0 0 NotDiseasedWithA
# 2: 2 1 0 0 0 1 Primary
# 3: 3 1 0 1 1 0 Secondary
# 4: 4 1 0 1 1 1 Secondary
# 5: 5 0 0 0 0 0 NotDiseasedWithA
I'm new to R programming and hope someone could help me with the situation below:
I have a dataframe shown in the picture (Original Dataframe), I would like to return the first record grouped by the [ID] column that has a value >= 1 in any of the four columns (A, B, C, or D) and all the records after based off the [Date] column (the desired dataframe should look like the Output Dataframe shown in the picture). Basically, remove all the records highlighted in yellow. I would appreciate greatly if you can provide the R code to achieve this.
structure(list(ID = c(101L, 101L, 101L, 101L, 101L, 101L, 103L,
103L, 103L, 103L), Date = c(43338L, 43306L, 43232L, 43268L, 43183L,
43144L, 43310L, 43246L, 43264L, 43209L), A = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L), B = c(0L, 2L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L), C = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), D = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("ID", "Date",
"A", "B", "C", "D"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Here is a solution,
ID Date A B C D
1 101 26.08.2018 0 0 0 0
2 101 25.07.2018 0 2 0 0
3 101 12.05.2018 0 0 1 0
4 101 17.06.2018 0 0 0 0
5 101 24.03.2018 0 0 0 0
6 101 13.02.2018 0 0 0 0
7 103 29.07.2018 0 0 0 0
8 103 26.05.2018 1 1 0 0
9 103 13.06.2018 0 0 0 0
10 103 19.04.2018 0 0 0 0
data$Check <- rowSums(data[3:6])
data$Date <- as.Date(data$Date , "%d.%m.%Y")
data <- data[order(data$ID,data$Date),]
id <- unique(data$ID)
for(i in 1:length(id)) {
data_sample <- data[data$ID == id[i],]
data_sample <- data_sample[ min(which(data_sample$Check>0 )):nrow(data_sample),]
if(i==1) {
final <- data_sample
} else {
final <- rbind(final,data_sample)
}
}
final <- final[,-7]
ID Date A B C D
3 101 2018-05-12 0 0 1 0
4 101 2018-06-17 0 0 0 0
2 101 2018-07-25 0 2 0 0
1 101 2018-08-26 0 0 0 0
8 103 2018-05-26 1 1 0 0
9 103 2018-06-13 0 0 0 0
7 103 2018-07-29 0 0 0 0
Here's a tidyverse solution. The filter condition deserves some explanation:
first, we sort by ID and Date and group_by ID
Then, for each ID (since we're grouped by ID) we apply the filter condition:
Test, for each row, whether any of the variables are > 0
Get the row number for all rows (in the group) where this is the case
Find the lowest one (since rows are sorted by Date, this will be the earliest)
Get the value of Date for that row.
Then filter rows where Date is >= than this.
Since we're still grouping by ID, all these calculations will happen separately for each group:
df %>%
arrange(ID, Date) %>%
group_by(ID) %>%
filter(Date >= Date[min(which(A > 0 | B > 0 | C > 0 | D > 0))])
# A tibble: 7 x 6
# Groups: ID [2]
ID Date A B C D
<int> <int> <int> <int> <int> <int>
1 101 43232 0 0 1 0
2 101 43268 0 0 0 0
3 101 43306 0 2 0 0
4 101 43338 0 0 0 0
5 103 43246 1 1 0 0
6 103 43264 0 0 0 0
7 103 43310 0 0 0 0
I have a dataframe like this :
G2_ref G10_ref G12_ref G2_alt G10_alt G12_alt
20011953 3 6 0 5 1 5
12677336 0 0 0 1 3 6
20076754 0 3 0 12 16 8
2089670 0 4 0 1 11 9
9456633 0 2 0 3 10 0
468487 0 0 0 0 0 0
And I'm trying to sort the columns to have finally this column order :
G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
I tried : df[,order(colnames(df))]
But I had this order :
G10_alt G10_ref G12_alt G12_ref G2_alt G2_ref
If anyone had any idea it will be great.
One option would be to extract the numeric part and also the substring at the end and then do the order
df[order(as.numeric(gsub("\\D+", "", names(df))),
factor(sub(".*_", "", names(df)), levels = c('ref', 'alt')))]
# G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
#20011953 3 5 6 1 0 5
#12677336 0 1 0 3 0 6
#20076754 0 12 3 16 0 8
#2089670 0 1 4 11 0 9
#9456633 0 3 2 10 0 0
#468487 0 0 0 0 0 0
data
df <- structure(list(G2_ref = c(3L, 0L, 0L, 0L, 0L, 0L), G10_ref = c(6L,
0L, 3L, 4L, 2L, 0L), G12_ref = c(0L, 0L, 0L, 0L, 0L, 0L), G2_alt = c(5L,
1L, 12L, 1L, 3L, 0L), G10_alt = c(1L, 3L, 16L, 11L, 10L, 0L),
G12_alt = c(5L, 6L, 8L, 9L, 0L, 0L)), .Names = c("G2_ref",
"G10_ref", "G12_ref", "G2_alt", "G10_alt", "G12_alt"),
class = "data.frame", row.names = c("20011953",
"12677336", "20076754", "2089670", "9456633", "468487"))
I am guessing your data is from genetics and looks pretty standard, first columns with ref alleles for all variants then followed by alt alleles for all variants.
Meaning we could just use alternated column index from half way of your dataframe, i.e.: we will try to create this index - c(1, 4, 2, 5, 3, 6) then subset:
ix <- c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))
ix
# [1] 1 4 2 5 3 6
df1[, ix]
# G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
# 20011953 3 5 6 1 0 5
# 12677336 0 1 0 3 0 6
# 20076754 0 12 3 16 0 8
# 2089670 0 1 4 11 0 9
# 9456633 0 3 2 10 0 0
# 468487 0 0 0 0 0 0
# or all in one line
df1[, c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))]
An easy solution using dplyr:
library(dplyr)
df <- df %>%
select(G2_ref, G2_alt, G10_ref, G10_alt, G12_ref, G12_alt)
Perhaps this is less (complicated) code than #akrun's answer, but only really suitable for when you want to order a small number of columns.
I have a dataset(nm) as shown below:
nm
2_V2O 10_Kutti 14_DD 15_TT 16_DD 19_V2O 20_Kutti
0 1 1 0 0 1 0
1 1 1 1 1 0 0
0 1 0 1 0 0 1
0 1 1 0 1 0 0
Now I want to have multiple new datasets which got segregated as per their unique column names. All dataset names also must be created as per their column names as shown below:
Kutti
10_Kutti 20_Kutti
1 0
1 0
1 1
1 0
V2O
2_V2O 19_V2O
0 1
1 0
0 0
0 0
DD
14_DD 16_DD
1 0
1 1
0 0
1 1
TT
16_TT
0
1
0
1
I know this can be done using "select" function in dplyr but I need one dynamic programme which builds this automatically for any dataset.
We can split by the substring of the column names of 'nm'. Remove the prefix of the columnames until the _ with sub and use that to split the 'nm'.
lst <- split.default(nm, sub(".*_", "", names(nm)))
lst
#$DD
# 14_DD 16_DD
#1 1 0
#2 1 1
#3 0 0
#4 1 1
#$Kutti
# 10_Kutti 20_Kutti
#1 1 0
#2 1 0
#3 1 1
#4 1 0
#$TT
# 15_TT
#1 0
#2 1
#3 1
#4 0
#$V2O
# 2_V2O 19_V2O
#1 0 1
#2 1 0
#3 0 0
#4 0 0
It is better to keep the data.frames in a list. If we insist that it should be individual data.frame objects in the global environment (not recommended), use list2env
list2env(lst, envir = .GlobalEnv)
Now, just call
DD
data
nm <- structure(list(`2_V2O` = c(0L, 1L, 0L, 0L), `10_Kutti` = c(1L,
1L, 1L, 1L), `14_DD` = c(1L, 1L, 0L, 1L), `15_TT` = c(0L, 1L,
1L, 0L), `16_DD` = c(0L, 1L, 0L, 1L), `19_V2O` = c(1L, 0L, 0L,
0L), `20_Kutti` = c(0L, 0L, 1L, 0L)), .Names = c("2_V2O", "10_Kutti",
"14_DD", "15_TT", "16_DD", "19_V2O", "20_Kutti"), class = "data.frame",
row.names = c(NA, -4L))
I'm really a beginner in R so, sorry if my code shocks you guys.
My data resembles something like this:
a b c d e f g h i j
t1 0 0 0 0 3 0 0 0 0 0
t2 0 0 0 0 0 6 0 0 0 0
t3 0 0 0 0 0 0 0 0 0 8
t4 0 0 0 0 0 0 0 0 9 0
I'd like to, for each row find the column with the maximum value and then get columns minus 3 to plus 3 of that one.
I wrote the following script to perform exactly that:
M<-c(1)
for (row in 1: length(D[,1])) {
max<-which.max(D[row,])
D<-D[,c(max-3,max-2,max-1,max,max+1,max+2,max+3)]
M<- cbind(M,D)
}
M<-M[,-1]
It would work, except for the case in which the maximum value is in a column near the beginning or end of a row (like rows t3 and t4 in the example above). In this case I'd like to have the 7 columns more close to the column with the maximum value, like this:
t1 0 0 0 3 0 0 0
t2 0 0 0 6 0 0 0
t3 0 0 0 0 0 0 8
t4 0 0 0 0 0 9 0
Help would be really appreciated!
dput() version of example data:
structure(list(a = c(0L, 0L, 0L, 0L), b = c(0L, 0L, 0L, 0L),
c = c(0L, 0L, 0L, 0L), d = c(0L, 0L, 0L, 0L), e = c(3L, 0L,
0L, 0L), f = c(0L, 6L, 0L, 0L), g = c(0L, 0L, 0L, 0L), h = c(0L,
0L, 0L, 0L), i = c(0L, 0L, 0L, 9L), j = c(0L, 0L, 8L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), class = "data.frame",
row.names = c("t1", "t2", "t3", "t4"))
This should work nicely:
t(apply(D,
MARGIN = 1,
FUN = function(X) {
n <- which.max(X)
i <- seq(min(max(1, n-3), ncol(D)-6), len=7)
X[i]
}))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# t1 0 0 0 3 0 0 0
# t2 0 0 0 6 0 0 0
# t3 0 0 0 0 0 0 8
# t4 0 0 0 0 0 9 0
To test that the key column-selecting bit works as you'd like it to, you can try the following:
n <- 2
seq(min(max(1, n-3), ncol(D)-6), len=7)
n <- 10
seq(min(max(1, n-3), ncol(D)-6), len=7)
n <- 6
seq(min(max(1, n-3), ncol(D)-6), len=7)