I want to take all possible samples of n=3 rows from a data frame. I can obtain all those row combinations using:
Combinations <- combn(dim(df)[1], 3)
Combinations[1:10]
[1] 1 2 3 1 2 4 1 2 5 1
I would need to take subsets of every 3 numbers I get in Combinations and use them for the row numbers of the df (i,e. first one is rows 1 2 3; second is 1 2 4, and so on).
I also need to 'store' each subset to perform functions (obtain sum and mean of the columns and save the row.names). I am using a list for that purpose which would store all samples of the original dataframe.
Here's an example of a function using sample which makes what I need but not for every combination.
take_sample = function(dataframe){
df2= dataframe[sample(nrow(dataframe),3),]
lista=list(filas=row.names(df2),
suma=sum(dist(df2)),
coord_mean=apply(df2,2,mean))
return(lista)
}
Here's the df
structure(list(x = c(4817.5, 4814.5, 4817.5, 8515, 8518, 8543.5,
8549.5, 7236.5, 7242.5, 8024.5, 8017, 5549, 5547, 8857, 8861.5,
8132.5, 8136.5, 9349, 9343, 5914), y = c(4424, 4431.5, 4432,
4482.5, 4484, 4574.5, 4576, 4758, 4759.5, 4772, 4772.5, 4807.5,
4809.5, 4936, 4940, 5177, 5175, 5192.5, 5192.5, 5236)), .Names = c("x",
"y"), row.names = c(1175L, 1176L, 1177L, 1180L, 1181L, 1185L,
1187L, 1203L, 1204L, 1206L, 1207L, 1215L, 1217L, 1227L, 1229L,
1249L, 1250L, 1252L, 1253L, 1257L), class = "data.frame")
Related
my data looks like
Nr. Type 1 Type 2 Type 1 Type 2
1 400 600 100 800
2 500 400 900 300
3 200 200 400 700
4 300 600 800 300
and I want to create Boxolplots of Type 1 and type 2.
Pivotlonger makes Type 1 and Type 1.1 which is not what I Need.
Maybe someone can help me.
It turns out your issue was not with the pivot_longer() but with the subsetting of your original data.frame using [. There is no direct control over the requirement that the output of [ or base::subset() have unique column names so you need do something else to subset your data and avoid losing column names. This is discussed in this question so borrowing from one of the answers, you can use:
library(tidyverse)
# data with extra column to be removed
d <- structure(list(Nr. = 1:4, `Type 1` = c(400L, 500L, 200L, 300L), x = 1:4, `Type 2` = c(600L,
400L, 200L, 600L), `Type 1` = c(100L, 900L, 400L, 800L), `Type 2` = c(800L,
300L, 700L, 300L)), row.names = c(NA, -4L), class = "data.frame")
# remove extra column without changing names then pivot
data.frame(as.list(d)[-3], check.names = FALSE) %>%
pivot_longer(-Nr.) %>%
ggplot(aes(name, value)) +
geom_boxplot()
Created on 2022-02-22 by the reprex package (v2.0.1)
I have a data table which i would like to filter based on multiple conditions looking at all previous columns. If the New_ID.1 row number is before the same id in the New_ID column, remove the row where New_ID= New_ID.1 from previous row. For example, I would remove the New_ID 581 in row 3 because New_ID.1 is in row 1. However I don't want to remove row six New_ID 551 since row 3 New_ID.551 would be removed first. Essentially, I think i need to loop through and create a new filtered table for each row and repeat process?
orig_df<- structure(list(New_ID = c(557L, 588L, 581L, 580L, 591L, 551L,
300L, 112L), New_ID.1 = c(581L, 591L, 551L, 300L, 112L, 584L,
416L, 115L), distance = c(3339.15537217173, 3432.33715484179,
5268.69104753613, 5296.72042763528, 5271.94917463488, 5258.66546295312,
5286.99982045171, 5277.81914818968), X.x = c(903604.940384474,
819515.728302034, 903663.550206032, 866828.860223065, 819525.350044447,
903720.790105847, 866881.654186025, 819585.173276271), Y.x = c(1027706.41509243,
1026880.34660449, 1024367.77412815, 1023962.99139374, 1023448.02293581,
1019099.39402149, 1018666.53407908, 1018176.41319296), X.y = c(903663.550206032,
819525.350044447, 903720.790105847, 866881.654186025, 819585.173276271,
903801.327345876, 866919.184271939, 819630.672367509), Y.y = c(1024367.77412815,
1023448.02293581, 1019099.39402149, 1018666.53407908, 1018176.41319296,
1013841.34531459, 1013379.66746509, 1012898.79016799), Y_filter = c(3338.64096427278,
3432.32366867992, 5268.38010666054, 5296.45731465891, 5271.60974284587,
5258.04870690871, 5286.86661398865, 5277.62302497006), X_filter = c(58.609821557533,
9.62174241337925, 57.2398998149438, 52.7939629601315, 59.8232318238588,
80.5372400298947, 37.5300859131385, 45.4990912381327), row.number = 1:8), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
End result would retain rows 1,2,4,6 and 8 from original data
output_table<-structure(list(New_ID = c(557L, 588L, 580L, 551L, 112L), New_ID.1 = c(581L,
591L, 300L, 584L, 115L), distance = c(3339.15537217173, 3432.33715484179,
5296.72042763528, 5258.66546295312, 5277.81914818968), X.x = c(903604.940384474,
819515.728302034, 866828.860223065, 903720.790105847, 819585.173276271
), Y.x = c(1027706.41509243, 1026880.34660449, 1023962.99139374,
1019099.39402149, 1018176.41319296), X.y = c(903663.550206032,
819525.350044447, 866881.654186025, 903801.327345876, 819630.672367509
), Y.y = c(1024367.77412815, 1023448.02293581, 1018666.53407908,
1013841.34531459, 1012898.79016799), Y_filter = c(3338.64096427278,
3432.32366867992, 5296.45731465891, 5258.04870690871, 5277.62302497006
), X_filter = c(58.609821557533, 9.62174241337925, 52.7939629601315,
80.5372400298947, 45.4990912381327), row.number = c(1L, 2L, 4L,
6L, 8L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
Below is a simpler problem that might be of help.
Original data
A|B
C|D
B|E
E|F
Updated data table
A|B
C|D
E|F
I think looping through the rows and saving the ids that you already encountered should be enough?
orig_df <- as.data.frame(orig_df)
included_rows <- rep(FALSE, nrow(orig_df))
seen_ids <- c()
for(i in 1:nrow(orig_df)){
# Skip row if we have seen either ID already
if(orig_df[i, 'New_ID'] %in% seen_ids) next
if(orig_df[i, 'New_ID.1'] %in% seen_ids) next
# If both ids are new, we save them as seen and include the entry
seen_ids <- c(seen_ids, orig_df[i, 'New_ID'] , orig_df[i, 'New_ID.1'] )
included_rows[i] <- TRUE
}
filtered_df <- orig_df[included_rows,]
This question already has answers here:
Normalize by Group
(2 answers)
Closed 2 years ago.
I have this dataset:
> head(meltCalcium)
Time Cell Intensity
1 1 IntDen1 306852.5
2 2 IntDen1 302892.2
3 3 IntDen1 298258.6
4 4 IntDen1 300769.9
5 5 IntDen1 301971.8
6 6 IntDen1 302585.6
> tail(meltCalcium)
Time Cell Intensity
32531 659 IntDen49 47788.16
32532 660 IntDen49 47560.32
32533 661 IntDen49 47738.24
32534 662 IntDen49 48968.96
32535 663 IntDen49 48796.16
32536 664 IntDen49 48156.80
I have 49 Cells and the time reaches 664 for each one of them. In this case time is not important, as I'd like to get the normalized Intensity for each cell (so (Intensity - min)/(max - min)), and possibly adding it as a new column to the dataframe.
I tried
> meltCalcium$normalized <- with(meltCalcium, (Intensity - min(Intensity))/diff(range(Intensity)))
but in this way the max and the min are calculated using the Intensity over all Cells. How can I do it for each cell separately?
Thanks!
Apply the formula by group :
library(dplyr)
result <- meltCalcium %>%
group_by(Cell) %>%
mutate(normalized = (Intensity-min(Intensity))/diff(range(Intensity)))
Base R solution:
normalise_vec_min_max <- function(num_vec){
minnv <- min(num_vec, na.rm = TRUE)
maxnv <- max(num_vec, na.rm = TRUE)
return((num_vec - minnv) / (maxnv - minnv))
}
with(meltCalcium, ave(Intensity, Cell, FUN = normalise_vec_min_max))
Data:
meltCalcium <- structure(list(Time = c(1L, 2L, 3L, 4L, 5L, 6L, 659L, 660L, 661L,
662L, 663L, 664L), Cell = c("IntDen1", "IntDen1", "IntDen1",
"IntDen1", "IntDen1", "IntDen1", "IntDen49", "IntDen49", "IntDen49",
"IntDen49", "IntDen49", "IntDen49"), Intensity = c(306852.5,
302892.2, 298258.6, 300769.9, 301971.8, 302585.6, 47788.16, 47560.32,
47738.24, 48968.96, 48796.16, 48156.8)), row.names = c(NA, -12L
), class = "data.frame")
When I use heatmap function to make heatmap of dataset, I get an error, I tried:
df1$family <- substr(as.character(df1$gene_id), 1, nchar(as.character(df1$gene_id))-2)
df01<-df1$family
df01m<-as.matrix(df01)
heatmap(df01m)
I get this error:
Error in heatmap(df01m): 'x' must be a numeric matrix
Traceback:
1. heatmap(df01m)
2. stop("'x' must be a numeric matrix")
The dataset is big, so I cut some of it:
structure(list(gene_id = structure(6:11, .Label = c("__alignment_not_unique",
"__ambiguous", "__no_feature", "__not_aligned", "__too_low_aQual",
"ENSG00000000005", "ENSG00000000419", "ENSG00000000457", "ENSG00000000460",
"ENSG00000000938", "ENSG00000000971", "ENSG00000001036", "ENSG00000001084",
"ENSG00000001167", "ENSG00000001460", "ENSG00000001461", "ENSG00000001497",
"ENSG00000001561", "ENSG00000001617", "ENSG00000001626", "ENSG00000001629",
"ENSG00000001630", "ENSG00000001631", "ENSG00000002016", "ENSG00000002079",
"ENSG00000002330", "ENSG00000002549", "ENSG00000002586", "ENSG00000002587",
"ENSG00000002726", "ENSG00000002745", "ENSG00000002746", "ENSG00000002822",
"ENSG00000002834", "ENSG00000002919", "ENSG00000002933", "ENSG00000003056",
"ENSG00000003096", "ENSG00000003137", "ENSG00000003147", "ENSG00000003249",
"ENSG00000003393", "ENSG00000003400", "ENSG00000003402", "ENSG00000003436",
"ENSG00000003509", "ENSG00000003756", "ENSG00000003987", "ENSG00000003989",
"ENSG00000004059", "ENSG00000004139", "ENSG00000004142", "ENSG00000004399",
"ENSG00000285989", "ENSG00000285990", "ENSG00000285991", "ENSG00000285992",
"ENSG00000285993", "ENSG00000285994"), class = "factor"), expr = c(6L,
754L, 447L, 426L, 5L, 1L)), row.names = c(NA, 6L), class = "data.frame")
head of the data set:
gene_id expr
<fct> <int>
1 ENSG00000000005 6
2 ENSG00000000419 754
3 ENSG00000000457 447
4 ENSG00000000460 426
5 ENSG00000000938 5
6 ENSG00000000971 1
The error shows that we need a numeric matrix. The substr function returns a character string. So, we can convert the substring vector to numeric
df01m <- as.matrix(as.numeric(df01))
Another issue is that heatmap requires a matrix with atleast 2 rows/2 columns. Here the as.matrix converts the vector to a single column matrix and it may not work
I am facing a problem I do not understand. It's a follow-up on answers suggested here and here
I have two identically structured datasets. One I created as a reproducible example for which the code works, and one which is real for which the code does not work. After staring at it for hours I cannot find what is causing the error.
The following gives an example that works
df <- data.table(cbind(rep(seq(1,25), each = 4 )), cbind(rep(seq(1,40), length.out = 100)))
colnames(df) <- c("a", "b") #ignore warning
setkey(df, a, b)
This is just to create a reproducible example. When I apply the - slightly adjusted - code suggested in the mentioned SO articles I get what I am looking for: a sparse matrix that indicates when two elements in column b occur together for values of column a
library(Matrix)
s <- sparseMatrix(
df$a,
df$b,
dimnames = list(
unique(df$a),unique(df$b)), x = 1)
v <- t(s) %*% s
Now I am doing - in my eyes - exactly the same on my real dataset which is much longer.
A sample dput below looks like this
test <- dput(dk[1:50,])
structure(list(pid = c(204L, 204L, 207L, 254L, 254L, 258L, 258L,
258L, 258L, 258L, 265L, 265L, 269L, 269L, 269L, 269L, 1520L,
1520L, 1520L, 1520L, 1532L, 1532L, 1534L, 1534L, 1534L, 1534L,
1539L, 1539L, 1543L, 1543L, 1546L, 1546L, 1546L, 1546L, 1546L,
1546L, 1546L, 1549L, 1549L, 1549L, 1559L, 1559L, 1559L, 1559L,
1559L, 1559L, 1559L, 1561L, 1561L, 1561L), cid = c(11023L, 11787L,
14232L, 14470L, 14480L, 1290L, 1637L, 4452L, 13964L, 14590L,
17814L, 23453L, 6658L, 10952L, 17259L, 27549L, 11034L, 22748L,
23345L, 23347L, 10487L, 11162L, 15570L, 15629L, 17983L, 17999L,
17531L, 22497L, 14425L, 14521L, 11495L, 24948L, 24962L, 24969L,
24972L, 24973L, 30627L, 17886L, 18428L, 23972L, 13890L, 13936L,
14432L, 21230L, 21271L, 21384L, 21437L, 341L, 354L, 6302L)), .Names = c("pid",
"cid"), sorted = c("pid", "cid"), class = c("data.table", "data.frame"
), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000000100788>)
Then when running the same formula, I get an error
s <- sparseMatrix(test$pid,test$cid,dimnames = list(unique(test$pid), unique(test$cid)),x = 1)
The Error (which occurs in the test dataset as well) reads as follows:
Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[[1]])' must match Dim[1]
The problem disappears when I remove the dimnames but I really need these dimnames to make sense of the results. I'm sure I'm missing out on something obvious. Can someone please tell me what it is ?
We can convert the 'pid', 'cid' columns to factor and coerce back to numeric or use match with unique values of each column to get the row/column index and this should work in creating sparseMatrix.
test1 <- test[, lapply(.SD, function(x)
as.numeric(factor(x, levels=unique(x))))]
Or we use match
test1 <- test[, lapply(.SD, function(x) match(x, unique(x)))]
s1 <- sparseMatrix(test1$pid,test1$cid,dimnames = list(unique(test$pid),
unique(test$cid)),x = 1)
dim(s1)
#[1] 15 50
s1[1:3, 1:3]
#3 x 3 sparse Matrix of class "dgCMatrix"
# 11023 11787 14232
#204 1 1 .
#207 . . 1
#254 . . .
head(test)
# pid cid
#1: 204 11023
#2: 204 11787
#3: 207 14232
#4: 254 14470
#5: 254 14480
#6: 258 1290
EDIT:
If we want this for the full row/column index specified in 'test', we need to make the dimnames as the same length as the max of 'pid', 'cid'
rnm <- seq(max(test$pid))
cnm <- seq(max(test$cid))
s2 <- sparseMatrix(test$pid, test$cid, dimnames=list(rnm, cnm))
dim(s2)
#[1] 1561 30627
s2[1:3, 1:3]
#3 x 3 sparse Matrix of class "ngCMatrix"
# 1 2 3
#1 . . .
#2 . . .
#3 . . .