This question already has answers here:
Normalize by Group
(2 answers)
Closed 2 years ago.
I have this dataset:
> head(meltCalcium)
Time Cell Intensity
1 1 IntDen1 306852.5
2 2 IntDen1 302892.2
3 3 IntDen1 298258.6
4 4 IntDen1 300769.9
5 5 IntDen1 301971.8
6 6 IntDen1 302585.6
> tail(meltCalcium)
Time Cell Intensity
32531 659 IntDen49 47788.16
32532 660 IntDen49 47560.32
32533 661 IntDen49 47738.24
32534 662 IntDen49 48968.96
32535 663 IntDen49 48796.16
32536 664 IntDen49 48156.80
I have 49 Cells and the time reaches 664 for each one of them. In this case time is not important, as I'd like to get the normalized Intensity for each cell (so (Intensity - min)/(max - min)), and possibly adding it as a new column to the dataframe.
I tried
> meltCalcium$normalized <- with(meltCalcium, (Intensity - min(Intensity))/diff(range(Intensity)))
but in this way the max and the min are calculated using the Intensity over all Cells. How can I do it for each cell separately?
Thanks!
Apply the formula by group :
library(dplyr)
result <- meltCalcium %>%
group_by(Cell) %>%
mutate(normalized = (Intensity-min(Intensity))/diff(range(Intensity)))
Base R solution:
normalise_vec_min_max <- function(num_vec){
minnv <- min(num_vec, na.rm = TRUE)
maxnv <- max(num_vec, na.rm = TRUE)
return((num_vec - minnv) / (maxnv - minnv))
}
with(meltCalcium, ave(Intensity, Cell, FUN = normalise_vec_min_max))
Data:
meltCalcium <- structure(list(Time = c(1L, 2L, 3L, 4L, 5L, 6L, 659L, 660L, 661L,
662L, 663L, 664L), Cell = c("IntDen1", "IntDen1", "IntDen1",
"IntDen1", "IntDen1", "IntDen1", "IntDen49", "IntDen49", "IntDen49",
"IntDen49", "IntDen49", "IntDen49"), Intensity = c(306852.5,
302892.2, 298258.6, 300769.9, 301971.8, 302585.6, 47788.16, 47560.32,
47738.24, 48968.96, 48796.16, 48156.8)), row.names = c(NA, -12L
), class = "data.frame")
Related
When I use heatmap function to make heatmap of dataset, I get an error, I tried:
df1$family <- substr(as.character(df1$gene_id), 1, nchar(as.character(df1$gene_id))-2)
df01<-df1$family
df01m<-as.matrix(df01)
heatmap(df01m)
I get this error:
Error in heatmap(df01m): 'x' must be a numeric matrix
Traceback:
1. heatmap(df01m)
2. stop("'x' must be a numeric matrix")
The dataset is big, so I cut some of it:
structure(list(gene_id = structure(6:11, .Label = c("__alignment_not_unique",
"__ambiguous", "__no_feature", "__not_aligned", "__too_low_aQual",
"ENSG00000000005", "ENSG00000000419", "ENSG00000000457", "ENSG00000000460",
"ENSG00000000938", "ENSG00000000971", "ENSG00000001036", "ENSG00000001084",
"ENSG00000001167", "ENSG00000001460", "ENSG00000001461", "ENSG00000001497",
"ENSG00000001561", "ENSG00000001617", "ENSG00000001626", "ENSG00000001629",
"ENSG00000001630", "ENSG00000001631", "ENSG00000002016", "ENSG00000002079",
"ENSG00000002330", "ENSG00000002549", "ENSG00000002586", "ENSG00000002587",
"ENSG00000002726", "ENSG00000002745", "ENSG00000002746", "ENSG00000002822",
"ENSG00000002834", "ENSG00000002919", "ENSG00000002933", "ENSG00000003056",
"ENSG00000003096", "ENSG00000003137", "ENSG00000003147", "ENSG00000003249",
"ENSG00000003393", "ENSG00000003400", "ENSG00000003402", "ENSG00000003436",
"ENSG00000003509", "ENSG00000003756", "ENSG00000003987", "ENSG00000003989",
"ENSG00000004059", "ENSG00000004139", "ENSG00000004142", "ENSG00000004399",
"ENSG00000285989", "ENSG00000285990", "ENSG00000285991", "ENSG00000285992",
"ENSG00000285993", "ENSG00000285994"), class = "factor"), expr = c(6L,
754L, 447L, 426L, 5L, 1L)), row.names = c(NA, 6L), class = "data.frame")
head of the data set:
gene_id expr
<fct> <int>
1 ENSG00000000005 6
2 ENSG00000000419 754
3 ENSG00000000457 447
4 ENSG00000000460 426
5 ENSG00000000938 5
6 ENSG00000000971 1
The error shows that we need a numeric matrix. The substr function returns a character string. So, we can convert the substring vector to numeric
df01m <- as.matrix(as.numeric(df01))
Another issue is that heatmap requires a matrix with atleast 2 rows/2 columns. Here the as.matrix converts the vector to a single column matrix and it may not work
I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.
I am trying to write a function that merges based on two columns both found in two dataframes. One of the columns is an identifier string and the other is a date.
The first df ("model") includes identifiers, starting dates, and some other relevant info.
The second df ("futurevalues") is a melted df that includes the identifier, multiple months for each identifier, and the relevant value for each identifier-month pair.
I would like to merge values for each identifier based on a certain period of time in the future. So for instance, for Identifier= Mary and starting month="2005-01-31" in "model" I would like to pull in the relevant value for the next month and 11 more months after (so 12 data points for Mary for months starting month+1:starting month+12).
I can merge my dfs by the two columns to get the as-of date value (see below), but this isn't what I need.
testmerge=merge(model,futurevalues,by=c("month","identifier"),all=TRUE)
To solve this, I am trying to use the lubridate date functions. For instance, the function below will allow me to enter a month (and then lapply across the df maybe) to get the values for each of the starting months (which vary across the df, meaning it's not a standard time period across the entire thing).
monthiterate=function (x) {
x %m+% months(1:12)
}
Thanks a lot for your help.
EDIT: adding toy data (first one is model, second one is futurevalues)
structure(list(month = structure(c(12814, 12814, 12814, 12814,
12814, 12814, 12814, 12814, 12814, 12814), class = "Date"), identifier = structure(c(1L,
3L, 2L, 4L, 5L, 7L, 8L, 6L, 9L, 10L), .Label = c("AB1", "AC5",
"BB9", "C99", "D81", "GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"),
value = c(0.831876072999969, 0.218494398256579, 0.550872926656984,
1.81882711231324, -0.245597705276932, -0.964277509916354,
-1.84714556574606, -0.916239506529079, -0.475649743547525,
-0.227721186387637)), .Names = c("month", "identifier", "value"
), class = "data.frame", row.names = c(NA, 10L))
structure(list(identifier = structure(c(1L, 3L, 2L, 4L, 5L, 7L,
8L, 6L, 9L, 10L), .Label = c("AB1", "AC5", "BB9", "C99", "D81",
"GG8", "Q11", "R45", "ZA1", "ZZ9"), class = "factor"), month = structure(c(12814,
13238, 12814, 12814, 12964, 12903, 12903, 12842, 13148, 13148
), class = "Date"), futurereturns = c(-0.503033205660682, 1.22446988772542,
-0.825490985851348, 1.03902417581908, 0.172595565260429, 0.894967582911769,
-0.242324006922964, 0.415520398113024, -0.734437328639625, 2.64184935856802
)), .Names = c("identifier", "month", "futurereturns"), class = "data.frame", row.names
= c(NA, 10L))
You need to create a table of all the combinations of ID and month that you want. Starting with a table of each ID and their starting month:
library(lubridate)
set.seed(1834)
# 3 people, each with a different starting month
x <- data.frame(id = sample(LETTERS, 3)
, month = ymd("2005-01-01") + months(sample(0:11, 3)) - days(1))
> x
id month
1 D 2005-03-31
2 R 2005-07-31
3 Y 2005-02-28
Now add rows for the following two months, per ID. I use dplyr for this kind of thing.
library(dplyr)
y <- x %>%
rowwise %>%
do(data.frame(id = .$id
, month = seq(.$month + days(1)
, by = "1 month"
, length.out = 3) - days(1)))
> y
Source: local data frame [9 x 2]
Groups: <by row>
id month
1 D 2005-03-31
2 D 2005-04-30
3 D 2005-05-31
4 R 2005-07-31
5 R 2005-08-31
6 R 2005-09-30
7 Y 2005-02-28
8 Y 2005-03-31
9 Y 2005-04-30
Now you can use merge() (or left_join() from dplyr) to retrieve the rows you want from the full dataset.
I have a 2 dimensional data set (matrix/data frame) that looks like this
779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00
The 779, 489,859, 1156 are values that I want to draw on the x-axis
The rest of the values on the column are values that correpond to each x
Now I want to plot the entire data set, so that I have a graph with the the following points
(779,56916) , (779, 41784)......
(482,78968) , (482, 64440)..... and so on
The way I did it so far is like this (it gives me the plot I am looking for)
plot(colnames(resultsSummary),resultsSummary[1,],ylim=c(0,80000),pch=6)
points(colnames(resultsSummary),resultsSummary[2,],pch=3)
points(colnames(resultsSummary),resultsSummary[3,])
and so on..... plotting row by row
I am sure there is a better way to do it, but I dont know how, any suggestions?
DF <- read.table(text=" 779 482 859 1156
maxs 56916.00 78968.00 51156.00 44827.01
Means+Stdv 41784.70 64440.83 38319.10 42767.14
Mean_Cost 31863.18 44407.40 29365.78 38711.29
Means_Stdv 21941.66 24373.97 20412.45 34655.43
mins 21088.00 13768.00 24132.00 31452.00",
header=TRUE, check.names=FALSE)
m <- as.matrix(DF)
matplot(as.integer(colnames(m)),
t(m), pch=seq_len(ncol(m)))
Following also works:
ddf = structure(list(var = structure(c(1L, 4L, 2L, 3L, 5L), .Label = c("maxs",
"Mean_Cost", "Means_Stdv", "Means+Stdv", "mins"), class = "factor"),
X779 = c(56916, 41784.7, 31863.18, 21941.66, 21088), X482 = c(78968,
64440.83, 44407.4, 24373.97, 13768), X859 = c(51156, 38319.1,
29365.78, 20412.45, 24132), X1156 = c(44827.01, 42767.14,
38711.29, 34655.43, 31452)), .Names = c("var", "X779", "X482",
"X859", "X1156"), class = "data.frame", row.names = c(NA, -5L
))
ddf
var X779 X482 X859 X1156
1 maxs 56916.00 78968.00 51156.00 44827.01
2 Means+Stdv 41784.70 64440.83 38319.10 42767.14
3 Mean_Cost 31863.18 44407.40 29365.78 38711.29
4 Means_Stdv 21941.66 24373.97 20412.45 34655.43
5 mins 21088.00 13768.00 24132.00 31452.00
ddf[6,2:5]=as.numeric(substr(names(ddf)[2:5],2,4))
ddf2 = data.frame(t(ddf))
ddf2 = ddf2[-1,]
mm = melt(ddf2, id='X6')
ggplot(mm)+geom_point(aes(x=X6, y=value, color=variable))
I have the following table ordered group by first, second and name.
myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53),
Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene",
"Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
), class = "factor"), Area = c(699468L, 153744L, 32913L,
4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L,
2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))
Within each group I want to extract the area that correspond to the specific sample. Several groups don´t have areas from the samples, so if the sample is´nt detected it should return "NA".Ideally, the final output should be a column for each sample.
I have tried the ifelse function to create one column to each sample:
PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")
However this doesn´t takes into account the group distribution. I want to do this, but within the group. Within each group (a group as equal value for first, second and Name columns) if sample=PO1:1, Area, else NA.
For the first group:
structure(list(first = c(120L, 120L), second = c(1.33, 1.33),
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"),
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1",
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))
The output should be:
structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))
Any suggestion?
As in the example in the quesiton, I am assuming Sample is a factor. If this is not the case, consider making it such.
First, lets clean up the column Sample to make it a legal name, or else it might cause errors
levels(myData$Sample) <- make.names(levels(myData$Sample))
## DEFINE THE CUTS##
# Adjust these as necessary
#--------------------------
max.second <- 3 # max & nin range of myData$second
min.second <- 0 #
sprd <- 0.15 # with spread for each group
#--------------------------
# we will cut the myData$second according to intervals, cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)
# Next, lets create a group column to split our data frame by
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-')
groups <- split(myData, myData$group)
samples <- levels(myData$Sample) ## I'm assuming not all samples are present in the example. Manually adjusting with: samples <- sort(c(samples, "PO3.1"))
# Apply over each group, then apply over each sample
myOutput <-
t(sapply(groups, function(g) {
#-------------------------------
# NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - ")) # allowing for multiple values
unlist(ifelse(res=="", NA, res))
## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
# res <- sapply(samples, function(s) g$Area[g$Sample==s]) # <~~ This line will work when only one value per sample
# unlist(ifelse(res==0, NA, res))
#-------------------------------
}))
# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-") ## or whichever proper group name
# remove dummy column
myData$group <- NULL
Results
myOutput
PO1.1 PO2.1 PO3.1 PO4.1
Group-1 NA "153744" NA "699468"
Group-2 NA NA NA "32913 - 4948619"
Group-3 NA NA NA "83528"
Group-4 "536339" NA NA NA
Group-5 "105598" NA NA NA
You cannot really expect R to intuit that there is a fourth factor level between PO2 and PO4 , now can you.
> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
first second Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1 120 1.3 Pentanone 699468 153744 NA
3 126 0.4 Heptene 32913 NA NA
4 126 0.4 Heptene 4948619 NA NA
5 126 0.3 Methylamine 83528 NA NA
6 132 0.5 Benzene NA NA 536339
7 132 0.5 Ethene._trichloro- NA NA 105598