Related
I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.
I have a problem regarding results from an aggregate function in R. My aim is to select certain bird species from a data set and calculate the density
of observed individuals over the surveyed area. To that end, I took a subset of the main data file, then aggregated over area, calculating the
mean, and the number of individuals (represented by length of vector). Then I wanted to use the calculated mean area and number of individuals to
calculate density. That didn't work. The code I used is given below:
> head(data)
positionmonth positionyear quadrant Species Code sum_areainkm2
1 5 2014 1 Bar-tailed Godwit 5340 155.6562
2 5 2014 1 Bar-tailed Godwit 5340 155.6562
3 5 2014 1 Bar-tailed Godwit 5340 155.6562
4 5 2014 1 Bar-tailed Godwit 5340 155.6562
5 5 2014 1 Gannet 710 155.6562
6 5 2014 1 Bar-tailed Godwit 5340 155.6562
sub.gannet<-subset(data, species == "Gannet")
sub.gannet<-data.frame(sub.gannet)
x<-sub.gannet
aggr.gannet<-aggregate(sub.gannet$sum_areainkm2, by=list(sub.gannet$positionyear, sub.gannet$positionmonth, sub.gannet$quadrant, sub.gannet$Species, sub.gannet$Code), FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
names(aggr.gannet)<-c("positionyear", "positionmonth", "quadrant", "species", "code", "x")
aggr.gannet<-data.frame(aggr.gannet)
> aggr.gannet
positionyear positionmonth quadrant species code x.observed_area x.NoInd
1 2014 5 4 Gannet 710 79.8257 10.0000
density <- c(aggr.gannet$x.NoInd/aggr.gannet$x.observed_area)
aggr.gannet <- cbind(aggr.gannet, density)
Error in data.frame(..., check.names = FALSE) :
Arguments imply differing number of rows: 1, 0
> density
numeric(0)
> aggr.gannet$x.observed_area
NULL
> aggr.gannet$x.NoInd
NULL
R doesn't seem to view the results from the function (observed_area and NoInd) as numeric values in their own right. That was already apparent, when I couldn't give them a name each, but had to call them "x".
How can I calculate density under these circumstances? Or is there another way to aggregate with multiple functions over the same variable that will result in a usable output?
It's a quirk of aggregate with multiple aggregations that the resulting aggregations are stored in a list within the column related to the aggregated variable.
The easiest way to get rid of this is to go through an as.list before as.dataframe, which flattens the data structure.
aggr.gannet <- as.data.frame(as.list(aggr.gannet))
It will still use x as the name. The way I discovered to fix this is to use the formula interface to aggregate, so your aggregate would look more like
aggr.gannet<-aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=sub.gannet,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
Walking it through (here I haven't taken the subset to illustrate the aggregation by species)
df <- structure(list(positionmonth = c(5L, 5L, 5L, 5L, 5L, 5L), positionyear = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L), quadrant = c(1L, 1L, 1L, 1L, 1L, 1L), Species = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Bar-tailed Godwit", "Gannet"), class = "factor"), Code = c(5340L, 5340L, 5340L, 5340L, 710L, 5340L), sum_areainkm2 = c(155.6562, 155.6562, 155.6562, 155.6562, 155.6562, 155.6562)), .Names = c("positionmonth", "positionyear", "quadrant", "Species", "Code", "sum_areainkm2"), class = "data.frame", row.names = c(NA, -6L))
df.agg <- as.data.frame(as.list(aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=df,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))))
Which results in what you want:
> df.agg
positionyear positionmonth quadrant Species Code
1 2014 5 1 Gannet 710
2 2014 5 1 Bar-tailed Godwit 5340
sum_areainkm2.observed_area sum_areainkm2.NoInd
1 155.6562 1
2 155.6562 5
> names(df.agg)
[1] "positionyear" "positionmonth"
[3] "quadrant" "Species"
[5] "Code" "sum_areainkm2.observed_area"
[7] "sum_areainkm2.NoInd"
Obligatory note here, that dplyr and data.table are powerful libraries that allow doing this sort of aggregation very simply and efficiently.
dplyr
Dplyr has some strange syntax (the %>% operator), but ends up being quite readable, and allows chaining more complex operations
> require(dplyr)
> df %>%
group_by(positionyear, positionmonth, quadrant, Species, Code) %>%
summarise(observed_area=mean(sum_areainkm2), NoInd = n())
data.table
data.table has a more compact syntax and may be faster with large datasets.
dt[,
.(observed_area=mean(sum_areainkm2), NoInd=.N),
by=.(positionyear, positionmonth, quadrant, Species, Code)]
I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like
Ok so I have a csv file similar to this structure
hashID,value,flag
98fafd, 35, 1
fh56w2, 25, 0
ggjeas, 55, 1
adfh5d, 45, 0
Basically what I want to do is get the median of the value column but only include rows where flag==1 in the calculation.
Is this even possible in R? I've searched around and haven't found anything like this.
You can also do this in a quick one-liner with a boolean array for an index to the data frame:
# read the data from a csv file
newdata <- read.csv("file.csv")
# this will give you a vector of boolean values of length nrow(newdata)
newdata$flag==1
# and this line uses the above vector to retrieve only those elements of
# newdata$value for which the row contains a flag value of 1
median(newdata$value[newdata$flag==1])
Here is one possibility:
Read your data set using the following command:
newdata <- read.csv("stackoverflow questions/mediancol.csv")
# I assume you have the data in csv format
# Showing the data I used for the computation
newdata <- structure(list(hashID = structure(c(1L, 3L, 4L, 2L), .Label = c("98fafd",
"adfh5d", "fh56w2", "ggjeas"), class = "factor"), value = c(35L,
25L, 55L, 45L), flag = c(1L, 0L, 1L, 0L)), .Names = c("hashID",
"value", "flag"), class = "data.frame", row.names = c(NA, -4L
))
> newdata
hashID value flag
1 98fafd 35 1
2 fh56w2 25 0
3 ggjeas 55 1
4 adfh5d 45 0
# Subset the data when flag =1
newdata1 <- subset(newdata,flag==1)
# Look at the summary of the data
> summary(newdata1)
hashID value flag
98fafd:1 Min. :35 Min. :1
adfh5d:0 1st Qu.:40 1st Qu.:1
fh56w2:0 Median :45 Median :1
ggjeas:1 Mean :45 Mean :1
3rd Qu.:50 3rd Qu.:1
Max. :55 Max. :1
# Only look at the median
median(newdata1$value)
[1] 45
I have the following table ordered group by first, second and name.
myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53),
Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene",
"Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
), class = "factor"), Area = c(699468L, 153744L, 32913L,
4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L,
2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))
Within each group I want to extract the area that correspond to the specific sample. Several groups don´t have areas from the samples, so if the sample is´nt detected it should return "NA".Ideally, the final output should be a column for each sample.
I have tried the ifelse function to create one column to each sample:
PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")
However this doesn´t takes into account the group distribution. I want to do this, but within the group. Within each group (a group as equal value for first, second and Name columns) if sample=PO1:1, Area, else NA.
For the first group:
structure(list(first = c(120L, 120L), second = c(1.33, 1.33),
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"),
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1",
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))
The output should be:
structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))
Any suggestion?
As in the example in the quesiton, I am assuming Sample is a factor. If this is not the case, consider making it such.
First, lets clean up the column Sample to make it a legal name, or else it might cause errors
levels(myData$Sample) <- make.names(levels(myData$Sample))
## DEFINE THE CUTS##
# Adjust these as necessary
#--------------------------
max.second <- 3 # max & nin range of myData$second
min.second <- 0 #
sprd <- 0.15 # with spread for each group
#--------------------------
# we will cut the myData$second according to intervals, cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)
# Next, lets create a group column to split our data frame by
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-')
groups <- split(myData, myData$group)
samples <- levels(myData$Sample) ## I'm assuming not all samples are present in the example. Manually adjusting with: samples <- sort(c(samples, "PO3.1"))
# Apply over each group, then apply over each sample
myOutput <-
t(sapply(groups, function(g) {
#-------------------------------
# NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - ")) # allowing for multiple values
unlist(ifelse(res=="", NA, res))
## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
# res <- sapply(samples, function(s) g$Area[g$Sample==s]) # <~~ This line will work when only one value per sample
# unlist(ifelse(res==0, NA, res))
#-------------------------------
}))
# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-") ## or whichever proper group name
# remove dummy column
myData$group <- NULL
Results
myOutput
PO1.1 PO2.1 PO3.1 PO4.1
Group-1 NA "153744" NA "699468"
Group-2 NA NA NA "32913 - 4948619"
Group-3 NA NA NA "83528"
Group-4 "536339" NA NA NA
Group-5 "105598" NA NA NA
You cannot really expect R to intuit that there is a fourth factor level between PO2 and PO4 , now can you.
> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
first second Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1 120 1.3 Pentanone 699468 153744 NA
3 126 0.4 Heptene 32913 NA NA
4 126 0.4 Heptene 4948619 NA NA
5 126 0.3 Methylamine 83528 NA NA
6 132 0.5 Benzene NA NA 536339
7 132 0.5 Ethene._trichloro- NA NA 105598