This is what the sample looks like:
PC1 PC2 PC3 PC4 clusterNum
1 -3.0278979 -0.9414093 -2.0593369 -0.92992822 6
2 -1.5343149 2.5393680 -0.6645160 -0.42415503 1
3 -3.1827899 0.4878230 -2.1716015 0.87140142 1
4 -2.0630451 -0.6765663 -2.0103567 -1.20913031 6
5 -2.5608251 0.3093504 -1.8429190 -0.08175088 1
6 -2.3229565 2.1314606 -1.0680616 0.53312488 1
7 -1.8015610 -0.4233978 -0.7954366 -0.74790714 6
62378 -2.5379848 -1.3008801 -1.3621545 0.93952670 6
62379 0.5763662 -0.5990910 -0.2045754 0.32887753 5
62380 1.0751095 -0.9948755 0.4209824 0.89306204 5
data <- structure(list(PC1 = c(-3.02789789907534, -1.53431493608036,-3.18278992851587, -2.06304508820853, -2.56082511958789, -2.32295654380193,-1.80156103002696, -2.53798478044841, 0.57636622461764, 1.07510945315635), PC2 = c(-0.94140934359441, 2.53936804189767, 0.487822997171811,-0.676566283079183, 0.309350374661524, 2.13146057296978, -0.423397780929157,-1.30088008176366, -0.599090979848925, -0.994875508747934), PC3 = c(-2.05933693083859,-0.664515950436883, -2.17160152842666, -2.01035669961785, -1.84291903624489,-1.06806160129806, -0.795436603544969, -1.36215450269855, -0.204575393904516,0.420982419847553), PC4 = c(-0.929928223454337, -0.424155026745399,0.871401419380821, -1.20913030836257, -0.0817508821137412, 0.533124880557676,-0.747907142699851, 0.939526696339997, 0.328877528585212, 0.893062041850707), clusterNum = c(6L, 1L, 1L, 6L, 1L, 1L, 6L, 6L, 5L, 5L)), row.names = c(1L,2L, 3L, 4L, 5L, 6L, 7L, 62378L, 62379L, 62380L), class = "data.frame")
So, I'm learning to plot 3d in R with rgl package. I used this code to plot my data.
plot3d(data$PC1, data$PC2, data$PC3, col=data$clusterNum)
and here is my output;
My question is how to add the legends based on my clusterNum column to visualize this graph.
Thank you in advance for any help.
Using rgl::legend3d(). You may practically use all the arguments of the graphics::legend() function, e.g. defining x and y coordinates of the legend and give a value for point characters pch= to get points printed, lookup ?pch for any other shape. To get the legend= elements just sort the unique values ofg your cluster variable. For the point colors use the same trick you did in the plot.
library(rgl)
with(data, plot3d(PC1, PC2, PC3, col=clusterNum)) ## use `with` to get nicer labs
k <- sort(unique(data$clusterNum))
legend3d(x=.1, y=.95, legend=k, pch=18, col=k, title='Cluster', horiz=TRUE)
Related
I have a dataset that was recorded by observation(each observation has its own row of data). I am looking to combine/condense these rows by the plant they were found on - currently a character variable. All other columns are numerical vales.
EX:
This is the raw data
|Sci_Name|Honeybee_count|Other_bee_Obsevrved|Stem_count|
|---|---|---|---|
|Zizia aurea|1|5|10|
|Asclepias viridiflora|15|1|3|
|Viola unknown|0|0|4|
|Zizia aurea|0|2|6|
|Zizia aurea|3|6|3|
|Asclepias viridiflora|8|2|17|
and I want:
Sci_Name
Honeybee_count
Other_bee_Obsevrved
Stem_count
Zizia aurea
4
13
19
Asclepias viridiflora
23
3
20
Viola unknown
0
0
4
I am currently pulling this data from a CSV already in table form. I have been attempting to create a new table/data frame with one entry of each plant species, and blanks/0s for each other variable, which I can then use to c-binding the two together. This, however, has been clunky at best and I am having trouble figuring out how to have each row check itself. I am open to any approach, let me know what you think!
Thanks :D
We can use the formula method in aggregate from base R. On the rhs of the ~, specify the grouping variable and on the lhs, use . for denoting the rest of the variables. Specify the FUN as sum and it will do the column wise sum by group
aggregate(. ~ Sci_Name, df1, sum)
-output
Sci_Name Honeybee_count Other_bee_Obsevrved Stem_count
1 Asclepias viridiflora 23 3 20
2 Viola unknown 0 0 4
3 Zizia aurea 4 13 19
data
df1 <- structure(list(Sci_Name = c("Zizia aurea", "Asclepias viridiflora",
"Viola unknown", "Zizia aurea", "Zizia aurea", "Asclepias viridiflora"
), Honeybee_count = c(1L, 15L, 0L, 0L, 3L, 8L), Other_bee_Obsevrved = c(5L,
1L, 0L, 2L, 6L, 2L), Stem_count = c(10L, 3L, 4L, 6L, 3L, 17L)),
class = "data.frame", row.names = c(NA,
-6L))
I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.
I have a problem regarding results from an aggregate function in R. My aim is to select certain bird species from a data set and calculate the density
of observed individuals over the surveyed area. To that end, I took a subset of the main data file, then aggregated over area, calculating the
mean, and the number of individuals (represented by length of vector). Then I wanted to use the calculated mean area and number of individuals to
calculate density. That didn't work. The code I used is given below:
> head(data)
positionmonth positionyear quadrant Species Code sum_areainkm2
1 5 2014 1 Bar-tailed Godwit 5340 155.6562
2 5 2014 1 Bar-tailed Godwit 5340 155.6562
3 5 2014 1 Bar-tailed Godwit 5340 155.6562
4 5 2014 1 Bar-tailed Godwit 5340 155.6562
5 5 2014 1 Gannet 710 155.6562
6 5 2014 1 Bar-tailed Godwit 5340 155.6562
sub.gannet<-subset(data, species == "Gannet")
sub.gannet<-data.frame(sub.gannet)
x<-sub.gannet
aggr.gannet<-aggregate(sub.gannet$sum_areainkm2, by=list(sub.gannet$positionyear, sub.gannet$positionmonth, sub.gannet$quadrant, sub.gannet$Species, sub.gannet$Code), FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
names(aggr.gannet)<-c("positionyear", "positionmonth", "quadrant", "species", "code", "x")
aggr.gannet<-data.frame(aggr.gannet)
> aggr.gannet
positionyear positionmonth quadrant species code x.observed_area x.NoInd
1 2014 5 4 Gannet 710 79.8257 10.0000
density <- c(aggr.gannet$x.NoInd/aggr.gannet$x.observed_area)
aggr.gannet <- cbind(aggr.gannet, density)
Error in data.frame(..., check.names = FALSE) :
Arguments imply differing number of rows: 1, 0
> density
numeric(0)
> aggr.gannet$x.observed_area
NULL
> aggr.gannet$x.NoInd
NULL
R doesn't seem to view the results from the function (observed_area and NoInd) as numeric values in their own right. That was already apparent, when I couldn't give them a name each, but had to call them "x".
How can I calculate density under these circumstances? Or is there another way to aggregate with multiple functions over the same variable that will result in a usable output?
It's a quirk of aggregate with multiple aggregations that the resulting aggregations are stored in a list within the column related to the aggregated variable.
The easiest way to get rid of this is to go through an as.list before as.dataframe, which flattens the data structure.
aggr.gannet <- as.data.frame(as.list(aggr.gannet))
It will still use x as the name. The way I discovered to fix this is to use the formula interface to aggregate, so your aggregate would look more like
aggr.gannet<-aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=sub.gannet,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
Walking it through (here I haven't taken the subset to illustrate the aggregation by species)
df <- structure(list(positionmonth = c(5L, 5L, 5L, 5L, 5L, 5L), positionyear = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L), quadrant = c(1L, 1L, 1L, 1L, 1L, 1L), Species = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Bar-tailed Godwit", "Gannet"), class = "factor"), Code = c(5340L, 5340L, 5340L, 5340L, 710L, 5340L), sum_areainkm2 = c(155.6562, 155.6562, 155.6562, 155.6562, 155.6562, 155.6562)), .Names = c("positionmonth", "positionyear", "quadrant", "Species", "Code", "sum_areainkm2"), class = "data.frame", row.names = c(NA, -6L))
df.agg <- as.data.frame(as.list(aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=df,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))))
Which results in what you want:
> df.agg
positionyear positionmonth quadrant Species Code
1 2014 5 1 Gannet 710
2 2014 5 1 Bar-tailed Godwit 5340
sum_areainkm2.observed_area sum_areainkm2.NoInd
1 155.6562 1
2 155.6562 5
> names(df.agg)
[1] "positionyear" "positionmonth"
[3] "quadrant" "Species"
[5] "Code" "sum_areainkm2.observed_area"
[7] "sum_areainkm2.NoInd"
Obligatory note here, that dplyr and data.table are powerful libraries that allow doing this sort of aggregation very simply and efficiently.
dplyr
Dplyr has some strange syntax (the %>% operator), but ends up being quite readable, and allows chaining more complex operations
> require(dplyr)
> df %>%
group_by(positionyear, positionmonth, quadrant, Species, Code) %>%
summarise(observed_area=mean(sum_areainkm2), NoInd = n())
data.table
data.table has a more compact syntax and may be faster with large datasets.
dt[,
.(observed_area=mean(sum_areainkm2), NoInd=.N),
by=.(positionyear, positionmonth, quadrant, Species, Code)]
I have a csv file like
id,date,event
1,01-01-2014,E1
1,01-02-2014,E2
2,01-03-2014,E1
2,01-04-2014,E1
2,01-05-2014,E2
I would like to plot events using R on time scale. For example x axis would be date and y axis would indicate event happened on a particular date. This would be one graph for one set of id's. In the above data set it would create 2 graphs.
This is little different from time series (i think). Anyway to accomplish this in R?
Thanks
Try:
ddf = structure(list(id = c(1L, 1L, 2L, 2L, 2L), date = structure(1:5, .Label = c("01-01-2014",
"01-02-2014", "01-03-2014", "01-04-2014", "01-05-2014"), class = "factor"),
event = structure(c(1L, 2L, 1L, 1L, 2L), .Label = c("E1",
"E2"), class = "factor")), .Names = c("id", "date", "event"
), class = "data.frame", row.names = c(NA, -5L))
>
ddf$date2 = as.Date(ddf$date, format="%m-%d-%Y")
ddf
id date event date2
1 1 01-01-2014 E1 2014-01-01
2 1 01-02-2014 E2 2014-01-02
3 2 01-03-2014 E1 2014-01-03
4 2 01-04-2014 E1 2014-01-04
5 2 01-05-2014 E2 2014-01-05
>
ggplot(data=ddf, aes(x=date2, y=event, group=factor(id), color=factor(id)))+
geom_line()+
geom_point()+
facet_grid(id~.)
Edit: The code is simple and self-explanatory. Basically the date is kept in x-axis and events in y-axis. For clarity, the graphs are plotted for different ID separately (using facet_grid command), although they can be kept in same graph also, as seen in graph below generated by excluding the facet_grid command in above code:
Here there may be some ambiguity when the lines get overlapping.
I am trying to compare two questions (columns Q1_b and Q2_b) and barplot them next to each other (in the same barplot), the answer options are 1-6. The problem is that noone answered with 4 for Q1_b, so the barplot skips to displaying 5 where 4 should be for Q1_b, next to the percentage of people who answered 4 for Q2_b. How can I make sure R doesn't do this and automatically enters a 0% column if there weren't any answers for a specific option?
alldataset<-structure(list(Q1_b = c(6L, 1L, 5L, 3L, 5L, 6L, 6L, 2L),
Q2_b = c(1L, 2L, 2L, 5L, 4L, 3L, 6L, 1L)),
.Names = c("Q1_b", "Q2_b"),
class = "data.frame",
row.names = c(NA, -8L))
Qb<-table(alldataset$Q2_b)
Qf<-table(alldataset$Q1_b)
nrowFUP<-NROW(alldataset$Q1_b)
nrowBL<-NROW(alldataset$Q2_b)
options(digits=6)
newbl <- transform(as.data.frame(table(alldataset$Q2_b)),
percentage_column=Freq/nrowBL*100)
newfup <- transform(as.data.frame(table(alldataset$Q1_b)),
percentage_column=Freq/nrowFUP*100)
matrixQ1<-cbind(newbl$percentage_column, newfup$percentage_column)
matrixQ1dataframe<-data.frame(matrixQ1)
rmatrixQ1<-as.vector(t(matrixQ1dataframe))
roundedrmatrix<-round(rmatrixQ1, digits=0)
barplotmatrix<-matrix(roundedrmatrix)
par(mar=c(7.5,4,3,2), mgp=c(2,.7,0), tck=-.01, las=1, xpd=TRUE)
b<-barplot(matrix(roundedrmatrix, nr=2),
beside=T, xlab="",
ylab="Percentage",
cex.lab=0.9,
main="Comparison",
cex.main=0.9, ylim=c(0,70),
col=c("black","yellow"),
names.arg=c(1:6),
legend=c("Q2_b","Q1_b"),
args.legend=list(x="bottomleft",
cex=0.8,
inset=c(0.4,-0.4)))
text(x=b, y=roundedrmatrix,labels=roundedrmatrix, pos=3, cex=0.8)
R also warns me this will happen by displaying:
Warning message:
In cbind(newbl$percentage_column, newfup$percentage_column) :
number of rows of result is not a multiple of vector length (arg 2)
I have been trying for ages to sort this out but I am not getting anywhere. Can anyone help?
The problem is that you never told R that you vectors represent categorical responses with potential values of 1-6, so it does not know to include the 0 counts (you would not want it to include a 0 for 7, 8, 1 million, etc.).
Try replacing your 1st 2 lines with:
Qb<-table(factor(alldataset$Q2_b, levels=1:6))
Qf<-table(factor(alldataset$Q1_b, levels=1:6))
or run somethingn like:
alldataset$Q1_b <- factor(alldataset$Q1_b, levels=1:6)
alldataset$Q2_b <- factor(alldataset$Q2_b, levels=1:6)
before the table commands.
You need to tell table to use all values from one to six with table(factor(x, seq.int(6))).
Here is an improved version of your code:
dat <- t(round(sapply(rev(alldataset),
function(x) table(factor(x, seq.int(6)))) /
nrow(alldataset) * 100))
par(mar=c(7.5,4,3,2), mgp=c(2,.7,0), tck=-.01, las=1, xpd=TRUE)
b <- barplot(dat, beside=T,xlab="", ylab="Percentage", cex.lab=0.9,
main="Comparison", cex.main=0.9, ylim=c(0,70),
col=c("black","yellow"), names.arg=c(1:6), legend=names(dat),
args.legend=list(x="bottomleft", cex=0.8, inset=c(0.4,-0.4)))
text(x=b, y=dat,labels=dat, pos=3, cex=0.8)