I have two lists of different sizes. One list (named * trees * ) is composed of phylogenetic trees (class phylo) and the second list (named * data_values*) is composed of numeric values.
The tips names of each phylogenetic tree of the list * tree* match with the names of each element inside of the list of values. But the list data_values is composed of a greater number of elements than the tips of each tree.
library(phytools)
library(ape)
#original tree:
tree_original = rtree(12, tip.label = paste0("species", LETTERS[1:12]))
##list of trees:
nodes = 14:23
trees = lapply(nodes,extract.clade,phy=tree_orignal)
names(trees) <- paste0("", 14:23)
data_values <- list()
for (i in 1:17) { data_values[[paste0('species', LETTERS[i])]] <- round(rnorm(10, 5, 4), 1) }
I would like to match both lists (trees and data_values) using species as an index to have a data frame for each tree (see example below). I can do this operation for each tree of the list trees individually but, as my list of species is much bigger than this example, I would like to know if I can do this operation (below) for the all list of trees and not run tree by tree, like this:
tree14 = data_values[match(trees$`14`$tip.label, names(data_values))]
tree14 = llply(tree14, function(x) sapply(x, as.numeric))
tree14_df = ldply(tree14, .fun=identity) **I will need each result as a data.frame**
.id 1 2 3 4 5 6 7 8 9 10
1 speciesE -0.5 3.4 2.0 5.3 3.7 8.2 3.5 -2.0 3.1 10.2
2 speciesL 6.8 4.3 7.1 5.5 4.9 2.5 0.3 -3.8 4.1 6.4
3 speciesA 2.5 2.5 9.6 10.6 2.2 7.1 4.1 4.4 6.0 6.7
4 speciesI -3.5 7.2 6.8 2.8 7.5 8.9 13.4 13.1 1.8 5.5
5 speciesC 4.3 2.2 10.0 7.4 4.4 8.3 -0.7 3.6 9.2 6.3
6 speciesH 6.3 6.1 2.2 4.6 7.4 7.3 2.9 0.6 3.0 5.2
7 speciesB 8.3 1.7 -0.1 4.5 9.4 -0.2 7.5 1.4 -0.3 4.6
8 speciesD 6.2 5.8 6.6 1.1 5.4 11.1 -1.1 0.0 7.9 0.4
9 speciesG 3.5 2.8 1.4 11.6 -2.8 11.0 3.5 2.8 3.1 4.8
10 speciesK 0.9 4.9 5.4 2.7 -0.7 5.1 18.3 4.9 2.5 -0.7
tree15 = data_values[match(trees$`15`$tip.label, names(data_values))]
tree15 = llply(tree15, function(x) sapply(x, as.numeric))
tree15_df = ldply(tree15, .fun=identity)
.id 1 2 3 4 5 6 7 8 9 10
1 speciesE -0.5 3.4 2.0 5.3 3.7 8.2 3.5 -2.0 3.1 10.2
2 speciesL 6.8 4.3 7.1 5.5 4.9 2.5 0.3 -3.8 4.1 6.4
3 speciesA 2.5 2.5 9.6 10.6 2.2 7.1 4.1 4.4 6.0 6.7
4 speciesI -3.5 7.2 6.8 2.8 7.5 8.9 13.4 13.1 1.8 5.5
5 speciesC 4.3 2.2 10.0 7.4 4.4 8.3 -0.7 3.6 9.2 6.3
6 speciesH 6.3 6.1 2.2 4.6 7.4 7.3 2.9 0.6 3.0 5.2
7 speciesB 8.3 1.7 -0.1 4.5 9.4 -0.2 7.5 1.4 -0.3 4.6
... this operation goes until tree23
I have a dataset as follows:
Apr May Jun Jul Aug Sep Oct Nov b
1.0 9.0 4.0 5.3 6.4 3.4 2.5 4.3 2
5.0 6.0 9.0 2.3 5.8 2.3 6.5 5.2 3
8.0 4.0 6.0 0.7 5.2 1.2 2.2 6.1 4
2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 7
3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 8
4.4 4.1 5.1 6.1 7.1 8.1 9.1 6.8 6
5.6 5.0 3.2 4.2 5.2 1.2 2.2 3.2 5
6.8 5.9 8.9 2.3 3.3 5.7 4.7 3.7 5
8.0 6.8 9.8 4.8 5.8 6.8 7.8 8.8 5
9.2 7.7 7.7 2.8 3.8 4.8 5.8 6.8 6
I want to add a column sum data$sum=rowSums(data[data$b:8]). But getting a warning `numerical expression has 2124 elements: only the first used. Please let me know a better method.
Here's a solution based on your comments:
data$sum <- NA # important to create the column before the for loop
for (rowIdx in 1:nrow(data)) {
startCol <- data[rowIdx, "b"]
data[rowIdx, "sum"] <- sum(data[rowIdx, startCol:8])
}
You need to use a for loop / apply statement to achieve this because you cannot specify a different starting column for each row using the [ subset operator.
Two things can happen when you use [] without a comma depending on your data structure:
If data is a matrix it will treat the entire matrix as a single vector, where each column occurs one after another. For example, data[1:15] will return the 10 values in the "Apr" column then the first 5 values in the "May" column.
If data is a data.frame it will use the indices to look up columns. That is data[1:5] is the same as data[,1:5]. The reason for this is that a data.frame is really a list() underneath the hood, where each column is an element of the list().
I have dataset of following:
> iris
X5.1 X3.3 X1.7 X0.5 X.1
1 6.1 3.0 4.6 1.4 1
2 4.8 3.1 1.6 0.2 -1
3 5.0 3.4 1.5 0.2 -1
4 4.5 2.3 1.3 0.3 -1
5 5.4 3.4 1.7 0.2 -1
6 5.1 2.5 3.0 1.1 1
7 5.5 2.6 4.4 1.2 1
8 4.8 3.4 1.9 0.2 -1
9 6.5 2.8 4.6 1.5 1
10 5.4 3.0 4.5 1.5 1
11 5.8 4.0 1.2 0.2 -1
12 5.0 3.3 1.4 0.2 -1
13 7.0 3.2 4.7 1.4 1
14 5.0 3.4 1.6 0.4 -1
15 4.7 3.2 1.6 0.2 -1
16 5.0 2.3 3.3 1.0 1
17 4.4 3.0 1.3 0.2 -1
18 5.0 3.0 1.6 0.2 -1
19 4.9 3.0 1.4 0.2 -1
Now, I want to create matrix called "train.x" and it should store 10 rows and 4 columns from the given dataset. How would i do that? My solution so far is
train.x<-matrix(iris[1:70,1:4])
and it doesn't work. Any help would be appreciated thanks!!
Use this code:
as.matrix(iris[1:10, 1:4])
# X5.1 X3.3 X1.7 X0.5
#1 6.1 3.0 4.6 1.4
#2 4.8 3.1 1.6 0.2
#3 5.0 3.4 1.5 0.2
#4 4.5 2.3 1.3 0.3
#5 5.4 3.4 1.7 0.2
#6 5.1 2.5 3.0 1.1
#7 5.5 2.6 4.4 1.2
#8 4.8 3.4 1.9 0.2
#9 6.5 2.8 4.6 1.5
#10 5.4 3.0 4.5 1.5
I'm trying to move the data in the data frame around. I want to move all the first values not equal to 0 to Height 1.
Example data looks like follow
Tree <- c(1:10)
height0 <- c(0,0,0,0,0,0,0,0,0,0)
height1 <- c(1.5,2.0,0.0,1.2,1.3,0.9,0.0,0.0,1.8,0.0)
height2 <- c(2.4,2.2,1.1,1.9,1.4,1.7,0.0,0.0,2.7,0.0)
height3 <- c(3.1,2.9,2.1,2.6,2.2,2.4,0.0,0.6,3.6,0.0)
height4 <- c(3.8,3.4,2.9,3.0,2.9,3.1,0.0,1.1,4.1,0.0)
height5 <- c(4.2,3.7,3.6,3.7,3.5,3.8,0.7,1.9,4.6,0.0)
height6 <- c(4.4,4.1,4.1,4.2,4.0,4.5,1.6,2.6,4.9,1.2)
height7 <- c(4.7,4.4,4.3,4.6,4.2,4.9,2.2,3.0,5.1,2.0)
df <- data.frame(Tree, height0, height1, height2, height3, height4, height5, height6, height7)
So the Data frame df looks like follow
df
Tree height0 height1 height2 height3 height4 height5 height6 height7
1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
3 3 0 0.0 1.1 2.1 2.9 3.6 4.1 4.3
4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
7 7 0 0.0 0.0 0.0 0.0 0.7 1.6 2.2
8 8 0 0.0 0.0 0.6 1.1 1.9 2.6 3.0
9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
10 10 0 0.0 0.0 0.0 0.0 0.0 1.2 2.0
I'm trying to move all the first height values to height 1, as not all the trees germinated at the same time and i only want to compare the growth speed and not get false results due to germination differences.
So what my data should like like afterwards is as follow
df
Tree height0 height1 height2 height3 height4 height5 height6 height7
1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
3 3 0 1.1 2.1 2.9 3.6 4.1 4.3
4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
7 7 0 0.7 1.6 2.2
8 8 0 0.6 1.1 1.9 2.6 3.0
9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
10 10 0 1.2 2.0
Is there any a way to do this?
I have over 3000 trees I measured for 40 times, and doing it manually is going to take to long
Thank you
One option would be to loop through the rows (apply with MARGIN = 1), extract the non-zero elements, pad the rest with NA using the length<-), transpose the output and assign it back.
df[-(1:2)] <- t(apply(df[-(1:2)], 1, function(x) `length<-`(x[x!=0], ncol(df)-2)))
df
# Tree height0 height1 height2 height3 height4 height5 height6 height7
#1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
#2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
#3 3 0 1.1 2.1 2.9 3.6 4.1 4.3 NA
#4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
#5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
#6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
#7 7 0 0.7 1.6 2.2 NA NA NA NA
#8 8 0 0.6 1.1 1.9 2.6 3.0 NA NA
#9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
#10 10 0 1.2 2.0 NA NA NA NA NA
Given two dataframes whose names overlap partially, foo and bar:
foo <- iris[1:10,-c(4,5)]
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
# 7 4.6 3.4 1.4
# 8 5.0 3.4 1.5
# 9 4.4 2.9 1.4
# 10 4.9 3.1 1.5
bar <- iris[3:13,-c(3,5)]
bar[1:8, ] <- bar[1:8, ] * 2
# Sepal.Length Sepal.Width Petal.Width
# 3 9.4 6.4 0.4
# 4 9.2 6.2 0.4
# 5 10.0 7.2 0.4
# 6 10.8 7.8 0.8
# 7 9.2 6.8 0.6
# 8 10.0 6.8 0.4
# 9 8.8 5.8 0.4
# 10 9.8 6.2 0.2
# 11 5.4 3.7 0.2
# 12 4.8 3.4 0.2
# 13 4.8 3.0 0.1
How can I merge the dataframes such that both rows and columns are padded for missing cases, while prioritising the results of one dataframe for overlapping elements? In this example, it is the overlapping results in bar that I wish to prioritise.
merge(..., by = "row.names", all = TRUE) is close, in that it retains all 13 rows, and returns missing values as NA:
foobar <- merge(foo, bar, by = "row.names", all = TRUE)
# Row.names Sepal.Length.x Sepal.Width.x Petal.Length Sepal.Length.y Sepal.Width.y Petal.Width
# 1 1 5.1 3.5 1.4 NA NA NA
# 2 10 4.9 3.1 1.5 9.8 6.2 0.2
# 3 11 NA NA NA 5.4 3.7 0.2
# 4 12 NA NA NA 4.8 3.4 0.2
# 5 13 NA NA NA 4.8 3.0 0.1
# 6 2 4.9 3.0 1.4 NA NA NA
# 7 3 4.7 3.2 1.3 9.4 6.4 0.4
# 8 4 4.6 3.1 1.5 9.2 6.2 0.4
# 9 5 5.0 3.6 1.4 10.0 7.2 0.4
# 10 6 5.4 3.9 1.7 10.8 7.8 0.8
# 11 7 4.6 3.4 1.4 9.2 6.8 0.6
# 12 8 5.0 3.4 1.5 10.0 6.8 0.4
# 13 9 4.4 2.9 1.4 8.8 5.8 0.4
However, it creates a distinct column for each column in the constituent dataframes, regardless of the fact that they share names.
The desired output would be as such:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 5.1 3.5 1.4 NA # unique to foo
# 2 4.9 3.0 1.4 NA # unique to foo
# 3 9.4 6.4 1.3 0.4 # overlap, retained from bar
# 4 9.2 6.2 1.5 0.4 #
# 5 10.0 7.2 1.4 0.4 # .
# 6 10.8 7.8 1.7 0.8 # .
# 7 9.2 6.8 1.4 0.6 # .
# 8 10.0 6.8 1.5 0.4 #
# 9 8.8 5.8 1.4 0.4 #
# 10 9.8 6.2 1.5 0.2 # overlap, retained from bar
# 11 5.4 3.7 NA 0.2 # unique to bar
# 12 4.8 3.4 NA 0.2 # unique to bar
# 13 4.8 3.0 NA 0.1 # unique to bar
My intuition is to subset the data into two disjoint sets, and the set of intersecting elements in bar, then merge these, but I'm sure there is a more elegant solution!
(Edited)
The package plyr is awesome for this sort of thing. Just do:
library(plyr)
foo$ID <- row.names(foo)
bar$ID <- row.names(bar)
foobar <- join(foo, bar, type = "full", by = "ID")
Joining by row.names didn't work, as Flodl noted in the comments, so that's why I made a new column "ID".
I see the glowing recommendation for plyr::join but do not see how it is much different than what the base merge offers:
merge(foo, bar, by=c("Sepal.Length", "Sepal.Width"), all=TRUE)