Moving data in dataframe - r

I'm trying to move the data in the data frame around. I want to move all the first values not equal to 0 to Height 1.
Example data looks like follow
Tree <- c(1:10)
height0 <- c(0,0,0,0,0,0,0,0,0,0)
height1 <- c(1.5,2.0,0.0,1.2,1.3,0.9,0.0,0.0,1.8,0.0)
height2 <- c(2.4,2.2,1.1,1.9,1.4,1.7,0.0,0.0,2.7,0.0)
height3 <- c(3.1,2.9,2.1,2.6,2.2,2.4,0.0,0.6,3.6,0.0)
height4 <- c(3.8,3.4,2.9,3.0,2.9,3.1,0.0,1.1,4.1,0.0)
height5 <- c(4.2,3.7,3.6,3.7,3.5,3.8,0.7,1.9,4.6,0.0)
height6 <- c(4.4,4.1,4.1,4.2,4.0,4.5,1.6,2.6,4.9,1.2)
height7 <- c(4.7,4.4,4.3,4.6,4.2,4.9,2.2,3.0,5.1,2.0)
df <- data.frame(Tree, height0, height1, height2, height3, height4, height5, height6, height7)
So the Data frame df looks like follow
df
Tree height0 height1 height2 height3 height4 height5 height6 height7
1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
3 3 0 0.0 1.1 2.1 2.9 3.6 4.1 4.3
4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
7 7 0 0.0 0.0 0.0 0.0 0.7 1.6 2.2
8 8 0 0.0 0.0 0.6 1.1 1.9 2.6 3.0
9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
10 10 0 0.0 0.0 0.0 0.0 0.0 1.2 2.0
I'm trying to move all the first height values to height 1, as not all the trees germinated at the same time and i only want to compare the growth speed and not get false results due to germination differences.
So what my data should like like afterwards is as follow
df
Tree height0 height1 height2 height3 height4 height5 height6 height7
1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
3 3 0 1.1 2.1 2.9 3.6 4.1 4.3
4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
7 7 0 0.7 1.6 2.2
8 8 0 0.6 1.1 1.9 2.6 3.0
9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
10 10 0 1.2 2.0
Is there any a way to do this?
I have over 3000 trees I measured for 40 times, and doing it manually is going to take to long
Thank you

One option would be to loop through the rows (apply with MARGIN = 1), extract the non-zero elements, pad the rest with NA using the length<-), transpose the output and assign it back.
df[-(1:2)] <- t(apply(df[-(1:2)], 1, function(x) `length<-`(x[x!=0], ncol(df)-2)))
df
# Tree height0 height1 height2 height3 height4 height5 height6 height7
#1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
#2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
#3 3 0 1.1 2.1 2.9 3.6 4.1 4.3 NA
#4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
#5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
#6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
#7 7 0 0.7 1.6 2.2 NA NA NA NA
#8 8 0 0.6 1.1 1.9 2.6 3.0 NA NA
#9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
#10 10 0 1.2 2.0 NA NA NA NA NA

Related

Matching elements of two lists of different sizes by their names

I have two lists of different sizes. One list (named * trees * ) is composed of phylogenetic trees (class phylo) and the second list (named * data_values*) is composed of numeric values.
The tips names of each phylogenetic tree of the list * tree* match with the names of each element inside of the list of values. But the list data_values is composed of a greater number of elements than the tips of each tree.
library(phytools)
library(ape)
#original tree:
tree_original = rtree(12, tip.label = paste0("species", LETTERS[1:12]))
##list of trees:
nodes = 14:23
trees = lapply(nodes,extract.clade,phy=tree_orignal)
names(trees) <- paste0("", 14:23)
data_values <- list()
for (i in 1:17) { data_values[[paste0('species', LETTERS[i])]] <- round(rnorm(10, 5, 4), 1) }
I would like to match both lists (trees and data_values) using species as an index to have a data frame for each tree (see example below). I can do this operation for each tree of the list trees individually but, as my list of species is much bigger than this example, I would like to know if I can do this operation (below) for the all list of trees and not run tree by tree, like this:
tree14 = data_values[match(trees$`14`$tip.label, names(data_values))]
tree14 = llply(tree14, function(x) sapply(x, as.numeric))
tree14_df = ldply(tree14, .fun=identity) **I will need each result as a data.frame**
.id 1 2 3 4 5 6 7 8 9 10
1 speciesE -0.5 3.4 2.0 5.3 3.7 8.2 3.5 -2.0 3.1 10.2
2 speciesL 6.8 4.3 7.1 5.5 4.9 2.5 0.3 -3.8 4.1 6.4
3 speciesA 2.5 2.5 9.6 10.6 2.2 7.1 4.1 4.4 6.0 6.7
4 speciesI -3.5 7.2 6.8 2.8 7.5 8.9 13.4 13.1 1.8 5.5
5 speciesC 4.3 2.2 10.0 7.4 4.4 8.3 -0.7 3.6 9.2 6.3
6 speciesH 6.3 6.1 2.2 4.6 7.4 7.3 2.9 0.6 3.0 5.2
7 speciesB 8.3 1.7 -0.1 4.5 9.4 -0.2 7.5 1.4 -0.3 4.6
8 speciesD 6.2 5.8 6.6 1.1 5.4 11.1 -1.1 0.0 7.9 0.4
9 speciesG 3.5 2.8 1.4 11.6 -2.8 11.0 3.5 2.8 3.1 4.8
10 speciesK 0.9 4.9 5.4 2.7 -0.7 5.1 18.3 4.9 2.5 -0.7
tree15 = data_values[match(trees$`15`$tip.label, names(data_values))]
tree15 = llply(tree15, function(x) sapply(x, as.numeric))
tree15_df = ldply(tree15, .fun=identity)
.id 1 2 3 4 5 6 7 8 9 10
1 speciesE -0.5 3.4 2.0 5.3 3.7 8.2 3.5 -2.0 3.1 10.2
2 speciesL 6.8 4.3 7.1 5.5 4.9 2.5 0.3 -3.8 4.1 6.4
3 speciesA 2.5 2.5 9.6 10.6 2.2 7.1 4.1 4.4 6.0 6.7
4 speciesI -3.5 7.2 6.8 2.8 7.5 8.9 13.4 13.1 1.8 5.5
5 speciesC 4.3 2.2 10.0 7.4 4.4 8.3 -0.7 3.6 9.2 6.3
6 speciesH 6.3 6.1 2.2 4.6 7.4 7.3 2.9 0.6 3.0 5.2
7 speciesB 8.3 1.7 -0.1 4.5 9.4 -0.2 7.5 1.4 -0.3 4.6
... this operation goes until tree23

Subset rows in dataframe with at least two of the multiple conditions

This has already been answered at this link (subset by at least two out of multiple conditions), however, I have an additional query to this. Following is my dataframe (df)
a b c d1 d2 e z
3.2 0.6 5.8 143.7 95.0 2.9 2
3.3 1.3 5.3 137.3 73.3 1.0 1
2.8 1.3 5.6 135.3 79.3 1.8 2
2.9 1.4 5.3 137.7 82.0 1.9 2
4.7 1.8 5.5 143.0 86.5 1.5 1
3.2 1.4 5.8 125.3 79.0 1.5 2
2.6 1.8 5.8 137.3 79.0 1.0 1
3.4 1.4 5.1 132.0 72.3 1.0 1
3.5 1.8 5.0 130.7 75.7 2.0 2
2.1 1.2 4.6 108.3 70.7 1.5 2
3.8 1.7 5.1 133.5 79.8 1.8 2
3.3 1.3 5.1 121.7 79.7 1.5 2
5.2 1.5 5.2 144.7 88.3 1.5 2
4.8 1.2 5.3 127.7 78.0 1.8 2
2.8 0.6 5.4 116.7 61.7 2.0 2
3.7 1.4 4.7 101.0 63.3 1.6 2
2.9 1.4 5.0 121.3 76.3 1.5 2
2.2 1.5 5.3 144.3 83.7 1.6 2
4.4 0.8 5.1 140.0 84.7 1.4 2
5.0 2.4 5.5 124.3 83.0 1.6 2
1.9 0.9 5.4 143.0 79.7 1.1 1
4.5 1.7 5.8 143.7 91.7 0.9 1
3.3 0.7 5.1 127.3 69.3 2.2 2
3.4 1.3 5.6 161.0 87.7 1.7 2
4.5 1.8 6.1 139.7 75.3 1.2 1
3.9 0.8 5.2 99.3 61.0 1.2 2
2.6 2.4 4.8 127.0 79.3 1.8 2
3.4 0.9 5.3 130.0 79.0 1.0 1
2.7 0.4 4.8 135.0 83.7 1.0 2
2.9 1.9 4.7 132.7 90.3 1.5 2
3.9 1.1 6.5 126.3 68.0 1.3 2
3.1 0.9 5.9 152.0 98.3 1.3 1
4.6 1.7 6.0 144.0 96.3 1.5 1
4.1 4.8 5.1 132.7 70.3 0.8 1
5.9 1.2 5.6 130.3 79.0 1.4 2
3.9 2.9 5.3 128.0 76.3 0.7 1
3.2 1.3 5.9 151.7 88.7 1.4 2
3.7 4.0 6.4 133.0 82.7 1.2 2
3.1 1.4 6.6 124.7 76.0 1.0 1
2.9 0.6 5.4 121.0 74.0 2.1 2
3.4 4.1 5.1 137.3 69.0 0.8 1
3.4 2.7 4.9 136.3 78.3 1.4 1
4.0 0.9 4.8 123.0 71.0 2.1 2
2.5 0.8 4.5 175.3 107.8 1.7 1
5.0 2.2 5.2 151.7 78.7 1.3 1
3.9 6.4 5.6 128.7 85.3 0.6 1
3.4 1.5 5.7 131.0 81.0 1.5 1
3.7 0.9 5.3 104.7 67.0 0.9 2
2.3 1.8 5.8 126.3 78.7 1.0 1
5.0 1.3 5.5 134.7 85.7 1.2 1
3.2 1.9 6.1 130.7 77.7 0.9 2
3.8 1.8 5.8 123.0 75.0 1.4 1
3.6 2.1 5.0 135.3 87.0 1.3 1
3.7 3.5 6.0 145.8 80.3 1.4 1
3.2 0.6 4.7 114.0 71.0 1.9 2
3.9 1.5 5.3 129.7 87.0 1.2 1
4.3 1.4 4.9 105.0 67.7 1.2 2
4.2 2.7 6.3 122.0 76.7 1.2 2
4.8 2.9 5.6 131.0 76.3 1.1 1
2.5 2.2 5.4 115.3 70.7 1.3 1
2.5 1.4 5.1 148.3 93.3 2.4 2
3.7 0.8 4.7 117.3 77.7 1.2 2
4.0 2.7 6.2 127.3 79.3 1.1 2
2.6 1.2 5.6 155.3 109.7 1.5 1
3.3 2.1 5.1 118.7 72.3 1.4 2
4.2 0.8 5.4 126.0 73.7 2.0 2
4.0 1.6 5.3 153.0 86.7 1.4 2
3.8 1.2 6.7 154.3 84.0 1.6 2
3.2 1.8 5.4 168.7 87.7 1.2 1
3.2 1.3 5.2 135.0 74.3 1.2 1
3.5 1.2 5.9 138.3 75.3 1.4 1
3.6 1.4 5.1 126.7 81.0 1.1 2
3.3 1.7 6.4 152.3 87.7 1.5 1
2.6 0.7 5.6 134.3 74.7 2.2 2
4.1 1.8 5.8 154.8 83.0 1.7 2
2.5 1.0 4.6 147.7 93.0 1.2 1
4.0 1.7 5.9 132.3 80.7 1.3 1
3.2 1.5 6.1 144.3 85.0 1.3 1
2.8 1.6 4.7 115.3 81.0 1.4 2
3.4 1.0 6.0 130.8 80.3 1.2 1
2.9 1.3 5.5 132.7 82.3 1.5 2
4.0 1.9 5.9 114.0 67.7 1.7 2
4.1 1.3 5.3 129.7 77.0 1.4 2
1.9 1.1 6.1 124.3 58.0 1.5 2
3.0 1.2 5.0 129.3 81.7 1.6 2
4.1 0.9 5.0 129.7 80.3 1.5 1
3.2 2.8 5.5 127.3 72.8 1.0 1
3.2 1.0 4.6 135.7 80.0 2.8 2
3.0 1.7 5.7 154.3 88.3 1.4 2
3.2 3.1 6.2 129.3 76.7 1.2 1
I want to subset this in such a way that at least 2 of the following 5 conditions are met:
a >= 4.11
b >= 2.26
c >= 5.6
d1 <= 140 and/or d2 <= 90 (considering both these variables d1 and/or d2 as one condition)
e <= 1.03 mmol/L (when z == 1) and e <= 1.29 mmol/L (when z == 2)
I understand how to add the first 3 in the following code, but can anyone help me with how can I add the last 2 conditions as well?
df_new <- df[rowSums(cbind(df$a >= 4.11, df$b >= 2.26, df$c >= 5.6)) > 1,]
Thanks in advance.
By combining filter with filter_if from the dplyr package you can filter (subset) your data based on your conditions
library(dplyr)
as_data_frame(df) -> df
# commas represent AND statements
df %>%
filter(
a >= 4.11,
b >= 2.26,
c >= 5.6,
d1 <=140,
d2 <= 90
) %>%
filter_if(
z == 1 & e <= 1.29, e <=1.03 # conditional filering
)->df_new

Create matrix from dataframe in R

I have dataset of following:
> iris
X5.1 X3.3 X1.7 X0.5 X.1
1 6.1 3.0 4.6 1.4 1
2 4.8 3.1 1.6 0.2 -1
3 5.0 3.4 1.5 0.2 -1
4 4.5 2.3 1.3 0.3 -1
5 5.4 3.4 1.7 0.2 -1
6 5.1 2.5 3.0 1.1 1
7 5.5 2.6 4.4 1.2 1
8 4.8 3.4 1.9 0.2 -1
9 6.5 2.8 4.6 1.5 1
10 5.4 3.0 4.5 1.5 1
11 5.8 4.0 1.2 0.2 -1
12 5.0 3.3 1.4 0.2 -1
13 7.0 3.2 4.7 1.4 1
14 5.0 3.4 1.6 0.4 -1
15 4.7 3.2 1.6 0.2 -1
16 5.0 2.3 3.3 1.0 1
17 4.4 3.0 1.3 0.2 -1
18 5.0 3.0 1.6 0.2 -1
19 4.9 3.0 1.4 0.2 -1
Now, I want to create matrix called "train.x" and it should store 10 rows and 4 columns from the given dataset. How would i do that? My solution so far is
train.x<-matrix(iris[1:70,1:4])
and it doesn't work. Any help would be appreciated thanks!!
Use this code:
as.matrix(iris[1:10, 1:4])
# X5.1 X3.3 X1.7 X0.5
#1 6.1 3.0 4.6 1.4
#2 4.8 3.1 1.6 0.2
#3 5.0 3.4 1.5 0.2
#4 4.5 2.3 1.3 0.3
#5 5.4 3.4 1.7 0.2
#6 5.1 2.5 3.0 1.1
#7 5.5 2.6 4.4 1.2
#8 4.8 3.4 1.9 0.2
#9 6.5 2.8 4.6 1.5
#10 5.4 3.0 4.5 1.5

Merge two dataframes containing duplicate elements

Given two dataframes whose names overlap partially, foo and bar:
foo <- iris[1:10,-c(4,5)]
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
# 7 4.6 3.4 1.4
# 8 5.0 3.4 1.5
# 9 4.4 2.9 1.4
# 10 4.9 3.1 1.5
bar <- iris[3:13,-c(3,5)]
bar[1:8, ] <- bar[1:8, ] * 2
# Sepal.Length Sepal.Width Petal.Width
# 3 9.4 6.4 0.4
# 4 9.2 6.2 0.4
# 5 10.0 7.2 0.4
# 6 10.8 7.8 0.8
# 7 9.2 6.8 0.6
# 8 10.0 6.8 0.4
# 9 8.8 5.8 0.4
# 10 9.8 6.2 0.2
# 11 5.4 3.7 0.2
# 12 4.8 3.4 0.2
# 13 4.8 3.0 0.1
How can I merge the dataframes such that both rows and columns are padded for missing cases, while prioritising the results of one dataframe for overlapping elements? In this example, it is the overlapping results in bar that I wish to prioritise.
merge(..., by = "row.names", all = TRUE) is close, in that it retains all 13 rows, and returns missing values as NA:
foobar <- merge(foo, bar, by = "row.names", all = TRUE)
# Row.names Sepal.Length.x Sepal.Width.x Petal.Length Sepal.Length.y Sepal.Width.y Petal.Width
# 1 1 5.1 3.5 1.4 NA NA NA
# 2 10 4.9 3.1 1.5 9.8 6.2 0.2
# 3 11 NA NA NA 5.4 3.7 0.2
# 4 12 NA NA NA 4.8 3.4 0.2
# 5 13 NA NA NA 4.8 3.0 0.1
# 6 2 4.9 3.0 1.4 NA NA NA
# 7 3 4.7 3.2 1.3 9.4 6.4 0.4
# 8 4 4.6 3.1 1.5 9.2 6.2 0.4
# 9 5 5.0 3.6 1.4 10.0 7.2 0.4
# 10 6 5.4 3.9 1.7 10.8 7.8 0.8
# 11 7 4.6 3.4 1.4 9.2 6.8 0.6
# 12 8 5.0 3.4 1.5 10.0 6.8 0.4
# 13 9 4.4 2.9 1.4 8.8 5.8 0.4
However, it creates a distinct column for each column in the constituent dataframes, regardless of the fact that they share names.
The desired output would be as such:
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 5.1 3.5 1.4 NA # unique to foo
# 2 4.9 3.0 1.4 NA # unique to foo
# 3 9.4 6.4 1.3 0.4 # overlap, retained from bar
# 4 9.2 6.2 1.5 0.4 #
# 5 10.0 7.2 1.4 0.4 # .
# 6 10.8 7.8 1.7 0.8 # .
# 7 9.2 6.8 1.4 0.6 # .
# 8 10.0 6.8 1.5 0.4 #
# 9 8.8 5.8 1.4 0.4 #
# 10 9.8 6.2 1.5 0.2 # overlap, retained from bar
# 11 5.4 3.7 NA 0.2 # unique to bar
# 12 4.8 3.4 NA 0.2 # unique to bar
# 13 4.8 3.0 NA 0.1 # unique to bar
My intuition is to subset the data into two disjoint sets, and the set of intersecting elements in bar, then merge these, but I'm sure there is a more elegant solution!
(Edited)
The package plyr is awesome for this sort of thing. Just do:
library(plyr)
foo$ID <- row.names(foo)
bar$ID <- row.names(bar)
foobar <- join(foo, bar, type = "full", by = "ID")
Joining by row.names didn't work, as Flodl noted in the comments, so that's why I made a new column "ID".
I see the glowing recommendation for plyr::join but do not see how it is much different than what the base merge offers:
merge(foo, bar, by=c("Sepal.Length", "Sepal.Width"), all=TRUE)

R: add column with slope coefficient for values over time

I have a dataframe which has values over time. The colnames reflect the time in milliseconds. I would like to add an additional column with the slope coefficient of a line of best fit for each token.
Token 0ms 20ms 40ms 60ms 80ms
1 2.5 3.7 4.8 5.2 6.3
2 3.6 4.9 5.2 6.1 7.8
3 1.1 3.2 4.6 7.8 9.1
4 4.5 3.3 2.1 1.9 NA
5 2.1 3.5 3.7 NA NA
Some rows have NAs, as not all tokens are active for the same amount of time.
d <- read.table(text=
"Token 0ms 20ms 40ms 60ms 80ms
1 2.5 3.7 4.8 5.2 6.3
2 3.6 4.9 5.2 6.1 7.8
3 1.1 3.2 4.6 7.8 9.1
4 4.5 3.3 2.1 1.9 NA
5 2.1 3.5 3.7 NA NA",
header=TRUE,check.names=FALSE)
slopes <- apply(as.matrix(d[,-1]),1,
function(y) {
fit <- lm(y~t,
data=data.frame(y,
t=seq(0,length=length(y),by=20)))
coef(fit)[2]
})
data.frame(d,slopes,check.names=FALSE)
## Token 0ms 20ms 40ms 60ms 80ms slopes
## 1 1 2.5 3.7 4.8 5.2 6.3 0.0455
## 2 2 3.6 4.9 5.2 6.1 7.8 0.0480
## 3 3 1.1 3.2 4.6 7.8 9.1 0.1030
## 4 4 4.5 3.3 2.1 1.9 NA -0.0450
## 5 5 2.1 3.5 3.7 NA NA 0.0400

Resources