R cran: package 'Diagram' link width - r

I am trying to show a transition matrix. In addition to the values of the matrix itself, I have another matrix that gives me the 'confidence' for the score in the first matrix. I want to draw the diagram such that the line width reflects the 'confidence' matrix, even though the values shown are from the transition matrix. For example:
library(diagram)
set.seed(1234)
# transition matrix
mat1 <- matrix(round(abs(rnorm(16)),2),4,4)
rownames(mat1) <- colnames(mat1) <- letters[1:4]
# confidence score matrix
mat2 <- matrix(c(rep(1,9),rep(2,2),rep(3,5)),4,4)
rownames(mat2) <- colnames(mat2) <- letters[1:4]
plotmat(mat1,box.size = 0.04,lwd=mat2)
> mat1
a b c d
a 1.21 0.43 0.56 0.78
b 0.28 0.51 0.89 0.06
c 1.08 0.57 0.48 0.96
d 2.35 0.55 1.00 0.11
> mat2
a b c d
a 1 1 1 3
b 1 1 2 3
c 1 1 2 3
d 1 1 3 3
The line widths look ok (e.g. 'd' to 'a', 'd' to 'b','c' to 'b'), except where the link is from a node, back to itself (e.g. 'd' to 'd', or 'c' to 'c'). In these self-looping cases, the line width does not appear to work.
Is there something else that I need to do?
thanks!

Related

How to use mutate result as input to calc another column in R dplyr

I'd like to calculate data for two new columns in a data.frame where the results are based on the value of the previous row. However, the previous row also needs to be calculated, which means that there is a dependency between the two columns (the input for one calculation is based on the output of another calculation). I could do it through a for, but maybe it's not the right way.
This is a sample for this case:
df <- data.frame(A=c(0.91,0.98,1,1.1), B=c(0.81, 1.11, 0.83, 0.92), C=c(0.09,0.06,0.09,0.08))
df$D <- NA
df$E <- NA
df[1,]$D <- 0.0
I've been trying it through dplyr::mutate.
df %>%
mutate(D = ifelse( lag(A) < 1, lag(E), lag(E) - lag(E) * lag(A)),
E = B - (B - D) * exp(-C)
)
This is how the output should be:
> df
A B C D E
1 0.91 0.81 0.09 0.00000000 0.06971574
2 0.98 1.11 0.06 0.06971574 0.13029718
3 1.00 0.83 0.09 0.13029718 0.19051977
4 1.10 0.92 0.08 0.00000000 0.07073296

Find the nth largest values in the top row and omit the rest of the columns in R

I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16

replacing values in dataframe with another dataframe r

I have a dataframe of values that represent fold changes as such:
> df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
A B C
1 1.74 1.50 1.10
2 -1.30 0.90 3.01
3 3.10 0.71 1.40
And a dataframe of pvalues as such that matches rows and columns identically:
> df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
A B C
1 0.02 NA 0.01
2 0.01 0.01 0.01
3 0.80 0.06 0.03
What I want is to modify the values in df1 so that only retain the values that had a correponding pvalue in df2 < .05, and replace with NA otherwise. Note there are also NA in df2.
> desired <- data.frame(A=c(1.74,-1.3,NA), B=c(NA,.9,NA), C=c(1.1,3.01,1.4))
> desired
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
I first tried to use vector syntax on these dataframes and that didn't work. Then I tried a for loop by columns and that also failed.
I don't think i understand how to index each i,j position and then replace df1 values with df2 values based on a logical.
Or if there is a better way in R.
You can try this:
df1[!df2 < 0.05 | is.na(df2)] <- NA
Out:
> df1
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40
ifelse and as.matrix seem to do the trick.
df1 <- data.frame(A=c(1.74,-1.3,3.1), B=c(1.5,.9,.71), C=c(1.1,3.01,1.4))
df2 <- data.frame(A=c(.02,.01,.8), B=c(NA,.01,.06), C=c(.01,.01,.03))
x1 <- as.matrix(df1)
x2 <- as.matrix(df2)
as.data.frame( ifelse( x2 >= 0.05 | is.na(x2), NA, x1) )
Result
A B C
1 1.74 NA 1.10
2 -1.30 0.9 3.01
3 NA NA 1.40

Create factor labels in a DF using a sequence of numbers

I have a data.frame containing numerics. I want to create a new column within that data.frame that will house factor labels using (letters[]). I want these factor labels to be built from a sequence of numbers that I have, and can change every time.
For example, my original DF has 1 column x containing numerics, I then have a sequence of numbers (3,7,9). So I need the new FLABEL column to populate according to the number sequence, i.e. first 3 lines are a, next 4 lines b and so on.
x FLABEL
0.23 a
0.21 a
0.19 a
0.27 b
0.25 b
0.22 b
0.15 b
0.09 c
0.32 c
0.19 d
0.17 d
I'm struggling with how to do this, I'm assuming some form of for-loop given that my number sequence can vary in length every time I run it So I could be populating letters a & b...or many more.
Based on the comment by #scoa, I suggest the following modified approach:
series <- c(3, 7, 9)
series <- c(series, nrow(DF)) # This ensures that the sequence extends to the last row of DF
series2 <- c(series[1] ,diff(series))
DF$FLABEL <- rep(letters[1:length(series2)], series2)
#> DF
# x FLABEL
#1 0.23 a
#2 0.21 a
#3 0.19 a
#4 0.27 b
#5 0.25 b
#6 0.22 b
#7 0.15 b
#8 0.09 c
#9 0.32 c
#10 0.19 d
#11 0.17 d
By using diff() the length of each sequence is calculated based on the index numbers in the input vector series. In this case, the index values 3, 7, 9 are converted into the number of repetitions of subsequent letters up to the last row of the data frame and stored in series2: 3, 4, 2, 2.
data
text <- "x FLABEL
0.23 x
0.21 x
0.19 x
0.27 x
0.25 x
0.22 x
0.15 x
0.09 x
0.32 x
0.19 x
0.17 x"
DF <- read.table(text = text, header=T)

How to delete a duplicate row in R

I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]

Resources