Matrix to data frame with row/columns numbers - r

I have a 10x10 matrix in R, called run_off. I would like to convert this matrix to a data frame that contains the entries of the matrix (the order doesn't really matter, although I'd prefer it to be filled by row) as well as the row and columns numbers of the entries as separate columns in the data frame, so that for instance element run_off[2,3] has a row in the data frame with 3 columns, the first containing the element itself, the second containing 2 and the third containing 3.
This is what I have so far:
run_off <- matrix(data = c(45630, 23350, 2924, 1798, 2007, 1204, 1298, 563, 777, 621,
53025, 26466, 2829, 1748, 732, 1424, 399, 537, 340, NA,
67318, 42333, -1854, 3178, 3045, 3281, 2909, 2613, NA, NA,
93489, 37473, 7431, 6648, 4207, 5762, 1890, NA, NA, NA,
80517, 33061, 6863, 4328, 4003, 2350, NA, NA, NA, NA,
68690, 33931, 5645, 6178, 3479, NA, NA, NA, NA, NA,
63091, 32198, 8938, 6879, NA, NA, NA, NA, NA, NA,
64430, 32491, 8414, NA, NA, NA, NA, NA, NA, NA,
68548, 35366, NA, NA, NA, NA, NA, NA, NA, NA,
76013, NA, NA, NA, NA, NA, NA, NA, NA, NA)
, nrow = 10, ncol = 10, byrow = TRUE)
df <- data.frame()
for (i in 1:nrow(run_off)) {
for (k in 1:ncol(run_off)) {
claim <- run_off[i,k]
acc_year <- i
dev_year <- k
df[???, "claims"] <- claim # Problem here
df[???, "acc_year"] <- acc_year # and here
df[???, "dev_year"] <- dev_year # and here
}
}
dev_year refers to the column number of the matrix entry and acc_yearto the row number. My problem is that I don't know the proper index to use for the data frame.

I am assuming you are not interested in the NA elements? You can use which and the arr.ind = TRUE argument to return a two column matrix of array indices for each value and cbind this to the values, excluding the NA values:
# Get array indices
ind <- which( ! is.na(run_off) , arr.ind = TRUE )
# cbind indices to values
out <- cbind( run_off[ ! is.na( run_off ) ] , ind )
head( as.data.frame( out ) )
# V1 row col
#1 45630 1 1
#2 53025 2 1
#3 67318 3 1
#4 93489 4 1
#5 80517 5 1
#6 68690 6 1
Use t() on the matrix first if you want to fill by row, e.g. which( ! is.na( t( run_off ) ) , arr.ind = TRUE ) (and when you cbind it).

Related

R: Pearson correlation in a loop, prevent stopping when an error occurs and output NAs

I want to run Pearson correlations of each row of a matrix (dat) vs a vector (v1), as part of a loop, and output the correlation coefficients and associated p-values in a table. Here is an example for random data (data pasted at the end):
result_table <- data.frame(matrix(ncol = 2, nrow = nrow(dat)))
colnames(result_table) <- c("correlation_coefficient", "pvalue")
for(i in 1:nrow(dat)){
print(i)
corr <- cor.test(as.numeric(dat[i,]), v1, na.action = "na.omit")
result_table[i,1] <- corr$estimate
result_table[i,2] <- corr$p.value
}
When cor.test() removes missing data, sometimes there are not enough observations remaining and the loop stops with an error (for example at row 11). I would like the loop to continue running, just leaving the values in the result table as NAs. I think the result table should then look like this:
> result_table
correlation_coefficient pvalue
1 0.68422642 0.04206591
2 -0.15895586 0.70694013
3 -0.37005028 0.53982309
4 0.08448970 0.89255250
5 0.86860091 0.05603661
6 0.19544883 0.75274040
7 -0.94695380 0.01454887
8 -0.03817885 0.94275955
9 -0.15214122 0.77354897
10 -0.22997890 0.70978386
11 NA NA
12 NA NA
13 -0.27769887 0.59415930
14 -0.09768153 0.81800885
15 -0.20986632 0.61790214
16 -0.40474976 0.31990456
17 -0.00605937 0.98863896
18 0.02176976 0.95919460
19 -0.14755097 0.72733118
20 -0.25830856 0.50216600
I would also like the errors to keep being printed
Here is the data:
> dput(v1)
c(-0.840396, 0.4746047, -1.101857, 0.5164767, 1.2203134, -0.9758888,
-0.3657913, -0.6272523, -0.5853803, 1.7367901)
> dput(dat)
structure(list(s1 = c(-0.52411895, 0.14709633, 0.05433954, 0.7504406,
-0.59971988, -0.59679685, -0.12571854, 0.73289705, -0.71668771,
-0.04813957, -0.67849896, -0.11947141, -0.26371884, -1.34137162,
2.60928064, -1.23397547, 0.51811222, -4.10759883, -0.70127093,
7.51914575), s2 = c(0.21446623, -0.27281487, NA, NA, NA, NA,
NA, NA, -0.62468391, NA, NA, NA, -3.84387999, 0.64010069, NA,
NA, NA, NA, NA, NA), s3 = c(0.3461212, 0.279062, NA, NA, NA,
-0.4737744, 0.6313365, -2.8472641, 1.2647846, 2.2524449, -0.7913039,
-0.752590307, -3.535815266, 1.692385187, 3.55789764, -1.694910854,
-3.624517121, -4.963855198, 2.395998161, 5.35680032), s4 = c(0.3579742,
0.3522745, -1.1720907, 0.4223402, 0.146605, -0.3175295, -1.383926807,
-0.688551166, NA, NA, NA, NA, NA, 0.703612974, 1.79890268, -2.625404608,
-3.235884921, -2.845474098, 0.058650461, 1.83900702), s5 = c(1.698104376,
NA, NA, NA, NA, NA, -1.488000007, -0.739488766, 0.276012387,
0.49344994, NA, NA, -1.417434166, -0.644962513, 0.04010434, -3.388182254,
2.900252493, -1.493417096, -2.852256003, -0.98871696), s6 = c(0.3419271,
0.2482013, -1.2230283, 0.270752, -0.6653978, -1.1357202, NA,
NA, NA, NA, NA, NA, NA, NA, -1.0288213, -1.17817328, 6.1682455,
1.02759131, -3.80372867, -2.6249692), s7 = c(0.3957243, 0.8758406,
NA, NA, NA, NA, NA, 0.60196247, -1.28631859, -0.5754757, NA,
NA, NA, NA, NA, NA, NA, NA, NA, -2.6303001), s8 = c(-0.26409595,
1.2643281, 0.05687957, -0.09459169, -0.7875279, NA, NA, NA, NA,
NA, NA, NA, 2.42442997, -0.00445559, -1.0341522, 2.47315322,
0.1190265, 5.82533417, 0.82239131, -0.8279679), s9 = c(0.237123,
-0.5004619, 0.4447322, -0.2155249, -0.2331443, 1.3438071, -0.3817672,
1.9228182, 0.305661, -0.01348, NA, NA, 3.4009042, 0.8268469,
0.2061843, -1.1228663, -0.1443778, 4.8789902, 1.3480328, 0.4258486
), s10 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
0.5211859, 0.2196643, -1.2333367, 0.1186947, 1.478086, 0.5211859,
0.2196643)), .Names = c("s1", "s2", "s3", "s4", "s5", "s6", "s7",
"s8", "s9", "s10"), class = "data.frame", row.names = c(NA, -20L
))
A solution with tryCatch could be
for(i in 1:nrow(dat)){
print(i)
corr <- tryCatch(cor.test(as.numeric(dat[i,]), v1, na.action = "na.omit"), error = function(e) return(NA))
if(length(corr) == 1){
result_table[i,1] <- NA
result_table[i,2] <- NA
}else{
result_table[i,1] <- corr$estimate
result_table[i,2] <- corr$p.value
}
}
Here is a solution with tryCatch():
Replacing the for loop with:
for(i in 1:nrow(dat)){
tryCatch({
print(i)
corr <- cor.test(as.numeric(dat[i,]), v1, na.action = "na.omit") # Correlation miRNA activity vs CNVs for that gene
result_table[i,1] <- corr$estimate
result_table[i,2] <- corr$p.value
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

find strings in data.frame to fill in new column

I used dplyr on my data to create a subset of data like this:
dd <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
`1` = c("eg", NA, NA, "eg", "eg", NA, NA, NA, NA, "eg", NA),
`2` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, "eg", NA),
`3` = c(NA, NA, NA, "sk", "lk", NA, NA, NA, NA, NA, NA),
`4` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`5` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA),
`6` = c(NA, NA, NA, "lk", "lk", NA, NA, NA, NA, NA, NA))
I now want to check every column except ID if it contains certain strings. In this example I want to create 1 column with "1" for every ID that contains a column with "eg" and "0" for the rest. Likewise one more column which tells me if there is either a "sk" or "lk" in the other columns. After that the old columns except ID can be removed from the data.frame
The difficult part for me is doing this with a dynamic number of columns, as my dplyr-subset will return different amounts of columns based on the specific case, but I need to check every one that is created in every case. I wanted to use unite first to put all strings together but I will have the same problem then: How can I unite all columns except the first ID one.
If this can be solved within dplyr it would be perfect but any working solution is appreciated.
The result should look like this:
result <- data.frame(ID = c(700689L, 712607L, 712946L, 735907L, 735908L, 735910L, 735911L, 735912L, 735913L, 746929L, 747540L),
with_eg = c(1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0),
with_sk_or_lk = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0))
From your description, you want one column to check for "eg" and another column to check for both "lk" and "sk". If this is the case, then the following base R method will work.
dfNew <- cbind(id=dd[1],
eg=pmin(rowSums(dd[-1] == "eg", na.rm=TRUE), 1),
other=pmin(rowSums(dd[-1] == "sk" | dd[-1] == "lk", na.rm=TRUE), 1))
Here, the presence of "eg" is checked across the entire data.frame (except the id column) and a logical matrix is returned, rowSums adds the TRUE values across the rows, with na.rm removing the NAs, then pmin takes the minimum of the output of rowSums and 1, so that any elements with 2 are replaced by 1 and any values with 0 are preserved.
This same logic is applied to the construction of the "other" variable, except the presence of either "lk" or "sk" are checked in the initial logical matrix. Finally, data.frame returns a 3 column data.frame with the desired values.
This returns
dfNew
ID eg other
1 700689 1 0
2 712607 0 0
3 712946 0 0
4 735907 1 1
5 735908 1 1
6 735910 0 0
7 735911 0 0
8 735912 0 0
9 735913 0 0
10 746929 1 0
11 747540 0 0
Here is an admittedly hacky dplyr/purrr solution. Given that your IDs don't seem like they'll ever equal 'eg', 'sk', or 'lk', I haven't included anything to not search the ID column.
library(dplyr)
library(purrr)
dd %>%
split(.$ID) %>%
map_df(~ data_frame(
ID = .x$ID,
eg = ifelse(any(.x == 'eg', na.rm = TRUE), 1, 0),
other = ifelse(any(.x == 'lk' | .x == 'sk', na.rm = TRUE), 1, 0)
))

how to paste an array to rows which contain a certain value in a certain column in R

I would like to paste values of a certain data.frame row to other rows which have a certain attribute of a certain feature, however not a whole row just a couple of values of it. Exactly it looks like:
z <- c(NA, NA, 3,4,2,3,5)
x <- c(NA, NA, 2,5,5,3,3)
a <- c("Hank", NA, NA, NA, NA, NA, NA)
b <- c("Hank", NA, NA, NA, NA, NA, NA)
c <- c(NA, NA, NA, NA, NA, NA, NA)
d <- c("Bobby", NA, NA, NA, NA, NA, NA)
df <- as.data.frame(rbind( a, b, c, d, z, x))
Now, I would like to pass df["z",3:7] to the rows[3:7] which have V1 == "Hank", and pass df["x", 3:7] when V1== "Bobby".
Do anybody has a hint for me? I guess it should be a function with sapply or something like that. Maybe a dplyr could give a solution? Thanks for any advice!

Identify data blocks

I have a vector with either a negative value or NA and a threshold:
threshold <- -1
example <- c(NA, NA, -0.108, NA, NA, NA, NA, NA -0.601, -0.889, -1.178, -1.089, -1.401, -1.178, -0.959, -1.085, -1.483, -0.891, -0.817, -0.095, -1.305, NA, NA, NA, NA, -0.981, -0.457, -0.003, -0.358, NA, NA)
I want to identify all the data blocks with at least one value lower than the threshold and to replace by NA all the other blocks. With my example vector, I want this result:
result <- c(NA, NA, NA, NA, NA, NA, NA, NA -0.601, -0.889, -1.178, -1.089, -1.401, -1.178, -0.959, -1.085, -1.483, -0.891, -0.817, -0.095, -1.305, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
So the first available value is the first block but -0.108 is higher than -1 so it turns into NA. The second block is kept the same because there is at least ine value lower than -1. The third block is now NA values because between the 4 available values, no one was lower than the threshold.
My first idea was to identify where were the values lower than the threshold:
val <- which(example < threshold)
But then I don't know how to say "keep all the values around this position which are not NA" because it is always a different number of values...
Try
library(data.table)#v >= 1.9.5 (devel version - install from GitHub).
#library(devtools)
#install_github("Rdatatable/data.table", build_vignettes = FALSE)
as.data.table(example)[, res:=(NA | (min(example)< -1))*example, by=rleid(is.na(example))][, res]
Another way, with the suggestion of OlliJ :
example <- c(NA, NA, -0.108, NA, NA, NA, NA, NA -0.601, -0.889, -1.178, -1.089, -1.401, -1.178, -0.959, -1.085, -1.483, -0.891, -0.817, -0.095, NA, NA, NA, NA, -0.981, -0.457, -0.003, -0.358, NA, NA)
test <- !(is.na(example))
len <- rle(test)$lengths
val <- rle(test)$values
##Matrix with the beginning and the end of each group
ind <- matrix(,nrow=length(which(val)),ncol=2)
ind[,1] <- (cumsum(len)[which(val==T)-1])+1
ind[,2] <- (cumsum(len))[val==T]
result <- rep(NA, length=length(example))
apply(ind, 1, function(x)
{
if(any(example[x[1]:x[2]] < -1))
{
result[x[1]:x[2]] <- example[x[1]:x[2]]
}
})

How to update values in a for-loop?

I have a for-loop that initializes 3 vectors (launch_2012, amount, and one_week_bf) and creates a data frame. Then, it predicts a single week's of data and inserts it into vectors (amount and one_week_bf), and recreates the data.frame again; this process is looped 8 times. However, I can't seem to get the data.frame to update the new amounts. Would anyone be able to assist please?
for (i in 1:8) {
launch_2012 <- c(rep('bf', 5), 'launch', rep('af', 7))
amount <- c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA)
one_week_bf <- c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA)
newdata <- data.frame(amount = amount, one_week_bf = one_week_bf, launch = launch_2012, week = week)
predicted <- predict(model0a, newdata)
amount[i+5] <- predicted[i+5]
one_week_bf[i+6] <- predicted[i+5]
View(newdata)
}
It's difficult to be sure since your example is not reproducible, but note that predict.lm(...) by default has na.action=na.pass, which means that any rows in newdata that have any NA values by default generate NA for the prediction. Since your first pass of newdata has NA in rows 6-13, predicted will have NA in those same elements. This means that amounts and one_week_bf will have NA in those elements, which in turn will generate the same newdata each time.
None of this should be in a for loop.
x <- data.frame("launch_2012" = c(rep('bf', 5), 'launch', rep('af', 7)),
"amount"=c(7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA, NA),
"one_week_bf"=c(NA, 7946, 6641, 5975, 5378, 5217, NA, NA, NA, NA, NA, NA, NA))
x$new_amount <- #the replacement from your predict vector
x$new_one_week_bf <- #the replacement from your predict vector
Note I have no idea what model0a does, so just gave what the new columns should be as whatever the resulting vector is from your predict function. This will add the new data as new columns

Resources