To find difference between unique levels - r

I have a column in which I have unique levels ,I want to find the gap (difference between the levels ).
I have data
x=c(0,0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4,4)
The result for this should be :
1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 5 6

This is not very clear code, but it gets the job done:
res = ifelse(x == 0, 0, ifelse(c(0, x[-length(x)]) != 0, 0, NA))
res[is.na(res)] = with(rle(x == 0), lengths[values])
res
# [1] 0 0 0 0 4 0 0 0 0 4 0 0 2 0
This is perhaps better:
res2 = x
res2[x != 0] = diff(c(0, which(x != 0))) - 1
res2
# [1] 0 0 0 0 4 0 0 0 0 4 0 0 2 0

Not the definite answer, but her's an approach using rle...
x=c(0,0,0,0,1,0,0,0,0,2,0,0,3,4)
y <- rle(x)
> y
# Run Length Encoding
# lengths: int [1:7] 4 1 4 1 2 1 1
# values : num [1:7] 0 1 0 2 0 3 4

We can use ave and create a grouping variable with cumsum and diff to capture the difference in unique levels and create a sequence with seq_along
ave(x, c(0, cumsum(diff(x) != 0)), FUN = seq_along)
#[1] 1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 6
For the given example, as suggested by #markus this works
ave(x, x, FUN = seq_along)
#[1] 1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 6
but what if the input is
x=c(0,0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4,4,0,0)
using
ave(x, x, FUN = seq_along) #gives
#[1] 1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 6 6 7
whereas
ave(x, c(0, cumsum(diff(x) != 0)), FUN = seq_along) #gives
#[1] 1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 6 1 2

We can user rleid from data.table
library(data.table)
ave(x, rleid(x), FUN = seq_along)
#[1] 1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 6
Or convert to a data.table and then group by the rleid
data.table(x)[, seq_len(.N), x]$V1
#[1] 1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 6
Or after using rle, extract the lengths and apply sequence
sequence(rle(x)$lengths)
#[1] 1 2 3 4 5 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5 6

Related

Looping through specified columns in a Matrix and replacing their values by subtracting the value from 4

I am new(ish) to R and I am still unsure about loops.
If I had a large matrix object in R with columns having values of 0 - 4, and I would like to invert these values for specified columns.
I would use the code:
b[, "AX1"] <- 4 - b[, "AX1"]
Where b is a Matrix extracted from a larger list object and AX1 would be a column in the matrix.
I would then replace the changed Matrix back into its list using the code:
DF1$geno[[1]]$data <- b
How would I loop this code through a list of column names(AX1, AX10, AX42, ...)for about 30 columns of the 5000 columns in the matrix if I used a list with the 30 Column names to be inverted?
The simplest way you can do it (assuming that you always transform it the way x = 4 - x) is to expand your approach to the list of columns:
# Create an example dataset
set.seed(68859457)
(
dat <- matrix(
data = sample(x = 0:4, size = 100, replace = TRUE),
nrow = 10,
dimnames = list(1:10, paste('AX', 1:10, sep = ''))
)
)
# AX1 AX2 AX3 AX4 AX5 AX6 AX7 AX8 AX9 AX10
# 1 2 1 2 3 2 2 3 1 0 4
# 2 4 3 4 4 0 1 3 1 3 4
# 3 3 0 3 4 2 2 4 1 2 1
# 4 2 2 0 2 4 2 2 1 1 0
# 5 4 4 4 3 3 1 0 3 2 2
# 6 2 1 1 0 3 3 4 4 1 0
# 7 2 3 1 3 3 1 0 1 0 4
# 8 2 2 1 1 0 3 1 3 2 1
# 9 3 1 4 1 2 1 0 0 4 1
# 10 4 3 2 4 1 0 2 0 3 2
# Create a list of columns you want to modify
set.seed(68859457)
(
cols_to_invert <- sort(sample(x = colnames(dat), size = 5))
)
# [1] "AX3" "AX4" "AX5" "AX6" "AX9"
# Use the list of columns created above to modify matrix in place
dat[, cols_to_invert] <- 4 - dat[, cols_to_invert]
# See the result
dat
# AX1 AX2 AX3 AX4 AX5 AX6 AX7 AX8 AX9 AX10
# 1 2 1 2 1 2 2 3 1 4 4
# 2 4 3 0 0 4 3 3 1 1 4
# 3 3 0 1 0 2 2 4 1 2 1
# 4 2 2 4 2 0 2 2 1 3 0
# 5 4 4 0 1 1 3 0 3 2 2
# 6 2 1 3 4 1 1 4 4 3 0
# 7 2 3 3 1 1 3 0 1 4 4
# 8 2 2 3 3 4 1 1 3 2 1
# 9 3 1 0 3 2 3 0 0 0 1
# 10 4 3 2 0 3 4 2 0 1 2
Difficult to tell without knowing exact structure of the data but based on your explanation and attempt maybe this will help.
cols <- c('AX1', 'AX10', 'AX42')
DF1$geno <- lapply(DF1$geno, function(x) {
x$data <- 4 - x$data[, cols]
x
})

Remove/replace values in a list using the previous number in R [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
R replacing zeros in dataframe with next non zero value
(1 answer)
Closed 3 years ago.
I am trying to replace all the zeros with the previous number from a list.
The list is something like this:
x <- c(3,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,3,0,0,0,1,0,2,0)
I tried already the function
x <- c(3,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,3,0,0,0,1,0,2,0)
replace (x, x==0, first(x))
[1] 3 3 3 3 3 1 3 3 3 3 3 2 3 3 3 3 3 3 3 3 1 3 2 3
But it changes the first value of the list =3 to all the zeros and the 2's and 1's are neglected.
Also
replace (x, x==0, x)
[1] 3 3 0 0 0 1 0 1 0 0 0 2 0 0 2 0 3 0 0 0 1 3 2 0
You can use approx after you replaced all zeros with NA
approx(replace(x, x == 0, NA),
xout = 1:length(x), method = "constant", f = 0, rule = 2)$y
# [1] 3 3 3 3 3 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 1 1 2 2
Could modify this.
fill = function(x){
ave(x, cumsum(x != 0), FUN = function(y) y[pmax(1, cumsum(y != 0))])
}
fill(x)
# [1] 3 3 3 3 3 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 1 1 2 2

rewrite a base R function with dplyr - utilizing filter instead of []

makeparts <- function(x, n) {
x <- unique(c(0, x))
x <- x[x >= 0 & x < n]
x <- x[order(x)]
x <- rep(c(seq_along(x)), diff(c(x, n)))
x
}
makeparts(c(20, 30, 58), 100)
How would I rewrite this function using dplyr? I am pretty good in the tidyverse but never learned base R. I don't even know what that function is doing above. If I see it in tidyverse syntax I can understand the function (probably). Which is my ultimate goal.
All the tidyverse verbs make sense, but this [, x] [[df]] stuff doesn't.
Here's a reformatted version using more tidyverse-like code :
x %>%
unique %>%
keep(~.>=0 & .<n) %>%
sort %>%
c(0,.,n) %>%
diff %>%
list(lengths = ., values = seq_along(.)) %>%
inverse.rle
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
# [31] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4
# [61] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# [91] 4 4 4 4 4 4 4 4 4 4
x looks like it is a vector. The first step uses unique, which would be the same as distinct in the tidyverse. The next line uses the [ operator, which is used to index a vector or matrix. The value inside of [ ] should (for all intents and purposes) evaluate to a vector of TRUE or FALSE values or numbers. This is the same as filter in the tidyverse. The next line uses order on x, which is the same as arrange. The last step does two things: 1) it repeats the values from seq_along(x), which, in this example, will be 1, 2, 3, and 4. It then concats x and n together, which gives c(0, 20, 30, 58, 100) and then runs diff on them, which will take the second element and subtract the first, take the third element and subtract the second, etc. This gives us c(20, 10, 28, 42) because (20-0) = 20, (30-20) = 10, and so forth. This last step is what could be achieved in the tidyverse using the lag function. The rep function does not have a direct tidyverse equivalent. As was mentioned in the comments above, this cannot be converted to tidyverse functions because those are for dataframes and you have a vector. I agree that you should learn base R. You can only get so far with the tidyverse.
UPDATE:
Adding a tidyverse version of this code by request.
makeparts <- function(x, n) {
x <- unique(c(0, x))
x <- x[x >= 0 & x < n]
x <- x[order(x)]
x <- rep(c(seq_along(x)), diff(c(x, n)))
x
}
makeparts_tidyverse <- function(x, n) {
df = data_frame(x = c(0, x)) %>%
distinct() %>%
filter(x >= 0 & x < n) %>%
arrange(x) %>%
bind_rows(data_frame(x = n)) %>%
mutate(lag_x = lag(x)) %>%
mutate(y = x - lag_x) %>%
filter(!is.na(y))
rep(seq_along(df$x), df$y)
}
> makeparts(c(20, 30, 58), 100)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[21] 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[41] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4
[61] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[81] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
> makeparts_tidyverse(c(20, 30, 58), 100)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[21] 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[41] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4
[61] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[81] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

expand data.frame to long format and increment value

I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).
For example, I have:
foo <- data.frame(id = 1:5,
y = c(0, 1, 0, 1, 0),
time = c(2, 3, 4, 2, 3))
id y time
1 0 2
2 1 3
3 0 4
4 1 2
5 0 3
I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:
id y time
1 0 1
1 0 2
2 1 1
2 1 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 1 1
4 1 2
5 0 1
5 0 2
5 0 3
As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:
id y time
1 0 1
1 0 2
2 0 1
2 0 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 0 1
4 1 2
5 0 1
5 0 2
5 0 3
This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.
You can create a new data frame with the proper id and time columns for the long format, then merge that with the original. This leaves NA for the unmatched values, which can then be substituted with 0:
merge(foo,
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
),
all.y=TRUE
)
## id time y
## 1 1 1 NA
## 2 1 2 0
## 3 2 1 NA
## 4 2 2 NA
## 5 2 3 1
## 6 3 1 NA
## 7 3 2 NA
## 8 3 3 NA
## 9 3 4 0
## 10 4 1 NA
## 11 4 2 1
## 12 5 1 NA
## 13 5 2 NA
## 14 5 3 0
A similar merge works for the first expansion. Merge foo without the time column with the same created data frame as above:
merge(foo[c('id','y')],
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
)
)
## id y time
## 1 1 0 1
## 2 1 0 2
## 3 2 1 1
## 4 2 1 2
## 5 2 1 3
## 6 3 0 1
## 7 3 0 2
## 8 3 0 3
## 9 3 0 4
## 10 4 1 1
## 11 4 1 2
## 12 5 0 1
## 13 5 0 2
## 14 5 0 3
It's not necessary to specify all (or all.y) in the latter expression because there are multiple time values for each matching id value, and these are expanded. In the prior case, the time values were matched from both data frames, and without specifying all (or all.y) you would get your original data back.
The initial expansion can be achieved with:
newdat <- transform(
foo[rep(rownames(foo),foo$time),],
time = sequence(foo$time)
)
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 1 1
#2.1 2 1 2
#2.2 2 1 3
# etc
To get the complete solution, including the bonus part, then do:
newdat$y[-cumsum(foo$time)] <- 0
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 0 1
#2.1 2 0 2
#2.2 2 1 3
#etc
If you were really excitable, you could do it all in one step using within:
within(
foo[rep(rownames(foo),foo$time),],
{
time <- sequence(foo$time)
y[-cumsum(foo$time)] <- 0
}
)
If you're willing to go with "data.table", you can try:
library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 1 1
# 4: 2 1 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 1 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
And, for the bonus question:
fooDT[, list(time = sequence(time)),
by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y},
by = id][]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 0 1
# 4: 2 0 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 0 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
For the bonus question, alternatively:
fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1,
y := c(rep.int(0, .N-1L), 1), by=id][]
With dplyr (and magritte for nice legibility):
library(magrittr)
library(dplyr)
foo[rep(1:nrow(foo), foo$time), ] %>%
group_by(id) %>%
mutate(y = !duplicated(y, fromLast = TRUE),
time = 1:n())
Hope it helps

Two dimensional heatmap with R

I have an input file of this form:
0.35217720 1 201 1
0.26413283 1 209 1
1.1665874 1 210 1
...
0.30815500 2 194 1
0.15407741 2 196 1
0.15407741 2 197 1
0.33016610 2 205 1
...
where the first column is a scalar value, the second is the x coordinate of a discrete lattice, the third is the y coordinate and the last one is time-like discrete component.
I would like to make a two dimensional heatmap of the scalar values at fixed time. How can i do? Edit: I don't know how to use image() to use the second and the third column as x, y coordinates.
Example file:
7.62939453 1 1 1
1.3153768 1 2 1
7.5560522 1 3 1
4.5865011 1 4 1
5.3276706 1 5 1
2.1895909 2 1 1
0.47044516 2 2 1
6.7886448 2 3 1
6.7929626 2 4 1
9.3469286 2 5 1
3.8350201 3 1 1
5.1941633 3 2 1
8.3096523 3 3 1
0.34571886 3 4 1
0.53461552 3 5 1
5.2970004 4 1 1
6.7114925 4 2 1
7.69805908 4 3 1
3.8341546 4 4 1
0.66842079 4 5 1
4.1748595 5 1 1
6.8677258 5 2 1
5.8897662 5 3 1
9.3043633 5 4 1
8.4616680 5 5 1
Reshape your data to a matrix and then use heatmap():
This worked on R version 2.10.1 (2009-12-14):
txt <- textConnection("7.62939453 1 1 1
1.3153768 1 2 1
7.5560522 1 3 1
4.5865011 1 4 1
5.3276706 1 5 1
2.1895909 2 1 1
0.47044516 2 2 1
6.7886448 2 3 1
6.7929626 2 4 1
9.3469286 2 5 1
3.8350201 3 1 1
5.1941633 3 2 1
8.3096523 3 3 1
0.34571886 3 4 1
0.53461552 3 5 1
5.2970004 4 1 1
6.7114925 4 2 1
7.69805908 4 3 1
3.8341546 4 4 1
0.66842079 4 5 1
4.1748595 5 1 1
6.8677258 5 2 1
5.8897662 5 3 1
9.3043633 5 4 1
8.4616680 5 5 1
")
df <- read.table(txt)
close(txt)
names(df) <- c("value", "x", "y", "t")
require(reshape)
dfc <- cast(df[ ,-4], x ~ y)
heatmap(as.matrix(dfc))
## Some copy/pasteable fake data for you (dput() works nicely for pasteable real data)
your_matrix <- cbind(runif(25, 0, 10), rep(1:5, each = 5), rep(1:5, 5), rep(1, 25))
heatmap_matrix <- matrix(your_matrix[, 1], nrow = 5)
## alternatively, if your_matrix isn't in order
## (The reshape method in EDi's answer is a nicer alternative)
for (i in 1:nrow(your_matrix)) {
heatmap_matrix[your_matrix[i, 2], you_matrix[i, 3]]
}
heatmap(heatmap_matrix) # one option
image(z = heatmap_matrix) # another option
require(gplots)
heatmap.2(heatmap_matrix) # this has fancier preferences

Resources