I would like to convert a charter sequence into a numeric sequence.
My variable is called labCancer and is made like this:
labCancer
[1] M M M M M M M M M M M M M M M M M M M B B B M M M M M M M M M M M M M M M B
I would like to have:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0
I tried using
labCancer_2 <- labCancer
for (i in 1:569) {
if (labCancer[i] == "M") {
labCancer_2[i] <- 1
} else {
labCancer_2[i] <- 2
} }
but it doesn't work.
Andrea
The only reason I can think of that would cause that loop to not work is failure to initialize labCancer_2. So you would want to do this prior to starting your loop:
labCancer_2 <- numeric(length(labCancer))
If you want to assign to an object element by element in a loop, you need to initialize that object first, or it needs to otherwise exist in some manner.
However, there is a better way to do this that would not require initialization and would be the way many would argue you should do it in R
labCancer_2 <- ifelse(labCancer == "M", 1, 0)
This takes advantage of R's vectorization.
Depending on what you are using the data for, as long as you only have two values, you can do this:
labCancer_2 <- ifelse(lab_cancer=="M", 1, 0)
If you have multiple values or you want to keep the letters around for reference or graphing, you can make the vector a factor:
labCancer_2 <-factor(lab_cancer, levels=c("B", "M"))
However, the factor begins with 1, so your vector would be
2 2 2 2 ... 1 1 1 ...
rather than
1 1 1 1 ... 0 0 0...
One solution would be to convert your vector to a factor, and then to an integer. This will result in all unique values of your original vector to get a separate integer number:
> x <- c("m", "b", "m", "b")
> x
[1] "m" "b" "m" "b"
> as.factor(x)
[1] m b m b
Levels: b m
> as.integer(as.factor(x))
[1] 2 1 2 1
> c(0, 1)[as.numeric(as.factor(x))]
[1] 1 0 1 0
Using the trick in the last line one can easily change the numbers to match 0 and 1.
create a numeric vector (0,1,0,0,1,1), change it to a vector of characters ("0","1","0","0","1","1")
Related
So I have this vector:
a = sample(0:3, size=30, replace = T)
[1] 0 1 3 3 0 1 1 1 3 3 2 1 1 3 0 2 1 1 2 0 1 1 3 2 2 3 0 1 3 2
What I want to have is a list of vectors with all the elements that are separated by n 0s. So in this case, with n = 0 (there can't be any 0 between the consecutive values), this would give:
res = c([1,3,3], [1,1,1,3,3,2,1,1,3], [2,1,1,2]....)
However, I would like to control the n-parameter flexible to that if I would set it for example to 2, that something like this:
b = c(1,2,0,3,0,0,4)
would still result in a result like this
res = c([1,2,3],[4])
I tried a lot of approaches with while loops in for-loops while trying to count the number of 0s. But I just could not achieve it.
Update
I tried to post the question in a more real-world setting here:
Flexibly calculate column based on consecutive counts in another column in R
Thank you all for the help. I just don't seem to manage put your help into practice with my limited knowledge..
Here is a base R option using rle + split for general cases, i.e., values in b is not limited to 0 to 3.
with(
rle(with(rle(b == 0), rep(values & lengths == n, lengths))),
Map(
function(x) x[x != 0],
unname(split(b, cut(seq_along(b), c(0, cumsum(lengths))))[!values])
)
)
which gives (assuming n=2)
[[1]]
[1] 1 2 3
[[2]]
[1] 4
If you have values within ragne 0 to 9, you can try the code below
lapply(
unlist(strsplit(paste0(b, collapse = ""), strrep(0, n))),
function(x) {
as.numeric(
unlist(strsplit(gsub("0", "", x), ""))
)
}
)
which also gives
[[1]]
[1] 1 2 3
[[2]]
[1] 4
I also wanted to paste a somehow useful solution with the function SplitAt from DescTools:
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)])
where a is your intial vector. It gives you a list where every entry contains the pair of numbers between zeros:
If you than add another SplitAt with empty chars, you can create sublist after sublist and split it in as many sublists as you want: e.g.:
n <- 4
SplitAt(a, which(a==0)) %>% lapply(., function(x) x[which(x != 0)]) %>% SplitAt(., n)
gives you:
set.seed(1)
a <- sample(0:3, size=30, replace = T)
a
[1] 0 3 2 0 1 0 2 2 1 1 2 2 0 0 0 1 1 1 1 2 0 2 0 0 0 0 1 0 0 1
a2 <- paste(a, collapse = "") # Turns into a character vector, making it easier to handle patterns.
a3 <- unlist(strsplit(a2, "0")) # Change to whatever pattern you want, like "00".
a3 <- a3[a3 != ""] # Remove empty elements
a3 <- as.numeric(a3) # Turn back to numeric
a3
[1] 32 1 221122 11112 2 1 1
I have a list of two series that start out the same length. After executing the following code, the second series has one fewer elements than the first. Is there a general way of removing the final element of only the series containing n+1 elements, so that all the series in my list have n elements? What about if I have a combination of series in my list containing n, n+1 and n+2 elements? Below is a minimal reproducible example.
#test
library('urca')
tseries <- list("t1" = c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1), "t2" = c(1,2,3,4,5,6,7,8,9,10,9,8,7,8,9));
# apply stationarity test to the list of series
adf <- lapply(tseries, function(x) tseries::adf.test(x)$p.value)
adf
# index only series that need differencing
not_stationary <- tseries[which(adf > 0.05)]
stationary <- tseries[which(adf < 0.05)]
not_stationary <- lapply(not_stationary, diff);
# verify
adf <- lapply(not_stationary, function(x) tseries::adf.test(x)$p.value)
adf
now_stationary <- not_stationary
#combine stationary and now_stationary
tseries_diff <- c(stationary, now_stationary)
tseries_diff
#$t1
#[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
#$t2
#[1] 1 1 1 1 1 1 1 1 1 -1 -1 -1 1 1
So to summarise, I would ike to remove the final element, 1, from t1, but using code that can be applied to a list of series of lengths n and n+1 (and n+2 would be useful).
Thanks!
You can find the minimum length and simply get the series up to that point, i.e.
new_series_list <- lapply(tseries_diff, function(i)i[seq(min(lengths(tseries_diff)))])
so the lengths are now the same
lengths(new_series_list)
#t1 t2
#14 14
This will work in any size series. It will trim the long series to much the short one.
Edited for list instead of vector -
If you are dealing with list, you are wanting to make all of the series the length of the shortest:
(I modify the example to avoid using a library)
#test
mylist <- c(1,1,1,1,1)
mylongerlist <- c(1,1,1,1,1,1,1)
length(mylist)
# [1] 5
length(mylongerlist)
# [1] 7
#combine
tseries_diff <- list("t1" = mylist, "t2" = mylongerlist)
tseries_diff
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1 1 1
# on the fly approach to truncate
lapply(tseries_diff, function(x) { length(x) <- min(lengths(tseries_diff)); x })
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1
And a function
# As a reusable function for clear code
reduceToShortestLength <- function(toCut) {
# takes a list and cuts the tail off of any series longer than the shortest
lapply(toCut, function(x) { length(x) <- min(lengths(tseries_diff)); x })
}
reduceToShortestLength(tseries_diff)
# $t1
# [1] 1 1 1 1 1
#
# $t2
# [1] 1 1 1 1 1
Original below (in case anyone thinks vector like I did at first)
I think you are asking how to truncate a vector to the shortest length. The head function does this well in base R.
the on the fly approach:
> mylist <- c(1,1,1,1,1)
> mylongerlist <- c(1,1,1,1,1,1,1)
> length(mylist)
[1] 5
> length(mylongerlist)
[1] 7
> x <- head(mylongerlist, length(mylist))
> length(x)
[1] 5
A function can be written like so:
> reduceToShorterLength<- function(toshorten, template) { head(toshorten, length(template))}
> x <- reduceToShorterLength(mylongerlist, mylist)
> length(x)
[1] 5
I created two nested for loops to complete the following:
Iterating through each column that is not the first column:
Iterate through each row i that is NOT the last row (the last row is denoted j)
Compare the value in i to the value in j.
If i is NA, i = NA.
If i >= j, i = 0.
If i < j, i = 1.
Store the results of all iterations across all columns and rows in a df.
The code below creates some test data but produces a Value "out" that is NULL (empty). Any recommendations?
# Create df
a <- rnorm(5)
b <- c(rnorm(3),NA,rnorm(1))
c <- rnorm(5)
df <- data.frame(a,b,c)
rows <- nrow(df) # rows
cols <- ncol(df) # cols
out <- for (c in 2:cols){
for (r in 1:(rows - 1)){
ifelse(
is.na(df[r,c]),
NA,
df[r, c] <- df[r, c] < df[rows, c])
}
}
There's no need for looping at all. Use a vectorised function like sweep to compare via > your last row - df[nrow(df),] vs all the other rows df[-nrow(df),]:
df
# a b c
#1 -0.2739735 0.5095727 0.30664838
#2 0.7613023 -0.1509454 -0.08818313
#3 -0.4781940 1.5760307 0.46769601
#4 1.1754130 NA 0.33394212
#5 0.5448537 1.0493805 -0.10528847
sweep(df[-nrow(df),], 2, unlist(df[nrow(df),]), FUN=`>`)
# a b c
#1 FALSE FALSE TRUE
#2 TRUE FALSE TRUE
#3 FALSE TRUE TRUE
#4 TRUE NA TRUE
sweep(df[-nrow(df),], 2, unlist(df[nrow(df),]), FUN=`>`) + 0
# a b c
#1 0 0 1
#2 1 0 1
#3 0 1 1
#4 1 NA 1
Here is another option. We can replicate the last row to make the dimensions of both datasets equal and then do the > to get a logical index, which can be coerced to binary by wrapping with +.
+(df[-nrow(df),] > df[nrow(df),][col(df[-nrow(df),])])
# a b c
#1 0 0 1
#2 1 0 1
#3 0 1 1
#4 1 NA 1
I have a dataframe containing (surprise) data. I have one column which I wish to populated on a per-row basis, calculated from the values of other columns in the same row.
From googling, it seems like I need 'apply', or one of it's close relatives. Unfortunately I haven't managed to make it actually work.
Example code:
#Example function
getCode <- function (ar1, ar2, ar3){
if(ar1==1 && ar2==1 && ar3==1){
return(1)
} else if(ar1==0 && ar2==0 && ar3==0){
return(0)
}
return(2)
}
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
#Add column for new data
df[,"x"] <- 0
#Apply function to new column
df[,"x"] <- apply(df[,"x"], 1, getCode(df[,"a"], df[,"b"], df[,"c"]))
I would like df to be taken from:
a b c x
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
to
a b c x
1 1 1 1 1
2 1 0 1 2
3 0 0 0 0
Unfortunately running this spits out:
Error in match.fun(FUN) : 'getCode(df[, "a"], df[, "b"], df[,
"c"])' is not a function, character or symbol
I'm new to R, so apologies if the answer is blindingly simple. Thanks.
A few things: apply would be along the dataframe itself (i.e. apply(df, 1, someFunc)); it's more idiomatic to access columns by name using the $ operator.. so if I have a dataframe named df with a column named a, access a with df$a.
In this case, I like to do an sapply along the index of the dataframe, and then use that index to get the appropriate elements from the dataframe.
df$x <- sapply(1:nrow(df), function(i) getCode(df$a[i], df$b[i], df$c[i]))
As #devmacrile mentioned above, I would just modify the function to be able to get a vector with 3 elements as input and use it within an apply command as you mentioned.
#Example function
getCode <- function (x){
ifelse(x[1]==1 & x[2]==1 & x[3]==1,
1,
ifelse(x[1]==0 & x[2]==0 & x[3]==0,
0,
2)) }
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
df
# a b c
# 1 1 1 1
# 2 1 0 1
# 3 0 0 0
# create your new column of results
df$x = apply(df, 1, getCode)
df
# a b c x
# 1 1 1 1 1
# 2 1 0 1 2
# 3 0 0 0 0
I've two distance matrices.. but either of them can have items missing, and they can be out of order -- for example:
matrix #1 (missing item c)
a b d
a 0 2 3
b 2 0 4
d 3 4 0
matrix #2 (missing item b, and items out of order)
d c a
d 0 1 2
c 1 0 1
a 2 1 0
I want to find the difference between the matrices, while assuming that any missing items are 0. So, my resulting matrix should be:
a b c d
a 0 2 1 1
b 2 0 0 4
c 1 0 0 1
d 1 4 1 0
What's the best way to go about this? Should I be sorting both matrices and then filling in missing columns/rows so that I can then just abs(m1-m2), or is there a way to use row/column headings to have them automatically "match up" when subtracting?
These matrices are 5000x5000 or so, and I'll have about a 1000 to do pairwise comparison on, so I'd rather take a hit on preprocessing the data if that will make each computation significantly faster.
Any hints or suggestions are welcome. I'm usually a non-R programmer, so an iterative solution that I would normally come up would take forever -- I'm hoping for the "R way" of doing things that will be significantly faster.
We create a names index ('Un1') which is the union of names of the first ('m1') and second ('m2') matrix. Two new 0 matrices ('m1N', 'm2N') are created by specifying the dimensions and dim names based on 'Un1'. By row/column indexing, we change the 0 values in these matrices to the values in 'm1', 'm2', subtract and get the absolute.
Un1 <- sort(union(colnames(m1), colnames(m2)))
m1N <- matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1))
m2N <- m1N
m1N[rownames(m1), colnames(m1)] <- m1
m2N[rownames(m2), colnames(m2)] <- m2
abs(m1N-m2N)
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0
Update
If we have several matrices with object names m followed by numbers, we can place them in a list. We get the object names using ls and the values in a list with mget. Loop through the list with lapply to get the column names, use union as f in Reduce, sort to get the unique elements.
lst <- mget(ls(pattern='m\\d+')) #change the pattern accordingly
Un1 <- sort(Reduce(union, lapply(lst, colnames)))
We can create another list with matrix of 0s.
lst1 <- lapply(seq_along(lst), function(i)
matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1)))
We can change the corresponding elements of 'lst1' using the row/column index of corresponding matrices of 'lst' using Map.
lst2 <- Map(function(x,y) {x[rownames(y), colnames(y)] <- y; x}, lst1, lst)
If we need pairwise difference, combn may be an option
lst3 <- combn(seq_along(lst2),2, FUN=function(x)
list(abs(lst2[[x[1]]]-lst2[[x[2]]])))
names(lst3) <- combn(seq_along(lst2), 2, FUN=paste, collapse='_')
Another approach using match (beginning is similar to #akrun):
func = function(cols, m)
{
res = `dimnames<-`(m[match(cols,rownames(m)), match(cols,colnames(m))],
list(cols, cols))
ifelse(is.na(res), 0, res)
}
cols = sort(union(colnames(m1), colnames(m2)))
abs(func(cols,m1) - func(cols,m2))
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0