interaction, number of groups - r

From a vector a I'm looking for a function (quick to compute) that returns a vector with numbers ranging between 1 and the number of levels in vector a and indicating which values are equal.
I know how to do this with a for loop but it is a bit slow to run.
a=vector(11,14,11,22,14,22)
levels(as.factor(a))==3
Solution
b=vector(1,2,1,3,2,3)
meaning that in position 1 and 3 (where are the numbers 1 in b) the values in a are equal.
in position 2 and 5 (where are the numbers 2 in b) the values in a are equal.
etc...
Thank you

You can use as.numeric() on a factor to get this:
a <- c(11,14,11,22,14,22)
as.numeric(factor(a))
# [1] 1 2 1 3 2 3

Here is one function thats quickily made:
numberfun <- function(x){y <- unique(x)
match(x,y)}
a <- c(11,14,11,22,14,22)
numberfun(a)
#[1] 1 2 1 3 2 3
a <- c(99,99,22,22,44,22,99)
numberfun(a)
#[1] 1 1 2 2 3 2 1

Related

Sort vector into repeating sequence when sequential values are missing R

I would like to take a vector such as this:
x <- c(1,1,1,2,2,2,2,3,3)
and sort this vector into a repeating sequence maintaining the hierarchical order of 1, 2, 3 when values are absent.
return: c(1,2,3,1,2,3,1,2,2)
We can create the order based on the sequence of 'x'
x[order(ave(x, x, FUN = seq_along))]
#[1] 1 2 3 1 2 3 1 2 2
Or with rowid fromdata.table
library(data.table)
x[order(rowid(x))]
#[1] 1 2 3 1 2 3 1 2 2

Assigning vector elements a value associated with preceding matching value [duplicate]

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
I have a vector of alternating TRUE and FALSE values:
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
I'd like to number each instance of TRUE with a unique sequential number and to assign each FALSE value the number associated with the TRUE value preceding it.
therefore, my desired output using the example dat above (which has 4 TRUE values):
1 1 1 2 2 2 2 3 3 4 4 4 4 4
What I tried:
I've tried the following (which works), but I know there must be a simpler solution!!
whichT <- which(dat==T)
whichF <- which(dat==F)
l1 <- lapply(1:length(whichT),
FUN = function(x)
which(whichF > whichT[x] & whichF < whichT[(x+1)])
)
l1[[length(l1)]] <- which(whichF > whichT[length(whichT)])
replaceFs <- unlist(
lapply(1:length(whichT),
function(x) l1[[x]] <- rep(x,length(l1[[x]]))
)
)
replaceTs <- 1:length(whichT)
dat2 <- dat
dat2[whichT] <- replaceTs
dat2[whichF] <- replaceFs
dat2
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
I need a simpler and quicker solution b/c my real data set is 181k rows long!
Base R solutions preferred, but any solution works
cumsum(dat) will do what you want. When used in mathematical functions TRUE gets converted to 1 and FALSE to 0 so taking the cumulative sum will add 1 every time you see a TRUE and add nothing when there is a FALSE which is what you want.
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
cumsum(dat)
# [1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
Instead of doing the indexing, it can be easily done with cumsum from base R. Here, TRUE/FALSE gets coerced to 1/0 and when we do the cumulative sum, whereever there is 1, it gets increment by 1
cumsum(dat)
#[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
cumsum() is the most straightforward way, however, you can also do:
Reduce("+", dat, accumulate = TRUE)
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4

how to compare and select the minimum of two features in R?

Assume i have the following dataset:
dt<-data.frame(X=sample(5),Y=sample(5))
now, i need to compare these two features and select the one which is smaller.
X Y
1 4 3
2 5 2
3 2 4
4 3 5
5 1 1
Then the expected answer would be
3
2
2
3
1
I know
min(dt[1,])
could be helpful but it only gives me 1
Use pmin, which is the vectorized version of min:
pmin(dt$X,dt$Y)
Like thus:
> dt<-data.frame(X=sample(5),Y=sample(5))
> dt
X Y
1 3 2
2 4 3
3 1 5
4 2 4
5 5 1
> pmin(dt$X,dt$Y)
[1] 2 3 1 2 1
high <- apply(dt[,c("X","Y")], 1, max)
is another implementation
integer(0) or length 0 element happens when one of X or Y is of length(0)
For min or max, a length-one vector. For pmin or pmax, a vector of length the longest of the input vectors, or length zero if one of the inputs had zero length.
(from documentation)
max(which(1:3 == 5),10) works but pmax(which(1:3 == 5),10) gives integer(0)

Replace some component value in a vector with some other value

In R, in a vector, i.e. a 1-dim matrix, I would like to change components with value 3 to with value 1, and components with value 4 with value 2. How shall I do that? Thanks!
The idiomatic r way is to use [<-, in the form
x[index] <- result
If you are dealing with integers / factors or character variables, then == will work reliably for the indexing,
x <- rep(1:5,3)
x[x==3] <- 1
x[x==4] <- 2
x
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
The car has a useful function recode (which is a wrapper for [<-), that will let you combine all the recoding in a single call
eg
library(car)
x <- rep(1:5,3)
xr <- recode(x, '3=1; 4=2')
x
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
xr
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
Thanks to #joran for mentioning mapvalues from the plyr package, another wrapper for [<-
x <- rep(1:5,3)
mapvalues(x, from = c(3,1), to = c(1,2))
plyr::revalue is a wrapper for mapvalues specifically factor or character variables.

Create a group number for each consecutive sequence

I have the data.frame below. I want to add a column 'g' that classifies my data according to consecutive sequences in column h_no. That is, the first sequence of h_no 1, 2, 3, 4 is group 1, the second series of h_no (1 to 7) is group 2, and so on, as indicated in the last column 'g'.
h_no h_freq h_freqsq g
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3
You can add a column to your data using various techniques. The quotes below come from the "Details" section of the relevant help text, [[.data.frame.
Data frames can be indexed in several modes. When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list.
my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector
The data.frame method for $, treats x as a list
my.dataframe$new.col <- a.vector
When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they act like indexing a matrix
my.dataframe[ , "new.col"] <- a.vector
Since the method for data.frame assumes that if you don't specify if you're working with columns or rows, it will assume you mean columns.
For your example, this should work:
# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))
# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs
# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})
# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)
no h_freq h_freqsq group
1 1 0.40998238 0.06463876 1
2 2 0.98086928 0.33093795 1
3 3 0.28908651 0.74077119 1
4 4 0.10476768 0.56784786 1
5 1 0.75478995 0.60479945 2
6 2 0.26974011 0.95231761 2
7 3 0.53676266 0.74370154 2
8 4 0.99784066 0.37499294 2
9 5 0.89771767 0.83467805 2
10 6 0.05363139 0.32066178 2
11 7 0.71741529 0.84572717 2
12 1 0.10654430 0.32917711 3
13 2 0.41971959 0.87155514 3
14 3 0.32432646 0.65789294 3
15 4 0.77896780 0.27599187 3
16 5 0.06100008 0.55399326 3
Easily: Your data frame is A
b <- A[,1]
b <- b==1
b <- cumsum(b)
Then you get the column b.
If I understand the question correctly, you want to detect when the h_no doesn't increase and then increment the class. (I'm going to walk through how I solved this problem, there is a self-contained function at the end.)
Working
We only care about the h_no column for the moment, so we can extract that from the data frame:
> h_no <- data$h_no
We want to detect when h_no doesn't go up, which we can do by working out when the difference between successive elements is either negative or zero. R provides the diff function which gives us the vector of differences:
> d.h_no <- diff(h_no)
> d.h_no
[1] 1 1 1 -3 1 1 1 1 1 1 -6 1 1 1
Once we have that, it is a simple matter to find the ones that are non-positive:
> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE
In R, TRUE and FALSE are basically the same as 1 and 0, so if we get the cumulative sum of nonpos, it will increase by 1 in (almost) the appropriate spots. The cumsum function (which is basically the opposite of diff) can do this.
> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2
But, there are two problems: the numbers are one too small; and, we are missing the first element (there should be four in the first class).
The first problem is simply solved: 1+cumsum(nonpos). And the second just requires adding a 1 to the front of the vector, since the first element is always in class 1:
> classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Now, we can attach it back onto our data frame with cbind (by using the class= syntax, we can give the column the class heading):
> data_w_classes <- cbind(data, class=classes)
And data_w_classes now contains the result.
Final result
We can compress the lines together and wrap it all up into a function to make it easier to use:
classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}
Or, since it makes sense for the class to be a factor:
classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}
You use either function like:
> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column
(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )
In addition to Roman's answer, something like this might be even simpler. Note that I haven't tested it because I do not have access to R right now.
# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})
The function iterates over the values in n_ho and always returns the categorie that the current value belongs to. If a value of 1 is detected, we increase the global variable index and continue.
Approach based on identifying number of groups (x in mapply) and its length (y in mapply)
mytb<-read.table(text="h_no h_freq h_freqsq group
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL
positionsof1s<-grep(1,mytb$h_no)
mytb$newgroup<-unlist(mapply(function(x,y)
rep(x,y), # repeat x number y times
x= 1:length(positionsof1s), # x is 1 to number of nth group = g1:g3
y= c( diff(positionsof1s), # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
nrow(mytb)- # this line and the following gives number of repeat for last group (g3)
(positionsof1s[length(positionsof1s )]-1 ) # number of rows - position of penultimate group (g2)
) ) )
mytb
I believe that using "cbind" is the simplest way to add a column to a data frame in R. Below an example:
myDf = data.frame(index=seq(1,10,1), Val=seq(1,10,1))
newCol= seq(2,20,2)
myDf = cbind(myDf,newCol)
The data.table function rleid is handy for things like this. We subtract the sequence 1:nrow(data) to transform consecutive sequences to constants, and then use rleid to create the group IDs:
data$g = data.table::rleid(data$h_no - 1:nrow(data))
Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))

Resources