Operations on data frame with a variable containing vectors - r

When I create a dataframe for which one variable contains a vector of integers, like
id <- 1:5
meas <- list(NA,c(1,2),c(1),c(1,2,3),c(1,2,3,4))
myDf <- data.frame(cbind(id,meas))
I can easily copy the vector into another variable or check if it contains a NA
myDf$copyMeas <- myDf$meas
myDf$naMeas <- is.na(myDf$meas)
but when I want to get the length of the vectors I get the number of observations in the data frame
myDf$lengthMeas <- length(myDf$meas)
id meas copyMeas naMeas lengthMeas
1 1 NA NA TRUE 5
2 2 1, 2 1, 2 FALSE 5
3 3 1 1 FALSE 5
4 4 1, 2, 3 1, 2, 3 FALSE 5
5 5 1, 2, 3, 4 1, 2, 3, 4 FALSE 5
Why is this behaviour? What should I use when I want the length of the vectors in another variable?

Because that column is a list. If you ask for the length of a list, you'll get how many elements it has. You seem to want the length of each element:
sapply(myDf$meas,length)
[1] 1 2 1 3 4

This does the trick:
sapply(myDf$meas, length)
[1] 1 2 1 3 4
length is not vectorized, it assumes you want the length of the object you put into it. Using sapply you force that length is done for each entry in myDf$meas.

Have a look at
str(myDf)
and you will see that myDf$meas is still a list. Accordingly, the result of length(myDf$meas) is the length of this list, which is 5.
You are looking for
myDf$lengthMeas <- sapply(myDf$meas, length).

Related

How to repeat the indices of a vector based on the values of that same vector?

Given a random integer vector below:
z <- c(3, 2, 4, 2, 1)
I'd like to create a new vector that contains all z's indices a number of times specified by the value corresponding to that element of z. To illustrate this. The desired result in this case should be:
[1] 1 1 1 2 2 3 3 3 3 4 4 5
There must be a simple way to do this.
You can use rep and seq to repeat the indices of a vector based on the values of that same vector. seq to get the indices and rep to repeat them.
rep(seq(z), z)
# [1] 1 1 1 2 2 3 3 3 3 4 4 5
Starting with all the indices of the vector z. These are given by:
1:length(z)
Then these elements should be repeated. The number of times these numbers should be repeated is specified by the values of z. This can be done using a combination of the lapply or sapply function and the rep function:
unlist(lapply(X = 1:length(z), FUN = function(x) rep(x = x, times = z[x])))
[1] 1 1 1 2 2 3 3 3 3 4 4 5
unlist(sapply(X = 1:length(z), FUN = function(x) rep(x = x, times = z[x])))
[1] 1 1 1 2 2 3 3 3 3 4 4 5
Both alternatives give the same result.

Sort vector into repeating sequence when sequential values are missing R

I would like to take a vector such as this:
x <- c(1,1,1,2,2,2,2,3,3)
and sort this vector into a repeating sequence maintaining the hierarchical order of 1, 2, 3 when values are absent.
return: c(1,2,3,1,2,3,1,2,2)
We can create the order based on the sequence of 'x'
x[order(ave(x, x, FUN = seq_along))]
#[1] 1 2 3 1 2 3 1 2 2
Or with rowid fromdata.table
library(data.table)
x[order(rowid(x))]
#[1] 1 2 3 1 2 3 1 2 2

R checking if the same numbers occur in multiple rows of a data frame

I have a data frame, nearest_neighbour, which lists the nearest neighbours of a point. So for point 1, the 1st nearest neighbour is point 2, the second nearest neighbour is point 3, and so on.
What is the quickest way to loop through this and check if 4 points all share the same nearest neighbours?
Eg. Point 1's three nearest neighbours are 2, 3 and 4. Point 2's nearest neighbours are 1, 3 and 4 etc.
which.1 which.2 which.3
1 2 3 4
2 1 4 3
3 1 4 2
4 3 1 2
5 2 4 6
6 7 5 2
I can do it easily with if statements for just two neighbours:
count <- 0
for (j in 1:length(nearest_neighbour[[1]])){
if(nearest_neighbour[[1]][nearest_neighbour[[1]][j]] == j){
count <- count + 1
}
}
However this method seems silly for more than 2 as there ends up being a lot of if statements.
Here is a base R method using factor and apply
groups <- factor(apply(cbind(df, seq_len(nrow(df))), 1,
function(i) paste(sort(i), collapse="_")))
groups
1 2 3 4 5 6
1_2_3_4 1_2_3_4 1_2_3_4 1_2_3_4 2_4_5_6 2_5_6_7
Levels: 1_2_3_4 2_4_5_6 2_5_6_7
The inner function sorts a vector and collapses the result into a string separated by underscores. This function is applied to each row of a modified version of the data frame where the current row number (element ID) is added.
Here is also a base R solution, but with a different approach:
dd <- t(apply(df, 1, function(x) table(factor(x, levels=1:max(df)))))
colSums(dd) >= 4
1 2 3 4 5 6 7
FALSE TRUE FALSE TRUE FALSE FALSE FALSE
So points 2 and 4 appear more (or equal) then 4 times.

Comparisons in row or column operations in R

I would like to perform an operation across a column of a data frame wherein the output is dependent on a comparison between two values.
My data frame dat is arranged like this:
region value1
a 0
a 0
a 6
a 7
a 3
a 0
a 4
b 5
b 1
b 0
I want to create a vector of factor values based in integers. The factor value should increment every time the region value changes or every time value1 is 0. So in this case the vector I want would be equivalent to c(1, 2, 2, 2, 2, 3, 3, 4, 4, 5).
I have code to make a factor vector that increments ONLY when value1 is 0:
fac <- as.factor(cumsum(dat[,2]==0))
and I have c-style code that gets roughly the vector I want, but runs extremely slowly on my overall data and is just plain ugly:
p <- 1
facint <- 1
for (i in 2:length(dat[,2])) {
facint <- c(facint, p)
if (dat[i, 2]==0 || dat[i, 1] != dat[i-1, 1])
p = p+1
}
fac <- as.factor(facint)
So how can I accomplish an operation such as this when operating on every row in R-style programming?
Try
cumsum(dat[,2]==0|c(FALSE,dat$region[-1]!=dat$region[-nrow(dat)]))
# [1] 1 2 2 2 2 3 3 4 4 5
Or
cumsum(!duplicated(dat[,1]) | dat[,2]==0)
#[1] 1 2 2 2 2 3 3 4 4 5

How many times occur pair of 1 in a vector

i have a problem.
I have a vector, that consists from 0 or 1 - for example (011011111011100001111). In R i need to figure out, how to count how many times appears in vector two 1, three 1, four 1 and so on. In this example vector I have 1 times 11, 1 times 111, 1 times 1111 and 1 times 11111.
Thanks a lot, Peter
I'm assuming you have an actual vector like c(0, 1, 1, 0...).
Here is a solution using table and rle. I've also provided some longer sample data to make it a bit more interesting.
set.seed(1)
myvec <- sample(c(0, 1), 100, replace = TRUE)
temp <- rle(myvec)
table(temp$lengths[temp$values == 1])
#
# 1 2 3 4 6
# 15 8 1 2 1
If, indeed, you are dealing with a crazy-long character string of ones and zeroes, just use strsplit and follow the same logic as above.
myvec <- "00110111100010101101101000001001001110101111110011010000011010001001"
myvec <- as.numeric(strsplit(myvec, "")[[1]])
Here, I've converted to numeric, but that's just so you can use the same code as earlier. You can use rle on a character vector too.
rle is your friend:
vec <-c(0,1,1,0,1,1,1,1,1,0,1,1,1,0,0,0,0,1,1,1,1)
res <-data.frame(table(rle(vec)))
res[res$values==1,]
lengths values Freq
6 1 1 0
7 2 1 1
8 3 1 1
9 4 1 1
10 5 1 1

Resources