subset dataframe based on conditions in vector - r

I have two dataframes
#df1
type <- c("A", "B", "C")
day_start <- c(5,8,4)
day_end <- c(12,10,11)
df1 <- cbind.data.frame(type, day_start, day_end)
df1
type day_start day_end
1 A 5 12
2 B 8 10
3 C 4 11
#df2
value <- 1:10
day <- 4:13
df2 <- cbind.data.frame(day, value)
day value
1 4 1
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
9 12 9
10 13 10
I would like to subset df2 such that each level of factor "type" in df1 gets its own dataframe, only including the rows/days between day_start and day_end of this factor level.
Desired outcome for "A" would be..
list_of_dataframes$df_A
day value
1 5 2
2 6 3
3 7 4
4 8 5
5 9 6
6 10 7
7 11 8
8 12 9
I found this question on SO with the answer suggesting to use mapply(), however, I just cannot figure out how I have to adapt the code given there to fit my data and desired outcome.. Can someone help me out?

The following solution assumes that you have all integer values for days, but if that assumption is plausible, it's an easy one-liner:
> apply(df1, 1, function(x) df2[df2$day %in% x[2]:x[3],])
[[1]]
day value
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
9 12 9
[[2]]
day value
5 8 5
6 9 6
7 10 7
[[3]]
day value
1 4 1
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
You can use setNames to name the dataframes in the list:
setNames(apply(df1, 1, function(x) df2[df2$day %in% x[2]:x[3],]),df1[,1])

Yes, you can use mapply:
Define a function that will do what you want:
fun <- function(x,y) df2[df2$day >= x & df2$day <= y,]
Then use mapply to apply this function with every element of day_start and day_end:
final.output <- mapply(fun,df1$day_start, df1$day_end, SIMPLIFY=FALSE)
This will give you a list with the outputs you want:
final.output
[[1]]
day value
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
9 12 9
[[2]]
day value
5 8 5
6 9 6
7 10 7
[[3]]
day value
1 4 1
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
You can name each data.frameof the list with setNames:
final.output <- setNames(final.output,df1$type)
Or you can also put an attribute type on the data.frames of the list:
fun <- function(x,y, type){
df <- df2[df2$day >= x & df2$day <= y,]
attr(df, "type") <- as.character(type)
df
}
Then each data.frame of final.output will have an attribute so you know which type it is:
final.output <- mapply(fun,df1$day_start, df1$day_end,df1$type, SIMPLIFY=FALSE)
# check wich type the first data.frame is
attr(final.output[[1]], "type")
[1] "A"
Finally, if you do not want a list with the 3 data.frames you can create a function that assigns the 3 data.frames to the global environment:
fun <- function(x,y, type){
df <- df2[df2$day >= x & df2$day <= y,]
name <- as.character(type)
assign(name, df, pos=.GlobalEnv)
}
mapply(fun,df1$day_start, df1$day_end, type=df1$type, SIMPLIFY=FALSE)
This will create 3 separate data.frames in the global environment named A, B and C.

Related

How to find closest match from list in R

I have a list of numbers and would like to find which is the next highest compared to each number in a data.frame. I have:
list <- c(3,6,9,12)
X <- c(1:10)
df <- data.frame(X)
And I would like to add a variable to df being the next highest number in the list. i.e:
X Y
1 3
2 3
3 3
4 6
5 6
6 6
7 9
8 9
9 9
10 12
I've tried:
df$Y <- which.min(abs(list-df$X))
but that gives an error message and would just get the closest value from the list, not the next above.
Another approach is to use findInterval:
df$Y <- list[findInterval(X, list, left.open=TRUE) + 1]
> df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12
You could do this...
df$Y <- sapply(df$X, function(x) min(list[list>=x]))
df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12

Replace sequence of identical values of length > 2

I have a sensor that measures a variable and when there is no connection it returns always the last value seen instead of NA. So in my vector I would like to replace these identical values by an imptuted value (for example with na.approx).
set.seed(3)
vec <- round(runif(20)*10)
#### [1] 2 8 4 3 6 6 1 3 6 6 5 5 5 6 9 8 1 7 9 3
But I want only the sequences bigger than 2 (3 or more identical numbers) because 2 identical numbers can appear naturally. (in previous example the sequence to tag would be 5 5 5)
I tried to do it with diff to tag my identical points (c(0, diff(vec) == 0)) but I don't know how to deal with the length == 2 condition...
EDIT
my expected output could be like this:
#### [1] 2 8 4 3 6 6 1 3 6 6 5 NA NA 6 9 8 1 7 9 3
(The second identical value of a sequence of 3 or more is very probably a wrong value too)
Thanks
you can use the lag function
set.seed(3)
> vec <- round(runif(20)*10)
>
> vec
[1] 2 8 4 3 6 6 1 3 6 6 5 5 5 6 9 8 1 7 9 3
>
> vec[vec == lag(vec) & vec == lag(vec,2)] <- NA
>
> vec
[1] 2 8 4 3 6 6 1 3 6 6 5 5 NA 6 9 8 1 7 9 3
>
you can use rle to get the indices of the positions where NA should be assigned.
vec[with(data = rle(vec),
expr = unlist(sapply(which(lengths > 2), function(i)
(sum(lengths[1:i]) - (lengths[i] - 2)):sum(lengths[1:i]))))] = NA
vec
#[1] 2 8 4 3 6 6 1 3 6 6 5 NA NA 6 9 8 1 7 9 3
In function
foo = function(X, length){
replace(x = X,
list = with(data = rle(X),
expr = unlist(sapply(which(lengths > length), function(i)
(sum(lengths[1:i]) - (lengths[i] - length)):sum(lengths[1:i])))),
values = NA)
}
foo(X = vec, length = 2)
#[1] 2 8 4 3 6 6 1 3 6 6 5 NA NA 6 9 8 1 7 9 3

repeat sequences from vector

Say I have a vector like so:
vector <- 1:9
#$ [1] 1 2 3 4 5 6 7 8 9
I now want to repeat every i to i+x sequence n times, like so for x=3, and n=2:
#$ [1] 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9
I'm accomplishing this like so:
index <- NULL
x <- 3
n <- 2
for (i in 1:(length(vector)/3)) {
index <- c(index, rep(c(1:x + (i-1)*x), n))
}
#$ [1] 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9
This works just fine, but I have a hunch there's got to be a better way (especially since usually, a for loop is not the answer).
Ps.: the use case for this is actually repeating rows in a dataframe, but just getting the index vector would be fine.
You can try to first split the vector, then use rep and unlist:
x <- 3 # this is the length of each subset sequence from i to i+x (see above)
n <- 2 # this is how many times you want to repeat each subset sequence
unlist(lapply(split(vector, rep(1:(length(vector)/x), each = x)), rep, n), use.names = FALSE)
# [1] 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9
Or, you can try creating a matrix and converting it to a vector:
c(do.call(rbind, replicate(n, matrix(vector, ncol = x), FALSE)))
# [1] 1 2 3 1 2 3 4 5 6 4 5 6 7 8 9 7 8 9

3-Dimesion Array in R

Suppose in one data frame I have, (they are strings)
data1<-data.frame(c("number1","number2"),c("dog,cat","pigeon,leopard"))
and in another data frame I have
data2<-data.frame(c("pigeon","leopard","dog","cat"),
c("5 6 7 8","10 11 12 13","1 2 3 4","5 6 7 8"))
data2:
pigeon 5 6 7 8
leopard 10 11 12 13
dog 1 2 3 4
cat 5 6 7 8
My expected output is a 3-d matrix which would give me:
i=number1/number2
j=the strings corresponding to i
k=the values from the 2nd data frame.
That is I will have, if i select number1,
dog 1 2 3 4
cat 5 6 7 8
It seems that you just want an extra column in data2 with "number1" and "number2" in the correct places and not really a 3d array.
data2 <- data.frame(j = c("pigeon","leopard","dog","cat"),
k = c("5 6 7 8","10 11 12 13","1 2 3 4","5 6 7 8"),
i = c("number2", "number2", "number1", "number1"))
Then you can choose everything for "number1" using
data2[data2$i == "number1", ]
If you don't like to have the i column in the result you can do:
data2[data2$i == "number1", ][c("j", "k")]
## j k
## 3 dog 1 2 3 4
## 4 cat 5 6 7 8
I'm not sure I understand your question, but if you want to select by numbers form data1 in data2 you could do
lapply(seq_along(data1[, 1]), function(i) data2[data2[, 1] %in% strsplit(as.character(data1[i, 2]), ",")[[1]],])
which will resolve in a list of matrices
# [[1]]
# c..pigeon....leopard....dog....cat.. c..5.6.7.8....10.11.12.13....1.2.3.4....5.6.7.8..
# 3 dog 1 2 3 4
# 4 cat 5 6 7 8
#
# [[2]]
# c..pigeon....leopard....dog....cat.. c..5.6.7.8....10.11.12.13....1.2.3.4....5.6.7.8..
# 1 pigeon 5 6 7 8
# 2 leopard 10 11 12 13

repeatedly applying ave for computing group means in a data frame

The following code separately produces the group means of x and y in accordance to group. Suppose that I have a number of variables for which repeating the same operation.
How would you suggest to proceed in order to obtain the same result through a single command? (I suppose it is necessary to adopt tapply, but I am not really sure about it..).
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- data.frame(cbind(group, x, y))
dat$m_x <- ave(dat$x, dat$group)
dat$m_y <- ave(dat$y, dat$group)
dat
Many thanks.
Alternative solutions using data.table and plyr packages:
1) Using data.table
require(data.table)
dt <- data.table(dat, key="group")
# Following #Matthew's comment, edited:
dt[, `:=`(m_x = mean(x), m_y = mean(y)), by=group]
Output:
group x y m_x m_y
1: 1 1 2 3 4
2: 1 3 4 3 4
3: 1 5 6 3 4
4: 2 7 8 9 10
5: 2 9 10 9 10
6: 2 11 12 9 10
2) using plyr and transform:
require(plyr)
ddply(dat, .(group), transform, m_x=mean(x), m_y=mean(y))
output:
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
3) using plyr and numcolwise (note the reduced output):
ddply(dat, .(group), numcolwise(mean))
Output:
group x y
1 1 3 4
2 2 9 10
Assuming you have more than just two columns, you would want to use apply to apply ave to every column in the matrix.
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- cbind(x, y)
ave.dat <- apply(dat, 2, function(column) ave(column, group))
# x y
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6
# [4,] 7 8
# [5,] 9 10
# [6,] 11 12
You can also use aggregate():
dat2 <- data.frame(dat, aggregate(dat[,-1], by=list(dat$group), mean)[group, -1])
dat2
group x y x.1 y.1
1 1 1 2 3 4
1.1 1 3 4 3 4
1.2 1 5 6 3 4
2 2 7 8 9 10
2.1 2 9 10 9 10
2.2 2 11 12 9 10
row.names(dat2) <- rownames(dat)
colnames(dat2) <- gsub("(.)\\.1", "m_\\1", colnames(dat2))
dat2
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
If the variable names are more than a single character, you would need to modify the gsub() call.

Resources