R - subset data if conditions - r

How can I subset data with logical conditions.
Assume that I have data as below. I would like to subset data set with first condition that all animals having FCR record, then I would like to take all animals in same pen with these animals in new data set.
animal Feed Litter Pen
1 0.2 5 3
2 NA 5 3
3 0.2 5 3
4 0.2 6 4
5 0.3 5 4
6 0.3 4 4
7 0.3 5 3
8 0.3 5 3
9 NA 5 5
10 NA 3 5
11 NA 3 3
12 NA 3 5
13 0.4 7 3
14 0.4 7 3
15 NA 7 5

I'm assuming that "FCR record" (in your question) relates to "Feed". Then, if I understand the question correctly, you can do this:
split(df[complete.cases(df),], df[complete.cases(df), 4])
# $`3`
# animal Feed Litter Pen
# 1 1 0.2 5 3
# 3 3 0.2 5 3
# 7 7 0.3 5 3
# 8 8 0.3 5 3
# 13 13 0.4 7 3
# 14 14 0.4 7 3
#
# $`4`
# animal Feed Litter Pen
# 4 4 0.2 6 4
# 5 5 0.3 5 4
# 6 6 0.3 4 4
In the above, complete.cases drops any of the incomplete observations. If you needed to match the argument on a specific variable, you can use something like df[!is.na(df$Feed), ] instead of complete.cases. Then, split creates a list of data.frames split by Pen.

# all animals with Feed data
df[!is.na(df$Feed), ]
# all animals from pens with at least one animal with feed data in the pen
df[ave(!is.na(df$Feed), df$Pen, FUN = any), ]

Related

Combine incremental sequence with a fixed columns in a dataframe [duplicate]

This question already has answers here:
Alternative to expand.grid for data.frames
(6 answers)
Closed 2 years ago.
I have a dataframe:
data.frame(x=c(1,2,3), y=c(4,5,6))
x y
1 1 4
2 2 5
3 3 6
For each row, I want to repeat x and y for each element within a given sequence, where the sequence is:
E=seq(0,0.2,by=0.1)
So when combined this would give:
x y E
1 1 4 0
2 1 4 0.1
3 1 4 0.2
4 2 5 0
5 2 5 0.1
6 2 5 0.2
7 3 6 0
8 3 6 0.1
9 3 6 0.2
I can not seem to achieve this with expand.grid - seems to give me all possible combinations. Am I after a cartesian product?
library(data.table)
dt <- data.table(x=c(1,2,3), y=c(4,5,6))
dt[,.(E=seq(0,0.2,by=0.1)),by=.(x,y)]
#> x y E
#> 1: 1 4 0.0
#> 2: 1 4 0.1
#> 3: 1 4 0.2
#> 4: 2 5 0.0
#> 5: 2 5 0.1
#> 6: 2 5 0.2
#> 7: 3 6 0.0
#> 8: 3 6 0.1
#> 9: 3 6 0.2
Created on 2020-05-01 by the reprex package (v0.3.0)
Yes, you are looking for cartesian product but base expand.grid cannot handle dataframes.
You can use tidyr functions here :
tidyr::expand_grid(df, E)
# A tibble: 9 x 3
# x y E
# <dbl> <dbl> <dbl>
#1 1 4 0
#2 1 4 0.1
#3 1 4 0.2
#4 2 5 0
#5 2 5 0.1
#6 2 5 0.2
#7 3 6 0
#8 3 6 0.1
#9 3 6 0.2
Or with crossing
tidyr::crossing(df, E)

How to find what objects get plotted in a region in R? [duplicate]

How can I subset data with logical conditions.
Assume that I have data as below. I would like to subset data set with first condition that all animals having FCR record, then I would like to take all animals in same pen with these animals in new data set.
animal Feed Litter Pen
1 0.2 5 3
2 NA 5 3
3 0.2 5 3
4 0.2 6 4
5 0.3 5 4
6 0.3 4 4
7 0.3 5 3
8 0.3 5 3
9 NA 5 5
10 NA 3 5
11 NA 3 3
12 NA 3 5
13 0.4 7 3
14 0.4 7 3
15 NA 7 5
I'm assuming that "FCR record" (in your question) relates to "Feed". Then, if I understand the question correctly, you can do this:
split(df[complete.cases(df),], df[complete.cases(df), 4])
# $`3`
# animal Feed Litter Pen
# 1 1 0.2 5 3
# 3 3 0.2 5 3
# 7 7 0.3 5 3
# 8 8 0.3 5 3
# 13 13 0.4 7 3
# 14 14 0.4 7 3
#
# $`4`
# animal Feed Litter Pen
# 4 4 0.2 6 4
# 5 5 0.3 5 4
# 6 6 0.3 4 4
In the above, complete.cases drops any of the incomplete observations. If you needed to match the argument on a specific variable, you can use something like df[!is.na(df$Feed), ] instead of complete.cases. Then, split creates a list of data.frames split by Pen.
# all animals with Feed data
df[!is.na(df$Feed), ]
# all animals from pens with at least one animal with feed data in the pen
df[ave(!is.na(df$Feed), df$Pen, FUN = any), ]

Dividing data and putting into two boxplot

> sleep
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10
I have this set of Data and Im supposed to Divide it by the effects that GROUP have on different people and put it into two different boxplot but as you can see theres group 1 and group 2 and they are on the same data which is group so I dont know how to divede the data into group 1 and group 2 can u help me with this?
You don't need to divide the data to put it into a boxplot:
boxplot(extra~group,data=sleep)
You can explore the different options available by using ?boxplot.
Some people like to use the ggplot2 package:
library(ggplot2)
ggplot(sleep,aes(x=group,y=extra,group=group))+geom_boxplot()
Others prefer lattice:
bwplot(group~extra,data=sleep)
This is a good dataset to use ggplot2 with.
library(ggplot2)
ggplot(sleep, aes(x=factor(group), y=extra)) + geom_boxplot()

How do i merge two dataframes in R but keep all missing values.

I need to combine to data frames that have different lengths, and keep all the "missing values". The problem is that there are not really missing values, but rather just less of one value than another.
Example:
df1 looks like this:
Shrub value period
1 0.5 1
2 0.6 1
3 0.7 1
4 0.8 1
5 0.9 1
10 0.9 1
1 0.4 2
5 0.4 2
6 0.5 2
7 0.3 2
2 0.4 3
3 0.1 3
8 0.5 3
9 0.2 3
df2 looks like this:
Shrub x y
1 5 8
2 6 7
3 3 2
4 1 2
5 4 6
6 5 9
7 9 4
8 2 1
9 4 3
10 3 6
and i want the combined dataframe to look like:
Shrub x y value period
1 5 8 0.5 1
2 6 7 0.6 1
3 3 2 0.7 1
4 1 2 0.8 1
5 4 6 0.9 1
6 5 9 NA 1
7 9 4 NA 1
8 2 1 NA 1
9 4 3 NA 1
10 3 6 0.9 1
1 5 8 0.4 2
2 6 7 NA 2
3 3 2 NA 2
4 1 2 NA 2
5 4 6 0.4 2
6 5 9 0.5 2
7 9 4 0.3 2
8 2 1 NA 2
9 4 3 NA 2
10 3 6 NA 2
1 5 8 NA 3
2 6 7 0.4 3
3 3 2 0.1 3
4 1 2 NA 3
5 4 6 NA 3
6 5 9 NA 3
7 9 4 NA 3
8 2 1 0.5 3
9 4 3 0.2 3
10 3 6 NA 3
I have tried the merge command using all = TRUE, but this does not give me what i want. I haven't been able to find this anywhere so any help is appreciated!
This is a situation where complete from package tidyr is useful (this is in tidyr_0.3.0, which is currently available on on github). You can use this function to expand df1 to include all period/Shrub combinations, filling the other variables in with NA by default. Once you do that you can simply join the two datasets together - I'll use inner_join from dplyr.
library(dplyr)
library(tidyr)
First, using complete on df1, showing the first 10 lines of output:
complete(df1, period, Shrub)
Source: local data frame [30 x 3]
period Shrub value
1 1 1 0.5
2 1 2 0.6
3 1 3 0.7
4 1 4 0.8
5 1 5 0.9
6 1 6 NA
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 0.9
.. ... ... ...
Then all you need to do is join this expanded dataset with df2:
complete(df1, period, Shrub) %>%
inner_join(., df2)
Source: local data frame [30 x 5]
period Shrub value x y
1 1 1 0.5 5 8
2 1 2 0.6 6 7
3 1 3 0.7 3 2
4 1 4 0.8 1 2
5 1 5 0.9 4 6
6 1 6 NA 5 9
7 1 7 NA 9 4
8 1 8 NA 2 1
9 1 9 NA 4 3
10 1 10 0.9 3 6
.. ... ... ... . .
Start by repeating the rows of df2 to create a "full" dataset (i.e., 30 rows, one for each shrub-period observation), then merge:
tmp <- df2[rep(seq_len(nrow(df2)), times=3),]
tmp$period <- rep(1:3, each = nrow(df2))
out <- merge(tmp, df1, all = TRUE)
rm(tmp) # remove `tmp` data.frame
The result:
> head(out)
Shrub period x y value
1 1 1 5 8 0.5
2 1 2 5 8 0.4
3 1 3 5 8 NA
4 2 1 6 7 0.6
5 2 2 6 7 NA
6 2 3 6 7 0.4
> str(out)
'data.frame': 30 obs. of 5 variables:
$ Shrub : int 1 1 1 2 2 2 3 3 3 4 ...
$ period: int 1 2 3 1 2 3 1 2 3 1 ...
$ x : int 5 5 5 6 6 6 3 3 3 1 ...
$ y : int 8 8 8 7 7 7 2 2 2 2 ...
$ value : num 0.5 0.4 NA 0.6 NA 0.4 0.7 NA 0.1 0.8 ...
You can use dplyr. This works by taking each period in a seperate frame, and merging with all=TRUE to force all values, then putting it all back together. The cbind(df2,.. part adds on the period to the missing values so we don't get extra NA.:
library(dplyr)
df1 %>% group_by(period) %>%
do(merge(., cbind(df2, period = .[["period"]][1]), by = c("Shrub", "period"), all = TRUE))
Shrub period value x y
1 1 1 0.5 5 8
2 2 1 0.6 6 7
3 3 1 0.7 3 2
4 4 1 0.8 1 2
5 5 1 0.9 4 6
6 6 1 NA 5 9
7 7 1 NA 9 4
8 8 1 NA 2 1
9 9 1 NA 4 3
10 10 1 0.9 3 6
11 1 2 0.4 5 8
12 2 2 NA 6 7
13 3 2 NA 3 2
14 4 2 NA 1 2
15 5 2 0.4 4 6
16 6 2 0.5 5 9
17 7 2 0.3 9 4
18 8 2 NA 2 1
19 9 2 NA 4 3
20 10 2 NA 3 6
21 1 3 NA 5 8
22 2 3 0.4 6 7
23 3 3 0.1 3 2
24 4 3 NA 1 2
25 5 3 NA 4 6
26 6 3 NA 5 9
27 7 3 NA 9 4
28 8 3 0.5 2 1
29 9 3 0.2 4 3
30 10 3 NA 3 6

Replicate each row of data.frame and specify the number of replications for each row?

I am programming in R and I got the following problem:
I have a data String jb, that is quite long. Heres a simple version of it:
jb: a b frequency jb.expanded: a b
5 3 2 5 3
5 7 1 5 3
9 1 40 5 7
12 4 5 9 1
12 5 13 9 1
... ...
I want to replicate the rows and the frequency of the replication is the column frequency. That means, the first row is replicated two times, the second row is replicated 1 time and so on. I already solved that problem with the code
jb.expanded <- jb[rep(row.names(jb), jb$freqency), 1:2]
Now here is the problem:
Whenever any number in the frequency corner is greater than 10, the number of replicated columns is wrong. For example:
Frequency: 43 --> 14 columns
40 --> 13 columns
13 --> 11 columns
14 --> 12 columns
Can you help me? I have no idea how to fix that, I also cannot find anything on the internet.
Thanks for your help!
Update
Upon revisiting this question, I have a feeling that #Codoremifa was correct in their assumption that your "frequency" column might be a factor.
Here's an example if that were the case. It won't match your actual data since I don't know what other levels are in your dataset.
mydf$F2 <- factor(as.character(mydf$frequency))
## expandRows(mydf, "F2")
mydf[rep(rownames(mydf), mydf$F2), ]
# a b frequency F2
# 1 5 3 2 2
# 1.1 5 3 2 2
# 1.2 5 3 2 2
# 2 5 7 1 1
# 3 9 1 40 40
# 3.1 9 1 40 40
# 3.2 9 1 40 40
# 3.3 9 1 40 40
# 4 12 4 5 5
# 4.1 12 4 5 5
# 4.2 12 4 5 5
# 4.3 12 4 5 5
# 4.4 12 4 5 5
# 5 12 5 13 13
# 5.1 12 5 13 13
Hmmm. That doesn't look like 61 rows to me. Why not? Because rep uses the numeric values underlying the factor, which is quite different in this case from the displayed value:
as.numeric(mydf$F2)
# [1] 3 1 4 5 2
To properly convert it, you would need:
as.numeric(as.character(mydf$F2))
# [1] 2 1 40 5 13
Original answer
A while ago I wrote a function that is a bit more of a generalization of #Simono101's answer. The function looks like this:
expandRows <- function(dataset, count, count.is.col = TRUE) {
if (!isTRUE(count.is.col)) {
if (length(count) == 1) {
dataset[rep(rownames(dataset), each = count), ]
} else {
if (length(count) != nrow(dataset)) {
stop("Expand vector does not match number of rows in data.frame")
}
dataset[rep(rownames(dataset), count), ]
}
} else {
dataset[rep(rownames(dataset), dataset[[count]]),
setdiff(names(dataset), names(dataset[count]))]
}
}
For your purposes, you could just use expandRows(mydf, "frequency")
head(expandRows(mydf, "frequency"))
# a b
# 1 5 3
# 1.1 5 3
# 2 5 7
# 3 9 1
# 3.1 9 1
# 3.2 9 1
Other options are to repeat each row the same number of times:
expandRows(mydf, 2, count.is.col=FALSE)
# a b frequency
# 1 5 3 2
# 1.1 5 3 2
# 2 5 7 1
# 2.1 5 7 1
# 3 9 1 40
# 3.1 9 1 40
# 4 12 4 5
# 4.1 12 4 5
# 5 12 5 13
# 5.1 12 5 13
Or to specify a vector of how many times to repeat each row.
expandRows(mydf, c(1, 2, 1, 0, 2), count.is.col=FALSE)
# a b frequency
# 1 5 3 2
# 2 5 7 1
# 2.1 5 7 1
# 3 9 1 40
# 5 12 5 13
# 5.1 12 5 13
Note the required count.is.col = FALSE argument in those last two options.
Nearly. You want to pass [ a vector of row indices, not row.names. Try this...
jb[ rep( seq_len( nrow(jb) ) , times = jb$frequency ) , ]
rep( seq_len( nrow(jb) ) , times = jb$frequency )
# [1] 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [39] 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5
This might be more of a comment but seeing that all the other answers are suggesting new options - if you correct the spelling of jb$freqency when creating jb.expanded, and convert jb$frequency to an integer then the construction you mention in your question also works.
And why I suspect jb$frequency is a factor is because the incorrect frequencies are neatly ordered as 11,12,13,14.

Resources