Rearrange and aggregate in r - r

I have the following data
> site<-c("A","A","A","B","B","C")
> sample<-c("N","N","N","W","W","S")
> effort<-c(2,2,2,1,1,3)
> y<-c(1,0,1,1,0,1)
> df<-data.frame(site,sample,effort,y)
> df
site sample effort y
1 A N 2 1
2 A N 2 0
3 A N 2 1
4 B W 1 1
5 B W 1 0
6 C S 3 1
And I would like to rearrange to get the minimum "effort" and sum "y" per sample and site.
To end up with the following
site sample effort y
1 A N 2 2
2 B W 1 1
3 C S 3 1
I have tried the following code
tr<-aggregate(.~site+sample,data=df, FUN=function(df) c(m=min(df), n=length(df)))
> tr
site sample effort.m effort.n y.m y.n
1 A N 2 3 0 3
2 C S 3 1 1 1
3 B W 1 2 0 2
This is nearly what I am looking for but is there a better way to do this and how should I deal with zeros in the data?

An answer using powerful dplyr package
library(dplyr)
df %.%
group_by(site,sample) %.%
select(site, sample) %.%
summarise (
mineff = min(effort),
y = sum(y))
site sample mineff y
1 C S 3 1
2 A N 2 2
3 B W 1 1

Using plyr
require(plyr)
ddply(df, c("site", "sample"), summarize,
min_eff = min(effort), sum_y = sum(y))
site sample min_eff sum_y
1 A N 2 2
2 B W 1 1
3 C S 3 1
In your example, there's a one-to-one correspondence between site and sample. This approach will work for every pairwise distinct combination. As to
How should I deal with zeros in the data?
How do you want to deal with them? What are your concerns?

Related

Adding an index column representing a repetition of a dataframe in R

I have a dataframe in R that I'd like to repeat several times, and I want to add in a new variable to index those repetitions. The best I've come up with is using mutate + rbind over and over, and I feel like there has to be an efficient dataframe method I could be using here.
Here's an example: df <- data.frame(x = 1:3, y = letters[1:3]) gives us the dataframe
x
y
1
a
2
b
3
c
I'd like to repeat that say 3 times, with an index that looks like this:
x
y
index
1
a
1
2
b
1
3
c
1
1
a
2
2
b
2
3
c
2
1
a
3
2
b
3
3
c
3
Using the rep function, I can get the first two columns, but not the index column. The best I've come up with so far (using dplyr) is:
df2 <-
df %>%
mutate(index = 1) %>%
rbind(df %>% mutate(index = 2)) %>%
rbind(df %>% mutate(index = 3))
This obviously doesn't work if I need to repeat my dataframe more than a handful of times. It feels like the kind of thing that should be easy to do using dataframe methods, but I haven't been able to find anything.
Grateful for any tips!
You can use this code for as many data frames as you would like. You just have to set the n argument:
replicate function takes 2 main arguments. We first specify the number of time we would like to reproduce our data set by n. Then we specify our data set as expr argument. The result would be a list whose elements are instances of our data set
After that we pass it along to imap function from purrr package to define the unique id for each of our data set. .x represents each element of our list (here a data frame) and .y is the position of that element which amounts to the number of instances we created. So for example we assign value 1 to the first id column of the first data set as .y is equal to 1 for that and so on.
library(dplyr)
library(purrr)
replicate(3, df, simplify = FALSE) %>%
imap_dfr(~ .x %>%
mutate(id = .y))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
In base R you can use the following code:
do.call(rbind,
mapply(function(x, z) {
x$id <- z
x
}, replicate(3, df, simplify = FALSE), 1:3, SIMPLIFY = FALSE))
x y id
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3
You can use rerun to repeat the dataframe n times and add an index column using bind_rows -
library(dplyr)
library(purrr)
n <- 3
df <- data.frame(x = 1:3, y = letters[1:3])
bind_rows(rerun(n, df), .id = 'index')
# index x y
#1 1 1 a
#2 1 2 b
#3 1 3 c
#4 2 1 a
#5 2 2 b
#6 2 3 c
#7 3 1 a
#8 3 2 b
#9 3 3 c
In base R, we can repeat the row index 3 times.
transform(df[rep(1:nrow(df), n), ], index = rep(1:n, each = nrow(df)))
One more way
n <- 3
map_dfr(seq_len(n), ~ df %>% mutate(index = .x))
x y index
1 1 a 1
2 2 b 1
3 3 c 1
4 1 a 2
5 2 b 2
6 3 c 2
7 1 a 3
8 2 b 3
9 3 c 3

Pool items in database until minimum sample size reached and find all permutations in R

This is an example.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
df
item n
1 a 3
2 b 2
3 c 2
4 d 1
5 e 1
Item needs to be grouped so that the group has a sample size of at least 4.
This would be the solution if you follow the sorting of df.
item n cluster
1 a 3 1
2 b 2 1
3 c 2 2
4 d 1 2
5 e 1 2
How to get all possible unique solutions?
Further, the code should also not allow any clusters to have a sample size less than 4.
Below, we have a brute force approach using the package partitions. The idea is that we find every partition of the rows of df. We then sum each group and check to see that the requirement has been met.
df <- data.frame(item=letters[1:5], n=c(3,2,2,1,1))
minSize <- 4
funGetClusters <- function(df, minSize) {
allParts <- partitions::listParts(nrow(df))
goodInd <- which(sapply(allParts, function(p) {
all(sapply(p, function(x) sum(df$n[x])) >= minSize)
}))
allParts[goodInd]
}
clusterBreakdown <- funGetClusters(df, minSize)
allDfs <- lapply(clusterBreakdown, function(p) {
copyDf <- df
copyDf$cluster <- 1L
clustInd <- 2L
for (i in p[-1]) {
copyDf$cluster[i] <- clustInd
}
copyDf
})
Here is the output:
allDfs
[[1]]
item n cluster
1 a 3 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 1
[[2]]
item n cluster
1 a 3 1
2 b 2 2
3 c 2 2
4 d 1 1
5 e 1 1
[[3]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 2
5 e 1 1
[[4]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
[[5]]
item n cluster
1 a 3 2
2 b 2 1
3 c 2 2
4 d 1 1
5 e 1 1
[[6]]
item n cluster
1 a 3 2
2 b 2 2
3 c 2 1
4 d 1 1
5 e 1 1
It should be noted, that there is a combinatorial explosion as the number of rows increases. For example, just with 10 rows we would have to test 115975 different partitions.
As #chinsoon comments, RcppAlgos could be a good choice for an acceptable solution for larger cases. Disclaimer, I am the author. I have answered similar questions with much larger inputs and have had good success.
Allocating tasks to parallel workers so that expected cost is roughly equal
Split a set into n unequal subsets with the key deciding factor being that the elements in the subset aggregate and equal a predetermined amount?
#AllanCameron also has a great answer and nice methodology to attacking this problem. You should give that a read as well.
Lastly, the following vignette by Robin K. S. Hankin (author of the partitions package) and Luke J. West is not only a great read, but very applicable to problems like the one presented here.
Set Partitions in R

Filtering observations using multivariate column conditions

I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)

Count based on multiple grouping variables r

I'm just trying to get a count of occurrences of 'stop' in variable (A) for each of 3 grouping variables (B,C,D).
A B C D
start 1 1 1
start 1 1 1
start 2 1 2
start 2 1 2
stop 1 2 1
stop 1 2 1
stop 2 2 1
Any help would be appreciated - please ask for clarification
I would convert to a data.table:
DT <- as.data.table(DF)
DT[A == 'stop', lapply(.SD, sum), .SDcols=c('B', 'C', 'D')]
B C D
1: 4 6 3
If you were working in the hadleyverse, you could do this using reshape2 and dplyr. Firstly you would use reshape to melt the data so that each of B,C,D has its own row. Then you can group_by and tally as usual.
library(reshape2)
library(dplyr)
melt(df) %>%
filter(A == "stop") %>%
group_by(variable, value) %>%
tally()
# variable value n
# 1 B 1 2
# 2 B 2 1
# 3 C 2 3
# 4 D 1 3
Do you mean each combination of B, C and D? If so here is a base R solution:
df <- read.table(text = "A B C D
start 1 1 1
start 1 1 1
start 2 1 2
start 2 1 2
stop 1 2 1
stop 1 2 1
stop 2 2 1", header = TRUE)
num.stops <- aggregate((A == "stop") ~ B + C + D, df, FUN = sum)
# B C D (A == "stop")
# 1 1 1 1 0
# 2 1 2 1 2
# 3 2 2 1 1
# 4 2 1 2 0
library(dplyr)
df%>%filter(A=='stop')%>%summarise_each(funs(sum),-1)

R counting strings variables in each row of a dataframe

I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.
You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))
You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1

Resources