Calculate product of frequencies in each column - r

I have a data frame with 3 columns, each containing a small number of values:
> df
# A tibble: 364 x 3
A B C
<dbl> <dbl> <dbl>
0. 1. 0.100
0. 1. 0.200
0. 1. 0.300
0. 1. 0.500
0. 2. 0.100
0. 2. 0.200
0. 2. 0.300
0. 2. 0.600
0. 3. 0.100
0. 3. 0.200
# ... with 354 more rows
> apply(df, 2, table)
$`A`
0 1 2 3 4 5 6 7 8 9 10
34 37 31 32 27 39 29 28 37 39 31
$B
1 2 3 4 5 6 7 8 9 10 11
38 28 38 37 32 34 29 33 30 35 30
$C
0.1 0.2 0.3 0.4 0.5 0.6
62 65 65 56 60 56
I would like to create a fourth column, which will contain for each row the product of the frequencies of each value withing each group. So for example the first value of the column "Freq" would be the product of the frequency of zero within column A, the frequency of 1 within column B and the frequency of 0.1 within column C.
How can I do this efficiently with dplyr/baseR?
To emphasize, this is not the combined frequency of each total row, but the product of the 1-column frequencies

An efficient approach using a combination of lapply, Map & Reduce from base R:
l <- lapply(df, table)
m <- Map(function(x,y) unname(y[match(x, names(y))]), df, l)
df$D <- Reduce(`*`, m)
which gives:
> head(df, 15)
A B C D
1 3 5 0.4 57344
2 5 6 0.5 79560
3 0 4 0.1 77996
4 2 6 0.1 65348
5 5 11 0.6 65520
6 3 8 0.5 63360
7 6 6 0.2 64090
8 1 9 0.4 62160
9 10 2 0.2 56420
10 5 2 0.2 70980
11 4 11 0.3 52650
12 7 6 0.5 57120
13 10 1 0.2 76570
14 7 10 0.5 58800
15 8 10 0.3 84175
What this does:
lapply(df, table) creates a list of frequency for each column
With Map a list is created with match where each list-item has the same length as the number of rows of df. Each list-item is a vector of frequencies corresponding to the values in df.
With Reduce the product of the vectors in the list m is calculated element wise: the first value of each vector in the list m are mulplied with each other, then the 2nd value, etc.
The same approach in tidyverse:
library(dplyr)
library(purrr)
df %>%
mutate(D = map(df, table) %>%
map2(df, ., function(x,y) unname(y[match(x, names(y))])) %>%
reduce(`*`))
Used data:
set.seed(2018)
df <- data.frame(A = sample(rep(0:10, c(34,37,31,32,27,39,29,28,37,39,31)), 364),
B = sample(rep(1:11, c(38,28,38,37,32,34,29,33,30,35,30)), 364),
C = sample(rep(seq(0.1,0.6,0.1), c(62,65,65,56,60,56)), 364))

will use the following small example
df
A B C
1 3 5 0.4
2 5 6 0.5
3 0 4 0.1
4 2 6 0.1
5 5 11 0.6
6 3 8 0.5
7 6 6 0.2
8 1 9 0.4
9 10 2 0.2
10 5 2 0.2
sapply(g,table)
$A
0 1 2 3 5 6 10
1 1 1 2 3 1 1
$B
2 4 5 6 8 9 11
2 1 1 3 1 1 1
$C
0.1 0.2 0.4 0.5 0.6
2 3 2 2 1
library(tidyverse)
df%>%
group_by(A)%>%
mutate(An=n())%>%
group_by(B)%>%
mutate(Bn=n())%>%
group_by(C)%>%
mutate(Cn=n(),prod=An*Bn*Cn)
A B C An Bn Cn prod
<int> <int> <dbl> <int> <int> <int> <int>
1 3 5 0.400 2 1 2 4
2 5 6 0.500 3 3 2 18
3 0 4 0.100 1 1 2 2
4 2 6 0.100 1 3 2 6
5 5 11 0.600 3 1 1 3
6 3 8 0.500 2 1 2 4
7 6 6 0.200 1 3 3 9
8 1 9 0.400 1 1 2 2
9 10 2 0.200 1 2 3 6
10 5 2 0.200 3 2 3 18

Related

Tall format of data frame with multiple (x,y) pairs

I have a data frame (actually I prefer data.table) with columns of multiple pairs of (x,y) coordinates and corresponding values alpha, something like follows:
> data.frame(x_1 = 1:5, y_1 = 6:10,
x_2 = 11:15, y_2 = 16:20,
x_3 = 21:25, y_3=26:30,
alpha = seq(0.2,1,0.2))
x_1 y_1 x_2 y_2 x_3 y_3 alpha
1 1 6 11 16 21 26 0.2
2 2 7 12 17 22 27 0.4
3 3 8 13 18 23 28 0.6
4 4 9 14 19 24 29 0.8
5 5 10 15 20 25 30 1.0
I need to organise it into a long format such that there is an x and a y column, where a row of coordinates from df is stacked to be three pairs on top of one another; a column for alpha which is duplicated for each pairing and; a column for the corresponding pair index, as follows:
x y alpha index
1 1 6 0.2 1
2 11 16 0.2 2
3 21 26 0.2 3
4 2 7 0.4 1
5 12 17 0.4 2
6 22 27 0.4 3
7 3 8 0.6 1
8 13 18 0.6 2
9 23 28 0.6 3
10 4 9 0.8 1
11 14 19 0.8 2
12 24 29 0.8 3
13 5 10 1.0 1
14 15 20 1.0 2
15 25 30 1.0 3
I have tried to use gather without much success - trying to melt by pairs of columns and then duplicating the alpha values caused me grief. I then resorted to a for loop through the rows of df, compiling a (pre-allocated) vector of values x, y and alpha with each iteration, but even with the pre-allocation this was horrendously slow compared to a similar operation in python.
In practice I have about 20,000-40,000 rows, many more "constant" columns like alpha and something like 3-5 pair indices.
Apologies if there has been a similar question - I couldn't find one and really struggle wording questions about quite specific data manipulations. Any help is greatly appreciated!
gather has been superseded by pivot_longer. I think this gives you what you want.
df %>%
pivot_longer(
c(starts_with("x"), starts_with("y")),
names_pattern="(.)_(.)",
names_to=c(".value", "index")
)
# A tibble: 15 x 4
alpha index x y
<dbl> <chr> <int> <int>
1 0.2 1 1 6
2 0.2 2 11 16
3 0.2 3 21 26
4 0.4 1 2 7
5 0.4 2 12 17
6 0.4 3 22 27
7 0.6 1 3 8
8 0.6 2 13 18
9 0.6 3 23 28
10 0.8 1 4 9
11 0.8 2 14 19
12 0.8 3 24 29
13 1 1 5 10
14 1 2 15 20
15 1 3 25 30
Does this work as expected?
df %>%
pivot_longer(cols = -alpha, names_to = c("col", "index"), names_sep = "_") %>%
pivot_wider(names_from = col, values_from = value)
Output
# A tibble: 15 x 4
alpha index x y
<dbl> <chr> <int> <int>
1 0.2 1 1 6
2 0.2 2 11 16
3 0.2 3 21 26
4 0.4 1 2 7
5 0.4 2 12 17
6 0.4 3 22 27
7 0.6 1 3 8
8 0.6 2 13 18
9 0.6 3 23 28
10 0.8 1 4 9
11 0.8 2 14 19
12 0.8 3 24 29
13 1 1 5 10
14 1 2 15 20
15 1 3 25 30
Here is another pivot_longer approach:
pivot_longer without alpha only columns start with x
use window function lead
remove every second row with filter
Create index
library(dplyr)
library(tidyr)
df %>%
pivot_longer(c(-alpha, starts_with("x")),
names_to = "names.x",
values_to = "x"
) %>%
mutate(y = lead(x)) %>%
filter(row_number() %% 2 != 0) %>% ## Delete even-rows
select(-names.x) %>%
mutate(index = rep(1:3, length.out = n()))
alpha x y index
<dbl> <int> <int> <int>
1 0.2 1 6 1
2 0.2 11 16 2
3 0.2 21 26 3
4 0.4 2 7 1
5 0.4 12 17 2
6 0.4 22 27 3
7 0.6 3 8 1
8 0.6 13 18 2
9 0.6 23 28 3
10 0.8 4 9 1
11 0.8 14 19 2
12 0.8 24 29 3
13 1 5 10 1
14 1 15 20 2
15 1 25 30 3

How to define rows numbering depending on a group and a value in group's first rows?

A dataframe DD has some missing rows. Based on the values in 'ID_raw' column I have duplicated the rows in order to replace the missing rows. Now I have to number the rows in such way that the first value in each group (column 'File') equals the value in the same row in the column 'ID_raw'. This will be a key in joining the dataframe with another one. Below a dummy example of the DD dataframe:
DD<-data.frame(ID_raw=c(1,5,7,8,5,7,9,13,3,6),Val=c(1,2,8,15,54,23,88,77,32,2),File=c("A","A","A","A","B","B","B","B","C","C"))
ID_raw Val File
1 1 1 A
2 5 2 A
3 7 8 A
4 8 15 A
5 5 54 B
6 7 23 B
7 9 88 B
8 13 77 B
9 3 32 C
10 6 2 C
So far I've successfully duplicated the rows, however, I have a problem in numbering the rows in such way, that they start from the same value as the value in ID_raw column for each group ('File').
DD$ID_diff <- 0
DD$ID_diff[1:nrow(DD)-1] <- as.integer(diff(DD$ID_raw, 1)) #values which tell how many times a row has to be duplicated
DD$ID_diff <- sapply(DD$ID_diff, function(x) ifelse(x<0, 0, x)) #replacement the values <0 (for the first rows in each 'File' group)
DD <- DD[rep(seq(nrow(DD)), DD$ID_diff), 1:ncol(DD)] #rows duplication
Based on the code above I receive this output:
ID_raw Val File ID_diff
1 1 1 A 4
1.1 1 1 A 4
1.2 1 1 A 4
1.3 1 1 A 4
2 5 2 A 2
2.1 5 2 A 2
3 7 8 A 1
5 5 54 B 2
5.1 5 54 B 2
6 7 23 B 2
6.1 7 23 B 2
7 9 88 B 4
7.1 9 88 B 4
7.2 9 88 B 4
7.3 9 88 B 4
9 3 32 C 3
9.1 3 32 C 3
9.2 3 32 C 3
I would like to receive this:
ID_raw Val File ID_diff ID_new
1 1 1 A 4 1
1.1 1 1 A 4 2
1.2 1 1 A 4 3
1.3 1 1 A 4 4
2 5 2 A 2 5
2.1 5 2 A 2 6
3 7 8 A 1 7
5 5 54 B 2 5
5.1 5 54 B 2 6
6 7 23 B 2 7
6.1 7 23 B 2 8
7 9 88 B 4 9
7.1 9 88 B 4 10
7.2 9 88 B 4 11
7.3 9 88 B 4 12
9 3 32 C 3 3
9.1 3 32 C 3 4
9.2 3 32 C 3 5
This is one option using dplyr based on the output of your code:
df %>%
group_by(File) %>%
mutate(ID_new = seq(1, n()) + first(ID_raw) - 1)
# A tibble: 18 x 5
# Groups: File [3]
ID_raw Val File ID_diff ID_new
<int> <int> <fct> <int> <dbl>
1 1 1 A 4 1
2 1 1 A 4 2
3 1 1 A 4 3
4 1 1 A 4 4
5 5 2 A 2 5
6 5 2 A 2 6
7 7 8 A 1 7
8 5 54 B 2 5
9 5 54 B 2 6
10 7 23 B 2 7
11 7 23 B 2 8
12 9 88 B 4 9
13 9 88 B 4 10
14 9 88 B 4 11
15 9 88 B 4 12
16 3 32 C 3 3
17 3 32 C 3 4
18 3 32 C 3 5
We can do this in the chain from the beginning itself i.e. instead of creating the 'ID_diff' and using sapply, directly use diff on the 'ID_raw', then uncount, grouped by 'File', create the sequence column
library(tidyverse)
DD %>%
mutate(ID_diff = pmax(c(diff(ID_raw), 0), 0)) %>%
uncount(ID_diff, .remove = FALSE) %>%
group_by(File) %>%
mutate(ID_new = seq(first(ID_raw), length.out = n(), by = 1))
# A tibble: 18 x 5
# Groups: File [3]
# ID_raw Val File ID_diff ID_new
# <dbl> <dbl> <fct> <dbl> <dbl>
# 1 1 1 A 4 1
# 2 1 1 A 4 2
# 3 1 1 A 4 3
# 4 1 1 A 4 4
# 5 5 2 A 2 5
# 6 5 2 A 2 6
# 7 7 8 A 1 7
# 8 5 54 B 2 5
# 9 5 54 B 2 6
#10 7 23 B 2 7
#11 7 23 B 2 8
#12 9 88 B 4 9
#13 9 88 B 4 10
#14 9 88 B 4 11
#15 9 88 B 4 12
#16 3 32 C 3 3
#17 3 32 C 3 4
#18 3 32 C 3 5

Group_by then Filter and compute in R

This is my data set
You can get the data form this link( If can't ,please inform me)
https://www.dropbox.com/s/1n9hpyhcniaghh5/table.csv?dl=0
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
I want to filter data by group with multiple conditions. And the conditions is
the number of rows for each type(0,1) of TYPE variable by group must
bigger than 1
the number of rows for each type must be equal
(For example: the number of rows for type 1 is equal to the number of rows for type 0 for each group)
I had tried many times... And finally I get this code and this output
table %>% group_by(label,date,tau,type) %>% filter(n()>1) %>% filter(length(type==1)==length(type==0))
# A tibble: 16 x 7
# Groups: label, date, tau, type [7]
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
1 A 1 2 1 0.75 7 16
2 A 1 2 1 0.39 6 14
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
7 A 3 1 0 0.38 8 15
8 A 3 1 0 0.02 5 16
9 A 3 1 0 0.71 8 13
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
14 B 2 1 1 0.04 5 15
15 B 2 1 1 0.75 5 13
16 B 2 1 1 0.52 9 13
I was confused about this output I get with this code. I already get rid of the data which didn't meet the condition 1 BUT the data which didn't meet the condition 2 still inside
The result I want is just like the below
LABEL DATE TAU TYPE x y z
<fctr> <int> <int> <int> <dbl> <int> <int>
3 A 2 3 0 0.65 5 14
4 A 2 3 1 0.55 7 19
5 A 2 3 1 0.69 5 19
6 A 2 3 0 0.66 7 19
10 B 1 2 1 0.25 9 18
11 B 1 2 0 0.06 8 20
12 B 1 2 1 0.60 8 20
13 B 1 2 0 0.56 6 13
And if I want to compute value with the function below for each row, how can i code?? Just use the function of mutate()??
f(x,y,z) = 2 * x + y - z / 3 if TYPE == 1
f(x,y,z) = 4 * x - y / 2 + z / 3 if TYPE == 0
I hope there is anyone can help me and I am appreciate for your help! If you need to provide any other information just let me know ~
# example dataset
df = read.table(text = "
LABEL DATE TAU TYPE x y z
1 A 1 2 1 0.75 7 16
2 A 1 2 0 0.41 5 18
3 A 1 2 1 0.39 6 14
4 A 2 3 0 0.65 5 14
5 A 2 3 1 0.55 7 19
6 A 2 3 1 0.69 5 19
7 A 2 3 0 0.66 7 19
8 A 3 1 0 0.38 8 15
9 A 3 1 0 0.02 5 16
10 A 3 1 0 0.71 8 13
11 B 1 2 1 0.25 9 18
12 B 1 2 0 0.06 8 20
13 B 1 2 1 0.60 8 20
14 B 1 2 0 0.56 6 13
15 B 1 3 1 0.50 8 19
16 B 1 3 0 0.04 8 16
17 B 2 1 1 0.04 5 15
18 B 2 1 1 0.75 5 13
19 B 2 1 0 0.44 8 18
20 B 2 1 1 0.52 9 13
", header=T, stringsAsFactors=F)
library(dplyr)
library(tidyr)
# function to use for each row
# (assumes that type can be only 1 or 0)
f = function(t,x,y,z) { ifelse(t == 1,
2 * x + y - z / 3,
4 * x - y / 2 + z / 3) }
df %>%
count(LABEL, DATE, TAU, TYPE) %>% # count rows for each group (based on those combinations)
filter(n > 1) %>% # keep groups with multiple rows
mutate(TYPE = paste0("TYPE_",TYPE)) %>% # update variable
spread(TYPE, n, fill = 0) %>% # reshape data
filter(TYPE_0 == TYPE_1) %>% # keep groups with equal number of rows for type 0 and 1
select(LABEL, DATE, TAU) %>% # keep variables/groups of interest
inner_join(df, by=c("LABEL", "DATE", "TAU")) %>% # join back info
mutate(f_value = f(TYPE,x,y,z)) # apply function
# # A tibble: 8 x 8
# LABEL DATE TAU TYPE x y z f_value
# <chr> <int> <int> <int> <dbl> <int> <int> <dbl>
# 1 A 2 3 0 0.65 5 14 4.76666667
# 2 A 2 3 1 0.55 7 19 1.76666667
# 3 A 2 3 1 0.69 5 19 0.04666667
# 4 A 2 3 0 0.66 7 19 5.47333333
# 5 B 1 2 1 0.25 9 18 3.50000000
# 6 B 1 2 0 0.06 8 20 2.90666667
# 7 B 1 2 1 0.60 8 20 2.53333333
# 8 B 1 2 0 0.56 6 13 3.57333333

How do i merge two dataframes in R but keep all missing values.

I need to combine to data frames that have different lengths, and keep all the "missing values". The problem is that there are not really missing values, but rather just less of one value than another.
Example:
df1 looks like this:
Shrub value period
1 0.5 1
2 0.6 1
3 0.7 1
4 0.8 1
5 0.9 1
10 0.9 1
1 0.4 2
5 0.4 2
6 0.5 2
7 0.3 2
2 0.4 3
3 0.1 3
8 0.5 3
9 0.2 3
df2 looks like this:
Shrub x y
1 5 8
2 6 7
3 3 2
4 1 2
5 4 6
6 5 9
7 9 4
8 2 1
9 4 3
10 3 6
and i want the combined dataframe to look like:
Shrub x y value period
1 5 8 0.5 1
2 6 7 0.6 1
3 3 2 0.7 1
4 1 2 0.8 1
5 4 6 0.9 1
6 5 9 NA 1
7 9 4 NA 1
8 2 1 NA 1
9 4 3 NA 1
10 3 6 0.9 1
1 5 8 0.4 2
2 6 7 NA 2
3 3 2 NA 2
4 1 2 NA 2
5 4 6 0.4 2
6 5 9 0.5 2
7 9 4 0.3 2
8 2 1 NA 2
9 4 3 NA 2
10 3 6 NA 2
1 5 8 NA 3
2 6 7 0.4 3
3 3 2 0.1 3
4 1 2 NA 3
5 4 6 NA 3
6 5 9 NA 3
7 9 4 NA 3
8 2 1 0.5 3
9 4 3 0.2 3
10 3 6 NA 3
I have tried the merge command using all = TRUE, but this does not give me what i want. I haven't been able to find this anywhere so any help is appreciated!
This is a situation where complete from package tidyr is useful (this is in tidyr_0.3.0, which is currently available on on github). You can use this function to expand df1 to include all period/Shrub combinations, filling the other variables in with NA by default. Once you do that you can simply join the two datasets together - I'll use inner_join from dplyr.
library(dplyr)
library(tidyr)
First, using complete on df1, showing the first 10 lines of output:
complete(df1, period, Shrub)
Source: local data frame [30 x 3]
period Shrub value
1 1 1 0.5
2 1 2 0.6
3 1 3 0.7
4 1 4 0.8
5 1 5 0.9
6 1 6 NA
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 0.9
.. ... ... ...
Then all you need to do is join this expanded dataset with df2:
complete(df1, period, Shrub) %>%
inner_join(., df2)
Source: local data frame [30 x 5]
period Shrub value x y
1 1 1 0.5 5 8
2 1 2 0.6 6 7
3 1 3 0.7 3 2
4 1 4 0.8 1 2
5 1 5 0.9 4 6
6 1 6 NA 5 9
7 1 7 NA 9 4
8 1 8 NA 2 1
9 1 9 NA 4 3
10 1 10 0.9 3 6
.. ... ... ... . .
Start by repeating the rows of df2 to create a "full" dataset (i.e., 30 rows, one for each shrub-period observation), then merge:
tmp <- df2[rep(seq_len(nrow(df2)), times=3),]
tmp$period <- rep(1:3, each = nrow(df2))
out <- merge(tmp, df1, all = TRUE)
rm(tmp) # remove `tmp` data.frame
The result:
> head(out)
Shrub period x y value
1 1 1 5 8 0.5
2 1 2 5 8 0.4
3 1 3 5 8 NA
4 2 1 6 7 0.6
5 2 2 6 7 NA
6 2 3 6 7 0.4
> str(out)
'data.frame': 30 obs. of 5 variables:
$ Shrub : int 1 1 1 2 2 2 3 3 3 4 ...
$ period: int 1 2 3 1 2 3 1 2 3 1 ...
$ x : int 5 5 5 6 6 6 3 3 3 1 ...
$ y : int 8 8 8 7 7 7 2 2 2 2 ...
$ value : num 0.5 0.4 NA 0.6 NA 0.4 0.7 NA 0.1 0.8 ...
You can use dplyr. This works by taking each period in a seperate frame, and merging with all=TRUE to force all values, then putting it all back together. The cbind(df2,.. part adds on the period to the missing values so we don't get extra NA.:
library(dplyr)
df1 %>% group_by(period) %>%
do(merge(., cbind(df2, period = .[["period"]][1]), by = c("Shrub", "period"), all = TRUE))
Shrub period value x y
1 1 1 0.5 5 8
2 2 1 0.6 6 7
3 3 1 0.7 3 2
4 4 1 0.8 1 2
5 5 1 0.9 4 6
6 6 1 NA 5 9
7 7 1 NA 9 4
8 8 1 NA 2 1
9 9 1 NA 4 3
10 10 1 0.9 3 6
11 1 2 0.4 5 8
12 2 2 NA 6 7
13 3 2 NA 3 2
14 4 2 NA 1 2
15 5 2 0.4 4 6
16 6 2 0.5 5 9
17 7 2 0.3 9 4
18 8 2 NA 2 1
19 9 2 NA 4 3
20 10 2 NA 3 6
21 1 3 NA 5 8
22 2 3 0.4 6 7
23 3 3 0.1 3 2
24 4 3 NA 1 2
25 5 3 NA 4 6
26 6 3 NA 5 9
27 7 3 NA 9 4
28 8 3 0.5 2 1
29 9 3 0.2 4 3
30 10 3 NA 3 6

Selecting Rows which contain daily max value in R

So I want to subset my data frame to select rows with a daily maximum value.
Site Year Day Time Cover Size TempChange
ST1 2011 97 0.0 Closed small 0.97
ST1 2011 97 0.5 Closed small 1.02
ST1 2011 97 1.0 Closed small 1.10
Section of data frame is above. I would like to select only the rows which have the maximum value of the variable TempChange for each variable Day. I want to do this because I am interested in specific variables (not shown) for these particular times.
AMENDED EXAMPLE AND REQUIRED OUTPUT
Site Day Temp Row
a 10 0.2 1
a 10 0.3 2
a 11 0.5 3
a 11 0.4 4
b 10 0.1 5
b 10 0.8 6
b 11 0.7 7
b 11 0.6 8
c 10 0.2 9
c 10 0.3 10
c 11 0.5 11
c 11 0.8 12
REQUIRED OUTPUT
Site Day Temp Row
a 10 0.3 2
a 11 0.5 3
b 10 0.8 6
b 11 0.7 7
c 10 0.3 10
c 11 0.8 12
Hope that makes it clearer.
After faffing with raw data frame code, I realised plyr could do this in one:
> df
Day V Z
1 97 0.26575207 1
2 97 0.09443351 2
3 97 0.88097858 3
4 98 0.62241515 4
5 98 0.61985937 5
6 99 0.06956219 6
7 100 0.86638108 7
8 100 0.08382254 8
> ddply(df,~Day,function(x){x[which.max(x$V),]})
Day V Z
1 97 0.88097858 3
2 98 0.62241515 4
3 99 0.06956219 6
4 100 0.86638108 7
To get the rows for max values for unique combinations of more than one column, just add the variable to the formula. For your modified example, its then:
> df
Site Day Temp Row
1 a 10 0.2 1
2 a 10 0.3 2
3 a 11 0.5 3
4 a 11 0.4 4
5 b 10 0.1 5
6 b 10 0.8 6
7 b 11 0.7 7
8 b 11 0.6 8
9 c 10 0.2 9
10 c 10 0.3 10
11 c 11 0.5 11
12 c 11 0.8 12
> ddply(df,~Day+Site,function(x){x[which.max(x$Temp),]})
Site Day Temp Row
1 a 10 0.3 2
2 b 10 0.8 6
3 c 10 0.3 10
4 a 11 0.5 3
5 b 11 0.7 7
6 c 11 0.8 12
Note this isn't in the same order as your original dataframe, but you can fix that.
> dmax = ddply(df,~Day+Site,function(x){x[which.max(x$Temp),]})
> dmax[order(dmax$Row),]
Site Day Temp Row
1 a 10 0.3 2
4 a 11 0.5 3
2 b 10 0.8 6
5 b 11 0.7 7
3 c 10 0.3 10
6 c 11 0.8 12

Resources