Quickest way to find the maximum value from one column with multiple duplicates in others? [duplicate] - r

This question already has answers here:
How to get the maximum value by group
(5 answers)
Closed 5 years ago.
In reality I have a very large data frame. One column contains an ID and another contains a value associated with that ID. However, each ID occurs multiple times with differing values, and I wish to record the maximum value for each ID while discarding the rest. Here is a replicable example using the quakes dataset in R:
data <- as.data.frame(quakes)
##Create output matrix
output <- matrix(,length(unique(data[,5])),2)
colnames(output) <- c("Station ID", "Max Mag")
##Grab unique station IDs
uni <- unique(data[,5])
##Go through each station ID and record the maximum magnitude
for (i in 1:dim(output)[1])
{
sub.data <- data[which(data[,5]==uni[i]),]
##Put station ID in column 1
output[i,1] <- uni[i]
##Put biggest magnitude in column 2
output[i,2] <- max(sub.data[,4])
}
Considering that with my real data I have data frames with dimensions of 100000's of rows, this is a slow process. Is there a quicker way to execute such a task?
Any help much appreciated!

library(plyr)
ddply(data, "stations", function(data){data[which.max(data$mag),]})
lat long depth mag stations
1 -27.21 182.43 55 4.6 10
2 -27.60 182.40 61 4.6 11
3 -16.24 168.02 53 4.7 12
4 -27.38 181.70 80 4.8 13
-----
You can also use:
> data2 <- data[order(data$mag,decreasing=T),]
> data2[!duplicated(data2$stations),]
lat long depth mag stations
152 -15.56 167.62 127 6.4 122
15 -20.70 169.92 139 6.1 94
17 -13.64 165.96 50 6.0 83
870 -12.23 167.02 242 6.0 132
1000 -21.59 170.56 165 6.0 119
558 -22.91 183.95 64 5.9 118
109 -22.55 185.90 42 5.7 76
151 -23.34 184.50 56 5.7 106
176 -32.22 180.20 216 5.7 90
275 -22.13 180.38 577 5.7 104
Also :
> library(data.table)
> data <- data.table(data)
> data[,.SD[which.max(mag)],by=stations]
stations lat long depth mag
1: 41 -23.46 180.11 539 5.0
2: 15 -13.40 166.90 228 4.8
3: 43 -26.00 184.10 42 5.4
4: 19 -19.70 186.20 47 4.8
5: 11 -27.60 182.40 61 4.6
---
98: 77 -21.19 181.58 490 5.0
99: 132 -12.23 167.02 242 6.0
100: 115 -17.85 181.44 589 5.6
101: 121 -20.25 184.75 107 5.6
102: 110 -19.33 186.16 44 5.4
data.table works better for large dataset

You could try tapply, too:
tapply(data$mag, data$stations, FUN=max)

You can try the new 'dplyr' package as well, which is much faster and easier to use than 'plyr'. Using what Hadley called "like a grammar of data manipulation" by chaining the operations together with %.%, like so :
library(dplyr)
df <- as.data.frame(quakes)
df %.%
group_by(stations) %.%
summarise(Max = max(mag)) %.%
arrange(desc(Max)) %.%
head(5)
Source: local data frame [5 x 2]
stations Max
1 122 6.4
2 94 6.1
3 132 6.0
4 119 6.0
5 83 6.0

Related

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!
I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

Subsetting - R prints data in reverse order- [R 3.2.2, Win10 Pro, 64-bit]

Aim: To retrieve last two entries of data.( I am aware of the tail function, or direct indexing)
Code:
> tdata <- read.csv("hw1_data.csv")
> temp <- tdata[(nrow(tdata)-1):nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
> temp <- tdata[nrow(tdata)-1:nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
151 14 191 14.3 75 9 28
150 NA 145 13.2 77 9 27
149 30 193 6.9 70 9 26
148 14 20 16.6 63 9 25
147 7 49 10.3 69 9 24
.
.
.
While taking a subset using the extract operator, I have used the nrows() function to retrieve the total number of rows in the data and subtracted one from it (one less than total rows) and used sequence operator(:) to sequence till nrows(data), i.e. total number of rows.
When I use parentheses, the logic works fine, but when I skip the parentheses the output is the total dataframe in a reverse order.
I can figure out that precedence rules are at play, but unable to figure out exact logic. New at R, so any formal explanation would be valuable.
As suspected correctly in the post, the observed behavior is in fact a matter of operator precedence.
A complete list of the operator syntax and precedence rules in R can be obtained by typing
help(Syntax)
in the console.
In this context, R programmers sometimes refer to a well-known and rather witty quote which encourages the use of parentheses:
library(fortunes)
fortune(138)
nrow(tdata) = 153
So the first line you run is:
temp <- tdata[(nrow(tdata)-1):nrow(tdata),]
This executes as tdata[152:153,]
Second line:
temp <- tdata[nrow(tdata)-1:nrow(tdata),]
This executes as tdata[153-1:153,]
So it returns the following:
tdata[152,]
tdata[151,]
...
tdata[0,]

R - Data Frame is a list of columns?

Question
Is a data frame in R is a list (list is, in my understanding, a sequence of objects) of columns?
What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
Any reference to related design document or article of data structure design would be appreciated.
I am just used to row-as-a-unit/record and would like to know why it is column oriented. Or if I misunderstood something, kindly suggest.
Background
I had thought a dataframe was a sequence of row, such as (Ozone, Solar.R, Wind, Temp, Month, Day).
> c ## data frame created from read.csv()
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
> typeof(c)
[1] "list"
However when lapply() is applied against c to show each list element, it was a column.
> lapply(c, function(arg){ return(arg) })
$Ozone
[1] 41 36 12 18 23 19
$Solar.R
[1] 190 118 149 313 299 99
$Wind
[1] 7.4 8.0 12.6 11.5 8.6 13.8
$Temp
[1] 67 72 74 62 65 59
$Month
[1] 5 5 5 5 5 5
$Day
[1] 1 2 3 4 7 8
Whereas I had expected was
[1] 41 190 7.4 67 5 1
[1] 36 118 8.0 72 5 2
…
1) Is a data frame in R a list of columns?
Yes.
df <- data.frame(a=c("the", "quick"), b=c("brown", "fox"), c=1:2)
is.list(df) # -> TRUE
attr(df, "name") # -> [1] "a" "b" "c"
df[[1]][2] # -> "quick"
2) What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
A data.frame is a list of column vectors.
is.atomic(df[[1]]) # -> TRUE
mode(df[[1]]) # -> [1] "character"
mode(df[[3]]) # -> [1] "numeric"
Vectors can only store one kind of object. A "row-oriented" data.frame would demand data frames be composed of lists instead. Now imagine what the performance of an operation like
df[[1]][20000]
would be in a list-based data frame keeping in mind that random access is O(1) for vectors and O(n) for lists.
3) Any reference to related design document or article of data structure design would be appreciated.
http://adv-r.had.co.nz/Data-structures.html#data-frames

R equivalent of Stata's for-loop over local macro list of stubnames

I'm a Stata user that's transitioning to R and there's one Stata crutch that I find hard to give up. This is because I don't know how to do the equivalent with R's "apply" functions.
In Stata, I often generate a local macro list of stubnames and then loop over that list, calling on variables whose names are built off of those stubnames.
For a simple example, imagine that I have the following dataset:
study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3
and so on...
I want to generate two new variables, varX and varY that take on the values of varX06 and varY06 respectively when year is 6, varX07 and varY07 respectively when year is 7, and varX08 and varY08 respectively when year is 8.
The final dataset should look like this:
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
1 6 50 40 30 20.5 19.8 17.4 50 20.5
1 7 50 40 30 20.5 19.8 17.4 40 19.8
1 8 50 40 30 20.5 19.8 17.4 30 17.4
2 6 60 55 44 25.1 25.2 25.3 60 25.1
2 7 60 55 44 25.1 25.2 25.3 55 25.2
2 8 60 55 44 25.1 25.2 25.3 44 25.3
and so on...
To clarify, I know that I can do this with melt and reshape commands - essentially converting this data from wide to long format, but I don't want to resort to that. That's not the intent of my question.
My question is about how to loop over a local macro list of stubnames in R and I'm just using this simple example to illustrate a more generic dilemma.
In Stata, I could generate a local macro list of stubnames:
local stub varX varY
And then loop over the macro list. I can generate a new variable varX or varY and replace the new variable value with the value of varX06 or varY06 (respectively) if year is 6 and so on.
foreach i of local stub {
display "`i'"
gen `i'=.
replace `i'=`i'06 if year==6
replace `i'=`i'07 if year==7
replace `i'=`i'08 if year==8
}
The last section is the section that I find hardest to replicate in R. When I write 'x'06, Stata takes the string "varX", concatenates it with the string "06" and then returns the value of the variable varX06. Additionally, when I write 'i', Stata returns the string "varX" and not the string "'i'".
How do I do these things with R?
I've searched through Muenchen's "R for Stata Users", googled the web, and searched through previous posts here at StackOverflow but haven't been able to find an R solution.
I apologize if this question is elementary. If it's been answered before, please direct me to the response.
Thanks in advance,
Tara
Well, here's one way. Columns in R data frames can be accessed using their character names, so this will work:
# create sample dataset
set.seed(1) # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)), #categorical variable
varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))
# you start here...
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
print(head(df),digits=4)
# year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
# 1 6 -0.6265 0.8937 -0.3411 -0.70757 1.1350 0.3412 -0.6265 -0.70757
# 2 6 0.1836 -1.0473 1.5024 1.97157 1.1119 1.3162 0.1836 1.97157
# 3 6 -0.8356 1.9713 0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4 6 1.5953 -0.3836 0.5422 -0.01402 0.2107 -1.2056 1.5953 -0.01402
# 5 6 0.3295 1.6541 -0.1367 -1.12346 0.0694 1.5676 0.3295 -1.12346
# 6 6 -0.8205 1.5122 -1.1367 -1.34413 -1.6626 0.2253 -0.8205 -1.34413
For a given yr, the anonymous function extracts the rows with that yr and column named "varX0" + yr (the result of paste0(...). Then lapply(...) "applies" this function for each year, and unlist(...) converts the returned list into a vector.
Maybe a more transparent way:
sub <- c("varX", "varY")
for (i in sub) {
df[[i]] <- NA
df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}
This method reorders your data, but involves a one-liner, which may or may not be better for you (assume d is your dataframe):
> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1 1 6 50 40 30 20.5 19.8 17.4 20.5 50
6.4 2 6 60 55 44 25.1 25.2 25.3 25.1 60
7.2 1 7 50 40 30 20.5 19.8 17.4 19.8 40
7.5 2 7 60 55 44 25.1 25.2 25.3 25.2 55
8.3 1 8 50 40 30 20.5 19.8 17.4 17.4 30
8.6 2 8 60 55 44 25.1 25.2 25.3 25.3 44
Essentially, it splits the data based on year, then uses within to create the varX and varY variables within each subset, and then rbind's the subsets back together.
A direct translation of your Stata code, however, would be something like the following:
u <- unique(d$year)
for(i in seq_along(u)){
d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}
Here's another option.
Create a 'column selection matrix' based on year, then use that to grab the values you want from any block of columns.
# indexing matrix based on the 'year' column
col_select_mat <-
t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat])
This gives the desired result (which you can cbind to your_df if you like)
varX varY
[1,] 50 20.5
[2,] 60 25.1
[3,] 40 19.8
[4,] 55 25.2
[5,] 30 17.4
[6,] 44 25.3
OP's dataset:
your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
1 6 50 40 30 20.5 19.8 17.4
1 7 50 40 30 20.5 19.8 17.4
1 8 50 40 30 20.5 19.8 17.4
2 6 60 55 44 25.1 25.2 25.3
2 7 60 55 44 25.1 25.2 25.3
2 8 60 55 44 25.1 25.2 25.3')
Benchmarking: Looking at the three posted solutions, this appears to be the fastest on average, but the differences are very small.
df <- your_df
d <- your_df
arvi1000 <- function() {
col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
# make selections from col groups by stub name
cbind(your_df,
sapply(c('varX', 'varY'),
function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}
jlhoward <- function() {
years <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}
Thomas <- function() {
do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}
> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
expr min lq mean median uq max neval
arvi1000 37 39 43.73 40 42 380 100
jlhoward 38 40 46.35 41 42 377 100
Thomas 37 40 56.99 41 42 1590 100

How to find the highest value of a column in a data frame in R?

I have the following data frame which I called ozone:
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
I would like to extract the highest value from ozone, Solar.R, Wind...
Also, if possible how would I sort Solar.R or any column of this data frame in descending order
I tried
max(ozone, na.rm=T)
which gives me the highest value in the dataset.
I have also tried
max(subset(ozone,Ozone))
but got "subset" must be logical."
I can set an object to hold the subset of each column, by the following commands
ozone <- subset(ozone, Ozone >0)
max(ozone,na.rm=T)
but it gives the same value of 334, which is the max value of the data frame, not the column.
Any help would be great, thanks.
Similar to colMeans, colSums, etc, you could write a column maximum function, colMax, and a column sort function, colSort.
colMax <- function(data) sapply(data, max, na.rm = TRUE)
colSort <- function(data, ...) sapply(data, sort, ...)
I use ... in the second function in hopes of sparking your intrigue.
Get your data:
dat <- read.table(h=T, text = "Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9")
Use colMax function on sample data:
colMax(dat)
# Ozone Solar.R Wind Temp Month Day
# 41.0 313.0 20.1 74.0 5.0 9.0
To do the sorting on a single column,
sort(dat$Solar.R, decreasing = TRUE)
# [1] 313 299 190 149 118 99 19
and over all columns use our colSort function,
colSort(dat, decreasing = TRUE) ## compare with '...' above
To get the max of any column you want something like:
max(ozone$Ozone, na.rm = TRUE)
To get the max of all columns, you want:
apply(ozone, 2, function(x) max(x, na.rm = TRUE))
And to sort:
ozone[order(ozone$Solar.R),]
Or to sort the other direction:
ozone[rev(order(ozone$Solar.R)),]
Here's a dplyr solution:
library(dplyr)
# find max for each column
summarise_each(ozone, funs(max(., na.rm=TRUE)))
# sort by Solar.R, descending
arrange(ozone, desc(Solar.R))
UPDATE: summarise_each() has been deprecated in favour of a more featureful family of functions: mutate_all(), mutate_at(), mutate_if(), summarise_all(), summarise_at(), summarise_if()
Here is how you could do:
# find max for each column
ozone %>%
summarise_if(is.numeric, funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
or
ozone %>%
summarise_at(vars(1:6), funs(max(., na.rm=TRUE)))%>%
arrange(Ozone)
In response to finding the max value for each column, you could try using the apply() function:
> apply(ozone, MARGIN = 2, function(x) max(x, na.rm=TRUE))
Ozone Solar.R Wind Temp Month Day
41.0 313.0 20.1 74.0 5.0 9.0
Another way would be to use ?pmax
do.call('pmax', c(as.data.frame(t(ozone)),na.rm=TRUE))
#[1] 41.0 313.0 20.1 74.0 5.0 9.0
There is a package matrixStats that provides some functions to do column and row summaries, see in the package vignette, but you have to convert your data.frame into a matrix.
Then you run: colMaxs(as.matrix(ozone))
max(may$Ozone, na.rm = TRUE)
Without $Ozone it will filter in the whole data frame, this can be learned in the swirl library.
I'm studying this course on Coursera too ~
Assuming that your data in data.frame called maxinozone, you can do this
max(maxinozone[1, ], na.rm = TRUE)
max(ozone$Ozone, na.rm = TRUE) should do the trick. Remember to include the na.rm = TRUE or else R will return NA.
Try this solution:
Oz<-subset(data, data$Month==5,select=Ozone) # select ozone value in the month of
#May (i.e. Month = 5)
summary(T) #gives caracteristics of table( contains 1 column of Ozone) including max, min ...

Resources