How to look up field name from an aggregate result - r

I am trying to look up the index or name of a data frame based on the maximum of the aggrate values of that data frame, for example:
df <- data.frame(
id = 1:6,
v1 = c(3, 20, 34, 23, 23, 56),
v2 = c(1, 3, 4, 10, 30, 40),
v3 = c(20, 35, 60, 60, 70, 80))
id v1 v2 v3
1 1 3 1 20
2 2 20 3 35
3 3 34 4 60
4 4 23 10 60
5 5 23 30 70
6 6 56 40 80
> colSums(as.data.frame(df[[1]]))
df[[1]]
21
> colSums(as.data.frame(df[[2]]))
df[[2]]
159
> colSums(as.data.frame(df[[3]]))
df[[3]]
88
So for example the maximum result using colSums is 159, and I'm trying to figure out how to to return 'df[[2]]'

First, you can simply run colSums directly on your data.frame
> colSums(df)
id v1 v2 v3
21 159 88 325
Subsetting is easy too
> df[which.max(colSums(df))]
v3
1 20
2 35
3 60
4 60
5 70
6 80
Or, if you just want the index, as implied in your first line:
> which.max(colSums(df))
v3
4
Also note that if you expect there might be more than one column with the same maximum sum, and you want to return all of them, you can use which(colSums(df) == max(colSums(df))) instead of which.max, which only returns the first occurrence.

Related

Calculate mean of specific row pattern

I have a dataframe like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
I want to calculate the means of the value-column and specific rows.
The pattern of the rows is pretty complicated:
Rows of MeanA1: 1, 5, 9
Rows of MeanA2: 2, 6, 10
Rows of MeanA3: 3, 7, 11
Rows of MeanA4: 4, 8, 12
Rows of MeanB1: 13, 17, 21
Rows of MeanB2: 14, 18, 22
Rows of MeanB3: 15, 19, 23
Rows of MeanB4: 16, 20, 24
Rows of MeanC1: 25, 29, 33
Rows of MeanC2: 26, 30, 34
Rows of MeanC3: 27, 31, 35
Rows of MeanC4: 28, 32, 36
Rows of MeanD1: 37, 41, 45
Rows of MeanD2: 38, 42, 46
Rows of MeanD3: 39, 43, 47
Rows of MeanD4: 40, 44, 48
As you see its starting at 4 different points (1, 13, 25, 37) then always +4 and for the following 4 means its just stepping 1 more row down.
I would like to have an output of all these means in one list.
Any ideas? NOTE: In this example the mean is of course always the middle number, but my real df is different.
Not quite sure about the output format you require, but the following codes can calculate what you want anyhow.
calc_mean1 <- function(x) mean(test$value[seq(x, by = 4, length.out = 3)])
calc_mean2 <- function(x){sapply(x:(x+3), calc_mean1)}
output <- lapply(seq(1, 37, 12), calc_mean2)
names(output) <- paste0('Mean', LETTERS[seq_along(output)]) # remove this line if more than 26 groups.
output
## $MeanA
## [1] 5 6 7 8
## $MeanB
## [1] 17 18 19 20
## $MeanC
## [1] 29 30 31 32
## $MeanD
## [1] 41 42 43 44
An idea via base R is to create a grouping variable for every 4 rows, split the data every 12 rows (nrow(test) / 4) and aggregate to find the mean, i.e.
test$new = rep(1:4, nrow(test)%/%4)
lapply(split(test, rep(1:4, each = nrow(test) %/% 4)), function(i)
aggregate(value ~ new, i, mean))
# $`1`
# new value
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
# $`2`
# new value
# 1 1 17
# 2 2 18
# 3 3 19
# 4 4 20
# $`3`
# new value
# 1 1 29
# 2 2 30
# 3 3 31
# 4 4 32
# $`4`
# new value
# 1 1 41
# 2 2 42
# 3 3 43
# 4 4 44
And yet another way.
fun <- function(DF, col, step = 4){
run <- nrow(DF)/step^2
res <- lapply(seq_len(step), function(inc){
inx <- seq_len(run*step) + (inc - 1)*run*step
dftmp <- DF[inx, ]
tapply(dftmp[[col]], rep(seq_len(step), run), mean, na.rm = TRUE)
})
names(res) <- sprintf("Mean%s", LETTERS[seq_len(step)])
res
}
fun(test, 2, 4)
#$MeanA
#1 2 3 4
#5 6 7 8
#
#$MeanB
# 1 2 3 4
#17 18 19 20
#
#$MeanC
# 1 2 3 4
#29 30 31 32
#
#$MeanD
# 1 2 3 4
#41 42 43 44
Since you said you wanted a long list of the means, I assumed it could also be a vector where you just have all these values. You would get that like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
meanVector <- NULL
for (i in 1:(nrow(test)-8)) {
x <- c(test$value[i], test$value[i+4], test$value[i+8])
m <- mean(x)
meanVector <- c(meanVector, m)
}

Compare vector to a dataframe

I have a dataframe that looks something like -
test A B C
28 67 4 23
45 82 43 56
34 8 24 42
I need to compare test to the other three columns in that I just need the number of elements in the other column that is less than the corresponding element in the test column.
So the desired output is -
test A B C result
28 67 4 23 2
45 82 43 56 1
34 8 24 42 2
When I tried -
comp_vec = "test"
name_vec = c("A", "B", "C")
rowSums(df[, comp_vec] > df[, name_vec])
I get the error -
Error in Ops.data.frame(df[, comp_vec], df[, name_vec]) :
‘>’ only defined for equally-sized data frames
I am looking for a way without replicating test to match size of dataframe.
You can use sapply to return a vector of mapping the df$test column against the other three columns. That will return a T/F matrix that you can do rowSums, and set as your result column.
df <- data.frame(test = c(28, 45, 34), A = c(67, 82, 8), B = c(4, 43, 24), C = c(23, 56, 42))
df$result <- rowSums(sapply(df[,2:4], function(x) df$test > x))
> df
test A B C result
1 28 67 4 23 2
2 45 82 43 56 1
3 34 8 24 42 2
I noticed your expected results has 82 for the second row of A, whereas its 5 in your starting example.
df$result <- apply(df, 1, function(x) sum(x < x[1]))
Use apply, specify 1 to indicate by row. x < x[1] will give a vector of TRUE/FALSE if the value at each position in the row is smaller than the first column's value. Use sum to give the number of TRUE values.
# test A B C result
# 1 28 67 4 23 2
# 2 45 82 43 56 1
# 3 34 8 24 42 2

Divide vector into groups according difference between two neighbouring numbers

My dummy input vector looks like this:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
What I want: Add group factor to each number. Group is assigned according difference between neighbouring numbers.
Example:
Difference (absolute) between 10 and 20 is 10, hence they belong to same group.
Difference between 30 and 20 is 10 - they belong to same group.
Difference between 30 and 70 is 40 - they belong to different groups.
Given maximal difference 20 wanted result is:
x group
10 1
20 1
30 1
70 4
80 4
90 4
130 7
190 8
200 8
My code:
library(data.table)
library(foreach)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
x <- data.table(x, group = 1)
y <- nrow(x)
maxGap <- 20
g <- 1
groups <-
foreach(i = 2:y, .combine = rbind) %do% {
if (x[i, x] - x[i - 1, x] < maxGap) {
g
} else {
g <- i
g
}
}
x[2:y]$group <- as.vector(groups)
My question
Given code works, but is too slow with large data (number of rows > 10mil). Is there simpler and quicker solution (not using loop)?
library(IRanges)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
# If the distance between two integers is larger than 30,
# then they would be in two groups. Otherwise, they would
# be in the same group.
ther <- 15
df.1 <- data.frame(val=x, left=x-15, right=x+15)
df.ir <- IRanges(df.1$left, df.1$right)
df.ir.re <- findOverlaps(df.ir, reduce(df.ir))
df.1$group <- subjectHits(df.ir.re)
df.1
# val left right group
# 1 10 -5 25 1
# 2 20 5 35 1
# 3 30 15 45 1
# 4 70 55 85 2
# 5 80 65 95 2
# 6 90 75 105 2
# 7 130 115 145 3
# 8 190 175 205 4
# 9 200 185 215 4
An implementation which uses the rleid and shift functions of data.table:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
DT <- data.table(x)
DT[, grp := rleid(cumsum(x - shift(x,1L,0) > 20))]
which gives:
> DT
x grp
1: 10 1
2: 20 1
3: 30 1
4: 70 2
5: 80 2
6: 90 2
7: 130 3
8: 190 4
9: 200 4
Explanation: With x - shift(x,1L,0) you calculate the difference with the previous observation of x. By comparing it to 20 (i.e.: the > 20 part) and wrapping that in cumsum and rleid a runlength id is created.
In response to #Roland's comments: you can leave the rleid-part out if you set the fill parameter in shift to -Inf:
DT[, grp := cumsum((x - shift(x, 1L, -Inf)) > 20)]
test <- c(TRUE, diff(x) > 20) #test the differences
res <- factor(cumsum(test)) #groups
#[1] 1 1 1 2 2 2 3 4 4
#Levels: 1 2 3 4
levels(res) <- which(test) #fix levels
res
#[1] 1 1 1 4 4 4 7 8 8
#Levels: 1 4 7 8

How to substract a column by row?

I want to do an easy subtract in R, but I don't know how to solve it. I would like to know if I have to do a loop or if there is a function.
I have a column with numeric variables, and I would like to subtract n by n-1.
Time_Day Diff
10 10
15 5
45 30
60 15
Thus, I would like to find the variable "Diff".
you can also try with package dplyr
library(dplyr)
mutate(df, dif=Time_Day-lag(Time_Day))
# Time_Day Diff dif
# 1 10 10 NA
# 2 15 5 5
# 3 45 30 30
# 4 60 15 15
Does this do what you need?
Here we save the column as a variable:
c <- c(10, 15, 45, 60)
Now we add a 0 to the beginning and then cut off the last element:
cm1 <- c(0, c)[1:length(c)]
Now we subtract the two:
dif <- c - cm1
If we print that out, we get what you're looking for:
dif # 10 5 30 15
With diff :
df <- data.frame(Time_Day = c(10, 15, 45, 60))
df$Diff <- c(df$Time_Day[1], diff(df$Time_Day))
df
## Time_Day Diff
##1 10 10
##2 15 5
##3 45 30
##4 60 15
It works fine in dplyr too :
library("dplyr")
df <- data.frame(Time_Day = c(10, 15, 45, 60))
df %>% mutate(Diff = c(Time_Day[1], diff(Time_Day)))

Sample to have an equal number of each sex within groups in R

First things, first. Here are my data:
lat <- c(12, 12, 58, 58, 58, 58, 58, 45, 45, 45, 45, 45, 45, 64, 64, 64, 64, 64, 64, 64)
long <- c(-14, -14, 139, 139, 139, 139, 139, -68, -68, -68, -68, -68, 1, 1, 1, 1, 1, 1, 1, 1)
sex <- c("M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "F", "M", "F", "M", "F", "F", "F", "F", "M")
score <- c(2, 6, 3, 6, 5, 4, 3, 2, 3, 9, 9, 8, 6, 5, 6, 7, 5, 7, 5, 1)
data <- data.frame(lat, long, sex, score)
The data should look like this:
lat long sex score
1 12 -14 M 2
2 12 -14 M 6
3 58 139 M 3
4 58 139 M 6
5 58 139 F 5
6 58 139 M 4
7 58 139 M 3
8 45 -68 F 2
9 45 -68 M 3
10 45 -68 M 9
11 45 -68 M 9
12 45 -68 F 8
13 45 1 M 6
14 64 1 F 5
15 64 1 M 6
16 64 1 F 7
17 64 1 F 5
18 64 1 F 7
19 64 1 F 5
20 64 1 M 1
I am at my wits end trying to figure this one out. The variables are latitude, longitude, sex and score. I would like to have an equal number of males and females within each location (i.e. with the same longitude and latitude). For instance, the second location (rows 3 to 7) has only one female. This female should be retained and one male from the remaining individuals should also be retained (by random sampling, perhaps). Some locations have only information about one sex, e.g. the first location (rows 1 and 2) has only data on males. The rows from this location should be dropped (since there are no females). All going according to plan the final dataset should look something like this:
lat2 long2 sex2 score2
1 58 139 F 5
2 58 139 M 4
3 45 -68 F 2
4 45 -68 M 3
5 45 -68 M 9
6 45 -68 F 8
7 64 1 M 6
8 64 1 F 5
9 64 1 F 7
10 64 1 M 1
Any help would be appreciated.
Here's a solution with lapply:
data[unlist(lapply(with(data, split(seq.int(nrow(data)), paste(lat, long))),
# 'split' splits the sequence of row numbers (indices) along the unique
# combinations of 'lat' and 'long'
# 'lapply' applies the following function to all sub-sequences
function(x) {
# which of the indices are for males:
male <- which(data[x, "sex"] == "M")
# which of the indices are for females:
female <- which(data[x, "sex"] == "F")
# sample from the indices of males:
s_male <- sample(male, min(length(male), length(female)))
# sample from the indices of females:
s_female <- sample(female, min(length(male), length(female)))
# combine both sampled indices:
x[c(s_male, s_female)]
})), ]
# The function 'lappy' returns a list of indices which is transformed to a vector
# using 'unlist'. These indices are used to subset the original data frame.
The result:
lat long sex score
9 45 -68 M 3
11 45 -68 M 9
12 45 -68 F 8
8 45 -68 F 2
7 58 139 M 3
5 58 139 F 5
20 64 1 M 1
15 64 1 M 6
19 64 1 F 5
16 64 1 F 7
Below is a quick way to go about it, which involves creating a temporary column of the lat-long combination. We split the DF according to this column, count the M/F in each split, sample appropriately, then re-combine.
# First, We call the dataframe something other than "data" ;)
mydf <- data.frame(lat, long, sex, score)
# create a new data frame with a temporary column, which concatenates the lat & long.
mydf.new <- data.frame(mydf, latlong=paste(mydf$lat, mydf$long, sep=","))
# Split the data frame according to the lat-long location
mydf.splat <- split(mydf.new, mydf.new$latlong)
# eg, taking a look at one of our tables:
mydf.splat[[4]]
sampled <-
lapply(mydf.splat, function(tabl) {
Ms <- sum(tabl$sex=="M")
Fs <- sum(tabl$sex=="F")
if(Fs == 0 || Ms ==0) # If either is zero, we drop that location
return(NULL)
if(Fs == Ms) # If they are both equal, no need to sample.
return(tabl)
# If number of Females less than Males, return all Females
# and sample from males in ammount equal to Females
if (Fs < Ms)
return(tabl[c(which(tabl$sex=="F"), sample(which(tabl$sex=="M"), Fs)), ])
if (Ms < Fs) # same as previous, but for Males < Femals
return(tabl[c(which(tabl$sex=="M"), sample(which(tabl$sex=="F"), Ms)), ])
stop("hmmm... something went wrong.") ## We should never hit this line, but just in case.
})
# Flatten into a single table
mydf.new <- do.call(rbind, sampled)
# Clean up
row.names(mydf.new) <- NULL # remove the row names that were added
mydf.new$latlong <- NULL # remove the temporary column that we added
RESULTS
mydf.new
# lat long sex score
# 1 45 -68 F 2
# 2 45 -68 F 8
# 3 45 -68 M 9
# 4 45 -68 M 3
# 5 58 139 F 5
# 6 58 139 M 3
# 7 64 1 M 6
# 8 64 1 M 1
# 9 64 1 F 7
# 10 64 1 F 5
This returns the values as list elements:
spl <- split(data, interaction(data$lat, data$long) ,drop=TRUE)
# interaction creates all the two way pairs from those two vectors
# drop is needed to eliminate the dataframes with no representation
res <- lapply(spl, function(x) { #First find the nuber of each gender to select
N=min(table(x$sex)) # then sample each sex separately
rbind( x[ x$sex=="M" & row.names(x) %in% sample(row.names(x[x$sex=="M",] ), N) , ],
# One (or both) of these will be "sampling" all of that sex.
x[ x$sex=="F" & row.names(x) %in% sample(row.names(x[x$sex=="F", ]), N) , ] )
} )
res
#------------
$`45.-68`
lat long sex score
9 45 -68 M 3
11 45 -68 M 9
8 45 -68 F 2
12 45 -68 F 8
$`12.-14` # So there were no women in this group and zero could be matched
[1] lat long sex score
<0 rows> (or 0-length row.names)
$`45.1`
[1] lat long sex score
<0 rows> (or 0-length row.names)
$`64.1`
lat long sex score
15 64 1 M 6
20 64 1 M 1
16 64 1 F 7
17 64 1 F 5
$`58.139`
lat long sex score
7 58 139 M 3
5 58 139 F 5
,,,, but if you wanted it as a dataframe you can just use do.call(rbind, res):
> do.call(rbind, res)
lat long sex score
45.-68.10 45 -68 M 9
45.-68.11 45 -68 M 9
45.-68.8 45 -68 F 2
45.-68.12 45 -68 F 8
64.1.15 64 1 M 6
64.1.20 64 1 M 1
64.1.17 64 1 F 5
64.1.18 64 1 F 7
58.139.6 58 139 M 4
58.139.5 58 139 F 5

Resources