Calculate median based on counts of value - r

I have a following data frame:
numbers Freq
1 4 2
2 5 1
3 23 2
4 34 2
I know how to calculate median in r when frequency is 1 (median(`numbers`, na.rm = TRUE)), but how to do that when frequency is different from 1?

We can create a logical index with !=, use that to subset the 'numbers' column and get the median
with(df1, median(numbers[Freq != 1], na.rm = TRUE))
#[1] 23
data
df1 <- structure(list(numbers = c(4L, 5L, 23L, 34L), Freq = c(2L, 1L,
2L, 2L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))

Here is another option using tapply along with subset
> with(subset(df,Freq != 1),tapply(numbers,Freq,median,na.rm = TRUE))
2
23
Data
> dput(df)
structure(list(numbers = c(4L, 5L, 23L, 34L), Freq = c(2L, 1L,
2L, 2L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))

Related

How to retrieve row names from each list levels after applying max.col

Good afternoon ,
Assume we have the following list :
L1=list(y1 = structure(c(2L, 2L, 5L, 3L, 3L, 4L), .Dim = 2:3, .Dimnames = structure(list(
c("PURPLE", "YELLOW"), c("1", "2", "3")), .Names = c("",
"")), class = "table"), y2 = structure(c(3L, 1L, 5L, 3L, 2L,
5L), .Dim = 2:3, .Dimnames = structure(list(c("LARGE", "SMALL"
), c("1", "2", "3")), .Names = c("", "")), class = "table"),
y3 = structure(c(2L, 2L, 3L, 5L, 3L, 4L), .Dim = 2:3, .Dimnames = structure(list(
c("DIP", "STRETCH"), c("1", "2", "3")), .Names = c("",
"")), class = "table"), y4 = structure(c(3L, 1L, 4L, 4L,
4L, 3L), .Dim = 2:3, .Dimnames = structure(list(c("ADULT",
"CHILD"), c("1", "2", "3")), .Names = c("", "")), class = "table"),
y5 = structure(c(3L, 1L, 5L, 3L, 4L, 3L), .Dim = 2:3, .Dimnames = structure(list(
c("FALSE", "TRUE"), c("1", "2", "3")), .Names = c("",
"")), class = "table"))
#output
> L1
$y1
1 2 3
PURPLE 2 5 3
YELLOW 2 3 4
$y2
1 2 3
LARGE 3 5 2
SMALL 1 3 5
$y3
1 2 3
DIP 2 3 3
STRETCH 2 5 4
$y4
1 2 3
ADULT 3 4 4
CHILD 1 4 3
$y5
1 2 3
FALSE 3 5 4
TRUE 1 3 3
For each variable $y , I tried to get the modalities that are the most frequent ( for each column). For each column , i tried to retrieve the row with the maximum frequency as following :
lapply(res, function(x) max.col(t(x)))
$y1
[1] 2 1 2
$y2
[1] 1 1 2
$y3
[1] 1 2 2
$y4
[1] 1 1 1
$y5
[1] 1 1 1
Now , i need to get the associated modalities , for example :
$y1
[1] YELLOW PURPLE YELLOW
$y2
[1] LARGE LARGE SMALL
...
I tried without success : lapply(res, function(x) rownames(max.col(t(x))))
Thank you for help !
You have to subset rownames(x) with the index, and not call it with the index, to get the names.
lapply(L1, function(x) rownames(x)[max.col(t(x))])
#$y1
#[1] "PURPLE" "PURPLE" "YELLOW"
#
#$y2
#[1] "LARGE" "LARGE" "SMALL"
#
#$y3
#[1] "DIP" "STRETCH" "STRETCH"
#
#$y4
#[1] "ADULT" "ADULT" "ADULT"
#
#$y5
#[1] "FALSE" "FALSE" "FALSE"

Using lapply to group list of data frames by column

I have a list that contains multiple data frames. I would like to sort the data by Category (A) and sum the Frequencies (B) using the lapply-command.
The data is df_list
df_list
$`df.1`
A B
1 Apples 2
2 Pears 5
3 Apples 6
4 Pears 1
5 Apples 3
$`df.2`
A B
1 Oranges 2
2 Pineapples 5
3 Oranges 6
4 Pineapples 1
5 Oranges 3
The desired outcome df_list_2 looks like this:
df_list_2
$`df.1`
A B
1 Apples 11
2 Pears 6
$`df.2`
A B
1 Oranges 11
2 Pineapples 6
I have tried the following code based on lapply:
df_list_2<-df_list[, lapply(B, sum), by = A]
However, I get an error code, saying that A was not found.
Either I mistake how the lapply command works in this case or my understating of how it should work is flawed.
Any help much appreciated.
You need to aggregate in lapply
lapply(df_list, function(x) aggregate(B~A, x, sum))
#[[1]]
# A B
#1 Apples 11
#2 Pears 6
#[[2]]
# A B
#1 Oranges 11
#2 Pineapples 6
Using map from purrr and dplyr it would be
library(dplyr)
purrr::map(df_list, ~.x %>% group_by(A) %>% summarise(sum = sum(B)))
data
df_list <- list(structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("Apples", "Pears"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)),
class = "data.frame", row.names = c("1", "2", "3", "4", "5")),
structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L), .Label = c("Oranges",
"Pineapples"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5")))
I fear you might not have a clear idea of lapply nor the extract operator ([). Remember lapply(list, function) applies the specified function you give it to each element of the list you give it. Extract gives you the element you specify:
x <- c('a', 'b', 'c')
x[2]
## "b"
I would imagine that somewhere in your R workspace you have an object names B which is why you didn't get an error along the lines of
## Error in lapply(B, sum) : object 'B' not found
Conversely if you had (accidentally or intentionally) defined both A and B you would see the error
## Error in df_list[, lapply(B, sum), by = A] : incorrect number of dimensions
because that's not at all how to use [; remember, you just pass indexes or booleans to [ along with the occasional optional argument, but by is not one of those.
So without further adieu, here's how I would do this (in base R):
# make some data
a <- c(1, 2, 1, 2, 1)
b <- c(2, 5, 6, 1, 3)
df_list <- list(df.1 = data.frame(A = c('Apples', 'Pears')[a], B = b),
df.2 = data.frame(A = c('Oranges', 'Pineapples')[a], B = b))
# simplify it
df_list_2 <- lapply(df_list, function(x) {
aggregate(list(B = x$B), list(A = x$A), sum)
})
# the desired result
df_list_2
## $df.1
## A B
## 1 Apples 11
## 2 Pears 6
##
## $df.2
## A B
## 1 Oranges 11
## 2 Pineapples 6
You can take advantage of the fact that a data.frame is just a list and shorten up your code like this:
df_list_2 <- lapply(df_list, function(x) {
aggregate(x['B'], x['A'], sum)
})
but the first way of writing it should help make more clear what we're doing
The data.table syntax in OP's post can changed to
library(data.table)
lapply(df_list, function(x) as.data.table(x)[, .(B = sum(B)), by = A])
#$df.1
# A B
#1: Apples 11
#2: Pears 6
#$df.2
# A B
#1: Oranges 11
#2: Pineapples 6
data
df_list <- list(df.1 = structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L
), .Label = c("Apples", "Pears"), class = "factor"), B = c(2L,
5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5")), df.2 = structure(list(A = structure(c(1L, 2L,
1L, 2L, 1L), .Label = c("Oranges", "Pineapples"), class = "factor"),
B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))

Remove column names in a DataFrame

In sparkR I have a DataFrame data.
When I type head(data) we get this output
C0 C1 C2 C3
1 id user_id foreign_model_id machine_id
2 1 3145 4 12
3 2 4079 1 8
4 3 1174 7 1
5 4 2386 9 9
6 5 5524 1 7
I want to remove C0,C1,C2,C3 because they give me problems later one. For example when I use the filter function:
filter(data,data$machine_id==1)
can't run because of this.
I have read the data like this
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv")
SparkR made the header into the first row and gave the DataFrame a new header because the default for the header option is "false". Set the header option to header="true" and then you won't have to handle with this problem.
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv", header="true")
Try
colnames(data) <- unlist(data[1,])
data <- data[-1,]
> data
# id user_id foreign_model_id machine_id
#2 1 3145 4 12
#3 2 4079 1 8
#4 3 1174 7 1
#5 4 2386 9 9
#6 5 5524 1 7
If you wish, you can add rownames(data) <- NULL to correct for the row numbers after the deletion of the first row.
After this manipulation, you can select rows that correspond to certain criteria, like
subset(data, data$machine_id==1)
# id user_id foreign_model_id machine_id
#4 3 1174 7 1
In base R, the function filter() suggested in the OP is part of the stats namespace and is usually reserved for the analysis of time series.
data
data <- structure(list(C0 = structure(c(6L, 1L, 2L, 3L, 4L, 5L),
.Label = c("1", "2", "3", "4", "5", "id"), class = "factor"),
C1 = structure(c(6L, 3L, 4L, 1L, 2L, 5L), .Label = c("1174", "2386",
"3145", "4079", "5524", "user_id"), class = "factor"),
C2 = structure(c(5L, 2L, 1L, 3L, 4L, 1L),
.Label = c("1", "4", "7", "9", "foreign_model_id"), class = "factor"),
C3 = structure(c(6L, 2L, 4L, 1L, 5L, 3L),
.Label = c("1", "12", "7", "8", "9", "machine_id"), class = "factor")),
.Names = c("C0", "C1", "C2", "C3"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))
try this
names <- c()
for (i in seq(along = names(data))) {
names <- c(names, toString(data[1,i]))
}
names(data) <- names
data <- data[-1,]
I simply can't use the answers because in sparkR it can't run: object of type 'S4' is not subsettable. I solved the problem this way, however, I think there is a better way to solve it.
data <- withColumnRenamed(data, "C0","id")
data <- withColumnRenamed(data, "C1","user_id")
data <- withColumnRenamed(data, "C2","foreign_model_id")
data <- withColumnRenamed(data, "C3","machine_id")
And now I can successfully use the filter function as I want to.

Replacing loop in dplyr R

So I am trying to program function with dplyr withou loop and here is something I do not know how to do
Say we have tv stations (x,y,z) and months (2,3). If I group by this say we get
this output also with summarised numeric value
TV months value
x 2 52
y 2 87
z 2 65
x 3 180
y 3 36
z 3 99
This is for evaluated Brand.
Then I will have many Brands I need to filter to get only those which get value >=0.8*value of evaluated brand & <=1.2*value of evaluated brand
So for example from this down I would only want to filter first two, and this should be done for all months&TV combinations
brand TV MONTH value
sdg x 2 60
sdfg x 2 55
shs x 2 120
sdg x 2 11
sdga x 2 5000
As #akrun said, you need to use a combination of merging and subsetting. Here's a base R solution.
m <- merge(df, data, by.x=c("TV", "MONTH"), by.y=c("TV", "months"))
m[m$value.x >= m$value.y*0.8 & m$value.x <= m$value.y*1.2,][,-5]
# TV MONTH brand value.x
#1 x 2 sdg 60
#2 x 2 sdfg 55
Data
data <- structure(list(TV = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("x",
"y", "z"), class = "factor"), months = c(2L, 2L, 2L, 3L, 3L,
3L), value = c(52L, 87L, 65L, 180L, 36L, 99L)), .Names = c("TV",
"months", "value"), class = "data.frame", row.names = c(NA, -6L
))
df <- structure(list(brand = structure(c(2L, 1L, 4L, 2L, 3L), .Label = c("sdfg",
"sdg", "sdga", "shs"), class = "factor"), TV = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "x", class = "factor"), MONTH = c(2L,
2L, 2L, 2L, 2L), value = c(60L, 55L, 120L, 11L, 5000L)), .Names = c("brand",
"TV", "MONTH", "value"), class = "data.frame", row.names = c(NA,
-5L))

Make Dataframe loop code run faster

DATA AND REQUIREMENTS
The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey.
What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.
myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")
> head(myMatrix1) # this is the old data
sampleIDs begin finish
1 19990224 4 5
2 20000224 5 6
3 20010203 6 8
4 20019024 29 30
5 20020201 51 52
> head(myMatrix2) # this is the new data
begin finish sedimentation boreWidth
1 0 10 1.002455 0.014354
2 11 367 2.094351 0.056431
3 368 920 0.450275 0.154105
4 921 1414 2.250820 1.004353
5 1415 5278 0.114109 NA`
Desired output:
> head(myMatrix6)
sampleIDs begin finish sedimentation #myMatrix4
1 19990224 4 5 1.002455
2 20000224 5 6 1.002455
3 20010203 6 8 2.094351
4 20019024 29 30 2.094351
5 20020201 51 52 2.094351
> head(myMatrix7)
sampleIDs begin finish boreWidthThresh #myMatrix5
1 19990224 4 5 FALSE
2 20000224 5 6 FALSE
3 20010203 6 8 FALSE
4 20019024 29 30 FALSE
5 20020201 51 52 FALSE`
CODE
The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
EDIT:
> dput(head(myMatrix2)
structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0",
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L,
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278",
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L,
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455",
"2.094351", "2.250820", "sedimentation"), class = "factor"),
V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354",
"0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(myMatrix1)
structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224",
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"),
V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29",
"4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L,
2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8",
"finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")
First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656
The first thing that jumps out at me is that I'd create only one results matrix. That way you're not duplicating the sampleIDs begin finish columns, and you can avoid any overhead that comes with running the matching algorithm twice.
Doing that, you can avoid selecting more than once (although it's trivial in terms of speed as long as you store your selection vector rather than re-calculate).
Here's a solution using apply:
myMatrix1 <- data.frame(sampleIDs=c(19990224,20000224),begin=c(4,5),finish=c(5,6))
myMatrix2 <- data.frame(begin=c(0,11),finish=c(10,367),sed=c(1.002,2.01),boreWidth=c(.014,.056))
glommer <- function(x,myMatrix2) {
x[4:5] <- as.numeric(myMatrix2[ myMatrix2$begin <= x["begin"] & myMatrix2$finish >= x["finish"], c("sed","boreWidth") ])
names(x)[4:5] <- c("sed","boreWidth")
return( x )
}
> t(apply( myMatrix1, 1, glommer, myMatrix2=myMatrix2))
sampleIDs begin finish sed boreWidth
[1,] 19990224 4 5 1.002 0.014
[2,] 20000224 5 6 1.002 0.014
I used apply and stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.
This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.

Resources