Aggregate single column in R - r

I have some data with only one variable, as following:
dd <- data.frame (
x=sample (1:10, 100, T)
)
I want to aggregate, for example count each occurance, but only with base package's functions
dd |>
transform(y=1) |>
do(aggregate(y~x, data=., FUN = \(x) length(x)))
is there any better solution?

Will this work:
as.data.frame(table(dd))
dd Freq
1 1 11
2 2 10
3 3 13
4 4 8
5 5 13
6 6 9
7 7 6
8 8 10
9 9 10
10 10 10

Related

How to find closest match from list in R

I have a list of numbers and would like to find which is the next highest compared to each number in a data.frame. I have:
list <- c(3,6,9,12)
X <- c(1:10)
df <- data.frame(X)
And I would like to add a variable to df being the next highest number in the list. i.e:
X Y
1 3
2 3
3 3
4 6
5 6
6 6
7 9
8 9
9 9
10 12
I've tried:
df$Y <- which.min(abs(list-df$X))
but that gives an error message and would just get the closest value from the list, not the next above.
Another approach is to use findInterval:
df$Y <- list[findInterval(X, list, left.open=TRUE) + 1]
> df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12
You could do this...
df$Y <- sapply(df$X, function(x) min(list[list>=x]))
df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12

dplyr calculate new columns in batch

I would like to add new columns to a data.frame using dplyr. One by one it is easy using mutate. However, I have a situation where I have a function that calculates several parameters based on some other column and I would like to add them to the table in one go. Suppose I have a function
f = function(x) {data.frame(A = x + 1, B = x + 2, C = x + 3)}
And I want to run this function against a column in a data.frame and add the results to the same data.frame, so
df = data.frame(x = 1:10)
df %>% XXX(f(x))
would result in data.frame like this:
x A B C
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
10 11 12 13
I know I have read about function like XXX in the example above, but I'm unable to find it right now. Anybody has hints?
We can use do
library(dplyr)
df %>%
do(data.frame(., f(.$x)))
# x A B C
#1 1 2 3 4
#2 2 3 4 5
#3 3 4 5 6
#4 4 5 6 7
#5 5 6 7 8
#6 6 7 8 9
#7 7 8 9 10
#8 8 9 10 11
#9 9 10 11 12
#10 10 11 12 13
Or
library(purrr)
df %>%
map_df(f) %>%
bind_cols(df, .)

subset dataframe based on conditions in vector

I have two dataframes
#df1
type <- c("A", "B", "C")
day_start <- c(5,8,4)
day_end <- c(12,10,11)
df1 <- cbind.data.frame(type, day_start, day_end)
df1
type day_start day_end
1 A 5 12
2 B 8 10
3 C 4 11
#df2
value <- 1:10
day <- 4:13
df2 <- cbind.data.frame(day, value)
day value
1 4 1
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
9 12 9
10 13 10
I would like to subset df2 such that each level of factor "type" in df1 gets its own dataframe, only including the rows/days between day_start and day_end of this factor level.
Desired outcome for "A" would be..
list_of_dataframes$df_A
day value
1 5 2
2 6 3
3 7 4
4 8 5
5 9 6
6 10 7
7 11 8
8 12 9
I found this question on SO with the answer suggesting to use mapply(), however, I just cannot figure out how I have to adapt the code given there to fit my data and desired outcome.. Can someone help me out?
The following solution assumes that you have all integer values for days, but if that assumption is plausible, it's an easy one-liner:
> apply(df1, 1, function(x) df2[df2$day %in% x[2]:x[3],])
[[1]]
day value
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
9 12 9
[[2]]
day value
5 8 5
6 9 6
7 10 7
[[3]]
day value
1 4 1
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
You can use setNames to name the dataframes in the list:
setNames(apply(df1, 1, function(x) df2[df2$day %in% x[2]:x[3],]),df1[,1])
Yes, you can use mapply:
Define a function that will do what you want:
fun <- function(x,y) df2[df2$day >= x & df2$day <= y,]
Then use mapply to apply this function with every element of day_start and day_end:
final.output <- mapply(fun,df1$day_start, df1$day_end, SIMPLIFY=FALSE)
This will give you a list with the outputs you want:
final.output
[[1]]
day value
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
9 12 9
[[2]]
day value
5 8 5
6 9 6
7 10 7
[[3]]
day value
1 4 1
2 5 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
8 11 8
You can name each data.frameof the list with setNames:
final.output <- setNames(final.output,df1$type)
Or you can also put an attribute type on the data.frames of the list:
fun <- function(x,y, type){
df <- df2[df2$day >= x & df2$day <= y,]
attr(df, "type") <- as.character(type)
df
}
Then each data.frame of final.output will have an attribute so you know which type it is:
final.output <- mapply(fun,df1$day_start, df1$day_end,df1$type, SIMPLIFY=FALSE)
# check wich type the first data.frame is
attr(final.output[[1]], "type")
[1] "A"
Finally, if you do not want a list with the 3 data.frames you can create a function that assigns the 3 data.frames to the global environment:
fun <- function(x,y, type){
df <- df2[df2$day >= x & df2$day <= y,]
name <- as.character(type)
assign(name, df, pos=.GlobalEnv)
}
mapply(fun,df1$day_start, df1$day_end, type=df1$type, SIMPLIFY=FALSE)
This will create 3 separate data.frames in the global environment named A, B and C.

repeatedly applying ave for computing group means in a data frame

The following code separately produces the group means of x and y in accordance to group. Suppose that I have a number of variables for which repeating the same operation.
How would you suggest to proceed in order to obtain the same result through a single command? (I suppose it is necessary to adopt tapply, but I am not really sure about it..).
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- data.frame(cbind(group, x, y))
dat$m_x <- ave(dat$x, dat$group)
dat$m_y <- ave(dat$y, dat$group)
dat
Many thanks.
Alternative solutions using data.table and plyr packages:
1) Using data.table
require(data.table)
dt <- data.table(dat, key="group")
# Following #Matthew's comment, edited:
dt[, `:=`(m_x = mean(x), m_y = mean(y)), by=group]
Output:
group x y m_x m_y
1: 1 1 2 3 4
2: 1 3 4 3 4
3: 1 5 6 3 4
4: 2 7 8 9 10
5: 2 9 10 9 10
6: 2 11 12 9 10
2) using plyr and transform:
require(plyr)
ddply(dat, .(group), transform, m_x=mean(x), m_y=mean(y))
output:
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
3) using plyr and numcolwise (note the reduced output):
ddply(dat, .(group), numcolwise(mean))
Output:
group x y
1 1 3 4
2 2 9 10
Assuming you have more than just two columns, you would want to use apply to apply ave to every column in the matrix.
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- cbind(x, y)
ave.dat <- apply(dat, 2, function(column) ave(column, group))
# x y
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6
# [4,] 7 8
# [5,] 9 10
# [6,] 11 12
You can also use aggregate():
dat2 <- data.frame(dat, aggregate(dat[,-1], by=list(dat$group), mean)[group, -1])
dat2
group x y x.1 y.1
1 1 1 2 3 4
1.1 1 3 4 3 4
1.2 1 5 6 3 4
2 2 7 8 9 10
2.1 2 9 10 9 10
2.2 2 11 12 9 10
row.names(dat2) <- rownames(dat)
colnames(dat2) <- gsub("(.)\\.1", "m_\\1", colnames(dat2))
dat2
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
If the variable names are more than a single character, you would need to modify the gsub() call.

Generate combination of data frame and vector

I know expand.grid is to create all combinations of given vectors. But is there a way to generate all combinations of a data frame and a vector by taking each row in the data frame as unique. For instance,
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
how to create a new data frame that is the combination of df and c without expanding df:
df.c:
a b c
1 5 9
2 6 9
3 7 9
1 5 10
2 6 10
3 7 10
Thanks!
As for me the simplest way is merge(df, as.data.frame(c))
a b c
1 1 5 9
2 2 6 9
3 3 7 9
4 1 5 10
5 2 6 10
6 3 7 10
This may not scale when your dataframe has more than two columns per row, but you can just use expand.grid on the first column and then merge the second column in.
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
combined <- expand.grid(a=df$a, c=c)
combined <- merge(combined, df)
> combined[order(combined$c), ]
a c b
1 1 9 5
3 2 9 6
5 3 9 7
2 1 10 5
4 2 10 6
6 3 10 7
You could also do something like this
do.call(rbind,lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
# or using rbindlist as a fast alternative to do.call(rbind,list)
library(data.table)
rbindlist(lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
or
rbindlist(Map(data.frame, c = 9:10, MoreArgs = list(a= 1:3,b=5:7)))
This question is really old but I found one more answer.
Use tidyr's expand_grid().
expand_grid(df, c)
# A tibble: 6 × 3
a b c
<int> <int> <int>
1 1 5 9
2 1 5 10
3 2 6 9
4 2 6 10
5 3 7 9
6 3 7 10

Resources