How to apply a function to grouped rows in R [duplicate] - r

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 6 years ago.
I have a data frame generated by
points_A = sample(1:6,6)
points_B = sample(1:6,6)
points_C = sample(1:6,6)
df <- data.frame( name = gl(3,2,labels=c("Luca","Mario","Paolo") ) , cbind(points_A,points_B,points_C) )
which display as
name points_A points_B points_C
1 Luca 5 2 3
2 Luca 3 3 1
3 Mario 1 5 2
4 Mario 6 6 4
5 Paolo 4 4 5
6 Paolo 2 1 6
I would like to apply a function (e.g. sum() ) to the rows grouped by the column name (1st column).
The output should be something like:
name points_A points_B points_C
1 Luca 8 5 4
2 Mario 7 11 6
3 Paolo 6 5 11
Any suggestions?

I like to do these things with data.table
library(data.table);
dt<-data.table(df) ;
dt[, function(column), by = group]
As "column" you can also set .SD to get multiple columns. "group" would be "name" in your example.

A (pretty raw) solution with data.table
require(data.table)
setDT(df)
df[, lapply(.SD, sum), by = name, .SDcols = 2:4]
name points_A points_B points_C
1: Luca 9 6 6
2: Mario 5 10 11
3: Paolo 7 5 4
EDIT:
A raw solution in base R:
t(sapply(split(df, df$name), function(x) colSums(x[, c("points_A", "points_B", "points_C")])))

Related

How to change a dataframe in R where you have multiple columns to just one column [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I have a data frame like this a
x <- data.frame("Name" = c("John","Dora"),"SN" = 1:2,"SN_1" = 7:8,"SN_2" = 5:6 )
Name SN SN_1 SN_2
1 John 1 7 5
2 Dora 2 8 6
And I want to transform it to:
Name SN
1 John 1
2 Dora 2
3 John 7
4 Dora 8
5 John 5
6 Dora 6
library(data.table)
setDT(x)
melt(x,
id.vars = "Name",
measure.vars = patterns(c("SN")))[,.(Name, SN = value)]

Sort data.frame or data.table using vector of column names [duplicate]

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)
Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]
Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

Create a new dataframe according to the contrast between two similar df [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have a dataframe made like this:
X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3
After several steps (not important which one) i obtained this df:
X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3
i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:
X Y Z T
1 2 4 2
7 5 NA 3
How could I do?
One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset
dfO[do.call(paste, dfO) == do.call(paste, df),]
# X Y Z T
#1 1 2 4 2
#3 7 5 NA 3
where 'dfO' is the old dataset and 'df' is the new
You can use dplyr's intersect function:
library(dplyr)
intersect(d1, d2)
# X Y Z T
#1 1 2 4 2
#2 7 5 NA 3
This is a data.frame-equivalent of base R's intersect function.
In case you're working with data.tables, that package also provides such a function:
library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
# X Y Z T
#1: 1 2 4 2
#2: 7 5 NA 3
Another dplyr solution: semi_join.
dt1 %>% semi_join(dt2, by = colnames(.))
X Y Z T
1 1 2 4 2
2 7 5 NA 3
Data
dt1 <- read.table(text = "X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- read.table(text = " X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.
From this perspective, I think the only correct one so far is akrun's.
You could also do something like:
df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]
But I think akrun's way is more elegant and likely to perform better in terms of speed.

R join same row and calculate mean value [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 7 years ago.
I have a data frame that looks like this:
data<-data.frame(y=c(1,1,2,2,3,4,5,5),x=c(5,5,10,10,5,10,5,5))
y x
1 1 5
2 1 5
3 2 10
4 2 30
5 3 5
6 4 10
7 5 4
8 5 8
How can a merge those rows with same value in y column and modify the x column value to the mean of them.
I would like something like this:
y x
1 1 5
2 2 20
3 3 5
4 4 10
7 5 6
I'm trying:
unique(data)
But it removes the values instead of doing the mean of same rows.
It is easy with dplyr. Like here:
library("dplyr")
data %>%
group_by(y) %>%
summarise(x=mean(x))
We can use aggregate
aggregate(x~y, data, mean)
User plyr.
# Create dummy data.
nel = 30
df <- data.frame(x = round(5*runif(nel)), y= round(10*runif(nel)))
# Summarise means
require(plyr)
df$x <- as.factor(df$x)
res <- ddply(df, .(x), summarise, mu=mean(y))

Combining Rows - Summing Certain Columns and Not Others in R

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.
I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution
Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
"Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)
The data looks like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 6 2
2 Hank 4 9 2
3 Tom 3 2 8
4 Jeff 3 5 6
5 Hank 4 7 4
6 Jeff 3 4 3
7 Jeff 3 8 3
8 Bill 6 2 3
9 Mark 2 9 1
I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 23 14
2 Hank 4 16 6
3 Tom 3 2 8
8 Bill 6 2 3
9 Mark 2 9 1
Any help would great. Thanks!
Another solution which is a bit more straightforward is by using the library dplyr
library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>% # group the columns you want to "leave alone"
summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4
if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))
I'd do it this way using data.table:
setDT(data)[, c(data.Point.1 = data.Point.1[1L],
lapply(.SD, sum)), by=Name,
.SDcols = -"data.Point.1"]
# Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff 3 23 14
# 2: Hank 3 16 6
# 3: Tom 3 2 8
# 4: Bill 3 2 3
# 5: Mark 3 9 1
We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.
Check the HTML vignettes for detailed info.
You could try
library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
# Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff 3 23 14
#2: Hank 4 16 6
#3: Tom 3 2 8
#4: Bill 6 2 3
#5: Mark 2 9 1
or using base R
data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
x[3:4] <- colSums(x[3:4])
x[1,]} ))
Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column
library(dplyr)
res1 <- data %>%
group_by(Name) %>%
summarise(data.Point.1=data.Point.1[1L])
res2 <- data %>%
group_by(Name) %>%
summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
# Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff 3 23 14
#2 Hank 4 16 6
#3 Tom 3 2 8
#4 Bill 6 2 3
#5 Mark 2 9 1
EDIT
The data created and the data showed initially differed in the original post. After the edit on OP's post (by #dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

Resources