Aggregating columns based on columns name in R - r

I have this dataframe in R
Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6
I want to aggregate it to where it will combined all the pros and anti based on party
for example
Party ProSum AntiSum
R. 234. 245
D. 234. 245
How would I do that in R?

You can use:
library(tidyverse)
df %>%
pivot_longer(-Party,
names_to = c(".value", NA),
names_pattern = "([a-zA-Z]*)([0-9]*)") %>%
group_by(Party) %>%
summarise(across(where(is.numeric), sum, na.rm = T))
# A tibble: 2 x 3
Party Pro Anti
<chr> <int> <int>
1 D 50 34
2 R 5 78

I would suggest a tidyverse approach reshaping the data and the computing the sum of values:
library(tidyverse)
#Data
df <- structure(list(Party = c("R", "R", "D", "D"), Pro2005 = c(1L,
1L, 13L, 12L), Anti2005 = c(18L, 19L, 7L, 8L), Pro2006 = c(0L,
0L, 3L, 3L), Anti2006 = c(7L, 7L, 4L, 4L), Pro2007 = c(2L, 1L,
10L, 9L), Anti2007 = c(13L, 14L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-4L))
The code:
df %>% pivot_longer(cols = -1) %>%
#Format strings
mutate(name=gsub('\\d+','',name)) %>%
#Aggregate
group_by(Party,name) %>% summarise(value=sum(value,na.rm=T)) %>%
pivot_wider(names_from = name,values_from=value)
The output:
# A tibble: 2 x 3
# Groups: Party [2]
Party Anti Pro
<chr> <int> <int>
1 D 34 50
2 R 78 5

Splitting by parties and loop sum over the pro/anti using sapply, finally rbind.
res <- data.frame(Party=sort(unique(d$Party)), do.call(rbind, by(d, d$Party, function(x)
sapply(c("Pro", "Anti"), function(y) sum(x[grep(y, names(x))])))))
res
# Party Pro Anti
# D D 50 34
# R R 5 78
An outer solution is also suitable.
t(outer(c("Pro", "Anti"), c("R", "D"),
Vectorize(function(x, y) sum(d[d$Party %in% y, grep(x, names(d))]))))
# [,1] [,2]
# [1,] 5 78
# [2,] 50 34
Data:
d <- read.table(header=T, text="Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6 ")

Related

Adding columns and insert info from a second dataframe R

hello everyone I have two dataframes and I'd like to join information from one df to another one in a specific way. I'm gonna explain better. Here is my first df where i'd like to add 6 columns (general col named col1, col2 and so on..):
res1 res4 aa1234
1 AAAAAA 1 4 IVGG
2 AAAAAA 8 11 RPRQ
3 AAAAAA 10 13 RQFP
4 AAAAAA 12 15 FPFL
5 AAAAAA 20 23 NQGR
6 AAAAAA 32 35 HARF
here is the 2nd df:
res1 dist
1 3.711846
1 3.698985
2 4.180874
2 3.112819
3 3.559737
3 3.722107
4 3.842375
4 3.914970
5 3.361647
5 2.982788
6 3.245118
6 3.224230
7 3.538315
7 3.602273
8 3.185184
8 2.771583
9 4.276871
9 3.157737
10 3.933783
10 2.956738
Considering "res1" I'd like to add to the 1st df in my new 6 columns the first 6th values contained in "dist" of second df corresponding to res1 = 1.
After, in the 1st df I have res1 = 8, so I'd like to add in the new 6 columns the 6 values from res1 = 8 contained in "dist" of 2nd df.
I'd like to obtain something like this
res1 res4 aa1234 col1 col2 col3 col4 col5 col6
1 4 IVGG 3.71 3.79 4.18 3.11 3.55 3.72
8 11 RPRQ 3.18 2.77 4.27 3.15 3.93 2.95
10 13 RQFP
12 15 FPFL
20 23 NQGR
32 35 HARF
Please consider that I have to do it on a large dataset and for 1000 and more files... thanks!
You could create a sequence from res1 to res4 and then join the data with pdb.
library(tidyverse)
turn %>%
mutate(res = map2(res1, res4, seq)) %>%
unnest(res) %>%
left_join(pdb, by = c('res' = 'res1')) %>%
group_by(res1 = as.character(res1)) %>%
mutate(col = paste0('col', row_number())) %>%
select(-res4, -res, -eleno) %>%
pivot_wider(names_from = col, values_from = dist)
We can use rowid from data.table
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df2 %>%
mutate(col = str_c("col", rowid(res1))) %>%
pivot_wider(names_from = col, values_from = dist) %>%
right_join(df1, by = 'res1')
-output
# A tibble: 6 x 4
# res1 col1 col2 res4
# <int> <dbl> <dbl> <int>
#1 1 3.71 3.70 4
#2 8 3.19 2.77 11
#3 10 3.93 2.96 13
#4 12 NA NA 15
#5 20 NA NA 23
#6 32 NA NA 35
data
df1 <- structure(list(res1 = c(1L, 8L, 10L, 12L, 20L, 32L), res4 = c(4L,
11L, 13L, 15L, 23L, 35L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(res1 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L,
6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), dist = c(3.711846,
3.698985, 4.180874, 3.112819, 3.559737, 3.722107, 3.842375, 3.91497,
3.361647, 2.982788, 3.245118, 3.22423, 3.538315, 3.602273, 3.185184,
2.771583, 4.276871, 3.157737, 3.933783, 2.956738)), class = "data.frame",
row.names = c(NA,
-20L))

calculate descriptives for a nested variable

I want to calculate the M, min, and max of a variable. Data were collected at different visits. My data look like this:
id visit V1
1 1 18
1 2 24
2 2 NA
2 3 5
2 4 6
I want it to look like this, where I have columns for the M, SD, min, and max for V1 for each participant.
id visit V1 M MIN MAX
1 1 18 21 18 24
2 2 3 4.67 3 6
In calculating the M, I want to take into account the # of visits (e.g., 18 + 24/2 visits). I tried this as a first step:
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1), na.rm = T)
When I try to handle the NAs by making sure they are not included, the na.rm = T results in a new column entitled "na.rm" with every value being true, which isn't what I want. Any thoughts on making this work?
The dplyr package makes this easy. You can group_by() a variable, and whatever you do after that only applies within the group. In dplyr notation, the %>% is a special operator that feeds the outcome of the function on the left into the first argument of the function on the right.
There are two ways to do it. The first way keeps all of the data, but your summary statistics are repeated in each row.
library(dplyr)
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1)
id visit V1 M MIN MAX
1 1 18 21 18 24
1 2 24 21 18 24
2 2 3 4.67 3 6
2 3 5 4.67 3 6
2 4 6 4.67 3 6
The second way provides only the summary statistics by the group.
library(dplyr)
df %>%
group_by(id) %>%
summarize(M = mean(V1), MIN = min(V1), MAX = max(V1)
id M MIN MAX
1 21 18 24
2 4.67 3 6
You can try this dplyr approach similar to #ThomasIsCoding that produces something similar to what you want:
library(dplyr)
#Data
df <- structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
The code:
df %>% group_by(id) %>% mutate(M=mean(V1),Min=min(V1),Max=max(V1),SD=sd(V1))
Output:
# A tibble: 5 x 7
# Groups: id [2]
id visit V1 M Min Max SD
<int> <int> <int> <dbl> <int> <int> <dbl>
1 1 1 18 21 18 24 4.24
2 1 2 24 21 18 24 4.24
3 2 2 3 4.67 3 6 1.53
4 2 3 5 4.67 3 6 1.53
5 2 4 6 4.67 3 6 1.53
Maybe you want something like below
transform(df,
M = ave(V1, id, FUN = mean),
MIN = ave(V1, id, FUN = min),
MAX = ave(V1, id, FUN = max)
)
which gives
id visit V1 M MIN MAX
1 1 1 18 21.000000 18 24
2 1 2 24 21.000000 18 24
3 2 2 3 4.666667 3 6
4 2 3 5 4.666667 3 6
5 2 4 6 4.666667 3 6
Data
> dput(df)
structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))

Getting rowSums for triplicate records and retaining only the one with highest value

I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").
Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:
ID Trip Acet Cell Fibe Mega Tera
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3
I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...
So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is
ID Trip Acet Cell Fibe Mega Tera
1 4 B 9 3 7 5 5
2 12 C 5 5 7 3 3
REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.
We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset
nm1 <- startsWith(names(df1), "V")
OP updated the column names. In that case, either an index
nm1 <- 3:7
Or select the columns with setdiff
nm1 <- setdiff(names(df1), c("ID", "Trip"))
v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
# ID Trip V1 V2 V3 V4 V5
#2 4 B 9 3 7 5 5
#6 12 C 5 5 7 3 3
data
df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L,
5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L,
1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is one way.
library(tidyverse)
dat2 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
group_by(ID) %>%
filter(Sum == max(Sum)) %>%
select(-Sum) %>%
ungroup()
dat2
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.
dat3 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
arrange(ID, desc(Sum)) %>%
group_by(ID) %>%
slice(1) %>%
select(-Sum) %>%
ungroup()
dat3
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
DATA
dat <- read.table(text = " ID Trip V1 V2 V3 V4 V5
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3 ",
header = TRUE)

Aggregate by one variable but adding other variables [duplicate]

This question already has an answer here:
How to GROUP and choose lowest value in R [duplicate]
(1 answer)
Closed 6 years ago.
I have a data.frame with this structure:
id time var1 var2 var3
1 2 4 5 6
1 4 8 51 7
1 1 9 17 38
2 12 8 9 21
2 15 25 6 23
For all the ids, I want to have the row that contains the minimum time. In the example in would be this:
id time var1 var2 var3
1 1 9 17 38
2 12 8 9 21
I think that the aggregate function would be useful, but I'm not sure how to use it.
Your title may be misleading, since you really just want to keep the row with the minimum time for every id. Try this:
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, time) %>%
filter(row_number() == 1)
We can use by, do.call, and the ever useful which.min function to get what we need:
do.call('rbind', by(df, df$id, function(x) x[which.min(x$time), ]))
# id time var1 var2 var3
# 1 1 1 9 17 38
# 2 2 12 8 9 21
And if you suspect there may be more than one minimum value per id, you can eschew the which.min function and use which(x$time == min(x$time)):
do.call('rbind', by(df, df$id, function(x) x[which(x$time == min(x$time)), ]))
# id time var1 var2 var3
# 1 1 1 9 17 38
# 2 2 12 8 9 21
Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L),
time = c(2L, 4L, 1L, 2L, 15L),
var1 = c(4L, 8L, 9L, 8L, 25L),
var2 = c(5L, 51L, 17L, 9L, 6L),
var3 = c(6L, 7L, 38L, 21L, 23L)),
.Names = c("id", "time", "var1", "var2", "var3"),
class = "data.frame", row.names = c(NA, -5L))
dplyr using the function slice
library(dplyr)
df %>%
group_by(id) %>%
slice(which.min(time))
Output:
Source: local data frame [2 x 5]
Groups: id [2]
id time var1 var2 var3
<dbl> <dbl> <dbl> <dbl> <int>
1 1 1 9 17 38
2 2 12 8 9 21
sqldf
library(sqldf)
sqldf('SELECT id, MIN(time) time, var1, var2, var3
FROM df
GROUP BY id')
Output:
id time var1 var2 var3
1 1 1 9 17 38
2 2 12 8 9 21

Complex data frame transposition in R

I've tried searching for an answer for this but most data.frame/matrix transpoitions aren't as complicated as I am trying to accomplish. Basically I have a data.frame which looks like
F M A
2008_b 1 5 6
2008_r 3 3 6
2008_a 4 1 5
2009_b 1 1 2
2009_r 5 4 9
2009_a 2 2 4
I'm trying to transpose it and rename the column and row names as such:
F_b M_b A_b F_r M_r A_r F_a M_a A_a
2008 1 5 6 3 3 6 4 1 5
2009 1 1 2 5 4 9 2 2 4
Essentially every three rows are being collapsed in to a single row. I assume this can be done with some clever plyr or reshape2 commands but I'm at a total loss how to accomplish it.
You could try
library(dplyr)
library(tidyr)
lvl <- c(outer(colnames(df), unique(gsub(".*_", "", rownames(df))),
FUN=paste, sep="_"))
res <- cbind(Var1=row.names(df), df) %>%
gather(Var2, value, -Var1) %>%
separate(Var1, c('Var11', 'Var12')) %>%
unite(VarN, Var2, Var12) %>%
mutate(VarN=factor(VarN, levels=lvl)) %>%
spread(VarN, value)
row.names(res) <- res[,1]
res1 <- res[,-1]
res1
# F_b M_b A_b F_r M_r A_r F_a M_a A_a
#2008 1 5 6 3 3 6 4 1 5
#2009 1 1 2 5 4 9 2 2 4
data
df <- structure(list(F = c(1L, 3L, 4L, 1L, 5L, 2L), M = c(5L, 3L, 1L,
1L, 4L, 2L), A = c(6L, 6L, 5L, 2L, 9L, 4L)), .Names = c("F",
"M", "A"), class = "data.frame", row.names = c("2008_b", "2008_r",
"2008_a", "2009_b", "2009_r", "2009_a"))

Resources