Summarize values between two rows, according to criteria

Summarize values between two rows, according to criteria - r

I have this dataframe
my dataframe
where values in the 'Age' columns need to be summarize per the whole size range
i.e. now the data frame is like this:
Size Age 1 Age 2 Age 3
[1] 8 2 8 5
[2] 8.5 4 7 9
[3] 9 1 11 45
[4] 9.5 3 2 0
But i want this
Size Age 1 Age 2 Age 3
[1+2] 8 6 15 16
[3+4] 9 4 13 45
Which function is better to use in R?
I thought but I don't tried, to use rowwise () together with mutate (), but I don't know how to set the criteria.
Thank you in advance for the help :)

You can do this quite easily with the dplyr library. (You may need to install.packages("dplyr") if you haven't already.)
Using dplyr functions, we can group by a new grouping column, size, replacing the existing size column with values that have been rounded down to the nearest whole number. Then we just summarise across all the columns that starts_with "Age" and sum up the values.
require(dplyr)
my_df |>
group_by(size = floor(size)) |>
summarise(
across(starts_with("Age"), sum)
)

Related

How to merge data to apply to all unique conditions of a column in second data set, even when not occuring

I am trying to insert new rows of data based on unique values of a column in my original data set. I have the following dummy data set:
sites<-c("10","10","11","11","12","12")
ID<-c("A","A","B","B","C","D")
value<-c("4","6","5","2","7","8")
dataframe<-data.frame(sites, ID, value)
sites<-c("10","10","11","11","12","12","13","14","15")
dataframe2<-data.frame(sites)
Producing:
sites ID value
10 A 4
10 A 6
11 B 5
11 B 2
12 C 7
12 D 8
sites
10
10
11
11
12
12
13
14
15
For each unique value in column ID, I would like each site number from the second data frame applied, and when there is no value I would like it to print 0.
So for example, ID A would have all sites from site2 listed and when there is no value (ie for site 11, 12, 13,14) I would like it to list 0 for value.
I have tried the following:
mergeddata<-merge(dataframe, dataframe2, by="sites", all.y=TRUE)
But that only adds the new sites at the bottom with NA's for each value other than site. I want dataframe2 to be applied for each unique value under column ID, so that each ID has an occurrence of all sites. I'm not sure what the best way to go about this would be, any help is much appreciated!

This could be a job for complete() from package tidyr. You can group your first dataset by ID and then use complete() to add rows for the site values from dataframe2 within each group.
This results in having at least one row for each site in each ID. I use the fill argument to add the 0 to value for the new rows (after converting value to numeric).
library(dplyr)
library(tidyr)
dataframe$value = as.numeric( as.character(dataframe$value) )
dataframe %>%
group_by(ID) %>%
complete(sites = dataframe2$sites, fill = list(value = 0) )
# A tibble: 26 x 3
# Groups: ID [4]
ID sites value
<fct> <chr> <dbl>
1 A 10 4
2 A 10 6
3 A 11 0
4 A 12 0
5 A 13 0
6 A 14 0
7 A 15 0
8 B 10 0
9 B 11 5
10 B 11 2
# ... with 16 more rows
Warning message:
Column `sites` joining factors with different levels, coercing to character vector
The warning message has to do with site being a factor in the two datasets, which complete() deals with by converting the two columns to characters instead.

Creating sum of many columns from names in one column [duplicate]

This question already has answers here:
Aggregate multiple columns at once [duplicate]
(2 answers)
Aggregating rows for multiple columns in R [duplicate]
(3 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have a large data frame where I have one column (Phylum) that has repeated names and 253 other columns (each with a unique name) that have counts of the Phylum column. I would like to sum the counts within each column that correspond to each Phylum.
This is a simplified version of what my data look like:
Phylum sample1 sample2 sample3 ... sample253
1 P1 2 3 5 5
2 P1 2 2 10 2
3 P2 1 0 0 1
4 P3 10 12 3 1
5 P3 5 7 14 15
I have seen similar questions, but they are for fewer columns, where you can just list the names of the columns you want summed. I don't want to enter 253 unique column names.
I would like my results to look like this
Phylum sample1 sample2 sample3 ... sample253
1 P1 4 5 15 7
2 P2 1 0 0 1
3 P3 15 19 17 16
I would appreciate any help. Sorry for the format of the question, this is my first time asking for help on stackoverflow (rather than sleuthing).

If your starting file looks like this (test.csv):
Phylum,sample1,sample2,sample3,sample253
P1,2,3,5,5
P1,2,2,10,2
P2,1,0,0,1
P3,10,12,3,1
P3,5,7,14,15
Then you can use group_by and summarise_each from dplyr:
read_csv('test.csv') %>%
group_by(Phylum) %>%
summarise_each(funs(sum))
(I first loaded tidyverse with library(tidyverse).)
Note that, if you were trying to do this for one column you can simply use summarise:
read_csv('test.csv') %>%
group_by(Phylum) %>%
summarise(sum(sample1))
summarise_each is required to run that function (in the above, funs(sum)) on each column.

Sum count data in a dataframe based on the size of an associated numeric variable

I have a data frame with data as follows (although my data set is much bigger)
ID Count Size
1 1 35
1 2 42
1 2 56
2 3 25
2 5 52
2 2 62
etc....
I would like to extract the total count for each ID but split for when the size variable is either <50 or <=50
So far I have done this to get the cumulative count based on the unique id
cbind(aggregate(Count~ID, sum, data=df)
To produce this
ID Count
1 5
2 10
But I want to produce something like this instead
ID <50 >=50
1 3 2
2 3 7
I've tried searching on how best to do this and am sure it is really simple but I'm struggling on how best to achieve this...any help would be great thanks!

We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we get the sum of 'Count' based on the logical indexes ('Size < 50,Size >=50`)
library(data.table)
setDT(df1)[,list(`<50` = sum(Count[Size <50]),
`>=50` = sum(Count[Size>=50])) , by = ID]
# ID <50 >=50
#1: 1 3 2
#2: 2 3 7
A similar option with dplyr is
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(`<50` = sum(Count[Size <50]),
`>=50` = sum(Count[Size>=50]))
NOTE: It is better to name the columns as less50, greaterthanEq50 instead of the names suggested in the expected output.

Continue your idea, you can actually aggregate on df[df$Size<50,] instead of df, and do this again for >=50 then merge.
d1 = aggregate(Count~ID,sum,data=df[df$Size<50,])
d2 = aggregate(Count~ID,sum,data=df[df$Size>=50,])
merge(d1,d2,by="ID",all=TRUE)
This is just based on what you already did, but not the best I guess..

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4

df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size

assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!

tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4

You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)

The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]