Creating sum of many columns from names in one column [duplicate] - r

This question already has answers here:
Aggregate multiple columns at once [duplicate]
(2 answers)
Aggregating rows for multiple columns in R [duplicate]
(3 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have a large data frame where I have one column (Phylum) that has repeated names and 253 other columns (each with a unique name) that have counts of the Phylum column. I would like to sum the counts within each column that correspond to each Phylum.
This is a simplified version of what my data look like:
Phylum sample1 sample2 sample3 ... sample253
1 P1 2 3 5 5
2 P1 2 2 10 2
3 P2 1 0 0 1
4 P3 10 12 3 1
5 P3 5 7 14 15
I have seen similar questions, but they are for fewer columns, where you can just list the names of the columns you want summed. I don't want to enter 253 unique column names.
I would like my results to look like this
Phylum sample1 sample2 sample3 ... sample253
1 P1 4 5 15 7
2 P2 1 0 0 1
3 P3 15 19 17 16
I would appreciate any help. Sorry for the format of the question, this is my first time asking for help on stackoverflow (rather than sleuthing).

If your starting file looks like this (test.csv):
Phylum,sample1,sample2,sample3,sample253
P1,2,3,5,5
P1,2,2,10,2
P2,1,0,0,1
P3,10,12,3,1
P3,5,7,14,15
Then you can use group_by and summarise_each from dplyr:
read_csv('test.csv') %>%
group_by(Phylum) %>%
summarise_each(funs(sum))
(I first loaded tidyverse with library(tidyverse).)
Note that, if you were trying to do this for one column you can simply use summarise:
read_csv('test.csv') %>%
group_by(Phylum) %>%
summarise(sum(sample1))
summarise_each is required to run that function (in the above, funs(sum)) on each column.

Related

Summarize values between two rows, according to criteria

I have this dataframe
my dataframe
where values in the 'Age' columns need to be summarize per the whole size range
i.e. now the data frame is like this:
Size Age 1 Age 2 Age 3
[1] 8 2 8 5
[2] 8.5 4 7 9
[3] 9 1 11 45
[4] 9.5 3 2 0
But i want this
Size Age 1 Age 2 Age 3
[1+2] 8 6 15 16
[3+4] 9 4 13 45
Which function is better to use in R?
I thought but I don't tried, to use rowwise () together with mutate (), but I don't know how to set the criteria.
Thank you in advance for the help :)
You can do this quite easily with the dplyr library. (You may need to install.packages("dplyr") if you haven't already.)
Using dplyr functions, we can group by a new grouping column, size, replacing the existing size column with values that have been rounded down to the nearest whole number. Then we just summarise across all the columns that starts_with "Age" and sum up the values.
require(dplyr)
my_df |>
group_by(size = floor(size)) |>
summarise(
across(starts_with("Age"), sum)
)

Is there a way in R to make all possible combinations between rows of different columns? [duplicate]

This question already has answers here:
Unique combination of all elements from two (or more) vectors
(6 answers)
Generate list of all possible combinations of elements of vector
(10 answers)
Closed 2 years ago.
I have a df with one column and I would like to make combinations with the values of this column in order to have a new df with two columns, like he simple example below: (Obs: my df has ~5000 rows)
df
CG
1
2
3
##I would like a result similar to this:
> head(df1)
C1 C2
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Does someone could help me?
Thank you in advance

identifying unique values of a grouped variable [duplicate]

This question already has answers here:
How to count the number of unique values by group? [duplicate]
(1 answer)
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
I am trying to count the # of unique date values across multiple visits. Here is sample data:
id date
1 2017-08-31
1 2017-08-31
1 2017-05-06
2 2015-09-01
2 2015-11-01
3 2010-12-02
3 2010-12-02
I want a df that shows how many unique dates there are per participant. Something like this:
id total_visit
1 2
2 2
3 1
I tried this code, but it's not doing what I want it to do.
library(tidyverse)
df1 <- df %>% group_by(id) %>% count(distinct(date))
Can someone please help?

Modify DataFrame, remove double Data with for each, R [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 2 years ago.
Im about to modify a dataframe because it includes double values
Data Frame:
Id Name Account
1 X 1
1 Y 2
1 Z 3
2 J 1
2 T 4
3 O 2
So when there are multiple rows with same Id I just want to keep the last row.
The desired output would be
Id Name Account
1 Z 3
2 T 4
3 O 2
This is my current Code:
for (i in 1:(nrow(mylist)-1)) {
if(mylist$Id[c(i)] == mylist$Id[c(i+1)]){
mylist <- mylist[-c(i), ]
}
}
I have Problems when a row is removed because all other rows get a lower index and the System skips rows in the next step.
You can do this easily with the dplyr package:
library(dplyr)
mylist %>%
group_by(Id) %>%
slice(n()) %>%
ungroup()
First you group_by the Id column. Afterwards you select only the last entry (slice(n())) of each group.
One option in Base-R is
mylist[cumsum(sapply(split(mylist,mylist$Id),nrow)),]
Id Name Account
3 1 Z 3
5 2 T 4
6 3 O 2

How to sum a specific column of replicate rows in dataframe? [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
How to group by two columns in R
(4 answers)
Closed 3 years ago.
I have a data frame which contains a lot of replicates rows. I would like to sum up the last column of replicates rows and remove the replications at the same time. Could anyone tell me how to do that?
The example is here:
name <- c("a","b","c","a","c")
position <- c(192,7,6,192,99)
score <- c(1,2,3,2,5)
df <- data.frame(name,position,score)
> df
name position score
1 a 192 1
2 b 7 2
3 c 6 3
4 a 192 2
5 c 99 5
#I would like to sum the score together if the first two columns are the
#same. The ideal result is like this way
name position score
1 a 192 3
2 b 7 2
3 c 6 3
4 c 99 5
Sincerely thank you for the help.
try this :
library(dplyr)
df %>%
group_by(name, position) %>%
summarise(score = sum(score, na.rm = T))

Resources