how to merge rows in a dataframe [duplicate] - r

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Sum multiple variables by group [duplicate]
(2 answers)
Closed 2 years ago.
I have a df
df1 <- data.frame(col1=c("x","y","y","z","z","z"), col2=c(0,1,0,0,0,1), col3=c(0,0,1,0,0,0), col4=c(1,0,0,0,1,0), col5=c(0,1,0,0,0,0))
I want to have a df like this
df2 <- data.frame(col1=c("x","y","z"), col2=c(0,1,1), col3=c(0,1,0),col4=c(1,0,1), col5=c(0,1,0))
Could anyone help me, please? Thank you

A solution using dplyr. The idea is group_by col1 and calculate sum for all the other columns.
library(dplyr)
df <- df1 %>%
group_by(col1) %>%
summarize_all(~sum(.)) %>%
ungroup()
df
# # A tibble: 3 x 5
# col1 col2 col3 col4 col5
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 x 0 0 1 0
# 2 y 1 1 0 1
# 3 z 1 0 1 0

Related

Display sum after each row r [duplicate]

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Closed 2 years ago.
df <- as_tibble(a <- c(1,2,3))
df
# A tibble: 3 x 1
value
<dbl>
1 1
2 2
3 3
The goal is this:
# A tibble: 3 x 2
value Sum
<dbl>
1 1 1
2 2 3
3 3 6
So just display the sum after each row. 1 = 1. 1+2 = 3. 3+3 = 6, and so on. I guess it's kinda easy, maybe with rowSums?
It is a cumulative sum. In R, there is cumsum to do that
df$Sum <- cumsum(df$value)
We could do the same while constructing the 'tibble
library(tibble)
df <- tibble(value = 1:3, Sum = cumsum(value))

Add columns using tidyverse language, but use column numbers instead of column names

library(tidyverse)
df <- tibble(col1 = c(5, 2), col2 = c(6, 4), col3 = c(9, 9))
# # A tibble: 2 x 3
# col1 col2 col3
# <dbl> <dbl> <dbl>
# 1 5 6 9
# 2 2 4 9
I need to add columns 1 and 3. But the column names often change. So I can only use column numbers as opposed to the actual column name.
Attempt 1 works as expected.
Attempt 2 and 3 don't work.
What's wrong with my syntax? I can't use attempt 1 because next month the column names may be something else, but their relative positions will remain the same.
df %>% mutate(col4 = col1 + col3) # attempt 1
df %>% mutate(col4 = .[, 1] + .[, 3]) # attempt 2
df %>% {mutate(col4 = .[, 1] + .[, 3])} # attempt 3
If it is based on position, use rowSums by subsetting the columns based on the column index. The advantage is that we can also take care of NA elements (if any)
df %>%
mutate(col4 = rowSums(.[c(1, 3)], na.rm = TRUE))
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 14
#2 2 4 9 11
Regarding the issue in OP's case, we need [[ instead of [ for subsetting a single column as a vector. With df[, 1] or .[,1] it would still be a tibble with one column instead of converting to a vector as we think about the behavior we find with data.frame
df %>%
mutate(col4 = .[[1]] + .[[3]])
# A tibble: 2 x 4
# col1 col2 col3 col4
# <dbl> <dbl> <dbl> <dbl>
#1 5 6 9 14
#2 2 4 9 11

From comma separate text to vector [duplicate]

This question already has answers here:
Dummify character column and find unique values [duplicate]
(7 answers)
Closed 3 years ago.
Having a data structure into the comma separated format:
dframe = data.frame(id=c(1,2,43,53), title=c("text1,color","color,text2","text2","text3,text2"))
To convert it as a Boolean vector with exist or not in every row like this expected output:
dframe = data.frame(id=c(1,2,43,53), text1=c(1,0,0,0), color=c(1,1,0,0), text2=c(0,1,1,1), text3=c(0,0,0,1))
We can use separate_rows and spread from tidyverse:
library(tidyverse)
dframe %>%
separate_rows(title, sep = ",") %>%
mutate(id2 = 1) %>%
spread(title, id2, fill = 0)
Output:
# A tibble: 4 x 5
# Groups: id [4]
id color text1 text2 text3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 0
2 2 1 0 1 0
3 43 0 0 1 0
4 53 0 0 1 1

Count the number of times two values appear in a column based on the unique values of another column [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Tidyr how to spread into count of occurrence [duplicate]
(2 answers)
Closed 4 years ago.
I have the dataframe below:
year<-c("2000","2000","2001","2002","2000")
gender<-c("M","F","M","F","M")
YG<-data.frame(year,gender)
In this dataframe I want to count the number of "M" and "F" for every year and then create a new dataframe like :
year M F
1 2000 2 1
2 2001 1 0
3 2002 0 1
I tried something like:
library(dplyr)
ns<-YG %>%
group_by(year) %>%
count(YG$gender == "M")
A solution using reshape2:
dcast(YG, year~gender)
year F M
1 2000 1 2
2 2001 0 1
3 2002 1 0
Or a different tidyverse solution:
YG %>%
group_by(year) %>%
summarise(M = length(gender[gender == "M"]),
F = length(gender[gender == "F"]))
year M F
<fct> <int> <int>
1 2000 2 1
2 2001 1 0
3 2002 0 1
Or as proposed by #zx8754:
YG %>%
group_by(year) %>%
summarise(M = sum(gender == "M"),
F = sum(gender == "F"))
We can use count and spread to get the df format and use fill = 0 in spread to fill in the 0s:
library(tidyverse)
YG %>%
group_by(year) %>%
count(gender) %>%
spread(gender, n, fill = 0)
Output:
# A tibble: 3 x 3
# Groups: year [3]
year F M
<fct> <dbl> <dbl>
1 2000 1 2
2 2001 0 1
3 2002 1 0

Collapsing a dataframe in R [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
I'm attempting to collapse a dataframe onto itself. The aggregate dataset seems like my best bet but I'm not sure how to have some columns add themselves and others remain the same.
My dataframe looks like this
A 1 3 2
A 2 3 4
B 1 2 4
B 4 2 2
How can I use the aggergate function or the ddply function to create something that looks like this:
A 3 3 6
B 5 2 6
We can use dplyr
library(dplyr)
df1 %>%
group_by(col1) %>%
summarise_each(funs(if(n_distinct(.)==1) .[1] else sum(.)))
Or another option if the column 'col3' is the same would be to keep it in the group_by and then summarise others
df1 %>%
group_by(col1, col3) %>%
summarise_each(funs(sum))
# col1 col3 col2 col4
# <chr> <int> <int> <int>
#1 A 3 3 6
#2 B 2 5 6
Or with aggregate
aggregate(.~col1+col3, df1, FUN = sum)
# col1 col3 col2 col4
#1 B 2 5 6
#2 A 3 3 6

Resources