Sum values from DF and make a new one

Sum values from DF and make a new one - r

I have this dataframe in R:
ID <- c(rep("ID1" , 4) , rep("ID2" , 4))
mut <- rep(c("AC", "TG", "AG", "TC"), 2)
count <- c(2,4,6,8,1,3,5,7)
data.frame(ID, mut, count)
ID mut count
1 ID1 AC 2
2 ID1 TG 4
3 ID1 AG 6
4 ID1 TC 8
5 ID2 AC 1
6 ID2 TG 3
7 ID2 AG 5
8 ID2 TC 7
I want to create a new one where I sum the values of count based on "mut" column.
Basically, for each ID, I would sum the count from mut=AC and TG and from AG and TC, to obtain this:
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12
I have absolutely no clue on how to do this!!
Thanks!!
M

You better make sure you have an even number of elements in each ID.
df=data.frame(ID, mut, count)
df$sek=rep(1:(nrow(df)/2),each=2)
do.call(rbind,
by(df,list(df$sek),function(x){
data.frame(
"ID"=x$ID[1],
"new_mut"=paste0(x$mut,collapse="-"),
"count"=sum(x$count)
)
})
)
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12

Using dplyr :
library(dplyr)
df %>%
group_by(ID, val = ceiling(match(mut, unique(mut))/2)) %>%
summarise(mut = paste0(mut,collapse="-"),
count = sum(count)) %>%
select(-val)
# ID mut count
# <chr> <chr> <dbl>
#1 ID1 AC-TG 6
#2 ID1 AG-TC 14
#3 ID2 AC-TG 4
#4 ID2 AG-TC 12

Related

Collapse rows in R

I have a dataframe
df <- data.frame(id1 = c("a" , "b", "b", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "a", "e"),
n1 = c(2,2,2,3),
n2 = c(2,1,1,1),
n3 = c(0,1,1,3),
n4 = c(0,1,1,2))
I want to collapse the 2nd and 3rd rows into one. Afterwards, I will do aggregate by column id3 sharing same character (i.e. a).
My real dataframe is long contaning many different latin names, filter by name i.e. a doesn´t make sense this case. I am thinking to collapse rows with the condition id3 == id2, but I could not do it. Any sugesstions for me?
My desired out put like this
id1 id2 id3 n1 n2 n3 n4
a NA a 2 2 0 0
b a a 2 1 1 1
c NA e 3 1 3 2
#Afterthat, it should be
id1 id3 n1 n2 n3 n4
a a 4 3 1 1
c e 3 1 3 2
(I just updated the dataframe, sorry for my mistake)

We get the distinct rows to generate the first expected
library(dplyr)
df %>%
distinct
id1 id2 id3 n1 n2 n3 n4
1 a <NA> a 2 2 0 0
2 b a a 2 1 1 1
3 c <NA> e 3 1 3 2
The final output we can get from the above, i.e. after the distinct step, do a group by coalesced 'id2', 'id1' along with 'id3' and then get the sum of numeric columns
df %>%
distinct %>%
group_by(id1 = coalesce(id2, id1), id3) %>%
summarise(across(where(is.numeric), sum), .groups = 'drop')
-output
# A tibble: 2 × 6
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2

Here is a slightly different way using slice after group_by instead of distinct:
df %>%
group_by(id1, id3) %>%
dplyr::slice(1L) %>%
mutate(id1 = coalesce(id2,id1)) %>%
summarise(across(where(is.numeric), sum))
output:
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2

How to make a copy of every row and column in one table for every index of another table in R?

There are two dataframes, one with an index and another with no index. I want to make a new dataframe with the indices of the first and the rows and columns of the other in such a way that there is a copy of every data in the second table for each index.
df_A <- data.frame("index" = c("id1","id2","id3")
, variable_a = c(1,2,3)
, variable_b = c("x","f","d"))
df_B <- data.frame(variable_x = c("4124","414","123")
, variable_y = c(12,22,13)
, variable_z = c("q","w","d"))
The result should be:
df_C <- data.frame("index" = c("id1","id1","id1","id2","id2","id2","id3","id3","id3")
, variable_x = c("4124","414","123","4124","414","123","4124","414","123")
, variable_y = c(12,22,13,12,22,13,12,22,13)
, variable_z = c("q","w","d","q","w","d","q","w","d"))

This is a full outer join and could be solved via
merge(df_B, df_A$index)
Which yields
> merge(df_B, df_A$index)
variable_x variable_y variable_z y
1 4124 12 q id1
2 414 22 w id1
3 123 13 d id1
4 4124 12 q id2
5 414 22 w id2
6 123 13 d id2
7 4124 12 q id3
8 414 22 w id3
9 123 13 d id3
You could correct the order of the columns like this:
merge(df_B, df_A$index)[,c(4, 1, 2, 3)]
Obviously, a full join can be done in dplyr as well, if you prefer that:
dplyr::full_join(df_B, df_A, by = character())

Another option is to use tidyr::crossing
tidyr::crossing(df_A, df_B)
#----------
# A tibble: 9 x 6
index variable_a variable_b variable_x variable_y variable_z
<chr> <dbl> <chr> <chr> <dbl> <chr>
1 id1 1 x 123 13 d
2 id1 1 x 4124 12 q
3 id1 1 x 414 22 w
4 id2 2 f 123 13 d
5 id2 2 f 4124 12 q
6 id2 2 f 414 22 w
7 id3 3 d 123 13 d
8 id3 3 d 4124 12 q
9 id3 3 d 414 22 w

The following function should help using the library dplyr. Insert the dataframe with index in the first parameter and add the dataframe without index in the second parameter. It should return the requested dataframe.
merge_lines_with_index <- function(index_table, data_table){
df <- data.frame(matrix(ncol = ncol(data_table) + 1))
x <- names(data_table) %>% unlist()
colnames(df) <- c("index", x)
for (item in index_table %>% select(1) %>% unlist()) {
new_data <- data_table %>%
mutate("index" = item)
df <- df %>% rbind(new_data)
}
return(df[-1,])
}

Row wise comparison of a dataframe in R

I have a data frame with multiple data points corresponding to each ID. When the status value is different between 2 timepoints for an ID, I want to flag the first status change. How do I achieve that in R ? Below is a sample dataset.
ID
Time
Status
ID1
0
X
ID1
6
X
ID1
12
Y
ID1
18
Z
Result dataset
ID
Time
Status
Flag
ID1
0
X
ID1
6
X
ID1
12
Y
1
ID1
18
Z

Here is a base R solution with ave. It creates a vector y that is equal to 1 every time the previous value is different from the current one. Then the Flag is computed with diff.
y <- with(df1, ave(Status, ID, FUN = function(x) c(0, x[-1] != x[-length(x)])))
df1$Flag <- c(0, diff(as.integer(y)) != 0)
df1
# ID Time Status Flag
#1 ID1 0 X 0
#2 ID1 6 X 0
#3 ID1 12 Y 1
#4 ID1 18 Z 0
Data
df1 <- read.table(text = "
ID Time Status
ID1 0 X
ID1 6 X
ID1 12 Y
ID1 18 Z
", header = TRUE)

You can use mutate() with ifelse() and lag(), then replace the non-first Flag==1 with 0s with replace():
df1%>%group_by(ID)%>%
mutate(Flag=ifelse(is.na(lag(Status)), 0,
as.integer(Time!=lag(Time) & Status!=lag(Status))))%>%
group_by(ID, Flag)%>%
mutate(Flag=replace(Flag, Flag==lag(Flag) & Flag==1, 0))
# A tibble: 4 x 4
# Groups: ID, Flag [2]
ID Time Status Flag
<fct> <int> <fct> <dbl>
1 ID1 0 X 0
2 ID1 6 X 0
3 ID1 12 Y 1
4 ID1 18 Z 0

How to transform "sequence" data to "from & to" in R

I have this dataset.
data.frame(
id = c("id1","id1","id1","id1","id2","id2","id2"),
seq = c(1,2,3,4,1,2,3),
obj = c("A","B","C","D","B","D","E")
)
id seq obj
1 id1 1 A
2 id1 2 B
3 id1 3 C
4 id1 4 D
5 id2 1 B
6 id2 2 D
7 id2 3 E
I want to transform seq&obj variable , from to form.
like this.
data.frame(
id = c("id1","id1","id1","id1","id1","id2","id2","id2","id2"),
from = c("start","A","B","C","D","start","B","D","E"),
to = c("A","B","C","D","end","B","D","E","end")
)
id from to
1 id1 start A
2 id1 A B
3 id1 B C
4 id1 C D
5 id1 D end
6 id2 start B
7 id2 B D
8 id2 D E
9 id2 E end
If we think of id as a runner names , we can imagine that it passes through checkpoints named obj in the order of seq.
do you know any idea?
thank you.

The following should work:
df %>%
group_by(id) %>%
arrange(seq) %>%
summarize(from = c('start', obj), to = c(obj, 'end'), .groups = 'drop')
# A tibble: 9 x 3
id from to
<chr> <chr> <chr>
1 id1 start A
2 id1 A B
3 id1 B C
4 id1 C D
5 id1 D end
6 id2 start B
7 id2 B D
8 id2 D E
9 id2 E end
If your initial data is already in the correct order (as in your given example), the arrange() call is unnecessary. However, with tabular data it’s best not to assume a specific order.

R Merge only parts of one column into an existent column from another dataframe

I am aware that merging is a widely covered topic. If you think this is a duplicate, I am very happy to be put onto the question that answers my question, but I haven't found it (Sorry!). Thanks
I have two data frames:
require(dplyr)
set.seed(1)
large_df <- data_frame(id = rep(paste0('id',1:40), each = 3),
age = c(rep(NA,60),rep (sample(20), each = 3)),
col3 = rep(letters[1:20],6), col4 = rep(1:60,2))
small_df <- data_frame(id = paste0('id',1:20),
age = sample(20))
large_df contains incomplete data (large_df$age), which is contained in small_df. Now I would like to bring the information from small_df$age into large_df$age (merged by the correct 'id'). I think this must be possible via merge or one of the join functions from dplyr, but several combinations did not bring the result I would like.
I also tried a for loop over the rows:
for(i in nrow(large_df)) {
if (large_df[i,'id'] %in% small_df$id == TRUE) {
large_df[i,'age'] <- small_df$age[which(small_df$id %in% large_df[i,'id'])]
}
}
But this doesnt help, it doesn't even return any result. (Anyone an idea why not?)
My result would look like that:
large_df$age[1:60] <- rep(small_df$age, each = 3)
large_df
# A tibble: 120 x 4
id age col3 col4
<chr> <int> <chr> <int>
1 id1 6 a 1
2 id1 6 b 2
3 id1 6 c 3
4 id2 8 d 4
5 id2 8 e 5
6 id2 8 f 6
7 id3 11 g 7
8 id3 11 h 8
9 id3 11 i 9
10 id4 16 j 10
# ... with 110 more rows

Using your data frames this would do the trick.
result =
large_df %>%
left_join(small_df, by = 'id') %>%
mutate(age = ifelse(is.na(age.x), age.y, age.x)) %>%
dplyr::select(-age.x, -age.y)
result
# A tibble: 120 x 4
id col3 col4 age
<chr> <chr> <int> <int>
1 id1 a 1 19
2 id1 b 2 19
3 id1 c 3 19
4 id2 d 4 5
If both age.x and age.y are NA then NA would be output in age.