I have this dataset.
data.frame(
id = c("id1","id1","id1","id1","id2","id2","id2"),
seq = c(1,2,3,4,1,2,3),
obj = c("A","B","C","D","B","D","E")
)
id seq obj
1 id1 1 A
2 id1 2 B
3 id1 3 C
4 id1 4 D
5 id2 1 B
6 id2 2 D
7 id2 3 E
I want to transform seq&obj variable , from to form.
like this.
data.frame(
id = c("id1","id1","id1","id1","id1","id2","id2","id2","id2"),
from = c("start","A","B","C","D","start","B","D","E"),
to = c("A","B","C","D","end","B","D","E","end")
)
id from to
1 id1 start A
2 id1 A B
3 id1 B C
4 id1 C D
5 id1 D end
6 id2 start B
7 id2 B D
8 id2 D E
9 id2 E end
If we think of id as a runner names , we can imagine that it passes through checkpoints named obj in the order of seq.
do you know any idea?
thank you.
The following should work:
df %>%
group_by(id) %>%
arrange(seq) %>%
summarize(from = c('start', obj), to = c(obj, 'end'), .groups = 'drop')
# A tibble: 9 x 3
id from to
<chr> <chr> <chr>
1 id1 start A
2 id1 A B
3 id1 B C
4 id1 C D
5 id1 D end
6 id2 start B
7 id2 B D
8 id2 D E
9 id2 E end
If your initial data is already in the correct order (as in your given example), the arrange() call is unnecessary. However, with tabular data it’s best not to assume a specific order.
Related
I have a dataframe
df <- data.frame(id1 = c("a" , "b", "b", "c"),
id2 = c(NA,"a","a",NA),
id3 = c("a", "a", "a", "e"),
n1 = c(2,2,2,3),
n2 = c(2,1,1,1),
n3 = c(0,1,1,3),
n4 = c(0,1,1,2))
I want to collapse the 2nd and 3rd rows into one. Afterwards, I will do aggregate by column id3 sharing same character (i.e. a).
My real dataframe is long contaning many different latin names, filter by name i.e. a doesn´t make sense this case. I am thinking to collapse rows with the condition id3 == id2, but I could not do it. Any sugesstions for me?
My desired out put like this
id1 id2 id3 n1 n2 n3 n4
a NA a 2 2 0 0
b a a 2 1 1 1
c NA e 3 1 3 2
#Afterthat, it should be
id1 id3 n1 n2 n3 n4
a a 4 3 1 1
c e 3 1 3 2
(I just updated the dataframe, sorry for my mistake)
We get the distinct rows to generate the first expected
library(dplyr)
df %>%
distinct
id1 id2 id3 n1 n2 n3 n4
1 a <NA> a 2 2 0 0
2 b a a 2 1 1 1
3 c <NA> e 3 1 3 2
The final output we can get from the above, i.e. after the distinct step, do a group by coalesced 'id2', 'id1' along with 'id3' and then get the sum of numeric columns
df %>%
distinct %>%
group_by(id1 = coalesce(id2, id1), id3) %>%
summarise(across(where(is.numeric), sum), .groups = 'drop')
-output
# A tibble: 2 × 6
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2
Here is a slightly different way using slice after group_by instead of distinct:
df %>%
group_by(id1, id3) %>%
dplyr::slice(1L) %>%
mutate(id1 = coalesce(id2,id1)) %>%
summarise(across(where(is.numeric), sum))
output:
id1 id3 n1 n2 n3 n4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 a a 4 3 1 1
2 c e 3 1 3 2
For some reason, I have a data in which a few columns are a set of data frame consist of one column. So, I want to "collapse" these columns of data frame into one data frame.
library(tidyverse)
df <- tibble(col1=1:5,
col2=tibble(newcol=LETTERS[1:5]),
col3=tibble(newcol2=LETTERS[6:10]))
df
# A tibble: 5 x 3
col1 col2$newcol col3$newcol2
<int> <chr> <chr>
1 1 A F
2 2 B G
3 3 C H
4 4 D I
5 5 E J
I have tried unnest(), but, the function actually replicate data frame/tibble of col2 and col3 for each row of col1, which is not what I want.
df2 <- df %>% unnest(cols = c(col2, col3))
df2
# A tibble: 25 x 3
col1 col2 col3
<int> <chr> <chr>
1 1 A F
2 1 B G
3 1 C H
4 1 D I
5 1 E J
6 2 A F
7 2 B G
8 2 C H
9 2 D I
10 2 E J
# ... with 15 more rows
The result that I want is as below:
df3 <- tibble(col1=1:5,
newcol=LETTERS[1:5],
newcol2=LETTERS[6:10])
df3
# A tibble: 5 x 3
col1 newcol newcol2
<int> <chr> <chr>
1 1 A F
2 2 B G
3 3 C H
4 4 D I
5 5 E J
Any idea how to do this? Any help is much appreciated.
it looks like you only want to change the column names or am I missing something here?
df<-df%>%mutate(col2=df$col2$newcol, col3=df$col3$newcol2)
After your comment, here you can find a more general version (might not be suitable for all use cases)
df1<-df%>%unnest(cols = c(1:3))%>%
group_by(col1)%>%
mutate(row=row_number())%>%
filter(row==col1)%>%
select(-row)
If I understand correct you have three dataframes each of them containing one column. Now you want to bring them all in one dataframe together. Then cbind is an option.
df3 <- cbind(df, col2, col3)
Output:
col1 newcol newcol2
1 1 A F
2 2 B G
3 3 C H
4 4 D I
5 5 E J
I have this dataframe in R:
ID <- c(rep("ID1" , 4) , rep("ID2" , 4))
mut <- rep(c("AC", "TG", "AG", "TC"), 2)
count <- c(2,4,6,8,1,3,5,7)
data.frame(ID, mut, count)
ID mut count
1 ID1 AC 2
2 ID1 TG 4
3 ID1 AG 6
4 ID1 TC 8
5 ID2 AC 1
6 ID2 TG 3
7 ID2 AG 5
8 ID2 TC 7
I want to create a new one where I sum the values of count based on "mut" column.
Basically, for each ID, I would sum the count from mut=AC and TG and from AG and TC, to obtain this:
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12
I have absolutely no clue on how to do this!!
Thanks!!
M
You better make sure you have an even number of elements in each ID.
df=data.frame(ID, mut, count)
df$sek=rep(1:(nrow(df)/2),each=2)
do.call(rbind,
by(df,list(df$sek),function(x){
data.frame(
"ID"=x$ID[1],
"new_mut"=paste0(x$mut,collapse="-"),
"count"=sum(x$count)
)
})
)
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12
Using dplyr :
library(dplyr)
df %>%
group_by(ID, val = ceiling(match(mut, unique(mut))/2)) %>%
summarise(mut = paste0(mut,collapse="-"),
count = sum(count)) %>%
select(-val)
# ID mut count
# <chr> <chr> <dbl>
#1 ID1 AC-TG 6
#2 ID1 AG-TC 14
#3 ID2 AC-TG 4
#4 ID2 AG-TC 12
I am aware that merging is a widely covered topic. If you think this is a duplicate, I am very happy to be put onto the question that answers my question, but I haven't found it (Sorry!). Thanks
I have two data frames:
require(dplyr)
set.seed(1)
large_df <- data_frame(id = rep(paste0('id',1:40), each = 3),
age = c(rep(NA,60),rep (sample(20), each = 3)),
col3 = rep(letters[1:20],6), col4 = rep(1:60,2))
small_df <- data_frame(id = paste0('id',1:20),
age = sample(20))
large_df contains incomplete data (large_df$age), which is contained in small_df. Now I would like to bring the information from small_df$age into large_df$age (merged by the correct 'id'). I think this must be possible via merge or one of the join functions from dplyr, but several combinations did not bring the result I would like.
I also tried a for loop over the rows:
for(i in nrow(large_df)) {
if (large_df[i,'id'] %in% small_df$id == TRUE) {
large_df[i,'age'] <- small_df$age[which(small_df$id %in% large_df[i,'id'])]
}
}
But this doesnt help, it doesn't even return any result. (Anyone an idea why not?)
My result would look like that:
large_df$age[1:60] <- rep(small_df$age, each = 3)
large_df
# A tibble: 120 x 4
id age col3 col4
<chr> <int> <chr> <int>
1 id1 6 a 1
2 id1 6 b 2
3 id1 6 c 3
4 id2 8 d 4
5 id2 8 e 5
6 id2 8 f 6
7 id3 11 g 7
8 id3 11 h 8
9 id3 11 i 9
10 id4 16 j 10
# ... with 110 more rows
Using your data frames this would do the trick.
result =
large_df %>%
left_join(small_df, by = 'id') %>%
mutate(age = ifelse(is.na(age.x), age.y, age.x)) %>%
dplyr::select(-age.x, -age.y)
result
# A tibble: 120 x 4
id col3 col4 age
<chr> <chr> <int> <int>
1 id1 a 1 19
2 id1 b 2 19
3 id1 c 3 19
4 id2 d 4 5
If both age.x and age.y are NA then NA would be output in age.
I have a data frame like below:
Group1 Group2 Group3 Group4
A B A B
A C B A
B B B B
A C B D
A D C A
I want to add a new column to the data frame which will have the count of unique elements in each row. Desired output:
Group1 Group2 Group3 Group4 Count
A B A B 2
A C B A 3
B B B B 1
A C B D 4
A D C A 3
I am able to find such a count for each row using
length(unique(c(df[,c(1,2,3,4)][1,])))
I want to do the same thing for all rows in the data frame. I tried apply() with var=1 but without success. Also, it would be great if you could provide a more elegant solution to this.
We can use apply with MARGIN =1 to loop over the rows
df1$Count <- apply(df1, 1, function(x) length(unique(x)))
df1$Count
#[1] 2 3 1 4 3
Or using tidyverse
library(dplyr)
df1 %>%
rowwise() %>%
do(data.frame(., Count = n_distinct(unlist(.))))
# A tibble: 5 × 5
# Group1 Group2 Group3 Group4 Count
#* <chr> <chr> <chr> <chr> <int>
#1 A B A B 2
#2 A C B A 3
#3 B B B B 1
#4 A C B D 4
#5 A D C A 3
We can also use regex to do this in a faster way. It is based on the assumption that there is only a single character per each cell
nchar(gsub("(.)(?=.*?\\1)", "", do.call(paste0, df1), perl = TRUE))
#[1] 2 3 1 4 3
More detailed explanation is given here
duplicated in base R:
df$Count <- apply(df,1,function(x) sum(!duplicated(x)))
# Group1 Group2 Group3 Group4 Count
#1 A B A B 2
#2 A C B A 3
#3 B B B B 1
#4 A C B D 4
#5 A D C A 3
Athough there are some pretty great solutions mentioned over here, You can also use, data.table :
DATA:
df <- data.frame(g1 = c("A","A","B","A","A"),g2 = c("B", "C", "B","C","D"),g3 = c("A","B","B","B","C"),g4 = c("B","A","B","D","A"),stringsAsFactors = F)
Code:
EDIT: After the David Arenberg's comment,added (.I) instead of 1:nrow(df). Thanks for valuable comments
library(data.table)
setDT(df)[, id := .I ]
df[, count := uniqueN(c(g1, g2, g3, g4)), by=id ]
df
Output:
> df
g1 g2 g3 g4 id count
1: A B A B 1 2
2: A C B A 2 3
3: B B B B 3 1
4: A C B D 4 4
5: A D C A 5 3