I am trying to prep my data and I am stuck with one issue. Lets say I have the following data frame:
df1
Name C1 Val1
A a x1
A a x2
A b x3
A c x4
B d x5
B d x6
...
and I want to narrow down the df to
df2
Name C1 Val
A a,b,c x1+x2+x3+x4
B d x5+x6
...
while a is a character value and x is numeric value
I have been trying using sapply, rowsum and
df2<- aggregate(df1, list(df1[,1]), FUN= summary)
but it just can't put the character values in a list for each Name.
Can someone help me how to receive df2?
m <- function(x) if(is.numeric(x<- type.convert(x)))sum(x) else toString(unique(x))
aggregate(.~Name,df1,m)
Name C1 Val1
1 A a, b, c 10
2 B d 11
where
df1
Name C1 Val1
1 A a 1
2 A a 2
3 A b 3
4 A c 4
5 B d 5
6 B d 6
This is your df, I give it numbers 1 to 6 in Val1
df <-
structure(list(Name = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), C1 = structure(c(1L, 1L, 2L, 3L, 4L,
4L), .Label = c("a", "b", "c", "d"), class = "factor"), Val1 = 1:6), row.names = c(NA,
-6L), class = "data.frame")
We just use summarise:
df %>%
group_by(Name) %>%
summarise(C1=paste(unique(C1),collapse=","),Val1=sum(Val1))
# A tibble: 2 x 3
Name C1 Val1
<fct> <chr> <int>
1 A a,b,c 10
2 B d 11
Quick and easy dplyr solution:
library(dplyr)
library(stringr)
df1 %>%
mutate(Val1_num = as.numeric(str_extract(Val1, "\\d+"))) %>%
group_by(Name) %>%
summarise(C1 = paste(unique(C1), collapse = ","),
Val1 = paste(unique(Val1), collapse = "+"),
Val1_num = sum(Val1_num))
#> # A tibble: 2 x 4
#> Name C1 Val1 Val1_num
#> <chr> <chr> <chr> <dbl>
#> 1 A a,b,c x1+x2+x3+x4 10
#> 2 B d x5+x6 11
Or in base:
df2 <- aggregate(df1, list(df1[,1]), FUN = function(x) {
if (all(grepl("\\d", x))) {
sum(as.numeric(gsub("[^[:digit:]]", "", x)))
} else {
paste(unique(x), collapse = ",")
}
})
df2
#> Group.1 Name C1 Val1
#> 1 A A a,b,c 10
#> 2 B B d 11
data
df1 <- read.csv(text = "
Name,C1,Val1
A,a,x1
A,a,x2
A,b,x3
A,c,x4
B,d,x5
B,d,x6", stringsAsFactors = FALSE)
Related
I want to fill df2 with information from df1.
df1 as below
ID Mutation
1 A
2 B
2 C
3 A
df2 as below
ID A B C
1
2
3
For example, if mutation A is found in ID 1, then I want it in df2 it marked as "Y".
So the df2 result should be
ID A B C
1 Y
2 Y Y
3 Y
I have hundreds of IDs and more than 20 mutations. How can I efficiently achieve this in R? Thanks!
Using data.table you can try
setDT(df)
df2 <- dcast(df,formula = ID~Mutation )
df2[, c("A", "B", "C") := lapply(.SD, function(x) ifelse(is.na(x), " ", "Y")), ID]
df2
#Output
ID A B C
1: 1 Y
2: 2 Y Y
3: 3 Y
Create a new column with value 'Y' and cast the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(value = 'Y') %>%
pivot_wider(names_from = Mutation, values_from = value, values_fill = '')
# ID A B C
# <int> <chr> <chr> <chr>
#1 1 "Y" "" ""
#2 2 "" "Y" "Y"
#3 3 "Y" "" ""
data
df <- structure(list(ID = c(1L, 2L, 2L, 3L), Mutation = c("A", "B",
"C", "A")), class = "data.frame", row.names = c(NA, -4L))
I have a dataset with one ID column, 12 information columns (strings) and n rows. It looks like this:
ID Col1 Col2 Col3 Col4 Col5 ...
01 a b c d a
02 a a a a a
03 b b b b b
...
I need to go row by row and check if that row (considering all of it's columns) is equal to any other row in the dataset. My output needs to be two new columns: one indicating if that particular row is equal to any other row and a second column indicating which row it is equal to (in case of TRUE in the previous column)
I appreciate any suggestions.
Assuming DF in the Note at the end, sort it and create a column dup indicating whether there exists a prior duplicate row. Then set to wx to the row number in the original data frame of the duplicate. Finaly resort back.
We have assumed that duplicate means that the columns other than the ID are the same but that is readily changed if need be. We have also assumed that we should mark the second and subsequent rows among duplicates whereas the first is not so marked becaue it has to that point no duplicate.
The question does not address the situation of more than 2 identical rows but if that situation exists then each duplicate will point to the nearest prior row of which it is a duplicate.
o <- do.call("order", DF[-1])
DFo <- DF[o, ]
DFo$wx <- DFo$dup <- duplicated(DFo)
DFo$wx[DFo$dup] <- as.numeric(rownames(DFo))[which(DFo$dup) - 1]
DFo[order(o), ] # back to original order
giving:
ID Col1 Col2 Col3 Col4 Col5 dup wx
1 1 a b c d a FALSE 0
2 2 a a a a a FALSE 0
3 3 b b b b b FALSE 0
4 1 a b c d a TRUE 1
Note
Lines <- "ID Col1 Col2 Col3 Col4 Col5
01 a b c d a
02 a a a a a
03 b b b b b"
DF <- read.table(text = Lines, header = TRUE)
DF <- DF[c(1:3, 1), ]
rownames(DF) <- NULL
giving:
> DF
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 1 a b c d a
With a df like below:
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 3 b b b b b
You could try grouping by all columns and checking whether any count > 1 as well as pasting together row numbers (1:nrow(df)):
df <- transform(
df,
dupe = ave(ID, mget(names(df)), FUN = length) > 1,
dupeRows = ave(1:nrow(df), mget(names(df)), FUN = toString)
)
As this would get you a number for each row, even when there are no duplicates, you could do:
df$dupeRows <- with(df,
Map(function(x, y)
toString(x[x != y]),
strsplit(as.character(dupeRows), split = ', '),
1:nrow(df)))
Output:
ID Col1 Col2 Col3 Col4 Col5 dupe dupeRows
1 1 a b c d a FALSE
2 2 a a a a a FALSE
3 3 b b b b b TRUE 4
4 3 b b b b b TRUE 3
Data
df <- structure(list(ID = c(1L, 2L, 3L, 3L), Col1 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col2 = structure(c(2L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col3 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "c"), class = "factor"), Col4 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "d"), class = "factor"), Col5 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor")), row.names = c(NA,
-4L), class = "data.frame")
A dplyr solution
library(dplyr)
df %>%
mutate(row_num = 1:n(), is_dup = duplicated(df)) %>%
group_by(across(-c(row_num, is_dup))) %>%
mutate(
has_copies = n() > 1L,
which_row = if_else(is_dup, first(row_num), NA_integer_),
row_num = NULL, is_dup = NULL
)
Output
# A tibble: 5 x 8
# Groups: ID, Col1, Col2, Col3, Col4, Col5 [3]
ID Col1 Col2 Col3 Col4 Col5 has_copies which_row
<chr> <fct> <fct> <fct> <fct> <fct> <lgl> <int>
1 1 a b c d a FALSE NA
2 2 a a a a a FALSE NA
3 3 b b b b b TRUE NA
4 3 b b b b b TRUE 3
5 3 b b b b b TRUE 3
For each row that has more than one copies, the has_copies gives a TRUE.
For a set of rows that are the same, I consider the first one as the original and all other rows as duplicates. In this regard, which_row gives you the index of the original for each duplicate it found. In other words, If a row has no duplicate or is the original, it gives you NA.
This question already has an answer here:
dplyr::first() to choose first non NA value
(1 answer)
Closed 2 years ago.
I understand we can use the dplyr function coalesce() to unite different columns, but is there such function to unite rows?
I am struggling with a confusing incomplete/doubled dataframe with duplicate rows for the same id, but with different columns filled. E.g.
id sex age source
12 M NA 1
12 NA 3 1
13 NA 2 2
13 NA NA NA
13 F 2 NA
and I am trying to achieve:
id sex age source
12 M 3 1
13 F 2 2
You can try:
library(dplyr)
#Data
df <- structure(list(id = c(12L, 12L, 13L, 13L, 13L), sex = structure(c(2L,
NA, NA, NA, 1L), .Label = c("F", "M"), class = "factor"), age = c(NA,
3L, 2L, NA, 2L), source = c(1L, 1L, 2L, NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
df %>%
group_by(id) %>%
fill(everything(), .direction = "down") %>%
fill(everything(), .direction = "up") %>%
slice(1)
# A tibble: 2 x 4
# Groups: id [2]
id sex age source
<int> <fct> <int> <int>
1 12 M 3 1
2 13 F 2 2
As mentioned by #A5C1D2H2I1M1N2O1R2T1 you can select the first non-NA value in each group. This can be done using dplyr :
library(dplyr)
df %>% group_by(id) %>% summarise(across(.fns = ~na.omit(.)[1]))
# A tibble: 2 x 4
# id sex age source
# <int> <fct> <int> <int>
#1 12 M 3 1
#2 13 F 2 2
Base R :
aggregate(.~id, df, function(x) na.omit(x)[1], na.action = 'na.pass')
Or data.table :
library(data.table)
setDT(df)[, lapply(.SD, function(x) na.omit(x)[1]), id]
Hi all I have got a dataframe. I need to create another column so that it should tell at what place each categories are there. For example PLease refer expected output
df
ColB ColA
X A>B>C
U B>C>A
Z C>A>B
Expected output
df1
ColB ColA A B C
X A>B>C 1 2 3
U B>C>A 3 1 2
Z C>A>B 2 3 1
We can first bring ColA into separate rows, group_by ColB and give an unique row number for each entry and then convert the data into wide format using pivot_wider.
library(dplyr)
library(tidyr)
df %>%
mutate(ColC = ColA) %>%
separate_rows(ColC, sep = ">") %>%
group_by(ColB) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = ColC, values_from = row)
# ColB ColA A B C
# <fct> <fct> <int> <int> <int>
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
We can do this in base R
df[LETTERS[1:3]] <- t(sapply(regmatches(df$ColA, gregexpr("[A-Z]",
df$ColA)), match, x = LETTERS[1:3]))
df
# ColB ColA A B C
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame",
row.names = c(NA,
-3L))
I want to summarize data and create dynamic columns columns and store in different data frame:
data is something like:
col1 col2 col3
A 1 200
B 1 300
A 2 400
k=c("A","B","C")
for(i in k)
{
group_data <- group_by(data[data$col1==i,], col2)
summary_i<- summarize(group_data ,paste("var",k[i],sep="_") = n())
}
Expected output:
Three data frame with name summary_A, summary_B, summary_C containing variable var_A, var_B and var_C respectively.
As correctly pointed out by #MrFlick, there are better ways to manage your problem.
Anyway, here is a working version of your code:
data <- structure(list(col1 = structure(c(1L, 2L, 1L), .Label = c("A",
"B"), class = "factor"), col2 = c(1L, 1L, 2L), col3 = c(200L,
300L, 400L)), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA,
-3L))
k=c("A","B","C")
for (i in seq_along(k)) {
group_data <- group_by(data[data$col1==k[i],], col2)
vark <- paste('var',i,sep='_')
eval(parse(text=paste("summary_",i," <- summarize(group_data,", vark, " = n())",sep="")))
}
print(summary_1)
# A tibble: 2 x 2
# col2 var_1
# <int> <int>
# 1 1 1
# 2 2 1
print(summary_2)
# A tibble: 1 x 2
# col2 var_2
# <int> <int>
# 1 1 1
print(summary_3)
# A tibble: 0 x 2
# ... with 2 variables: col2 <int>, var_3 <int>