R sum of column on a dataframe - r

I have a dataframe like this
a b
1 A.1 1
2 A.2 2
3 A.3 1
5 B.1 2
6 B.2 2
7 B.3 1
I need to count for each letter (A and B here) the sum of the column b
a b
1 A 4
2 B 5

One option is using separate from tidyr to separate the column 'a' based on the delimiter ., group using the new 'a' and get the sum of 'b'.
library(tidyr)
library(dplyr)
separate(df1, a, into=c('a', 'a1')) %>%
group_by(a) %>%
summarise(b=sum(b))
# a b
#1 A 4
#2 B 5
Or we can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)). Use sub to remove the characters starting from ., followed by digits, use that as the grouping variable and get the sum of 'b'.
library(data.table)
setDT(df1)[,list(b=sum(b)) , by = .(a=sub('\\.\\d+$', '', a))]
# a b
#1: A 4
#2: B 5
Or a similar option using the formula method of aggregate from base R.
aggregate(b~cbind(a=sub('\\.\\d+$', '', a)), df1, FUN=sum)
# a b
# 1 A 4
# 2 B 5
Or using sqldf
library(sqldf)
sqldf('select substr(a, 1, instr(a, ".")-1) as a1,
sum(b) as b
from df1
group by a1')
# a1 b
#1 A 4
#2 B 5
data
df1 <- structure(list(a = c("A.1", "A.2", "A.3", "B.1", "B.2", "B.3"
), b = c(1L, 2L, 1L, 2L, 2L, 1L)), .Names = c("a", "b"),
class = "data.frame", row.names = c(NA, -6L))

Related

How to merge two data frame which has jumbled column names

I have 2 data frames df1 and df2 with the same column names but in different column numbers. How to merge as df3 without creating additional columns/rows.
df1
a b c
1 3 6
df2
b c a
5 6 1
expected df3
a b c
1 3 6
1 5 6
Tried below code but it did not work
df3=merge(df1, df2, by = "col.names")
We may use bind_rows which automatically find the matching column names and if it is not there, it will add a NA row for those doesn't have. The order of columns will be based on the order from the first dataset input in `bind_rows i.e. df1
library(dplyr)
bind_rows(df1, df2)
-output
a b c
1 1 3 6
2 1 5 6
data
df1 <- structure(list(a = 1L, b = 3L, c = 6L), class = "data.frame", row.names = c(NA,
-1L))
df2 <- structure(list(b = 5L, c = 6L, a = 1L), class = "data.frame", row.names = c(NA,
-1L))
Rearrange columns of any one dataframe according on another dataframe so both the columns have the same order of column names and then use rbind.
rbind(df1, df2[names(df1)])
# a b c
#1 1 3 6
#2 1 5 6
In this case, using rbind(df1, df2) should work too.

Count distinct in R groupby, first spliting the cells by ","?

I have data in format given below
a
b
1
A,B
1
A
1
B
2
A,B
2
D,C
2
A
2
A
What I need is when groupby column 'a' need the distinct values of column 'b'
a
count
1
2
2
4
Because for 1 we only have 2 distinct values, i.e. A,B
but for 2 we have 4 ,i.e. A,B,C,D.
I can first explode the data in tall format and then do the groupby, but since I have few other aggregation to be done so I was thinking of way to do in one line.
Thanks in advance
We can use aggregate in base R :
aggregate(b~a,df, function(x) length(unique(unlist(strsplit(x, ',')))))
# a b
#1 1 2
#2 2 4
data
df <- structure(list(a = c(1L, 1L, 1L, 2L, 2L, 2L, 2L), b = c("A,B",
"A", "B", "A,B", "D,C", "A", "A")), class = "data.frame", row.names = c(NA, -7L))
Using tidyr::separate_rows and dplyr::n_distinct this could be achieved like so:
library(dplyr)
d %>%
tidyr::separate_rows(b) %>%
group_by(a) %>%
summarise(count = n_distinct(b))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 2
#> a count
#> <int> <int>
#> 1 1 2
#> 2 2 4
DATA
d <- read.table(text = "a b
1 A,B
1 A
1 B
2 A,B
2 D,C
2 A
2 A", header = TRUE)
Base R using Map():
setNames(do.call(c, Map(function(x){length(unique(trimws(unlist(strsplit(x, ",")))))},
with(df, split(b, a)))), names(df))

How to use R to get all pairs from two column with index

I would like to use R to get all pairs from two column with index. It may need some loop to finish this function. For example, turn two columns with the gene name and index:
a 1,
b 1,
c 1,
d 2,
e 2
into a new matrix
a b 1,
b c 1,
a c 1,
d e 2
Can anyone help?
A tidyverse option using combn on a grouped data.frame:
library(tidyverse)
df %>% group_by(index) %>%
summarise(gene = list(as_data_frame(t(combn(gene, 2))))) %>%
unnest(.sep = '_')
## # A tibble: 4 × 3
## index gene_V1 gene_V2
## <int> <chr> <chr>
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
The same logic can be replicated in base R:
df2 <- aggregate(gene ~ index, df, function(x){t(combn(x, 2))})
do.call(rbind, apply(df2, 1, data.frame))
## index gene.1 gene.2
## 1 1 a b
## 2 1 a c
## 3 1 b c
## 4 2 d e
Data
df <- structure(list(gene = c("a", "b", "c", "d", "e"), index = c(1L,
1L, 1L, 2L, 2L)), .Names = c("gene", "index"), row.names = c(NA,
-5L), class = "data.frame")
Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'index', we get the combn of 'gene', transpose it and set the names of the 2nd and 3rd column (if needed).
library(data.table)
setnames(setDT(df)[, transpose(combn(gene, 2, FUN = list)),
by = index], 2:3, paste0("gene", 1:2))[]
# index gene1 gene2
#1: 1 a b
#2: 1 a c
#3: 1 b c
#4: 2 d e

Specified rows with same character in some rows in dataframe in R

I have a data frame with same character in specific rows:
a 1
a 3
a 7
b 4
b 8
I want to changed it:
a.1 1
a.2 3
a.3 7
b.1 4
b.2 8
Do you know any code in R for this?
Thanks a lot.
You can also use data.table package:
library(data.table)
setDT(df)[,ix:=paste(V1,1:.N, sep='.'),V1][]
# V1 V2 ix
#1: a 1 a.1
#2: a 3 a.2
#3: a 7 a.3
#4: b 4 b.1
#5: b 8 b.2
Data:
df = structure(list(V1 = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), V2 = c(1L, 3L, 7L, 4L, 8L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -5L))
In base R, you could do:
df$V1 <- with(df, paste(V1, ave(as.numeric(V1), V1, FUN = seq_along), sep="."))
print(df)
# V1 V2
#1 a.1 1
#2 a.2 3
#3 a.3 7
#4 b.1 4
#5 b.2 8
We can use dplyr/tidyr. We group by 'V1', create a sequence column ('VN'), unite the columns 'V1' and 'VN', and then rename the column.
library(dplyr)
library(tidyr)
df %>%
group_by(V1) %>%
mutate(VN = row_number()) %>%
unite(V1n, V1, VN, sep='.') %>%
rename(V1=V1n)
# V1 V2
# (chr) (int)
#1 a.1 1
#2 a.2 3
#3 a.3 7
#4 b.1 4
#5 b.2 8

Replace NA with values in another row of same column for each group in r

I need to replace the NA's of each row with non NA's values of different row for a given column for each group
let say sample data like:
id name
1 a
1 NA
2 b
3 NA
3 c
3 NA
desired output:
id name
1 a
1 a
2 b
3 c
3 c
3 c
Is there a way to perform this in r ?
Here is an approach using dplyr. From the data frame x we group by id and replace NA with the relevant values. I am assuming one unique value of name per id.
x <- data.frame(id = c(1, 1, 2, rep(3,3)),
name = c("a", NA, "b", NA, "c", NA), stringsAsFactors=F)
require(dplyr)
x %>%
group_by(id) %>%
mutate(name = unique(name[!is.na(name)]))
Source: local data frame [6 x 2]
Groups: id
# id name
#1 1 a
#2 1 a
#3 2 b
#4 3 c
#5 3 c
#6 3 c
We can use data.table to do this. Convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'id', we replace the 'name' with the non-NA value in 'name'.
library(data.table)#v1.9.5+
setDT(df1)[, name:= name[!is.na(name)][1L] , by = id]
df1
# id name
#1: 1 a
#2: 1 a
#3: 2 b
#4: 3 c
#5: 3 c
#6: 3 c
NOTE: Here I assumed that there is only a single unique non-NA value within each 'id' group.
Or another option would be to join the dataset with the unique rows of the data after we order by 'id' and 'name'.
setDT(df1)
df1[unique(df1[order(id, name)], by='id'), on='id', name:= i.name][]
# id name
#1: 1 a
#2: 1 a
#3: 2 b
#4: 3 c
#5: 3 c
#6: 3 c
NOTE: The on is only available with the devel version of data.table. Instructions to install the devel version are here
data
df1 <- structure(list(id = c(1L, 1L, 2L, 3L, 3L, 3L), name = c("a",
NA, "b", NA, "c", NA)), .Names = c("id", "name"),
class = "data.frame", row.names = c(NA, -6L))
Base R
d<-na.omit(df)
transform(df,name=d$name[match(id,d$id)])
again assuming one unique value of name per id (forces it)

Resources