Separating a column in multiple columns in R - r

Here is my data:
df1<-read.table(text= "Group
11Z-23456
12B-10000
22M-2000
12M-1100
33G-100",header=TRUE)
I want to get this data:
A B C Code
1 1 Z 23456
1 2 B 10000
2 2 M 2000
1 2 M 1100
3 3 G 100
As you can see from my data, I want to separate the values in the Group This is just an small sample, but please consider for a large sample.

Here is a base R code that you can have a try
l <- strsplit(as.character(df1$Group),split = "-")
dfout <- setNames(data.frame(t(mapply(c, strsplit(sapply(l, `[[`,1),split = ""),lapply(l, `[[`,2)))),
c("A","B","C","Code"))
or using tidyr package
library(tidyr)
df1 %>%
separate(Group,c("X","Code"),sep = "-") %>%
separate(X,c("A","B","C"),sep = 1:3)
such that
> dfout
A B C Code
1 1 1 Z 23456
2 1 2 B 10000
3 2 2 M 2000
4 1 2 M 1100
5 3 3 G 100

Using `data.table:
library(data.table)
setDT(df1)
df1[, c("ABC", "Code") := tstrsplit(Group, "-")]
df1[, c("A", "B", "C") := tstrsplit(ABC, "")]
df1[, c("ABC", "Group") := NULL]
df1
# Code A B C
# 1: 23456 1 1 Z
# 2: 10000 1 2 B
# 3: 2000 2 2 M
# 4: 1100 1 2 M
# 5: 100 3 3 G

Related

Realization of a data frame with the smallest values grouped by a column

How can I create a new data frame with the smallest values group by a column.
For example this df:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 0
B 6
C 1
D 0
D 4')
Now with:
test <- setDT(df)[, .SD[which.min(Value)], by=Gene]
I get this:
> test
Gene Value
1: A 10
2: B 0
3: C 1
4: D 0
But how can I use a second condition for Value > 0 here? I want to have this output:
> test
Gene Value
1: A 10
2: B 3
3: C 1
4: D 4
Could do:
setDT(df)[, .(Value = min(Value[Value > 0])), by=Gene]
Output:
Gene Value
1: A 10
2: B 3
3: C 1
4: D 4
Using tidyverse you can group, filter and then summarize the min value:
library(tidyverse)
df2 <- df %>%
group_by(Gene) %>%
filter(Value != 0) %>%
summarise(Value = min(Value))
# A tibble: 4 x 2
Gene Value
<fct> <dbl>
1 A 10
2 B 3
3 C 1
4 D 4
Using aggregate from base R
aggregate(Value ~ Gene, subset(df, Value > 0), min)
# Gene Value
#1 A 10
#2 B 3
#3 C 1
#4 D 4

Filter data.table with another data.table with different column names

I have this dataset:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
and I have used the solution from this post to create this summary:
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
dt_score_1 <- dt_summary[score == 1]
setnames(dt_score_1, "area", "zone")
I would like to use the results from dt_score_1 to filter dt based on the area/zone and cluster/cluster_mode. So in a new data.table, the only rows taken from dt for area A should belong to cluster X, for area D they should be cluster Z etc.
If I'm understanding the question correctly, this is a merge of dt with dt_score_1 with the conditions area = zone, cluster = cluster_mode.
dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]
# record area score cluster i.score cluster_pct cluster_freq record_freq
# 1: 1 A 1 X 1 100.00000 2 2
# 2: 2 A 1 X 1 100.00000 2 2
# 3: 7 B 1 X 1 66.66667 2 3
# 4: 8 B 1 X 1 66.66667 2 3
# 5: 11 C 2 X 1 100.00000 1 1
# 6: 12 C 1 X 1 100.00000 1 1
# 7: 14 D 1 Z 1 80.00000 4 5
# 8: 15 D 1 Z 1 80.00000 4 5
# 9: 16 D 1 Z 1 80.00000 4 5
# 10: 17 D 1 Z 1 80.00000 4 5
# 11: 20 D 3 Z 1 80.00000 4 5
For a more detailed explanation of join-as-filter, see the link below posted by #Frank
Perform a semi-join with data.table

New columns based off existing column and column located next to it

My dataframe looks like this
ID t1 obs1 t2 obs2 t3 obs3
1 0 a 11 d 0 g
2 0 b 13 e 11 i
3 0 c 0 f 0 h
I need to make sure each ID has at least one t above 10 (delete row if not). Then, I want to save the lowest t value above 10, but also save the corresponding obs in new columns. (The complicated part about my question is that the lowest t above 10 could be in any column). The corresponding obs to some t is located in the next column, so that helps. So my resulting data frame would look like this:
ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1 0 a 11 d 0 g 11 d
2 0 b 13 e 11 i 11 i
With data.table, go to long format:
library(data.table)
setDT(DT)
dat = melt(DT, measure.vars = patterns("^t\\d+$", "^obs\\d+$"), value.name = c("t", "obs"))
setorder(dat, ID, variable)
# ID variable t obs
# 1: 1 1 0 a
# 2: 1 2 11 d
# 3: 1 3 0 g
# 4: 2 1 0 b
# 5: 2 2 13 e
# 6: 2 3 11 i
# 7: 3 1 0 c
# 8: 3 2 0 f
# 9: 3 3 0 h
Find max value per group and mark groups to keep:
IDDT = dat[order(-t),
.(max.variable = first(variable), max.t = first(t), max.obs = first(obs))
, by=ID]
IDDT[, keep := max.t > 10]
# ID max.variable max.t max.obs keep
# 1: 2 2 13 e TRUE
# 2: 1 2 11 d TRUE
# 3: 3 1 0 c FALSE
Find min value over 10 per kept group using a rolling update join:
IDDT[(keep), c("my.variable", "my.t", "my.obs") := {
m = .(ID = ID, t_thresh = 10)
dat[m, on=.(ID, t = t_thresh), roll=-Inf, .(x.variable, x.t, x.obs)]
}]
# ID max.variable max.t max.obs keep my.variable my.t my.obs
# 1: 2 2 13 e TRUE 3 11 i
# 2: 1 2 11 d TRUE 2 11 d
# 3: 3 1 0 c FALSE NA NA NA
I would stop here, with the main data in long format dat and the ID level variables in the separate table IDDT. To filter dat to groups that should be kept: dat[IDDT[(keep), .(ID)], on=.(ID)]. See ?data.table and the other intro materials mentioned when you load the package for details on the syntax.
See ?dcast if you insist on going back to wide.
Using base R:
Drop all rows with no t-values above 10:
df1 <- df1[rowSums(df1[, grepl("^t", colnames(df1))] >10) > 0, ]
Determine the group that contains the lowest value above 10 and then retrieve values:
df1$group <- apply(df1[grepl("^t", names(df1))], 1, function(x) which(x == min(x[x > 10])))
df1 <- cbind(df1, do.call(rbind, lapply(seq_len(nrow(df1)),
function(x) setNames(df1[x, paste0(c("t", "obs"), df1$group[x])],
c("lowesttabove10", "correspondingobs")))))
> df1
ID t1 obs1 t2 obs2 t3 obs3 group lowesttabove10 correspondingobs
1 1 0 a 11 d 0 g 2 11 d
2 2 0 b 13 e 11 i 3 11 i
My approach is not neat , but still works, You can try it.
library(dplyr)
library(reshape)
df1=melt(df,id='ID')
df2=df1%>%group_by(ID)%>%filter(value>10)%>%dplyr::slice(which.min(value))%>%na.omit()
> df2
# A tibble: 2 x 3
# Groups: ID [2]
ID variable value
<int> <fctr> <chr>
1 1 t2 11
2 2 t3 11
df2$variable=as.character(df2$variable)
C=as.numeric(gsub("[[:alpha:]]", "", df2$variable))
df=df[df$ID%in%df2$ID,]
for (i in 1:length(C)){
DF1=df[i,str_detect(names(df),as.character(C[i]))]
names(DF1)=c('lowesttabove10 ','correspondingobs')
if (i ==1 ){DFF=DF1}else{DFF=rbind(DFF,DF1)}
}
cbind(df,DFF)
ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1 1 0 a 11 d 0 g 11 d
2 2 0 b 13 e 11 i 11 i
Solution uses dplyr and tidyr in one pipeline. dt is the original data, while dt2 is the final output.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
gather(t_group, t_value, starts_with("t")) %>%
gather(obs_group, obs_value, starts_with("obs")) %>%
filter(gsub("t", "", t_group) == gsub("obs", "", obs_group)) %>%
filter(t_value >= 10) %>%
filter(t_value == min(t_value)) %>%
select(ID, lowesttabove10 = t_value, correspondingobs = obs_value) %>%
inner_join(dt, by = "ID") %>%
select(colnames(dt), lowesttabove10, correspondingobs)
df2
ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1 1 0 a 11 d 0 g 11 d
2 2 0 b 13 e 11 i 11 i
Data:
dt <- read.table(text = "ID t1 obs1 t2 obs2 t3 obs3
1 0 a 11 d 0 g
2 0 b 13 e 11 i
3 0 c 0 f 0 h",
header = TRUE, stringsAsFactors = FALSE)

Replace NA each column based on another vector using dplyr

I am trying to replace NAs in a data.frame of many columns using another vector in which the replacement values for each column are given. I know how I could replace each value using a function, but not to find the value in another vector. I am searching for a dplyr approach:
For example:
require(dplyr)
test <- data.frame(A = c(1,2,3,NA), B = c(4,5,NA,2), C = c(NA,2,2,NA), D = c(1,2,3,4))
replace_na <- c(A = 100, B = 200, C = 300)
# Replace with median should be replace with look up value in vector based on the name of the vector or position
test %>% mutate_each_(funs(replace(., is.na(.), median(.,na.rm = T))), names(replace_na))
expected_result <- data.frame(A = c(1,2,3,100), B = c(4,5,200,2), C = c(300,2,2,300), D = c(1,2,3,4))
> expected_result
A B C D
1 1 4 300 1
2 2 5 2 2
3 3 200 2 3
4 100 2 300 4
It is as easy as using replace_na function from tidyr-package:
library(tidyr)
test %>% replace_na(as.list(replacements))
The output:
A B C D
1 1 4 300 1
2 2 5 2 2
3 3 200 2 3
4 100 2 300 4
This function needs a list for which columns the NA's you want to replace. So, it is possible to replace for only selected columns. Example:
replacements2 <- list(B = 200, C = 300)
test %>% replace_na(replacements2)
output:
A B C D
1 1 4 300 1
2 2 5 2 2
3 3 200 2 3
4 NA 2 300 4
As you can see, only the NA's for the B and C columns are replaced.
Data:
test <- data.frame(A = c(1,2,3,NA), B = c(4,5,NA,2), C = c(NA,2,2,NA), D = c(1,2,3,4))
replacements <- c(A = 100, B = 200, C = 300)
We can use Map from base R
test[names(replace_na)] <- Map(function(x,y)
replace(x, is.na(x), y), test[names(replace_na)], replace_na)
test
# A B C D
#1 1 4 300 1
#2 2 5 2 2
#3 3 200 2 3
#4 100 2 300 4
Or with tidyverse
library(tidyverse)
test %>%
select_at(names(replace_na)) %>%
map2_df(., replace_na, ~replace(., is.na(.), .y)) %>%
bind_cols(., select_at(test, setdiff(names(test), names(replace_na))))
# A tibble: 4 x 4
# A B C D
# <dbl> <dbl> <dbl> <dbl>
#1 1 4 300 1
#2 2 5 2 2
#3 3 200 2 3
#4 100 2 300 4
Or with set from data.table
library(data.table)
setDT(test)
for(j in names(replace_na)){
set(test, i = which(is.na(test[[j]])), j = j, value = replace_na[j])
}
test
# A B C D
#1: 1 4 300 1
#2: 2 5 2 2
#3: 3 200 2 3
#4: 100 2 300 4

Frequency of rows by ID

The data set contains three variables: id, sex, and grade (factor).
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
For each ID, I need to see how many unique grades we have and then create a new column (call N) to record the grade frequency. For instance, for ID=1, we have five unique values for "grade", so N = 4; for ID=2, we have two unique values for "grade", so N = 2; for ID=4, we have two unique values for "grade" (ignore NA), so N = 2.
The final data set is
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
mydata$N <- c(5,5,5,5,5,2,2,2,2,1,1,1,1,2,2,2,2,2)
New answer:
The uniqueN-function of data.table has a na.rm argument, which we can use as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade, na.rm = TRUE), by = id]
which gives:
> mydata
id sex grade n
1: 1 1 a 5
2: 1 1 b 5
3: 1 1 c 5
4: 1 1 d 5
5: 1 1 e 5
6: 2 0 x 2
7: 2 0 y 2
8: 2 0 y 2
9: 2 0 x 2
10: 3 0 q 1
11: 3 0 q 1
12: 3 0 q 1
13: 3 0 q 1
14: 4 1 a 2
15: 4 1 a 2
16: 4 1 a 2
17: 4 1 NA 2
18: 4 1 b 2
Old answer:
With data.table you could do this as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade[!is.na(grade)]), by = id]
or:
setDT(mydata)[, n := uniqueN(na.omit(grade)), by = id]
You could use the package data.table:
library(data.table)
setDT(mydata)
#I have removed NA's, up to you how to count them
mydata[,N_u:=length(unique(grade[!is.na(grade)])),by=id]
Very short, readable and fast. It can also be done in base-R:
#lapply(split(grade,id),...: splits data into subsets by id
#unlist: creates one vector out of multiple vectors
#rep: makes sure each ID is repeated enough times
mydata$N <- unlist(lapply(split(mydata$grade,mydata$id),function(x){
rep(length(unique(x[!is.na(x)])),length(x))
}
))
Because there was discussion on what is faster, let's do some benchmarking.
Given dataset:
> test1
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 3.043186 3.161732 3.422327 3.286436 3.477854 10.627030 100 b
uniqueN 2.481761 2.615190 2.763192 2.738354 2.872809 3.985393 100 a
Larger dataset: (10000 observations, 1000 id's)
> test2
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 11.84123 24.47122 37.09234 30.34923 47.55632 97.63648 100 a
uniqueN 25.83680 50.70009 73.78757 62.33655 97.33934 210.97743 100 b
A dplyr option that makes use of dplyr::n_distinct and its na.rm-argument:
library(dplyr)
mydata %>% group_by(id) %>% mutate(N = n_distinct(grade, na.rm = TRUE))
#Source: local data frame [18 x 4]
#Groups: id [4]
#
# id sex grade N
# (dbl) (dbl) (fctr) (int)
#1 1 1 a 5
#2 1 1 b 5
#3 1 1 c 5
#4 1 1 d 5
#5 1 1 e 5
#6 2 0 x 2
#7 2 0 y 2
#8 2 0 y 2
#9 2 0 x 2
#10 3 0 q 1
#11 3 0 q 1
#12 3 0 q 1
#13 3 0 q 1
#14 4 1 a 2
#15 4 1 a 2
#16 4 1 a 2
#17 4 1 NA 2
#18 4 1 b 2
Looks like we have several votes for data.table, but you could also use the base R function ave():
mydata$N <- ave(as.character(mydata$grade),mydata$id,
FUN = function(x) length(unique(x[!is.na(x)])))
use tapply and lookup table
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4),
sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q",
"q","q","q", "a", "a", "a", NA, "b"))
uniqN <- tapply(mydata$grade, mydata$id, function(x) sum(!is.na(unique(x))))
mydata$N <- uniqN[mydata$id]
Here is a dplyr method. I kept the summary table separate for tidy reasons.
library(dplyr)
summary =
mydata %>%
distinct(id, grade) %>%
filter(grade %>% is.na %>% `!`) %>%
count(id)
mydata %>%
left_join(summary)

Resources