Separating a column in multiple columns in R

Separating a column in multiple columns in R - r

Here is my data:
df1<-read.table(text= "Group
11Z-23456
12B-10000
22M-2000
12M-1100
33G-100",header=TRUE)
I want to get this data:
A B C Code
1 1 Z 23456
1 2 B 10000
2 2 M 2000
1 2 M 1100
3 3 G 100
As you can see from my data, I want to separate the values in the Group This is just an small sample, but please consider for a large sample.

Here is a base R code that you can have a try
l <- strsplit(as.character(df1$Group),split = "-")
dfout <- setNames(data.frame(t(mapply(c, strsplit(sapply(l, `[[`,1),split = ""),lapply(l, `[[`,2)))),
c("A","B","C","Code"))
or using tidyr package
library(tidyr)
df1 %>%
separate(Group,c("X","Code"),sep = "-") %>%
separate(X,c("A","B","C"),sep = 1:3)
such that
> dfout
A B C Code
1 1 1 Z 23456
2 1 2 B 10000
3 2 2 M 2000
4 1 2 M 1100
5 3 3 G 100

Using `data.table:
library(data.table)
setDT(df1)
df1[, c("ABC", "Code") := tstrsplit(Group, "-")]
df1[, c("A", "B", "C") := tstrsplit(ABC, "")]
df1[, c("ABC", "Group") := NULL]
df1
# Code A B C
# 1: 23456 1 1 Z
# 2: 10000 1 2 B
# 3: 2000 2 2 M
# 4: 1100 1 2 M
# 5: 100 3 3 G

Related

Realization of a data frame with the smallest values grouped by a column

How can I create a new data frame with the smallest values group by a column.
For example this df:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 0
B 6
C 1
D 0
D 4')
Now with:
test <- setDT(df)[, .SD[which.min(Value)], by=Gene]
I get this:
> test
Gene Value
1: A 10
2: B 0
3: C 1
4: D 0
But how can I use a second condition for Value > 0 here? I want to have this output:
> test
Gene Value
1: A 10
2: B 3
3: C 1
4: D 4

Could do:
setDT(df)[, .(Value = min(Value[Value > 0])), by=Gene]
Output:
Gene Value
1: A 10
2: B 3
3: C 1
4: D 4

Using tidyverse you can group, filter and then summarize the min value:
library(tidyverse)
df2 <- df %>%
group_by(Gene) %>%
filter(Value != 0) %>%
summarise(Value = min(Value))
# A tibble: 4 x 2
Gene Value
<fct> <dbl>
1 A 10
2 B 3
3 C 1
4 D 4

Using aggregate from base R
aggregate(Value ~ Gene, subset(df, Value > 0), min)
# Gene Value
#1 A 10
#2 B 3
#3 C 1
#4 D 4

Filter data.table with another data.table with different column names

I have this dataset:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
and I have used the solution from this post to create this summary:
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
dt_score_1 <- dt_summary[score == 1]
setnames(dt_score_1, "area", "zone")
I would like to use the results from dt_score_1 to filter dt based on the area/zone and cluster/cluster_mode. So in a new data.table, the only rows taken from dt for area A should belong to cluster X, for area D they should be cluster Z etc.

If I'm understanding the question correctly, this is a merge of dt with dt_score_1 with the conditions area = zone, cluster = cluster_mode.
dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]
# record area score cluster i.score cluster_pct cluster_freq record_freq
# 1: 1 A 1 X 1 100.00000 2 2
# 2: 2 A 1 X 1 100.00000 2 2
# 3: 7 B 1 X 1 66.66667 2 3
# 4: 8 B 1 X 1 66.66667 2 3
# 5: 11 C 2 X 1 100.00000 1 1
# 6: 12 C 1 X 1 100.00000 1 1
# 7: 14 D 1 Z 1 80.00000 4 5
# 8: 15 D 1 Z 1 80.00000 4 5
# 9: 16 D 1 Z 1 80.00000 4 5
# 10: 17 D 1 Z 1 80.00000 4 5
# 11: 20 D 3 Z 1 80.00000 4 5
For a more detailed explanation of join-as-filter, see the link below posted by #Frank
Perform a semi-join with data.table

New columns based off existing column and column located next to it

My dataframe looks like this
ID t1 obs1 t2 obs2 t3 obs3
1 0 a 11 d 0 g
2 0 b 13 e 11 i
3 0 c 0 f 0 h
I need to make sure each ID has at least one t above 10 (delete row if not). Then, I want to save the lowest t value above 10, but also save the corresponding obs in new columns. (The complicated part about my question is that the lowest t above 10 could be in any column). The corresponding obs to some t is located in the next column, so that helps. So my resulting data frame would look like this:
ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1 0 a 11 d 0 g 11 d
2 0 b 13 e 11 i 11 i

With data.table, go to long format:
library(data.table)
setDT(DT)
dat = melt(DT, measure.vars = patterns("^t\\d+$", "^obs\\d+$"), value.name = c("t", "obs"))
setorder(dat, ID, variable)
# ID variable t obs
# 1: 1 1 0 a
# 2: 1 2 11 d
# 3: 1 3 0 g
# 4: 2 1 0 b
# 5: 2 2 13 e
# 6: 2 3 11 i
# 7: 3 1 0 c
# 8: 3 2 0 f
# 9: 3 3 0 h
Find max value per group and mark groups to keep:
IDDT = dat[order(-t),
.(max.variable = first(variable), max.t = first(t), max.obs = first(obs))
, by=ID]
IDDT[, keep := max.t > 10]
# ID max.variable max.t max.obs keep
# 1: 2 2 13 e TRUE
# 2: 1 2 11 d TRUE
# 3: 3 1 0 c FALSE
Find min value over 10 per kept group using a rolling update join:
IDDT[(keep), c("my.variable", "my.t", "my.obs") := {
m = .(ID = ID, t_thresh = 10)
dat[m, on=.(ID, t = t_thresh), roll=-Inf, .(x.variable, x.t, x.obs)]
}]
# ID max.variable max.t max.obs keep my.variable my.t my.obs
# 1: 2 2 13 e TRUE 3 11 i
# 2: 1 2 11 d TRUE 2 11 d
# 3: 3 1 0 c FALSE NA NA NA
I would stop here, with the main data in long format dat and the ID level variables in the separate table IDDT. To filter dat to groups that should be kept: dat[IDDT[(keep), .(ID)], on=.(ID)]. See ?data.table and the other intro materials mentioned when you load the package for details on the syntax.
See ?dcast if you insist on going back to wide.

Using base R:
Drop all rows with no t-values above 10:
df1 <- df1[rowSums(df1[, grepl("^t", colnames(df1))] >10) > 0, ]
Determine the group that contains the lowest value above 10 and then retrieve values:
df1$group <- apply(df1[grepl("^t", names(df1))], 1, function(x) which(x == min(x[x > 10])))
df1 <- cbind(df1, do.call(rbind, lapply(seq_len(nrow(df1)),
function(x) setNames(df1[x, paste0(c("t", "obs"), df1$group[x])],
c("lowesttabove10", "correspondingobs")))))
> df1
ID t1 obs1 t2 obs2 t3 obs3 group lowesttabove10 correspondingobs
1 1 0 a 11 d 0 g 2 11 d
2 2 0 b 13 e 11 i 3 11 i

My approach is not neat , but still works, You can try it.
library(dplyr)
library(reshape)
df1=melt(df,id='ID')
df2=df1%>%group_by(ID)%>%filter(value>10)%>%dplyr::slice(which.min(value))%>%na.omit()
> df2
# A tibble: 2 x 3
# Groups: ID [2]
ID variable value
<int> <fctr> <chr>
1 1 t2 11
2 2 t3 11
df2$variable=as.character(df2$variable)
C=as.numeric(gsub("[[:alpha:]]", "", df2$variable))
df=df[df$ID%in%df2$ID,]
for (i in 1:length(C)){
DF1=df[i,str_detect(names(df),as.character(C[i]))]
names(DF1)=c('lowesttabove10 ','correspondingobs')
if (i ==1 ){DFF=DF1}else{DFF=rbind(DFF,DF1)}
}
cbind(df,DFF)
ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1 1 0 a 11 d 0 g 11 d
2 2 0 b 13 e 11 i 11 i

Solution uses dplyr and tidyr in one pipeline. dt is the original data, while dt2 is the final output.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
gather(t_group, t_value, starts_with("t")) %>%
gather(obs_group, obs_value, starts_with("obs")) %>%
filter(gsub("t", "", t_group) == gsub("obs", "", obs_group)) %>%
filter(t_value >= 10) %>%
filter(t_value == min(t_value)) %>%
select(ID, lowesttabove10 = t_value, correspondingobs = obs_value) %>%
inner_join(dt, by = "ID") %>%
select(colnames(dt), lowesttabove10, correspondingobs)
df2
ID t1 obs1 t2 obs2 t3 obs3 lowesttabove10 correspondingobs
1 1 0 a 11 d 0 g 11 d
2 2 0 b 13 e 11 i 11 i
Data:
dt <- read.table(text = "ID t1 obs1 t2 obs2 t3 obs3
1 0 a 11 d 0 g
2 0 b 13 e 11 i
3 0 c 0 f 0 h",
header = TRUE, stringsAsFactors = FALSE)

Replace NA each column based on another vector using dplyr

I am trying to replace NAs in a data.frame of many columns using another vector in which the replacement values for each column are given. I know how I could replace each value using a function, but not to find the value in another vector. I am searching for a dplyr approach:
For example:
require(dplyr)
test <- data.frame(A = c(1,2,3,NA), B = c(4,5,NA,2), C = c(NA,2,2,NA), D = c(1,2,3,4))
replace_na <- c(A = 100, B = 200, C = 300)
# Replace with median should be replace with look up value in vector based on the name of the vector or position
test %>% mutate_each_(funs(replace(., is.na(.), median(.,na.rm = T))), names(replace_na))
expected_result <- data.frame(A = c(1,2,3,100), B = c(4,5,200,2), C = c(300,2,2,300), D = c(1,2,3,4))
> expected_result
A B C D
1 1 4 300 1
2 2 5 2 2
3 3 200 2 3
4 100 2 300 4

It is as easy as using replace_na function from tidyr-package:
library(tidyr)
test %>% replace_na(as.list(replacements))
The output:
A B C D
1 1 4 300 1
2 2 5 2 2
3 3 200 2 3
4 100 2 300 4
This function needs a list for which columns the NA's you want to replace. So, it is possible to replace for only selected columns. Example:
replacements2 <- list(B = 200, C = 300)
test %>% replace_na(replacements2)
output:
A B C D
1 1 4 300 1
2 2 5 2 2
3 3 200 2 3
4 NA 2 300 4
As you can see, only the NA's for the B and C columns are replaced.
Data:
test <- data.frame(A = c(1,2,3,NA), B = c(4,5,NA,2), C = c(NA,2,2,NA), D = c(1,2,3,4))
replacements <- c(A = 100, B = 200, C = 300)

We can use Map from base R
test[names(replace_na)] <- Map(function(x,y)
replace(x, is.na(x), y), test[names(replace_na)], replace_na)
test
# A B C D
#1 1 4 300 1
#2 2 5 2 2
#3 3 200 2 3
#4 100 2 300 4
Or with tidyverse
library(tidyverse)
test %>%
select_at(names(replace_na)) %>%
map2_df(., replace_na, ~replace(., is.na(.), .y)) %>%
bind_cols(., select_at(test, setdiff(names(test), names(replace_na))))
# A tibble: 4 x 4
# A B C D
# <dbl> <dbl> <dbl> <dbl>
#1 1 4 300 1
#2 2 5 2 2
#3 3 200 2 3
#4 100 2 300 4
Or with set from data.table
library(data.table)
setDT(test)
for(j in names(replace_na)){
set(test, i = which(is.na(test[[j]])), j = j, value = replace_na[j])
}
test
# A B C D
#1: 1 4 300 1
#2: 2 5 2 2
#3: 3 200 2 3
#4: 100 2 300 4

Frequency of rows by ID

The data set contains three variables: id, sex, and grade (factor).
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
For each ID, I need to see how many unique grades we have and then create a new column (call N) to record the grade frequency. For instance, for ID=1, we have five unique values for "grade", so N = 4; for ID=2, we have two unique values for "grade", so N = 2; for ID=4, we have two unique values for "grade" (ignore NA), so N = 2.
The final data set is
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
mydata$N <- c(5,5,5,5,5,2,2,2,2,1,1,1,1,2,2,2,2,2)

New answer:
The uniqueN-function of data.table has a na.rm argument, which we can use as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade, na.rm = TRUE), by = id]
which gives:
> mydata
id sex grade n
1: 1 1 a 5
2: 1 1 b 5
3: 1 1 c 5
4: 1 1 d 5
5: 1 1 e 5
6: 2 0 x 2
7: 2 0 y 2
8: 2 0 y 2
9: 2 0 x 2
10: 3 0 q 1
11: 3 0 q 1
12: 3 0 q 1
13: 3 0 q 1
14: 4 1 a 2
15: 4 1 a 2
16: 4 1 a 2
17: 4 1 NA 2
18: 4 1 b 2
Old answer:
With data.table you could do this as follows:
library(data.table)
setDT(mydata)[, n := uniqueN(grade[!is.na(grade)]), by = id]
or:
setDT(mydata)[, n := uniqueN(na.omit(grade)), by = id]

You could use the package data.table:
library(data.table)
setDT(mydata)
#I have removed NA's, up to you how to count them
mydata[,N_u:=length(unique(grade[!is.na(grade)])),by=id]
Very short, readable and fast. It can also be done in base-R:
#lapply(split(grade,id),...: splits data into subsets by id
#unlist: creates one vector out of multiple vectors
#rep: makes sure each ID is repeated enough times
mydata$N <- unlist(lapply(split(mydata$grade,mydata$id),function(x){
rep(length(unique(x[!is.na(x)])),length(x))
}
))
Because there was discussion on what is faster, let's do some benchmarking.
Given dataset:
> test1
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 3.043186 3.161732 3.422327 3.286436 3.477854 10.627030 100 b
uniqueN 2.481761 2.615190 2.763192 2.738354 2.872809 3.985393 100 a
Larger dataset: (10000 observations, 1000 id's)
> test2
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 11.84123 24.47122 37.09234 30.34923 47.55632 97.63648 100 a
uniqueN 25.83680 50.70009 73.78757 62.33655 97.33934 210.97743 100 b

A dplyr option that makes use of dplyr::n_distinct and its na.rm-argument:
library(dplyr)
mydata %>% group_by(id) %>% mutate(N = n_distinct(grade, na.rm = TRUE))
#Source: local data frame [18 x 4]
#Groups: id [4]
#
# id sex grade N
# (dbl) (dbl) (fctr) (int)
#1 1 1 a 5
#2 1 1 b 5
#3 1 1 c 5
#4 1 1 d 5
#5 1 1 e 5
#6 2 0 x 2
#7 2 0 y 2
#8 2 0 y 2
#9 2 0 x 2
#10 3 0 q 1
#11 3 0 q 1
#12 3 0 q 1
#13 3 0 q 1
#14 4 1 a 2
#15 4 1 a 2
#16 4 1 a 2
#17 4 1 NA 2
#18 4 1 b 2

Looks like we have several votes for data.table, but you could also use the base R function ave():
mydata$N <- ave(as.character(mydata$grade),mydata$id,
FUN = function(x) length(unique(x[!is.na(x)])))

use tapply and lookup table
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4),
sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q",
"q","q","q", "a", "a", "a", NA, "b"))
uniqN <- tapply(mydata$grade, mydata$id, function(x) sum(!is.na(unique(x))))
mydata$N <- uniqN[mydata$id]

Here is a dplyr method. I kept the summary table separate for tidy reasons.
library(dplyr)
summary =
mydata %>%
distinct(id, grade) %>%
filter(grade %>% is.na %>% `!`) %>%
count(id)
mydata %>%
left_join(summary)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Separating a column in multiple columns in R - r

Using `data.table: library(data.table) setDT(df1) df1[, c("ABC", "Code") := tstrsplit(Group, "-")] df1[, c("A", "B", "C") := tstrsplit(ABC, "")] df1[, c("ABC", "Group") := NULL] df1 # Code A B C # 1: 23456 1 1 Z # 2: 10000 1 2 B # 3: 2000 2 2 M # 4: 1100 1 2 M # 5: 100 3 3 G

Related

Realization of a data frame with the smallest values grouped by a column

Filter data.table with another data.table with different column names

New columns based off existing column and column located next to it

Replace NA each column based on another vector using dplyr

Frequency of rows by ID

Categories

Resources