Split String without losing character- R - r

I have two columns in a much larger dataframe that I am having difficult splitting. I have used strsplit in past when I was trying to split using a "space", "," or some other delimiter. The hard part here is I don't want to lose any information AND when I split some parts I will end up with missing information. I would like to end up with four columns in the end. Here's a sample of a couple rows of what I have now.
age-gen surv-camp
45M 1LC
9F 0
12M 1AC
67M 1LC
Here is what I would like to ultimately get.
age gen surv camp
45 M 1 LC
9 F 0
12 M 1 AC
67 M 1 LC
I've done quite a lot of hunting around on here and have found a number of responses in Java, C++, html etc., but I haven't found anything that explains how to do this in R and when you have missing data.
I saw this about adding a space between values and then just splitting on the space, but I don't see how this would work 1) with missing data, 2) when I don't have consistent numeric or character values in each row.

We loop through the columns of 'df1' (lapply(df1, ..), create a delimiter after the numeric substring using sub, read the vector as data.frame with read.table, rbind the list of data.frames and change the column names of the output.
res <- do.call(cbind, lapply(df1, function(x)
read.table(text=sub("(\\d+)", "\\1,", x),
header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0
#3 12 M 1 AC
#4 67 M 1 LC
Or using separate from tidyr
library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>%
separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0 <NA>
#3 12 M 1 AC
#4 67 M 1 LC
Or as #Frank mentioned, we can use tstrsplit from data.table
library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x)
tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE,
type.convert=TRUE)), recursive = FALSE)]
EDIT: Added the convert = TRUE in separate to change the type of columns after the split.
data
df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC",
"0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"),
class = "data.frame", row.names = c(NA, -4L))

Related

How to "translate" variables in one data frame using a second data frame as a key?

I have a data frame with two string variables, and would like to convert them to numeric values using a separate "key" data frame. The below example is simplified, but I need to be able to apply it to replace the contents of the V1 and V2 variables based on an arbitrary key that will not always be a=1, b=2 etc...
Example:
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters, 1:26)
I need to reference the first element of V1 against the key, replace with the according value (e.g. a = 1, b = 2, etc.), do the same for the second element, and then when done with V1 move on and do the same for V2.
I've been struggling to work out a solution using lapply() and sub() but keep getting stuck because I can't see a way to pass the sub() function more than a 1:1 comparison. Is there a different function I should be using?
Forgive me- I'm sure the solution must be simple but I'm quite new to R still.
Here are two approaches with base R to make it:
using sapply()
x[] <- with(key, sapply(x, function(v) values[match(v,letters)]))
or
x <- data.frame(with(key, sapply(x, function(v) values[match(v,letters)])))
using as.matrix (similar to the unlist() approach by #Ronak Shah)
x[] <- with(key, values[match(as.matrix(x),letters)])
You can create a lookup table with data.table and then apply the mapping along the columns of your data frame with apply:
library(data.table)
key <- data.table(letters = letters, value = 1:26, key = "letters")
apply(x, 2, function(x) key[x]$value)
>
V1 V2
1 y a
2 d u
3 g u
4 a j
5 b v
6 w n
7 k j
8 n g
9 r i
10 s o
You could unlist and match in base R
x[] <- key$values[match(unlist(x), key$letters)]
x
# V1 V2
#1 25 1
#2 4 21
#3 7 21
#4 1 10
#5 2 22
#6 23 14
#7 11 10
#8 14 7
#9 18 9
#10 19 15
Or using dplyr
library(dplyr)
x %>% mutate_all(~key$values[match(., key$letters)])
data
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters = letters, values = 1:26)
You could use apply with both row and column margins, e.g, as.data.frame(apply(x, c(1,2), function(l) key[key$letters == l,c(2)])).

String manipulation in r where contents of interest are in a different order

I have dataframe where I am attempting to extract content of a column and then append it to the dataframe as a new column.
For example my dataframe looks like:
> head(df)
id event_params
1 {"type":"L","maximumangle":-87.618,"duration":25}
2 {"type":"L","maximumangle":1.62,"duration":25}
3 {"maximumangle":-29.661,"type":"L","duration":20}
I wish to extract the maximum angle, and then append this to the existing dataframe as a new column titled maximumangle. My initial thought was to use the grep function. However, since maximumangle does not appear in the same order in each row, this will not work.
What can I do to achieve what I want?
1) Parse the last column using fromJSON in the rjson package. This adds all the JSON data.
library(rjson)
L <- lapply(as.character(DF$event_params), fromJSON)
cbind(DF, do.call("rbind", lapply(L, as.data.frame, stringsAsFactors = FALSE)))
giving:
id event_params type maximumangle duration
1 1 {"type":"L","maximumangle":-87.618,"duration":25} L -87.618 25
2 2 {"type":"L","maximumangle":1.62,"duration":25} L 1.620 25
3 3 {"maximumangle":-29.661,"type":"L","duration":20} L -29.661 20
2) We can simplify this slightly if you really only need maximumangle:
maximumangle <- function(x) fromJSON(as.character(x))$maximumangle
transform(DF, maximumangle = sapply(DF$event_params, maximumangle, USE.NAMES = FALSE))
giving:
id event_params maximumangle
1 1 {"type":"L","maximumangle":-87.618,"duration":25} -87.618
2 2 {"type":"L","maximumangle":1.62,"duration":25} 1.620
3 3 {"maximumangle":-29.661,"type":"L","duration":20} -29.661
Note
We assumed that the input in reproducible form is given by:
Lines <- '
id event_params
1 {"type":"L","maximumangle":-87.618,"duration":25}
2 {"type":"L","maximumangle":1.62,"duration":25}
3 {"maximumangle":-29.661,"type":"L","duration":20}'
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
1) We can use str_extract from stringr by using a regex lookaround to match the string 'maximumangle' followed by a quote (") and colon (:) and extract the pattern the follows it i.e. zero or more - (-*) followed by numbers with digits ([0-9.]+)
library(dplyr)
library(stringr)
df %>%
mutate(maximumangle = as.numeric(str_extract(event_params,
'(?<=maximumangle":)-*[0-9.]+')))
# id event_params maximumangle
#1 1 {"type":"L","maximumangle":-87.618,"duration":25} -87.618
#2 2 {"type":"L","maximumangle":1.62,"duration":25} 1.620
#3 3 {"maximumangle":-29.661,"type":"L","duration":20} -29.661
2) Or the same can be done with base R using regexpr/regmatches
df$maximumangle <- as.numeric(regmatches(df$event_params,
regexpr('(?<=maximumangle":)-*[0-9.]+', df$event_params, perl = TRUE)))
data
df <- structure(list(id = 1:3, event_params = c("{\"type\":\"L\",\"maximumangle\":-87.618,\"duration\":25}",
"{\"type\":\"L\",\"maximumangle\":1.62,\"duration\":25}", "{\"maximumangle\":-29.661,\"type\":\"L\",\"duration\":20}"
)), .Names = c("id", "event_params"), class = "data.frame", row.names = c(NA,
-3L))

replace for loop in r with function from apply family (large data sets)

I'm looking at a problem where we are trying to create time series differences for a large numbers of pairs (+40 000) where the time series has (+150 daily points)
Each row is representing a pair of two individuals we want to compare
pairs = data.frame("number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A")
)
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
Each row is representing time series data for a particular individual
ts = data.frame("name" = c("A","B","C","D"),
"day1" = c(10,12,54,13),
"day2" = c(2,8,47,29),
"day3" = c(1,5,14,36)
)
ts$name <- as.character(ts$name)
I have the following R Code where the goal is to create for each pair of individuals (so 4 in my example) a new data frame that has the difference of their daily time series from the ts data frame. This works but is extremely slow when I try to run it on my real data set where pairs has 40 000 rows and ts about 150 columns. Anyone has an idea of how I could speed this up? I tried to use lapply but cant figure how to create the difference and store it in new time series. Thanks!!
diffs<-data.frame(matrix(ncol=ncol(ts)))
colnames(diffs)<-colnames(ts)
for (row in 1:nrow(pairs)){
row1<-ts[(ts$name==pairs[row,"name1"]),]
row2<-ts[(ts$name==pairs[row,"name2"]),]
difference<-rbind(row1,row2)
difference[3,1]<-pairs[row,"number"]
difference[3,2:ncol(difference)]<-difference[1,2:ncol(difference)]-difference[2,2:ncol(difference)]
diffs<-rbind(diffs,difference[3,])
}
A few remarks up front:
i) data.frame() has an argument stringsAsFactors, which you can set to FALSE, i.e.:
pairs = data.frame(
"number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A"),
stringsAsFactors = FALSE
)
ii) speeding up your code is not really a matter of replacing the for-loop with an apply, but rather a matter of data structures and processing efficiency. Relying on a package / function that internally loops in C++ rather than R or writing C++ code yourself will give you the biggest boost.
iii) I'll also provide a larger dummy example here, such that you and others can test and compare times a little easier:
# all combination of LETTERS, including identity pairs like A~A
pairs = cbind.data.frame(
"number" = seq(1, 676),
setNames(expand.grid(LETTERS, LETTERS), nm = c("name1", "name2"))
)
# expand.grid produces factor columns
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
ts = cbind.data.frame(
"name" = LETTERS,
matrix(sample.int(100, 150*26, replace = TRUE), ncol = 150),
stringsAsFactors = FALSE
)
names(ts)[-1] <- paste0("day", names(ts)[-1])
iv) an improved version of your loop could then look like:
# initialize full matrix (since the ID is a number too), allocating necessary memory
diffs2 <- matrix(0, ncol = ncol(ts), nrow = nrow(pairs))
colnames(diffs2) <- colnames(ts)
# first column is given
diffs2[, 1] <- pairs$number
for (row in 1:nrow(pairs)) {
row1 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name1"], -1]))
row2 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name2"], -1]))
diffs2[row, -1] <- row1 - row2
}
this is already several times faster than what you had, but illustrates the awkwardness of having a data.frame object for a time-series, which should instead be an object of a class that allows to work with the numeric data more directly / efficiently (there are several packages that offer time-series classes).
Now for an answer that is still fairly simple yet rather fast, using dplyr and tidyr:
# simple way of measuring time
start <- Sys.time()
xx <- tidyr::gather(ts, key = "day", value = "value", 2:151)
yy <- dplyr::left_join(pairs, xx, by = c("name1" = "name"))
zz <- dplyr::left_join(yy, xx, by = c("name2" = "name", "day" = "day"))
res <- dplyr::mutate(zz, diff = value.x - value.y)
end <- Sys.time()
duration <- end - start
duration
Time difference of 0.06700397 secs
You can also try the approaches from the previous two answers, it's clear that the mapply solution will be slow and the data.table one isn't working fully yet and already looks slower and more complicated.
I have a data.table solution to help.
The idea is to switch to long format to be able to use grouping operation (equivalent to apply) and create permutated column to make the pairs:
name1idx <- unlist(lapply(pairs$name1,function(x){grep(x,ts$name)}))
name2idx <- unlist(lapply(pairs$name2,function(x){grep(x,ts$name)}))
plouf <-melt(setDT(ts),measure.vars = patterns("^day"),variable.name = "day")
plouf[,name1 := name[name1idx],by = day]
plouf[,value1 := value[name1idx],by = day]
plouf[,name2 := name[name2idx],by = day]
plouf[,value2 := value[name2idx],by = day]
plouf[,diff := value1 - value2]
plouf[,.(day,diff),by = .(name1,name2)]
name1 name2 day diff
1: A B day1 -2
2: A B day2 -6
3: A B day3 -4
4: B D day1 -1
5: B D day2 -21
6: B D day3 -31
7: C D day1 41
8: C D day2 18
9: C D day3 -22
10: D A day1 3
11: D A day2 27
12: D A day3 35
name1idx and name1idx are the index of ts$name corresponding to pairs$name1 and pairs$name2. You can have this way all the pairs.
I was looking for a solution to in which use of column names is dynamic and no column name to be used other than name. mapply, dplyr and reshape2 has been used for this solution.
# library(reshape2)
# A function which will filter value based on pairs
matchPair <- function(x, y){
matchedRow <- ts %>%
filter(name == x | name == y) %>%
select(-name)
data.frame(diff(as.matrix(matchedRow))) %>%
mutate(name = paste0(x, '~',y))
}
df.r <-do.call(rbind,mapply(matchPair, pairs$name1, pairs$name2,
SIMPLIFY = FALSE))
# Row names are not meaningful. Hence remove those.
row.names(df.r) <- NULL
#Result
#> df.r
# day1 day2 day3 name
#1 2 6 4 A~B
#2 1 21 31 B~D
#3 -41 -18 22 C~D
#4 3 27 35 D~A

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!
I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.
If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")
Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98
I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

data.frame and splitting rows... not found a suitable solution for my data

I am struggling a bit with my tables. I am trying to split some variables (using R), but I am having difficulties with one specific column.
My dataset is like this:
test<-data.frame(
Chrom_no=c(1,1,2,3),
Region=c('12..13','22..23','100','34..36'),
Ref=c('AT','CG','A','AAA'),
Alt=c('TA','GA','T','CGG'),
Prob=c(99,98.7,99,99.9))
I want to separate all the regions that are grouped together. So far, I have solved for all the columns, but the 'Region' one:
ref2 <- strsplit(as.character(test$Ref), '')
alt2<-strsplit(as.character(test$Alt), '')
test2<-data.frame(
Chrom_no=rep(test$Chrom_no, vapply(ref2, FUN=length, FUN.VALUE=integer(1))),
Region=rep(test$Region, vapply(ref2, FUN=length, FUN.VALUE=integer(1))),
Ref=unlist(ref2),
Alt=unlist(alt2),
Prob=rep(test$Prob, vapply(ref2, FUN=length, FUN.VALUE=integer(1))))
I don't know how to solve fix that column: e.g. '12..13': 12 should go on the Ref=A and 13 should go in Ref=T (first and second character, respectively). Things get complicated, as some of the columns have 3 characters (and corresponding range: 22..24), some will have more.
How could I solve? I have been looking for a solution in the last couple of days, but I am still not sure how to solve. I apologize if this has already been solved somewhere else.
P.S.: I am aware that in order to strsplit on the 'Region' column I need to use:
'\\..'
as separator.
If I understand your end goal correctly, you can look into using the "data.table" package. With it, you can set up your problem like the following:
library(data.table)
## Change your data.frame to a data.table
DT <- as.data.table(test)
## Convert the relevant columns to be characters instead of factors
DT[, c("Region", "Ref", "Alt") := lapply(.SD, as.character),
.SDcols = c("Region", "Ref", "Alt")]
DT[, list(Chrom_no = rep(Chrom_no, nchar(Ref)), # Expand the Chrom_no
Region = unlist(lapply( # Split Region and use
strsplit(Region, "..", TRUE), # the result to create
function(x) { # the range of values
x <- as.numeric(x) # needed
if (length(x) > 1) seq(x[1], x[2]) else x
})),
Ref = unlist(strsplit(Ref, "")), # Split Ref
Alt = unlist(strsplit(Alt, "")), # Split Alt
Prob = rep(Prob, nchar(Ref)))] # Expand Prob
# Chrom_no Region Ref Alt Prob
# 1: 1 12 A T 99.0
# 2: 1 13 T A 99.0
# 3: 1 22 C G 98.7
# 4: 1 23 G A 98.7
# 5: 2 100 A T 99.0
# 6: 3 34 A C 99.9
# 7: 3 35 A G 99.9
# 8: 3 36 A G 99.9
The above code can probably be streamlined a bit, but I thought this should be enough to get you started.

Resources