Separate text to variables in R - r

I have in one column of the table this:
paragemcard-resp+insufcardioresp
dpco+pneumonia
posopperfulceragastrica+ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb+insuf.resp
dpco+dhca+#femur
posde#subtroncantГ©ricaesqВЄ+complicepidural
dpco+asma
And i want to separate them like this:
paragemcard-resp insufcardioresp
dpco pneumonia
posopperfulceragastrica ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb insuf.resp
dpco dhca #femur
posde#subtroncantГ©ricaesqВЄ complicepidural
dpco asma
But the problem is that they don't have the same length.
As you can see, in line 3, we have 2 variable and in line 6 we have 3.
And i want to create this string in the same column for further analysis.
Thanks

You can use read.table, but you should use count.fields or some kind of regex to figure out the correct number of columns first. Using Robert's "text" sample data:
Cols <- max(sapply(gregexpr("+", text, fixed = TRUE), length))+1
## Cols <- max(count.fields(textConnection(text), sep = "+"))
read.table(text = text, comment.char="", header = FALSE,
col.names=paste0("V", sequence(Cols)),
fill = TRUE, sep = "+")
# V1 V2 V3
# 1 paragemcard-resp insufcardioresp
# 2 dpco pneumonia
# 3 posopperfulceragastrica ards
# 4 pos op hematoma #rim direito expontanea
# 5 miopatiaduchenne-erb insuf.resp
# 6 dpco dhca #femur
# 7 posde#subtroncantГ©ricaesqВЄ complicepidural
# 8 dpco asma
Also, possibly useful: the "stringi" library makes counting elements easy (as an alternative to the gregexpr step above).
library(stringi)
Cols <- max(stri_count_fixed(x, "+") + 1)
Why the need for the "Cols" step? read.table and family decides how many columns to use either by (1) the maximum number of fields detected within the first 5 rows of data or (2) the length of the col.names argument. In your example row with the most number of fields is the sixth row, so directly using read.csv or read.table would result in incorrectly wrapped data.

You can use strsplit:
text <- c("paragemcard-resp+insufcardioresp", "dpco+pneumonia", "posopperfulceragastrica+ards", "pos op hematoma #rim direito expontanea", "miopatiaduchenne-erb+insuf.resp", "dpco+dhca+#femur", "posde#subtroncantГ©ricaesqВЄ+complicepidural", "dpco+asma")
strings <- strsplit(text, "+", fixed = TRUE)
maxlen <- max(sapply(strings, length))
strings <- lapply(strings, function(s) { length(s) <- maxlen; s })
strings <- data.frame(matrix(unlist(strings), ncol = maxlen, byrow = TRUE))
and it looks like
X1 X2 X3
1 paragemcard-resp insufcardioresp <NA>
2 dpco pneumonia <NA>
3 posopperfulceragastrica ards <NA>
4 pos op hematoma #rim direito expontanea <NA> <NA>
5 miopatiaduchenne-erb insuf.resp <NA>
6 dpco dhca #femur
7 posde#subtroncantГ©ricaesqВЄ complicepidural <NA>
8 dpco asma <NA>

Related

Numbers stick together as characters

I have a dataset with measured values (txt file, whitespace separated) and some numbers stick together like this:
Currently all columns are of class "character", since after conversion those pasted numbers got "NA"s. I created a routine for negative numbers, which was easy so far:
findandreplace <- function(file_name){
dat <- read_table2(file_name, col_names = FALSE)
for (n in 0:9) {
dat <- data.frame(lapply(dat, function(x) {gsub(paste0(n, "-"), paste0(n, " -"), x)}))
}
#save dat as txt and read it again
}
But now, I have no idea how to separate positive values. If you want you can use this MWE:
b = c("340.9","341","316.1","336.8316.39","378.8","315","386.57317.33",NA,NA)
a =c(1,2,3,4,5,6,7,8,9)
c = data.frame(a,b)
This is how it should be:
b = c("340.9","341","316.1","336.8","316.39","378.8","315","386.57", "317.33")
a =c(1,2,3,4,5,6,7,8,9)
c = data.frame(a,b)
x=unlist(strsplit(gsub("(.*)(3(?>\\d{2}\\.))","\\1 \\2",b,perl=T)," "))
grep("\\d",x,value = T)
[1] "340.9" "341" "316.1" "336.8" "316.39" "378.8" "315" "386.57" "317.33"
transform(c,b=grep("\\d",x,value = T))
a b
1 1 340.9
2 2 341
3 3 316.1
4 4 336.8
5 5 316.39
6 6 378.8
7 7 315
8 8 386.57
9 9 317.33

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!
I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.
If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")
Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98
I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

Split String without losing character- R

I have two columns in a much larger dataframe that I am having difficult splitting. I have used strsplit in past when I was trying to split using a "space", "," or some other delimiter. The hard part here is I don't want to lose any information AND when I split some parts I will end up with missing information. I would like to end up with four columns in the end. Here's a sample of a couple rows of what I have now.
age-gen surv-camp
45M 1LC
9F 0
12M 1AC
67M 1LC
Here is what I would like to ultimately get.
age gen surv camp
45 M 1 LC
9 F 0
12 M 1 AC
67 M 1 LC
I've done quite a lot of hunting around on here and have found a number of responses in Java, C++, html etc., but I haven't found anything that explains how to do this in R and when you have missing data.
I saw this about adding a space between values and then just splitting on the space, but I don't see how this would work 1) with missing data, 2) when I don't have consistent numeric or character values in each row.
We loop through the columns of 'df1' (lapply(df1, ..), create a delimiter after the numeric substring using sub, read the vector as data.frame with read.table, rbind the list of data.frames and change the column names of the output.
res <- do.call(cbind, lapply(df1, function(x)
read.table(text=sub("(\\d+)", "\\1,", x),
header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0
#3 12 M 1 AC
#4 67 M 1 LC
Or using separate from tidyr
library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>%
separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0 <NA>
#3 12 M 1 AC
#4 67 M 1 LC
Or as #Frank mentioned, we can use tstrsplit from data.table
library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x)
tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE,
type.convert=TRUE)), recursive = FALSE)]
EDIT: Added the convert = TRUE in separate to change the type of columns after the split.
data
df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC",
"0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"),
class = "data.frame", row.names = c(NA, -4L))

Extract data elements found in a single column

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

Convert one row to multiple rows per subject in a data frame by splitting text data in R [duplicate]

This question already has answers here:
Split 1 Column into 2 Columns in a Dataframe [duplicate]
(1 answer)
Breaking up (melting) text data in a column in R?
(2 answers)
Closed 9 years ago.
I have a dataset with a patient identifier and a text field with a summary of medical findings (1 row per patient). I would like to create a dataset with multiple rows per patients by splitting the text field so that each sentence of the summary falls on a different line. Subsequently, I would like to text parse each line looking for certain keywords and negation terms. An example of the structure of the data frame is (the letters represent the sentences):
ID Summary
1 aaaaa. bb. c
2 d. eee. ff. g. h
3 i. j
4 k
I would like to split the text field at the “.” to convert it to:
ID Summary
1 aaaaa
1 bb
1 c
2 d
2 eee
2 ff
2 g
2 h
3 i
3 j
4 k
R code to create the initial data frame:
ID <- c(1, 2, 3, 4)
Summary <- c("aaaaa. bb. c", "d. eee. ff. g. h", "i. j", "k")
df <- data.frame(cbind(ID, Summary))
df$ID <- as.numeric(df$ID)
df$Summary <- as.character(df$Summary)
The following previous posting provides a nice solution:
Breaking up (melting) text data in a column in R?
I used the following code from that posting which works for this sample dataset:
dflong <- by(df, df$ID, FUN = function(x) {
sentence = unlist(strsplit(x$Summary, "[.]"))
data.frame(ID = x$ID, Summary = sentence)
})
dflong2<- do.call(rbind,dflong)
However, when I try to apply to my larger dataset (>200,000 rows), I get the error message:
Error in data.frame(ID = x$ID, Summary = sentence) : arguments imply differing number of rows: 1, 0
I reduced the data frame down to test it on a smaller dataset and I still get this error message any time the number of rows is >57.
Is there another approach to take that can handle a larger number of rows? Any advice is appreciated. Thank you.
Use data.table:
library(data.table)
dt = data.table(df)
dt[, strsplit(Summary, ". ", fixed = T), by = ID]
# ID V1
# 1: 1 aaaaa
# 2: 1 bb
# 3: 1 c
# 4: 2 d
# 5: 2 eee
# 6: 2 ff
# 7: 2 g
# 8: 2 h
# 9: 3 i
#10: 3 j
#11: 4 k
There are many ways to address #agstudy's comment about empty Summary, but here's a fun one:
dt[, c(tmp = "", # doesn't matter what you put here, will delete in a sec
# the point of having this is to force the size of the output table
# which data.table will kindly fill with NA's for us
Summary = strsplit(Summary, ". ", fixed = T)), by = ID][,
tmp := NULL]
You get an error because for some rows you have no data ( summary column). Try this should work for you:
dflong <- by(df, df$ID, FUN = function(x) {
sentence = unlist(strsplit(x$Summary, "[.]"))
## I just added this line to your solution
if(length(sentence )==0)
sentence <- NA
data.frame(ID = x$ID, Summary = sentence)
})
dflong2<- do.call(rbind,dflong)
PS : This is slightly different from the data.table solution which will remove rows where summary equal to ''(0 charcaters). That's said I would would use a data.table solution here since you have more than 200 000 rows.

Resources