Convert a string to data frame, including column names - r

I have a string whose structure and length can keep varying, that is
Input:
X <- ("A=12&B=15&C=15")
Y <- ("A=12&B=15&C=15&D=32&E=53")
What I was looking for this string to convert to data frame
Output Expected:
Dataframe X
A B C
12 15 15
and Dataframe Y
A B C D E
12 15 15 32 53
What I tired was this:
X <- as.data.frame(strsplit(X, split="&"))
But this didn't work for me, as it created only one column and column name was messed up.
P.S: I cannot hard code the column names because they can vary, and at any given time a string will contain only one row

One option is to extract the numeric part from the string, and read it with read.table. The pattern [^0-9]+ indicates one or more characters that are not a number and replace it with a space in the first gsub, read that using read.table, and specify the column names in the col.names argument with the values got by removing all characters that are not an upper case letter (second gsub)
f1 <- function(str1){
read.table(text=gsub("[^0-9]+", " ", str1),
col.names = scan(text=trimws(gsub("[^A-Z]+", " ", str1)),
what = "", sep=" ", quiet=TRUE))
}
f1(X)
# A B C
#1 12 15 15
f1(Y)
# A B C D E
#1 12 15 15 32 53

You can try this too:
library(stringr)
res <- str_match_all(X, "([A-Z]+)=([0-9]+)")[[1]]
df <- as.data.frame(matrix(as.integer(res[,3]), nrow=1))
names(df) <- res[,2]
df
A B C
1 12 15 15

Related

How can I make an array of two matrices with a vector?

Generate an array of two 3X4 matrices from a vector containing the elements 1 through 20. Fill by column.
Add suitable row names, column names and matrix names to the array.
Using the apply function, find the sum of the column elements in the array.
In one line of code, find the mean of each matrix in the array. Result should be a vector of two numbers, one for each matrix.
Above is what I'm trying to solve. What I've tried is code below,
y <- array(1:20, dim = c(3,4,2))
print(y)
It seems like it works. How do I define fill by column here? Also, how can I add names using a single vector? Normally there should be at least two vectors like this,
result <- array(c(vector1, vector2), dim = c(3,3,2))
This is introductory R, the OP should read An Introduction to R, file R-intro.pdf, that comes with every installation of R.
In order to fill the matrices by column, there is nothing in particular to be made, R already fills them in column-major order.
Note that 3*4*2 == 24 and since only the values 1:20 are accepted, the last 4 values are recycled.
y <- array(1:20, dim = c(3,4,2))
dimnames(y) <- list(paste("row", 1:3, sep = "_"),
paste("col", 1:4, sep = "_"),
paste("matrix", 1:2, sep = "_"))
y
#, , matrix_1
#
# col_1 col_2 col_3 col_4
#row_1 1 4 7 10
#row_2 2 5 8 11
#row_3 3 6 9 12
#
#, , matrix_2
#
# col_1 col_2 col_3 col_4
#row_1 13 16 19 2
#row_2 14 17 20 3
#row_3 15 18 1 4
Now the column sums by 3rd dimension.
apply(y, 3, colSums)
# matrix_1 matrix_2
#col_1 6 42
#col_2 15 51
#col_3 24 40
#col_4 33 9
And the mean values of each matrix.
apply(y, 3, mean)
#matrix_1 matrix_2
# 6.50000 11.83333

Rename rownames

I would like to rename row names by removing common part of a row name
a b c
CDA_Part 1 4 4
CDZ_Part 3 4 4
CDX_Part 1 4 4
result
a b c
CDA 1 4 4
CDZ 3 4 4
CDX 1 4 4
1.Create a minimal reproducible example:
df <- data.frame(a = 1:3, b = 4:6)
rownames(df) <- c("CDA_Part", "CDZ_Part", "CDX_Part")
df
Returns:
a b
CDA_Part 1 4
CDZ_Part 2 5
CDX_Part 3 6
2.Suggested solution using base Rs gsub:
rownames(df) <- gsub("_Part", "", rownames(df), fixed=TRUE)
df
Returns:
a b
CDA 1 4
CDZ 2 5
CDX 3 6
Explanation:
gsub uses regex to identify and replace parts of strings. The three first arguments are:
pattern the pattern to be replaced - i.e. "_Part"
replacement the string to be used as replacement - i.e. the empty string ""
x the string we want to replace something in - i.e. the rownames
An additional argument (not in the first 3):
fixed indicating if pattern is meant to be a regular expression or "just" an ordinary string - i.e. just a string
You can try this approach, you can use Reduce with intersect to determine the common parts in the name, Note I am assuming here that you have structure like below in your dataset, where underscore is a separator between two words. This solution will work with both word_commonpart or commonpart_word, like in the example below.
Logic:
Using strsplit, split-ted the column basis underscore(not eating underscore as well, so used look around zero width assertions), now using Reduce to find intersection between the strings of all rownames. Those found are then pasted as regex with pipe separated items and replaced by Nothing using gsub.
Input:
structure(list(a = 1:4, b = 4:7), class = "data.frame", row.names = c("CDA_Part",
"CDZ_Part", "CDX_Part", "Part_ABC"))
Solution:
red <- Reduce('intersect', strsplit(rownames(df),"(?=_)",perl=T))
##1. determining the common parts
e <- expand.grid(red, red)
##2. getting all the combinations of underscores and the remaining parts
rownames(df) <- gsub(paste0(do.call('paste0', e[e$Var1!=e$Var2,]), collapse = "|"), '', rownames(df))
##3. filtering only those combinations which are different and pasting together using do.call
##4. using paste0 to get regex seperated by pipe
##5.replacing the common parts with nothing here
Output:
> df
# a b
# CDA 1 4
# CDZ 2 5
# CDX 3 6
# ABC 4 7

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!
I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.
If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")
Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98
I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

Split String without losing character- R

I have two columns in a much larger dataframe that I am having difficult splitting. I have used strsplit in past when I was trying to split using a "space", "," or some other delimiter. The hard part here is I don't want to lose any information AND when I split some parts I will end up with missing information. I would like to end up with four columns in the end. Here's a sample of a couple rows of what I have now.
age-gen surv-camp
45M 1LC
9F 0
12M 1AC
67M 1LC
Here is what I would like to ultimately get.
age gen surv camp
45 M 1 LC
9 F 0
12 M 1 AC
67 M 1 LC
I've done quite a lot of hunting around on here and have found a number of responses in Java, C++, html etc., but I haven't found anything that explains how to do this in R and when you have missing data.
I saw this about adding a space between values and then just splitting on the space, but I don't see how this would work 1) with missing data, 2) when I don't have consistent numeric or character values in each row.
We loop through the columns of 'df1' (lapply(df1, ..), create a delimiter after the numeric substring using sub, read the vector as data.frame with read.table, rbind the list of data.frames and change the column names of the output.
res <- do.call(cbind, lapply(df1, function(x)
read.table(text=sub("(\\d+)", "\\1,", x),
header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0
#3 12 M 1 AC
#4 67 M 1 LC
Or using separate from tidyr
library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>%
separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0 <NA>
#3 12 M 1 AC
#4 67 M 1 LC
Or as #Frank mentioned, we can use tstrsplit from data.table
library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x)
tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE,
type.convert=TRUE)), recursive = FALSE)]
EDIT: Added the convert = TRUE in separate to change the type of columns after the split.
data
df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC",
"0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"),
class = "data.frame", row.names = c(NA, -4L))

Merge multiple files after process into one data frame and assign file name to each colname

I have a list of file and I wrote a function to process each file and return two columns ("name" and "value").
file_list <- list.files(pattern=".txt")
sample_name <- sub (".*?lvl.(.*?).txt","\\1",file_list)
for (i in 1:length(file_list)){
x<- cleanMyData(file_list[i]) # this function returns a two column data
#then I want to merge all these processed data into one dataframe. Merge all "value" column based on the "name" column
# at the same time I want to put the file name in the corresponding column name. I already process the file name and put them into sample_name
}
To be more clear, here is my processed data for example:
file: apple.txt
name value
A 12
B 13
C 14
file: pear.txt
name value
A 15
B 14
C 20
D 21
Expect output:
Apple Pear
A 12 15
B 13 14
C 14 20
You could try
fns <- c("apple.txt", "pear.txt")
(df <-
Reduce(function(...) merge(..., all=F),
lapply(
seq(fns), function(x) {
read.table(fns[x],
header=TRUE,
col.names = c("name",
tools::file_path_sans_ext(fns)[x]))
})
)
)
# name apple pear
# 1 A 12 15
# 2 B 13 14
# 3 C 14 20
To capitalize the first characters, you could use something like
sub("\\b(\\w)", "\\U\\1", fns, perl=TRUE)
(see ?sub)
To get rid of the name column, you could use subset(df, select = -name).

Resources