Specifying column names with a defined structure - r

Assume I've a (used defined covariance) matrix and I want define the column names like this:
y <- matrix(rnorm(15*10),ncol=15)
colnames(y) <- c("Var1", "Cov12", "Cov13","Cov14", "Cov15",
"Var2", "Cov23", "Cov24", "Cov25",
"Var3", "Cov34" , "Cov35"
"Var4", "Cov45",
"Var5")
where each row contained the variance or co variance for an date t. I want to find a more general way to assign the column names as above, because I'll not always have 5 different variances. I tried something with rep and seq but I didn't find a solution.

Maybe not the most optimal way but we can do
n <- 5
paste0("Var", rep(1:n, n:1), unlist(sapply(2:n, function(x) c("",seq(x, n)))))
[1] "Var1" "Var12" "Var13" "Var14" "Var15" "Var2" "Var23" "Var24" "Var25" "Var3"
"Var34" "Var35" "Var4" "Var45" "Var5"
Breaking it down for better understanding
rep(1:n, n:1)
#[1] 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
unlist(sapply(2:n, function(x) c("",seq(x, n))))
#[1] "" "2" "3" "4" "5" "" "3" "4" "5" "" "4" "5" "" "5"
We take these outputs and paste them parallely with "Var" to get the desired column names.

Related

Max or min of a vector with strings containing only numbers in r. Precedence of 'stringed' numbers in a vector

I have a vector containing numbers in quotes (so they are actually strings), and I am trying to figure out the max and min of the vector. For example, in the vector x <- c("5", "12", "7"), according to R, max(x) is 7 and min(x) is 12. In this other vector, y <- c("1","12","13","14","15","10","38","19", "60"), max(y) is 60 and min(y) is 1. There seems to be a contradiction here. I have tried this several times and each time, I get weird and even more contradictory results. Also, sort(x) gives "12", "5", "7" as the result. This doesn't make sense too. Could someone help me explain what's happening? Thanks!
Strings are sorted alphabetically. We can verify the consistency of the ordering you observe by changing the strings of numbers to strings of letters, with 0 being a, 1 being b, etc.:
x <- c("5", "12", "7")
y <- c("1","12","13","14","15","10","38","19", "60")
digit_to_letter = function(x) {
x = strsplit(x, "")
lets = lapply(x, function(d) letters[as.integer(d) + 1])
sapply(lets, paste, collapse = "")
}
Binding the original number strings to their "equivalent" letters and then sorting, we can see that the ordering you observed is the same as the familiar alphabetical ordering, just applied to numbers. Similarly, the min and max are consistent with the alphabetical ordering. You may have noticed similar ordering, for example, in a directory on your computer if you have files with names that start with numbers.
x_example = cbind(x, digit_to_letter(x))
y_example = cbind(y, digit_to_letter(y))
x_example[order(x), ]
# x
# [1,] "12" "bc"
# [2,] "5" "f"
# [3,] "7" "h"
y_example[order(y), ]
# y
# [1,] "1" "b"
# [2,] "10" "ba"
# [3,] "12" "bc"
# [4,] "13" "bd"
# [5,] "14" "be"
# [6,] "15" "bf"
# [7,] "19" "bj"
# [8,] "38" "di"
# [9,] "60" "ga"
If you want to use numbers as numbers, use as.numeric() or as.integer() to convert your number strings to a more appropriate class.
One final example to illustrate a bit better:
z = as.character(c(1, 2, 10, 12, 100, 101, 121, 1000, 9))
cbind(z, digit_to_letter(z))[order(z), ]
# z
# [1,] "1" "b"
# [2,] "10" "ba"
# [3,] "100" "baa"
# [4,] "1000" "baaa"
# [5,] "101" "bab"
# [6,] "12" "bc"
# [7,] "121" "bcb"
# [8,] "2" "c"
# [9,] "9" "j"
In your case you are working with strings.
To address your specific problem you can use the destring() function available in taRifx package.
See the code below:
x <- c("5", "12", "7")
install.packages("taRifx")
library(taRifx)
y <- destring(x)
sort(y)
This will destring the values and now when you ask for:
min(y) will give you 5
max(y) will give you 12

Convert column values separated by comma to a numeric vector in R

I have a data frame "dfx" like below. I need to convert values in "COUNTY_ID" to a vector to provide to function.
dfx:
STATE COUNTY_ID
KS 15,21,33,101
OH 133,51,12
TX 15,21,37,51,65
I have converted the STATE to a vector like below:
st = as.vector(as.character(dfx$STATE))
But, I need to convert each row in "COUNTY_ID" column to a number/numeric vector. For example, c(15,21,33,101)
How can I achieve this in R?
Any help is appreciated.
cty_id <- lapply(strsplit(as.character(dfx$COUNTY_ID), ","), as.numeric)
DOES NOT work:
mclapply(cty_id[1], FUN = each_cty, st = st[1], mc.cores = detectCores() - 1)
DOES works:
mclapply(c(15,21,33,101), FUN = each_cty, st = st[1], mc.cores = detectCores() - 1)
Is this what you're after?
strsplit(as.character(dfx$COUNTY_ID), ",")
#[[1]]
#[1] "15" "21" "33" "101"
#
#[[2]]
#[1] "133" "51" "12"
#
#[[3]]
#[1] "15" "21" "37" "51" "65"
Explanation: strsplit(..., ",") splits every entry based on ",", and stores the result in a list of character vectors.
Or to get a list of numeric vectors:
lapply(strsplit(as.character(dfx$COUNTY_ID), ","), as.numeric);
#[[1]]
#[1] 15 21 33 101
#
#[[2]]
#[1] 133 51 12
#
#[[3]]
#[1] 15 21 37 51 65
How do you want to handle situations like the one in your example data, when KS has four distinct values of county_id, but OH has only three? If you seek to get one column per county_id, and you're ok with missing values in some of the cells, then the easiest thing is to use stringr::str_split_fixed().
> result <- stringr::str_split_fixed(dfx$COUNTY_ID, ",", n=5)
> result
[,1] [,2] [,3] [,4] [,5]
[1,] "15" "21" "33" "101" ""
[2,] "133" "51" "12" "" ""
[3,] "15" "21" "37" "51" "65"
Note that you need to know the max number of county_ids per row, and put this as an argument n above. You can be conservative and just drop columns full of NAs later.
What you get out of this is a matrix of characters. You can then convert it to numeric as follows: class(result) <- 'numeric'. After that, each row of the result matrix gives you the vector of interest, you may have to wrap it in na.omit() to be sure you only get numbers.

Change values from categorical to nominal in R

I want to change all the values in categorical columns by rank. Rank can be decided using the index of the sorted unique elements in the column.
For instance,
> data[1:5,1]
[1] "B2" "C4" "C5" "C1" "B5"
then I want these entries in the column replacing categorical values
> data[1:5,1]
[1] "1" "4" "5" "3" "2"
Another column:
> data[1:5,3]
[1] "Verified" "Source Verified" "Not Verified" "Source Verified" "Source Verified"
Then the updated column:
> data[1:5,3]
[1] "3" "2" "1" "2" "2"
I used this code for this task but it is taking a lot of time.
for(i in 1:ncol(data)){
if(is.character(data[,i])){
temp <- sort(unique(data[,i]))
for(j in 1:nrow(data)){
for(k in 1:length(temp)){
if(data[j,i] == temp[k]){
data[j,i] <- k}
}
}
}
}
Please suggest me the efficient way to do this, if possible.
Thanks.
Here a solution in base R. I create a helper function that convert each column to a factor using its unique sorted values as levels. This is similar to what you did except I use as.integer to get the ranking values.
rank_fac <- function(col1)
as.integer(factor(col1,levels = unique(col1)))
Some data example:
dx <- data.frame(
col1= c("B2" ,"C4" ,"C5", "C1", "B5"),
col2=c("Verified" , "Source Verified", "Not Verified" , "Source Verified", "Source Verified")
)
Applying it without using a for loop. Better to use lapply here to avoid side-effect.
data.frame(lapply(dx,rank_fac)
Results:
# col1 col2
# [1,] 1 3
# [2,] 4 2
# [3,] 5 1
# [4,] 3 2
# [5,] 2 2
using data.table syntax-sugar
library(data.table)
setDT(dx)[,lapply(.SD,rank_fac)]
# col1 col2
# 1: 1 3
# 2: 4 2
# 3: 5 1
# 4: 3 2
# 5: 2 2
simpler solution:
Using only as.integer :
setDT(dx)[,lapply(.SD,as.integer)]
Using match:
# df is your data.frame
df[] <- lapply(df, function(x) match(x, sort(unique(x))))

In R, compare two lists of strings, find parts of each element of the first list on the second

NOTE: I have updated the question to reflect specific patterns in the data.
Say that I have two vectors.
names_data <- c('A', 'B', 'C', 'D', 'E', 'F')
levels_selected <- c('A1','A3', 'Blow', 'Bhigh', 'D(4.88e+03,9.18+e+04]', 'F')
I want to know how to get a vector, a data frame, a list, or whatever, that checks on the levels vector and returns which levels of which variables where selected. Something that says:
A: 1, 3
B: low, high
D: (4.88e+03,9.18e+04]
Ultimately, there is a data frame X for which names_data = names(data) and levels_selected are some, but not all, of the levels in each of the variables. In the end what I want to do is to make a matrix (for, say for example, a random forest) using model.matrix where I want to include only the variables AND levels in levels_selected. Is there a straightforward way of doing so?
We can create a grouping variable after keeping the substring that contains the "names_data" in the "levels_selected" ('grp'), split the substring with prefix removed using the 'grp' to get a list.
grp <- sub(paste0("^(", paste(names_data, collapse="|"), ").*"), "\\1", levels_selected)
value <- gsub(paste(names_data, collapse="|"), "",
levels_selected)
lst <- split(value, grp)
lst
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "x"
If we meant something like
library(qdapTools)
mtabulate(lst)
# 1 3 high low x
#A 1 1 0 0 0
#B 0 0 1 1 0
#D 0 0 0 0 1
Or another option is using strsplit
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
aggregate(V2~V1, d1, FUN= toString)
# V1 V2
#1 A 1, 3
#2 B low, high
#3 D x
and possibly the model.matrix would be
model.matrix(~V1+V2-1, d1)
Update
By using the OP's new example
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
split(d1$V2, d1$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
It is also working with the first method.
Update2
If there are no characters that succeed the elements in 'names_data', we can filter them out
lst <- strsplit(levels_selected, paste0("(?<=(", paste(names_data,
collapse="|"), "))"), perl = TRUE)
d2 <- as.data.frame(do.call(rbind,lst[lengths(lst)==2]), stringsAsFactors=FALSE)
split(d2$V2, d2$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
An option that returns a list with levels as a vector stored under each corresponding name:
> setNames(lapply(names_data, function(x) gsub(x, "", levels_selected[grepl(x, levels_selected)])), names_data)
$A
[1] "1" "3"
$B
[1] "low" "high"
$C
character(0)
$D
[1] "x"
$E
character(0)
So this is a handy little function I extended from the regexpr help example, using perl-style regex
parseAll <- function(data, pattern) {
result <- gregexpr(pattern, data, perl = TRUE)
do.call(rbind,lapply(seq_along(data), function(i) {
if(any(result[[i]] == -1)) return("")
st <- data.frame(attr(result[[i]], "capture.start"))
le <- data.frame(attr(result[[i]], "capture.length") - 1)
mapply(function(start,leng) substring(data[i], start, start + leng), st, le)
}))
}
EDIT: It's extended because this one will find multiple matches of the patterns, allowing you to look for say, multiple patterns per line. so a pattern like: "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)" (from the original regexpr help) finds all instances of the pattern in each string, rather than just one.
suppose I had data that looked like this:
dat <- c('A1','A2','A3','B3')
I could then search for this data via
parseAll(z,'A(?<A>.*)|B(?<B>.*)') to get a data.frame with the levels selected:
parseAll(dat,'A(?<A>.*)|B(?<B>.*)')
A B
[1,] "1" ""
[2,] "2" ""
[3,] "3" ""
[4,] "" "3"
and which selection had each level (though that may not be useful to you), I can programmatically generate these patterns as well from your vectors:
pattern <- paste(paste0(names_data,'(?<',names_data,'>.*)'),collapse = '|')
then your selected levels are the unique elements of each column, (it's in data.frame, so the conversion to list is easy enough)
This is my omnitool for this kinda stuff, hope it's handy

Dataframe within dataframe?

Consider this example:
df <- data.frame(id=1:10,var1=LETTERS[1:10],var2=LETTERS[6:15])
fun.split <- function(x) tolower(as.character(x))
df$new.letters <- apply(df[ ,2:3],2,fun.split)
df$new.letters.var1
#NULL
colnames(df)
# [1] "id" "var1" "var2" "new.letters"
df$new.letters
# var1 var2
# [1,] "a" "f"
# [2,] "b" "g"
# [3,] "c" "h"
# [4,] "d" "i"
# [5,] "e" "j"
# [6,] "f" "k"
# [7,] "g" "l"
# [8,] "h" "m"
# [9,] "i" "n"
# [10,] "j" "o"
Would be someone so kind and explain what is going on here? A new dataframe within dataframe?
I expected this:
colnames(df)
# id var1 var2 new.letters.var1 new.letters.var2
The reason is because you assigned a single new column to a 2 column matrix output by apply. So, the result will be a matrix in a single column. You can convert it back to normal data.frame with
do.call(data.frame, df)
A more straightforward method will be to assign 2 columns and I use lapply instead of apply as there can be cases where the columns are of different classes. apply returns a matrix and with mixed class, the columns will be 'character' class. But, lapply gets the output in a list and preserves the class
df[paste0('new.letters', names(df)[2:3])] <- lapply(df[2:3], fun.split)
#akrun solved 90% of my problem. But I had data.frames buried within data.frames, buried within data.frames and so on, without knowing the depth to which this was happening.
In this case, I thought sharing my recursive solution might be helpful to others searching this thread as I was:
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) unnest_dataframes(y)
y
}
new_data <- unnest_dataframes(df)
Although this itself sometimes has problems and it can be helpful to separate all columns of class "data.frame" from the original data set then cbind() it back together like so:
# Find all columns that are data.frame
# Assuming your data frame is stored in variable 'y'
data.frame.cols <- unname(sapply(y, function(x) class(x) == "data.frame"))
z <- y[, !data.frame.cols]
# All columns of class "data.frame"
dfs <- y[, data.frame.cols]
# Recursively unnest each of these columns
unnest_dataframes <- function(x) {
y <- do.call(data.frame, x)
if("data.frame" %in% sapply(y, class)) {
unnest_dataframes(y)
} else {
cat('Nested data.frames successfully unpacked\n')
}
y
}
df2 <- unnest_dataframes(dfs)
# Combine with original data
all_columns <- cbind(z, df2)
In this case R doesn't behave like one would expect but maybe if we dig deeper we can solve it. What is a data frame? as Norman Matloff says in his book (chapter 5):
a data frame is a list, with the components of that list being
equal-length vectors
The following code might be useful to understand.
class(df$new.letters)
[1] "matrix"
str(df)
'data.frame': 10 obs. of 4 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ var1 : Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
$ var2 : Factor w/ 10 levels "F","G","H","I",..: 1 2 3 4 5 6 7 8 9 10
$ new.letters: chr [1:10, 1:2] "a" "b" "c" "d" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "var1" "var2"
Maybe the reason why it looks strange is in the print methods. Consider this:
colnames(df$new.letters)
[1] "var1" "var2"
maybe there must something in the print methods that combine the sub-names of objects and display them all.
For example here the vectors that constitute the df are:
names(df)
[1] "id" "var1" "var2" "new.letters"
but in this case the vector new.letters also has a dim attributes (in fact it is a matrix) were dimensions have names var1 and var1 too. See this code:
attributes(df$new.letters)
$dim
[1] 10 2
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "var1" "var2"
but when we print we see all of them like they were separated vectors (and so columns of the data.frame!).
Edit: Print methods
Just for curiosity in order to improve this question I looked inside the methods of the print functions:
methods(print)
The previous code produces a very long list of methods for the generic function print but there is no one for data.frame. The one that looks for data frame (but I am sure there is a more technically way to find out that) is listof.
getS3method("print", "listof")
function (x, ...)
{
nn <- names(x)
ll <- length(x)
if (length(nn) != ll)
nn <- paste("Component", seq.int(ll))
for (i in seq_len(ll)) {
cat(nn[i], ":\n")
print(x[[i]], ...)
cat("\n")
}
invisible(x)
}
<bytecode: 0x101afe1c8>
<environment: namespace:base>
Maybe I am wrong but It seems to me that in this code there might be useful informations about why that happens, specifically when the if (length(nn) != ll) is stated.

Resources