R - Print record in console like PSQL - r

Is there a way to print a record from a table in R console looking like PSQL with \x enabled? Meaning print it on the vertical?
Something like this:
-Record1-
Var1 | xxx
Var2 | yyy
Var3 | zzz

kable from knitr will help you!
> df1
E F G H
1 A 0.9,1
2 B 0.98,0.34 0.98,0.34
3 C
> knitr::kable(df1)
|E |F |G |H |
|:--|:-----|:---------|:---------|
|A |0.9,1 | | |
|B | |0.98,0.34 |0.98,0.34 |
|C | | | |
>

If you gather the dataset using the tidyr package, it will basically be stored in that format.
If the original looks like..
Var1 Var2 Var3
Record1 xxx yyy xxx
The gathered dataset will look like...
Record1 Var1 xxx
Record1 Var2 yyy
Record1 Var3 zzz
Or perhaps even using str(YourData[x,]) will be sufficient if just for a quick check of a single record without needing to restructure the dataset.
Or, if you're set on the style in your example, this function is one (rough) way to approach it:
vertical_print_function <- function(data_vert,console=T){
psql <- character()
for(i in 1:nrow(data_vert)){
psql <- append(psql,paste0("\n\n-Record ",i,"-"))
colnames_s <- colnames(data_vert)
var_names <- colnames_s %>% stringr::str_pad(width=max(nchar(.)))
for(x in 1:ncol(data_vert)){
psql <- append(psql,paste0("\n",var_names[x]," | ",data_vert[i,colnames_s[x]]))
}
}
if (console==T) {
cat(psql)
} else {
return(psql)
}
}
console as TRUE prints immediately to the console (dangerous/pointless if many records). FALSE just returns the string, which can then exported to a text file.

Related

Combining Grep and a For-Loop to Construct A Matrix (R)

I have a huge list of small data frames which I would like to meaningfully combine into one, however the logic around how to do so escapes me.
For instance, if I have a list of data frames that look something like this albeit with far more files, many of which I do not want in my data frame:
MyList = c("AthosVersusAthos.csv", "AthosVerusPorthos.csv", "AthosVersusAramis.csv", "PorthosVerusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv", "AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVerusPothos.csv", "BobVersusMary.csv", "LostCities.txt")
What I want is to assemble these into one large data frame. Which would look like this.
| |
AthosVersusAthos | PorthosVersusAthos | AramisVersusAthos
| |
------------------------------------------------------
| |
AthosVerusPorthos | PothosVersusPorthos| AramisVersusPorthos
| |
------------------------------------------------------
| |
AthosVersusAramis | PorthosVersusAramis| AramisVersusAramis
| |
Or perhaps more correctly (with sample numbers in only one portion of the matrix):
| Athos | Porthos | Aramis
-------|------------------------------------------------------
| 10 9 5 | |
Athos | 2 10 4 | |
| 3 0 10 | |
-------|------------------------------------------------------
| | |
Porthos | | |
| | |
-------|------------------------------------------------------
| | |
Aramis | | |
| | |
-------------------------------------------------------------
What I have managed so far is:
Musketeers = c("Athos", "Porthos", "Aramis")
for(i in 1:length(Musketeers)) {
for(j in 1:length(Musketeers)) {
CombinedMatrix <- cbind (
rbind(MyList[grep(paste0("^(", Musketeers[i],
")(?=.*Versus[", Musketeers[j], "]"), names(MyList),
value = T, perl=T)])
)
}
}
What I was trying to do was combine my grep command (quite importnant given the number of files and specificity with which I need to select them) and then combine rbind and cbind so that the rows and the columns of the matrix are meaningfully concatenated.
My general plan was to merge all the data frames starting with 'Athos' into one column, and doing this once again for data frames starting with 'Porthos' and 'Aramis', and then combine those three columns, row-wise into a final dataframe.
I know I'm quite far off but I can't quite get my head around where to start.
Edit: #PierreGramme generated a useful model data set which I will add below seeing as I imagine it would have been useful to provide it originally.
Musketeers = c("Athos", "Porthos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPorthos.csv", "AthosVersusAramis.csv",
"PorthosVersusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
First make a reproducible example. Is it faithful? If so, I will add code to answer
Musketeers = c("Athos", "Pothos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPothos.csv", "AthosVersusAramis.csv",
"PothosVersusAthos.csv", "PothosVersusPothos.csv", "PothosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPothos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
And then is it correct that you would like to concatenate 9 of these matrices into your combined matrix shaped as you described?
Edit:
Then the code solving your problem:
# Helper function to extract the relevant portion of MyList and rbind() it
makeColumns = function(n){
re = paste0("^",n,"Versus")
sublist = MyList[grep(re, names(MyList))]
names(sublist) = sub(re, "", sub("\\.csv$","", names(sublist)))
# Make sure sublist is sorted correctly and contains info on all musketeers
sublist = sublist[Musketeers]
# Change row and col names so that they are unique in the final result
sublist = lapply(names(sublist), function(m) {
res = sublist[[m]]
rownames(res) = paste0(m,"_",rownames(res))
colnames(res) = paste0(n,"_",colnames(res))
res
})
do.call(rbind, sublist)
}
lColumns = lapply(setNames(nm=Musketeers), makeColumns)
CombinedMatrix = do.call(cbind, lColumns)

Compare faster columns of different dataframes

Let's assume two dataframes: A and B containing data like the following one:
Dataframe: A Dataframe: B
ColA ColB1 ColB2
| Dog | | Lion | yes
| Lion | | Cat |
| Zebra | | Elephant |
| Bat | | Dog | yes
Want to compare the values of ColA to the values of ColB1, in order to insert yes in case of match in column ColB2. What I'm running is this:
for (i in 1:nrow(B)){
for (j in 1:nrow(A)){
if (B[i,1] == A[j,1]){
B[i,2] <- "yes"
}
}
}
In reality we re talking abaout 20000 lines. How could this become faster?
You can use the %in% operator to determine membership:
B$ColB2 <- B$ColB1 %in% A$ColA
ColB2 will contain TRUE/FALSE dependent on whether value in ColB1 of data frame B was found in ColA of data frame A.
For more info see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/match.html

Splitting strings and stacking them in one column

I've got a data frame with this structure:
> df
modifications
13-MOD:0057
13-MOD:0046
13-MOD:0051,13-MOD:0076
13-MOD:0036,13-MOD:0076,13-MOD:0016
13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125
13-MOD:0014 13-MOD:0156, 13-MOD:0956,13-MOD:0125...n
13-MOD:0012 ... n
To split the data I used this code:
df2 <- data.frame(str_split_fixed(df$modifications, ",", 20))
Basically, I get this data.
> df2
x1 | x2 | x3 | empty |
13-MOD:0057 | empty | empty | empty |
13-MOD:0046 | emply | empty | empty |
13-MOD:0051 | 13-MOD:0076 | empty | empty |
13-MOD:0036 | 13-MOD:0076 | 13-MOD:0016 | empty |
13-MOD:0256 | 13-MOD:0156 | 13-MOD:0956 | 13-MOD:0125
13-MOD:0014 | 13-MOD:0156 | 13-MOD:0956 | 13-MOD:0125 | ... n
13-MOD:0012 | ... | ...n
What I want is remove the empty values and stack the data from columns X2,X3, X4 ... n to the first one X1.
To do that I was using this:
df3 <- melt(setDT(df2), # set df to a data.table
measure.vars = list(c(1:20)), # set column groupings
value.name = 'V')[ # set output name scheme
, -1, with = F]
To remove the empty values:
df3[df3==""] <- NA
histo3 = subset(df3, V1 != 'NA')
But I don't know why I get an error about the length of the column in melt function. Do you know any way to make this easier?.
Reproducible example:
df <- data.frame(modifications=c("UNIMOD:108,UNIMOD:108","UNIMOD:108","UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108,UNIMOD:108","UNIMOD:108"))
could it be something like this?
library(stringr)
# input dataset
s <- c('13-MOD:0057', '13-MOD:0046', '13-MOD:0051,13-MOD:0076', '13-MOD:0036,13-MOD:0076,13-MOD:0016', '13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125')
s
[1] "13-MOD:0057"
[2] "13-MOD:0046"
[3] "13-MOD:0051,13-MOD:0076"
[4] "13-MOD:0036,13-MOD:0076,13-MOD:0016"
[5] "13-MOD:0256,13-MOD:0156,13-MOD:0956,13-MOD:0125"
# get the individual lengths
lengths <- sapply(str_split(s,','), function(x){ length(x) })
# create the dataframe splitting in N columns
as.data.frame(str_split_fixed(s, ',', max(lengths)))
V1 V2 V3 V4
1 13-MOD:0057
2 13-MOD:0046
3 13-MOD:0051 13-MOD:0076
4 13-MOD:0036 13-MOD:0076 13-MOD:0016
5 13-MOD:0256 13-MOD:0156 13-MOD:0956 13-MOD:0125
UPDATE 1
To stack all the non-empty cells into a single column
# create the dataframe splitting in N columns
first.matrix <- str_split_fixed(s, ',', max(lengths))
# select only the cells != ""
first.matrix[which(first.matrix!="")]
[1] "13-MOD:0057" "13-MOD:0046" "13-MOD:0051" "13-MOD:0036" "13-MOD:0256" "13-MOD:0076"
[7] "13-MOD:0076" "13-MOD:0156" "13-MOD:0016" "13-MOD:0956" "13-MOD:0125"

Getting a dataframe of logical values from a vector of statements

I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"

R data frame column comparison without for loops, complex case

Let's suppose I have a data frame with the following parameters
DATA <- data.frame(ROWID, ID1, NAME1, ...IDn, NAMEn)
Sample of what the data might look like:
ROWID | ID1 | NAME1 | ID2 | NAME2 | IDn | NAMEn
001 | 001 | FAS | 002 | MAS | 999 | ZOO
002 | 003 | BIN | 004 | DUN | 998 | SOO
Where I have 201 columns by 10k+ rows. What I would like to do is to reshape this data such that for each row in the original DATA, I produce a set of rows in a subsequent data frame. Each row would consist of the originating ROWID, IDa, NAMEa, IDb, NAMEb pairs such that the first is matched with all others (99 pairs containing ID1, 98 with ID2, and so on). This would occur for each row producing a large data frame of all possible combinations within rows for every row. The result would look like:
ROWID1 | ID1 | NAME1 | ID2 | NAME2
ROWID1 | ID1 | NAME1 | ID3 | NAME3
...
ROWID1 | ID2 | NAME2 | ID3 | NAME3
...
ROWID2 | ID1 | NAME1 | ID2 | NAME2
ROWID2 | ID1 | NAME1 | ID3 | NAME3
...
The code I produced to do this is as follows. It works great, but only on smaller data frames. The full data frame is painfully slow, and I am hoping to have alternatives to speed it up using functions or something else of which I am unaware. Thanks in advance!!
DATA <- data.frame(as described above)
META <- data.frame(ROWID=numeric(0),ID1=numeric(0),
BUS1=character(0),ID2=numeric(0),BUS2=character(0))
for (i in 1:length(DATA$ROWID)) {
SET <- data.frame(ROWID=numeric(0),ID1=numeric(0),
BUS1=character(0),ID2=numeric(0),BUS2=character(0))
ROWID <- DATA[i,1]
for (x in seq(3,ncol(DATA),2)) {
for (y in seq(x,ncol(DATA),2)) {
ID1 <- DATA[i,x-2]
BUS1 <- DATA[i,x]
ID2 <- DATA[i,y-2]
BUS2 <- DATA[i,y]
if (!is.na(BUS1) && !is.na(BUS2)) {
NEW <- cbind(ROWID, ID1, BUS1, ID2, BUS2)
SET <- rbind(SET, NEW)
}
}
}
META <- rbind(META, SET)
}
Here is my way to write it, which includes all the 3 optimizations I wrote as comments. Also, be careful! your code had some bugs in the addressing the columns... which I hopefully also fixed.
require('compiler')
enableJIT(3)
DATA2 = as.matrix(DATA)
META2 <- matrix(character(),ncol=5,nrow=(nrow(DATA2)*(ncol(DATA2)-2)^2/2)) # you want a matrix instead of a data.frame, and you want to pre-allocate its size
colnames(META2) = c("ROWID","ID1","BUS1","ID2","BUS2")
k=0
for (i in 1:nrow(DATA2)) {
for (x in seq(3,ncol(DATA2)-2,2)) {
for (y in seq(x+2,ncol(DATA2),2)) {
k=k+1
META2[k,] = c(DATA2[i,1],DATA2[i,x-1], DATA2[i,x], DATA2[i,y-1], DATA2[i,y]) # no need to use temporary variables
}
}
}
META2 = as.data.frame(META2) # converting back to data.frame
META2$BUS1 = as.numeric(META2$BUS1)
META2$BUS2 = as.numeric(META2$BUS2)
I will let you handle yourself the case in which BUS1 or BUS2 is NA - basically, you need to not add these lines (and not increment the variable k), and after the loops you need to crop your matrix to remove the trailing empty rows.

Resources