Compare faster columns of different dataframes - r

Let's assume two dataframes: A and B containing data like the following one:
Dataframe: A Dataframe: B
ColA ColB1 ColB2
| Dog | | Lion | yes
| Lion | | Cat |
| Zebra | | Elephant |
| Bat | | Dog | yes
Want to compare the values of ColA to the values of ColB1, in order to insert yes in case of match in column ColB2. What I'm running is this:
for (i in 1:nrow(B)){
for (j in 1:nrow(A)){
if (B[i,1] == A[j,1]){
B[i,2] <- "yes"
}
}
}
In reality we re talking abaout 20000 lines. How could this become faster?

You can use the %in% operator to determine membership:
B$ColB2 <- B$ColB1 %in% A$ColA
ColB2 will contain TRUE/FALSE dependent on whether value in ColB1 of data frame B was found in ColA of data frame A.
For more info see:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/match.html

Related

Selecting row by min value and merging

This is row selection problem in R.
I want to get only one row in each dataset based on the minimum value of the variable fifteen. I have tried two approaches below, none of them returning the desired output.
For list 1> SPX[[1]] the date.frame is set up like this:
SPX1[[1]]
+----------------------------------+---------+------------------------+----------------+
| | stkPx | expirDate | fifteen |
+----------------------------------+---------+------------------------+----------------+
| 1 | 1461.62 | 2013-01-19 | 2 |
| 2 | 1461.25 | 2013-01-25 | 8 |
| 3 | 1461.35 | 2013-02-01 | 3 |
| . | . | . | . |
| . | . | . | . |
+----------------------------------+---------+------------------------+----------------+
The first approach is aggregating and the merge. As this has to be done for a list the code is in a loop:
df.agg<- list() # creates a list
for (l in 1:length(SPX1)){
df.agg[[l]]<- SPX1[[l]] %>%
aggregate(fifteen ~ ticker, data=SPX1[[l]], min) #Finding minimum value of fifteen for ticker
df.minSPX1 <- merge(df.agg[[l]], SPX1[[l]]) #Merge dataset so we only get row with min fifteen value
}
I get :
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'SPX1' of mode 'function' was not found
Another approach, however it just changes all the values of the first column to one, not deleting any rows when merging:
TESTER<- which.min(SPX1[[1]]$fifteen) # Finds which row has minimum value of fifteen
df.minSPX1 <- merge(TESTER, SPX1[[1]],by.x=by) #Try to merge so I only get the row with min. fifteen
I have tried reading other answers on SO but maybe because of the way the lists are set up this won't work?
I hope you can tell me where I get this wrong.
Try This
df<- lapply(SPX, function(x) x[x$fifteen==min(x$fifteen),])
df<- as.data.frame(df)
Edit:
As suggested by #Gregor-Thomas this will work when there is a tie.
df<- lapply(SPX, function(x) x[which.min(x$fifteen), ])
df<- as.data.frame(df)

Combining Grep and a For-Loop to Construct A Matrix (R)

I have a huge list of small data frames which I would like to meaningfully combine into one, however the logic around how to do so escapes me.
For instance, if I have a list of data frames that look something like this albeit with far more files, many of which I do not want in my data frame:
MyList = c("AthosVersusAthos.csv", "AthosVerusPorthos.csv", "AthosVersusAramis.csv", "PorthosVerusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv", "AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVerusPothos.csv", "BobVersusMary.csv", "LostCities.txt")
What I want is to assemble these into one large data frame. Which would look like this.
| |
AthosVersusAthos | PorthosVersusAthos | AramisVersusAthos
| |
------------------------------------------------------
| |
AthosVerusPorthos | PothosVersusPorthos| AramisVersusPorthos
| |
------------------------------------------------------
| |
AthosVersusAramis | PorthosVersusAramis| AramisVersusAramis
| |
Or perhaps more correctly (with sample numbers in only one portion of the matrix):
| Athos | Porthos | Aramis
-------|------------------------------------------------------
| 10 9 5 | |
Athos | 2 10 4 | |
| 3 0 10 | |
-------|------------------------------------------------------
| | |
Porthos | | |
| | |
-------|------------------------------------------------------
| | |
Aramis | | |
| | |
-------------------------------------------------------------
What I have managed so far is:
Musketeers = c("Athos", "Porthos", "Aramis")
for(i in 1:length(Musketeers)) {
for(j in 1:length(Musketeers)) {
CombinedMatrix <- cbind (
rbind(MyList[grep(paste0("^(", Musketeers[i],
")(?=.*Versus[", Musketeers[j], "]"), names(MyList),
value = T, perl=T)])
)
}
}
What I was trying to do was combine my grep command (quite importnant given the number of files and specificity with which I need to select them) and then combine rbind and cbind so that the rows and the columns of the matrix are meaningfully concatenated.
My general plan was to merge all the data frames starting with 'Athos' into one column, and doing this once again for data frames starting with 'Porthos' and 'Aramis', and then combine those three columns, row-wise into a final dataframe.
I know I'm quite far off but I can't quite get my head around where to start.
Edit: #PierreGramme generated a useful model data set which I will add below seeing as I imagine it would have been useful to provide it originally.
Musketeers = c("Athos", "Porthos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPorthos.csv", "AthosVersusAramis.csv",
"PorthosVersusAthos.csv", "PorthosVersusPorthos.csv", "PorthosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPorthos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
First make a reproducible example. Is it faithful? If so, I will add code to answer
Musketeers = c("Athos", "Pothos", "Aramis")
MyList = c("AthosVersusAthos.csv", "AthosVersusPothos.csv", "AthosVersusAramis.csv",
"PothosVersusAthos.csv", "PothosVersusPothos.csv", "PothosVersusAramis.csv",
"AramisVersusAthos.csv", "AramisVersusPothos.csv", "AramisVersusAramis.csv",
"BobVersusMary.csv", "LostCities.txt")
MyList = lapply(setNames(nm=MyList), function(x) matrix(rnorm(9), nrow=3, dimnames=list(c("a","b","c"), c("x","y","z"))) )
And then is it correct that you would like to concatenate 9 of these matrices into your combined matrix shaped as you described?
Edit:
Then the code solving your problem:
# Helper function to extract the relevant portion of MyList and rbind() it
makeColumns = function(n){
re = paste0("^",n,"Versus")
sublist = MyList[grep(re, names(MyList))]
names(sublist) = sub(re, "", sub("\\.csv$","", names(sublist)))
# Make sure sublist is sorted correctly and contains info on all musketeers
sublist = sublist[Musketeers]
# Change row and col names so that they are unique in the final result
sublist = lapply(names(sublist), function(m) {
res = sublist[[m]]
rownames(res) = paste0(m,"_",rownames(res))
colnames(res) = paste0(n,"_",colnames(res))
res
})
do.call(rbind, sublist)
}
lColumns = lapply(setNames(nm=Musketeers), makeColumns)
CombinedMatrix = do.call(cbind, lColumns)

R - Print record in console like PSQL

Is there a way to print a record from a table in R console looking like PSQL with \x enabled? Meaning print it on the vertical?
Something like this:
-Record1-
Var1 | xxx
Var2 | yyy
Var3 | zzz
kable from knitr will help you!
> df1
E F G H
1 A 0.9,1
2 B 0.98,0.34 0.98,0.34
3 C
> knitr::kable(df1)
|E |F |G |H |
|:--|:-----|:---------|:---------|
|A |0.9,1 | | |
|B | |0.98,0.34 |0.98,0.34 |
|C | | | |
>
If you gather the dataset using the tidyr package, it will basically be stored in that format.
If the original looks like..
Var1 Var2 Var3
Record1 xxx yyy xxx
The gathered dataset will look like...
Record1 Var1 xxx
Record1 Var2 yyy
Record1 Var3 zzz
Or perhaps even using str(YourData[x,]) will be sufficient if just for a quick check of a single record without needing to restructure the dataset.
Or, if you're set on the style in your example, this function is one (rough) way to approach it:
vertical_print_function <- function(data_vert,console=T){
psql <- character()
for(i in 1:nrow(data_vert)){
psql <- append(psql,paste0("\n\n-Record ",i,"-"))
colnames_s <- colnames(data_vert)
var_names <- colnames_s %>% stringr::str_pad(width=max(nchar(.)))
for(x in 1:ncol(data_vert)){
psql <- append(psql,paste0("\n",var_names[x]," | ",data_vert[i,colnames_s[x]]))
}
}
if (console==T) {
cat(psql)
} else {
return(psql)
}
}
console as TRUE prints immediately to the console (dangerous/pointless if many records). FALSE just returns the string, which can then exported to a text file.

Getting a dataframe of logical values from a vector of statements

I have a number of lists of conditions and I would like to evaluate their combinations, and then I'd like to get binary values for these logical values (True = 1, False = 0). The conditions themselves may change or grow as my project progresses, and so I'd like to have one place within the script where I can alter these conditional statements, while the rest of the script stays the same.
Here is a simplified, reproducible example:
# get the data
df <- data.frame(id = c(1,2,3,4,5), x = c(11,4,8,9,12), y = c(0.5,0.9,0.11,0.6, 0.5))
# name and define the conditions
names1 <- c("above2","above5")
conditions1 <- c("df$x > 2", "df$x >5")
names2 <- c("belowpt6", "belowpt4")
conditions2 <- c("df$y < 0.6", "df$y < 0.4")
# create an object that contains the unique combinations of these conditions and their names, to be used for labeling columns later
names_combinations <- as.vector(t(outer(names1, names2, paste, sep="_")))
condition_combinations <- as.vector(t(outer(conditions1, conditions2, paste, sep=" & ")))
# create a dataframe of the logical values of these conditions
condition_combinations_logical <- ????? # This is where I need help
# lapply to get binary values from these logical vectors
df[paste0("var_",names_combinations] <- +(condition_combinations_logical)
to get output that could look something like:
-id -- | -x -- | -y -- | -var_above2_belowpt6 -- | -var_above2_belowpt4 -- | etc.
1 | 11 | 0.5 | 1 | 0 |
2 | 4 | 0.9 | 0 | 0 |
3 | 8 | 0.11 | 1 | 1 |
etc. ....
Looks like the dreaded eval(parse()) does it (hard to think of a much easier way ...). Then use storage.mode()<- to convert from logical to integer ...
res <- sapply(condition_combinations,function(x) eval(parse(text=x)))
storage.mode(res) <- "integer"

R data frame column comparison without for loops, complex case

Let's suppose I have a data frame with the following parameters
DATA <- data.frame(ROWID, ID1, NAME1, ...IDn, NAMEn)
Sample of what the data might look like:
ROWID | ID1 | NAME1 | ID2 | NAME2 | IDn | NAMEn
001 | 001 | FAS | 002 | MAS | 999 | ZOO
002 | 003 | BIN | 004 | DUN | 998 | SOO
Where I have 201 columns by 10k+ rows. What I would like to do is to reshape this data such that for each row in the original DATA, I produce a set of rows in a subsequent data frame. Each row would consist of the originating ROWID, IDa, NAMEa, IDb, NAMEb pairs such that the first is matched with all others (99 pairs containing ID1, 98 with ID2, and so on). This would occur for each row producing a large data frame of all possible combinations within rows for every row. The result would look like:
ROWID1 | ID1 | NAME1 | ID2 | NAME2
ROWID1 | ID1 | NAME1 | ID3 | NAME3
...
ROWID1 | ID2 | NAME2 | ID3 | NAME3
...
ROWID2 | ID1 | NAME1 | ID2 | NAME2
ROWID2 | ID1 | NAME1 | ID3 | NAME3
...
The code I produced to do this is as follows. It works great, but only on smaller data frames. The full data frame is painfully slow, and I am hoping to have alternatives to speed it up using functions or something else of which I am unaware. Thanks in advance!!
DATA <- data.frame(as described above)
META <- data.frame(ROWID=numeric(0),ID1=numeric(0),
BUS1=character(0),ID2=numeric(0),BUS2=character(0))
for (i in 1:length(DATA$ROWID)) {
SET <- data.frame(ROWID=numeric(0),ID1=numeric(0),
BUS1=character(0),ID2=numeric(0),BUS2=character(0))
ROWID <- DATA[i,1]
for (x in seq(3,ncol(DATA),2)) {
for (y in seq(x,ncol(DATA),2)) {
ID1 <- DATA[i,x-2]
BUS1 <- DATA[i,x]
ID2 <- DATA[i,y-2]
BUS2 <- DATA[i,y]
if (!is.na(BUS1) && !is.na(BUS2)) {
NEW <- cbind(ROWID, ID1, BUS1, ID2, BUS2)
SET <- rbind(SET, NEW)
}
}
}
META <- rbind(META, SET)
}
Here is my way to write it, which includes all the 3 optimizations I wrote as comments. Also, be careful! your code had some bugs in the addressing the columns... which I hopefully also fixed.
require('compiler')
enableJIT(3)
DATA2 = as.matrix(DATA)
META2 <- matrix(character(),ncol=5,nrow=(nrow(DATA2)*(ncol(DATA2)-2)^2/2)) # you want a matrix instead of a data.frame, and you want to pre-allocate its size
colnames(META2) = c("ROWID","ID1","BUS1","ID2","BUS2")
k=0
for (i in 1:nrow(DATA2)) {
for (x in seq(3,ncol(DATA2)-2,2)) {
for (y in seq(x+2,ncol(DATA2),2)) {
k=k+1
META2[k,] = c(DATA2[i,1],DATA2[i,x-1], DATA2[i,x], DATA2[i,y-1], DATA2[i,y]) # no need to use temporary variables
}
}
}
META2 = as.data.frame(META2) # converting back to data.frame
META2$BUS1 = as.numeric(META2$BUS1)
META2$BUS2 = as.numeric(META2$BUS2)
I will let you handle yourself the case in which BUS1 or BUS2 is NA - basically, you need to not add these lines (and not increment the variable k), and after the loops you need to crop your matrix to remove the trailing empty rows.

Resources