I've finally lost my habit of loops in R. Basically usually calculating new columns, and then doing calculations and aggregations on these new columns.
But I have a question regarding cbind which I use for adding columns.
Is there a better way than using bind for things like this?
Naming this new column always is done by me in this tedious way... Anything cleverer/simpler out there?
library(quantmod)
getSymbols("^GSPC")
GSPC <- cbind(GSPC, lag(Cl(GSPC), k=1)) #Doing some new column calculation
names(GSPC)[length(GSPC[1,])] <- "Laged_1_Cl" #Naming this new column
GSPC <- cbind(GSPC, lag(Cl(GSPC), k=2))
names(GSPC)[length(GSPC[1,])] <- "Laged_2_Cl"
tail(GSPC)
** EDITED **
Roman Luštrik added a great solution in comments below.
GSPC$Laged_3_Cl <- lag(Cl(GSPC), k=3)
tail(GSPC)
One way of adding new variables to a data.frame is through the $ operator. Help page (?"$") shows common usage in the form of
x$i <- value
Where i is the new variable name and value are its associated values.
You can name the new column on the left side of the assignment like so:
exdat <- data.frame(lets = LETTERS[1:10],
nums = 1:10)
exdat$combo <- paste0(exdat$lets, exdat$nums)
Related
Starting off on a project, I am doing some performance checks to decide on whether to use mainly DataFrames with dplyr or DataTables. One thing my project will require is repeated row lookups in different contexts using a row ID column value (not the index).
After doing some reading, it seemed that the datatables package's datatable with a properly defined key would provide the best performance. But from my checks below, it seems that filtering on a dataframe is faster?
I surmise from this post that using matrices are the fastest way of assigning values, but I am surprised that the two approaches using dataframes in my code below are both faster than datatables. Matrices are the fastest of all.
I am new to datatables (and R in general), so can someone point out what I am doing wrong here and how to improve the code for datatables so the performance matches that of dataframes?
library("data.table")
library("microbenchmark")
# Create testing data set with character IDs - here we only have an ID
# column and one column, but in real application there will be many
#columns
dt <- data.table(ID = paste0("A", seq(1,5000, by = 1)),
Value = runif(5000,1,100))
setkey(dt, ID) #Set the key on the data table
#Get the table as a data frame
df <- as.data.frame(dt)
rownames(df) <- df$ID #Set row names on data frame
# Get a matrix version -
matx <- matrix(df$Value, nrow = nrow(df), ncol = 1)
rownames(matx) <- df$ID #Set row names on matrix
id <- "A567"
dt.by.key = function(id.value) {
return(dt[.(id.value)]$Value)
}
df.by.filter = function(id.value) {
return(df[df$ID == id.value, ]$Value)
}
df.by.rowname = function(id.value) {
return(df[[id.value, "Value"]])
}
matrix.by.rowname = function(id.value) {
return(matx[[id.value, 1]])
}
microbenchmark(
dt.by.key(id),
df.by.filter(id),
df.by.rowname(id),
matrix.by.rowname(id),
check = "equal",
times = 1000
)
Results below:
Note that I am testing the lookup/filter on the DataFrame using filtering on the ID column and on the Name which is mapped to the ID column. It seems the matrix lookup is fastest.
Note: At first I presented my performance testing code using sytem.time with for loops. Most of the comments below focused on the for loops instead of the main question, so I have refactored my example code. Thanks to #r2evans for suggesting the use of microbenchmark.
Any advice related specifically to improving performance of lookup on datatables is much appreciated!
im new to R and was wondering if there is a way to assign names to columns in a matrix without using the colnames() function
#creating two vectors
player <- c(rep('dark',5),rep('light',5))
piece <-c('king','queen','pawn','pawn','knight','bishop','king','rook','pawn','pawn')
#creating a matrix
matrix2 <- c(player, piece)
dim(matrix2) <- c(10, 2)
#this would work perfectly but i was looking for an alternate method which doesn't uses
#colnames() function
colnames(matrix2) <- c('player','piece')
I also know that using cbind() would give me a matrix with column names as those of the two vectors
matrix2<-cbind(player,piece)
But I don't want to create my matrix with the cbind() function. I wanted to know if there is a way to name the colunmns of the matrix other than using the colnames() function after creating the matrix like I have created above.
Difficult to answer. Do you mean like this?
dimnames(matrix2) <- list(c(1:10), c("player", "piece"))
EDIT, without "naming" row_names (see comments, #akrun mentioned that earlier):
dimnames(matrix2) <- list(NULL, c("player", "piece"))
I am trying to loop through a large address data set(300,000+ lines) based on a common factor for each observation, ID2. This data set contains addresses from two different sources, and I am trying to find matches between them. To determine this match, I want to loop through each ID2 as a factor and search for a line from each of the two data sets (building and property data sets) Here is a picture of my desire output Picture of desired output
Here is a sample code of what I have tried
PROPERTYNAME=c("Vista 1","Vista 1","Vista 1","Chesnut Street","Apple
Street","Apple Street")
CITY=c("Pittsburgh","Pittsburgh","Pittsburgh","Boston","New York","New
York")
STATE= c("PA","PA","PA","MA","NY","NY")
ID2=c(1,1,1,2,3,3)
IsBuild=c(1,0,0,0,1,1)
IsProp=c(0,1,1,1,0,0)
df=data.frame(PROPERTYNAME,CITY,STATE,ID2,IsBuild,IsProp)
for(i in levels(as.factor(df$ID2))){
for(row in 1:nrow(df)){
df$Any_Build[row][i]<-ifelse(as.numeric(df$IsBuild[row][i])==1)
df$Any_Prop[row][i]<-ifelse(as.numeric(df$IsProp[row][i])==1)
}
}
I've tried nested for loops but have had no luck and am struggling with the apply functions of r. I would appreciate any help. Thank you!
If your main dataset is called D and the building data set is called B and the property dataset is called P, you can do the following:
D$inB <- D$ID2 %in% B$ID2
D$inP <- D$ID2 %in% P$ID2
If you want some data in B, like let's say an address, you can use merge:
D <- merge(D, B[c("ID2", "address")], by = "ID2", all.x = TRUE, all.y = FALSE)
If every row in B has an address, then the NAs in the new address column in D should coincide with the FALSEs in D$inB.
How does ID2 affect the output? If it doesn't have any effect, you can use the same logic you used in your example code without the loop. Ifelse is vectorized so you dont have to run it per row
Edited formatting:
LIHTCComp1$AnyBuild <- ifelse(LIHTCComp1$IsBuild ==1,TRUE,FALSE)
LIHTCComp1$AnyProp <- ifelse(LIHTCComp1$IsProp ==1,TRUE,FALSE)
Hope this helps.
I have a column header stored in a variable as follows:
a <- get("colA")# this variable changes and was obtained using regexp
The value of a is actually a column header called Nimu.
I also have a data frame (BigData) having Nimu as a column header along with the other columns. How can I use cbind/data.frame to select a only a few columns, including Nimu, into a new data frame.
I have tried:
data <- cbind(BigData$Miu,BigData$sil,BigData$a)
But this did not work. R did not like BigData$a. Any suggestions? Thanks.
Something like this should work:
a <- get("colA")
b <- get("colB")
c <- get("colC")
cols = c(a, b, c)
df_subset = df[cols]
I do think your solution using get is probably sub-optimal and not needed, but without more context it is hard to say.
I am a beginner to R programming and am trying to add one extra column to a matrix having 50 columns. This new column would be the avg of first 10 values in that row.
randomMatrix <- generateMatrix(1,5000,100,50)
randomMatrix51 <- matrix(nrow=100, ncol=1)
for(ctr in 1:ncol(randomMatrix)){
randomMatrix51.mat[1,ctr] <- sum(randomMatrix [ctr, 1:10])/10
}
This gives the below error
Error in randomMatrix51.mat[1, ctr] <- sum(randomMatrix[ctr, 1:10])/10 :incorrect
number of subscripts on matrix
I tried this
cbind(randomMatrix,sum(randomMatrix [ctr, 1:10])/10)
But it only works for one row, if I use this cbind in the loop all the old values are over written.
How do I add the average of first 10 values in the new column. Is there a better way to do this other than looping over rows ?
Bam!
a <- matrix(1:5000, nrow=100)
a <- cbind(a,apply(a[,1:10],1,mean))
On big datasets it is however faster (and arguably simpler) to use:
cbind(a, rowMeans(a[,1:10]) )
Methinks you are over thinking this.
a <- matrix(1:5000, nrow=100)
a <- transform(a, first10ave = colMeans(a[1:10,]))