I have a quite complex set of functions that I need to apply to four different dummy variables with same core name but different number at the end. I am looking to apply these functions in one go rather than repeating it four times.
As an example, here's a made up dataset just for illustrative purposes:
n <- c(1:100)
var1 <-NA
var1[n < 20] <- 1
var1[n >50] <- 0
var2 <-NA
var2[n < 30] <- 1
var2[n >50] <- 0
var3 <-NA
var3[n < 10] <- 1
var3[n >40] <- 0
var4 <-NA
var4[n < 20] <- 1
var4[n > 450] <- 0
df <- data.frame(var1, var2, var3, var4, n)
In terms of the functions I need to loop over, it's mainly three with regards to these variables. I need to be able to first subset the dataframe, create a new variable for each of the original ones, and write the new results into a dataframe. Please don't ask me why I need to do these, they're a part of a much larger code.
These are the steps I need to perform but on all 4:
df_sub <- subset(df, !is.na(df$var1))
sample1 <- nrow(df_sub[df_sub$var1 == 1,])
if(sample1 < 35) {
a1 <- NA
} else {
a1 <- mean(df_sub$n[df_sub$var1==1])
new_df <- data.frame(a1,a2,a3,a4)
I was thinking of looping over the suffix but I cannot figure out how R deals with this. I found a solution for creating a variable in a loop through assign() (https://stats.stackexchange.com/questions/10838/produce-a-list-of-variable-name-in-a-for-loop-then-assign-values-to-them)
But I still cannot figure out how to deal with the subset. And more generally, how I would go about looping over a number in variable name rather than column number, list, etc.
Alternatively if there is a way to create a function in which I can actually create variables to export into environment outside of this function and then apply the function to var1 - var4 in df and still get 4 different versions of a (a1 - a4) in a new_df.
You can start the loop and update the variable over which you work by using get() and then use assign(). As an example:
for (i in 1:number_of_variables){
variable=get(paste0("var",i))
... work on the variable ...
# Returns
assign(paste0("df_sub",i), ... your result ...)
}
Related
It may sound trivial and the solution is probably quite simple but I can't figure it out.
I just want to combine all my variables in a data.frame. I wonder if there is a way to do that without choosing them one by one, but instead telling R that I want to use all of the already existing variables?
var1 <- c(1,2)
var2 <- c(3,4)
Instead of doing this
df <- data.frame(var1, var2)
I want to do something like this
df <- data.frame(-ALL_VARIABLES_IN_ENVIRONMENT-)
I've tried ls() (respectively objects()) also in combinatination with unquote() as well as names() but this only gives me a vector with names (undquoted or not) and not the environment's objects.
var1 <- 1:3
var2 <- 1:3
data.frame(sapply(ls(), get))
# var1 var2
# 1 1 1
# 2 2 2
# 3 3 3
I would like to ask if it is possible to apply this function to a data.table approach:
myfunction <- function(i) {
a <- test.dt[i, 1:21, with = F]
final <- t((t(b) == a) * value)
final[is.na(final)] <- 0
sum.value <- rowSums(final)
final1 <- cbind(train.dt, sum.value)
final1 <- final1[order(-sum.value),]
final1 <- final1[final1$sum.value > 0,]
suggestion <- unique(final1[, 22, with = F])
suggestion <- suggestion[1:5, ]
return(suggestion)
}
This is a custom kNN function I made to be used on character columns. It gives top 5 suggestions/predictions. However, It has performance issues on my end if it is performed on large test data (I cannot tweak it myself so far).
The variables used are as folllows:
train.dt -- the training data, includes 22 columns (21 features, 1 label column)
test.dt -- the test data, same structure as training data
value -- a vector that contains the weights/importance value of 21 features
sum.value -- sum of all the weights on value vector (sum(value))
b -- has the same data as the training data, but excluding the label column
a -- has the same data as the test data, but excluding the label column
suggestion -- the output
Also, I want to use lapply (or any appropriate apply family) on this function, and the i variable in the function pertains to the row number on the test data: meaning, I want to apply it on each rows of the test data. I cannot make it yet.
Hope you can understand and thank you in advance!
I'm trying to change column names over multiple data sets. I have tried writing the following function to do this:
# simplified test data #
df1<-as.data.frame(c("M","F"))
colnames(df1)<-"M1"
# my function #
rename_cols<-function(df){
colnames(df)[names(df) == "M1"] <- "sex"
}
rename_cols(df1)
However when testing this function on df1, the column is always called "M1" instead of "sex". How can I correct this?
SOLUTION - THANKS TO DAVID ARENBERG
rename_cols<-function(df){
colnames(df)[names(df) == "M1"] <- "sex"
df
}
df1<-rename_cols(df1)
Here is another solution which gets around the problem of functions operating in a temporary space:
df <- as.data.frame(c("M","F"))
colnames(df) <- "M1"
rename_cols <- function(df) {
colnames(df)[names(df) == "M1"] <<- "sex"
}
> rename_cols(df) # this will operate directly on the 'df' object
> df
sex
1 M
2 F
Using the global assignment operator <<- makes the name changes to the input data frame df "stick". Granted, this solution is not ideal because it means the function could potentially do something unwanted. But I feel this is in the spirit of what you were trying to do originally.
I am trying to convert cross-sectional data into an adjacency matrix, as I want to analyze how often certain variables are present together with social network analysis.
In case empirical examples would help with the logic, it's basically analogous to presenting 4 people with a choice of three objects; they can choose from 0 to 3 of the objects. I'd like to analyze how commonly different objects were chosen together and visualize this as a network of preferences.
The data is set up as cross-sectional data, below:
ID1 <- c(1,0,0)
ID2 <- c(1,0,1)
ID3 <- c(1,1,1)
ID4 <- c(0,0,0)
IDs <- c("1","2","3","4")
df <- data.frame(rbind(ID1, ID2, ID3, ID4))
df <- cbind(IDs, df)
colnames(df) <- c("ID", "Var1", "Var2", "Var3")
I'd like to create a weighted adjacency matrix for Var1, Var2 and Var3, with each cell containing the total number of times the two variables occur together among the observations.
So the basic procedure I was thinking about is to create a separate matrix for each row (each ID number) with a 1 or 0 for each cell indicating whether or not both variables are present for the ID. And then add these matrices together, so the final matrix gives the total number of joint appearances.
I've been looking around and haven't quite gotten it right. I thought of using outer, but it'd need to work for each column in sequence. This answer was pretty close, but I wasn't exactly sure how they were adding together the values. I ended up with a list of matrices, but the values didn't correspond to the initial data-
Convert categorical data in data frame to weighted adjacency matrix. And this answer was also close, although it seemed to have a different type of data. It gave me an adjacency matrix based on the IDs-
http://r.789695.n4.nabble.com/Conversion-to-Adjacency-Matrix-td794102.html
Here is very messy code to manually create a matrix for one observation, just so you get a sense for what I'm going for (using a vector representing just the first ID observation)
ID1 <- c(1,0,0)
var1 <- ID1[[1]]
var2 <- ID1[[2]]
var3 <- ID1[[3]]
onetwo <- var1 * var2
onethree <- var1 * var3
twothree <- var2 * var3
oneone <- var1 * var1
twotwo <- var2 * var2
threethree <- var3 * var3
rows1 <- rbind(oneone, onetwo, onethree)
rows2 <- rbind(onetwo, twotwo, twothree)
rows3 <- rbind(onethree, twothree, threethree)
df2 <- cbind(rows1, rows2, rows3)
This obviously is not ideal, my actual dataset has 198 observations and 33 variables, so even with looping or the use of apply functions it would be very inefficient.
I can't tell if I'm making this more difficult than it needs to be, or if I'm trying to force my data to do something it wasn't meant to do. But if anyone has run into this sort of task before, please let me know. Is there a way to create the desired adjacency matrix directly? Should I transfer this into an edge list first, and is there a good way to do that? Is there code that would make the first step(creating a matrix for each row of the dataframe) more efficient?
Thanks for your help,
I'm not sure if I understand the question, but is this what you want?
nc=33
nr=198
m3<-matrix(sample(0:1,nc*nr,replace=TRUE),nrow=nr)
df3<-data.frame(m3)
m3b <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
t3<-table(df3[,i],df3[,j])
m3b[i,j] = t3[2,2] # t3[2,2] contains the count of df3[,i] = df3[,j] = 1
# or
# t3 = sum(df3[,i]==df3[,j] & df3[,i] == 1)
# m3b[i,j] = t3
}
}
or, if you want the sum of the product, which gives the same result if everything is 1 or 0
m3c <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
sv=0
for (k in seq(1,nr)) {
vi = df3[k,i]
vj = df3[k,j]
sv=sv+vi*vj
}
m3c[i,j] = sv
}
}
I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Please let me know what you think and thanks for the help.
Below is some sample data with 2K observations instead of 200K
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply.
Also, we are creating two vectors that we will combine for vectorized multiplication.
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Two points to note:
Try not to use words like data, table, df, sub etc which are commonly used functions
In the above code I used mydf in place of data.
You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation