Separating database by attribute using a loop - r

I am trying to separate a database by year using a loop in R, however I´m having troubles when trying to save my multiple results. My code is this one
d<- read.csv("BD_070218.csv")
results<-NULL
for(i in 1990:2015){
ano<-d[which(d$year==i),]
results[[i]] <- ano
}

I think I understand your question.
Two potential methods.
# Set a seed
set.seed(1)
# Create example dataframe
d <- data.frame(
a=1:120,
year=sample(1990:2015,120,replace = TRUE),
d=sample(letters,120,replace = TRUE)
)
# Method 1: Nested dataframes in a list
results<-list()
for(i in 1990:2015){
ano<-d[which(d$year==i),]
eval(parse(text=paste("results$year_", i, " <- ano", sep="")))
}
str(results)
results[["year_2012"]]
# Method 2: individual dataframes
for(i in 1990:2015){
ano<-d[which(d$year==i),]
assign(paste0("year_",i), ano, envir = .GlobalEnv)
}
str(year_2000)

Related

how to make a loop to fetch one variable from 1000 dataframes

I have dataframe by name V1...V1000. inside the dataframe each has one variable with the same name 'var1.predict'. I'm having a hard time creating a loop in order to concatenate all the variables I want to fetch into one new dataframe
this is the syntax I want to make a loop
df <- cbind.data.frame(model_V1$var1.pred,model_V2$var1.pred,.....model_V1000$var1.pred)
I hope someone can help solve this.
thank you
a new dataframe formed by taking one variable from each dataframe
I assume that you mean you have 1,000 dataframes V1... V1000 each with the column var1.predict and you want to extract the predictions column from each df. If so, there are a few methods outlined below with a little reprex:
# putting dummy data in to the global env
lapply(1:3, \(i) {
assign(paste0("V", i), data.frame(v1 = rnorm(5),
v2 = rnorm(5),
var1.predict = rnorm(5)), envir = .GlobalEnv)
})
df_list <- list(V1, V3, V3)
# using a for loop and do.call
pred_cols <- list()
for (df in df_list) {
pred_cols <- c(pred_cols, list(df[["var1.predict"]]))
}
pred_cols_df <- do.call(cbind, pred_cols)
as.data.frame(pred_cols_df)
pred_cols_df
# using a loop without do.call
for (i in seq_along(df_list)) {
if (i == 1) {
pred_cols_df <- df_list[[1]][["var1.predict"]]
} else {
pred_cols_df <- cbind(pred_cols_df, df_list[[i]][["var1.predict"]])
}
}
as.data.frame(pred_cols_df)
pred_cols_df
# using lapply
pred_cols <- lapply(df_list, `[`, "var1.predict")
pred_cols_df <- do.call(cbind, pred_cols)
as.data.frame(pred_cols_df)
pred_cols_df

R loop to create data frames with 2 counters

What I want is to create 60 data frames with 500 rows in each. I tried the below code and, while I get no errors, I am not getting the data frames. However, when I do a View on the as.data.frame, I get the view, but no data frame in my environment. I've been trying for three days with various versions of this code:
getDS <- function(x){
for(i in 1:3){
for(j in 1:30000){
ID_i <- data.table(x$ID[j: (j+500)])
}
}
as.data.frame(ID_i)
}
getDS(DATASETNAME)
We can use outer (on a small example)
out1 <- c(outer(1:3, 1:3, Vectorize(function(i, j) list(x$ID[j:(j + 5)]))))
lapply(out1, as.data.table)
--
The issue in the OP's function is that inside the loop, the ID_i gets updated each time i.e. it is not stored. Inorder to do that we can initialize a list and then store it
getDS <- function(x) {
ID_i <- vector('list', 3)
for(i in 1:3) {
for(j in 1:3) {
ID_i[[i]][[j]] <- data.table(x$ID[j:(j + 5)])
}
}
ID_i
}
do.call(c, getDS(x))
data
x <- data.table(ID = 1:50)
I'm not sure the description matches the code, so I'm a little unsure what the desired result is. That said, it is usually not helpful to split a data.table because the built-in by-processing makes it unnecessary. If for some reason you do want to split into a list of data.tables you might consider something along the lines of
getDS <- function(x, n=5, size = nrow(x)/n, column = "ID", reps = 3) {
x <- x[1:(n*size), ..column]
index <- rep(1:n, each = size)
replicate(reps, split(x, index),
simplify = FALSE)
}
getDS(data.table(ID = 1:20), n = 5)

Running multiple iterations of K-Means with different values for number of centroids

I have a large dataset and I am trying to run a K-means cluster analysis. However, I want to repeat this with multiple iterations by changing the number of centroids. Here's what I've done so far:
# import data
week1 <- read.csv("WEEK1.csv", header = TRUE)
week2 <- read.csv("WEEK2.csv", header = TRUE)
week3 <- read.csv("WEEK3.csv", header = TRUE)
week4 <- read.csv("WEEK4.csv", header = TRUE)
data <- rbind(week1, week2, week3, week4)
# variable names
for(i in 1:50){
assign(paste("cluster", i, sep = ""), i)
}
I've spent a long time trying to figure out how to "recall" my cluster variables in a for loop so that I can do something like this:
for (i in 1:50){
cluster[i] <- kmeans(data, i, nstart = 1)
}
Any thoughts?
Maybe this could help, put the various numbers of clusters in a vector, and store the result in a list. My example is with 3 max centroids, and I'm using the mtcars dataset, due you have not posted your data.
vector <- c() # an empty vector
for(i in 1:3){ # a loop that creates the
# various n of clusters
vector[i] <- assign(paste("cluster", i, sep = ""), i)
}
Now we can create the list of kmeans:
list_k <- list() # an empty list
for (i in vector){ # fill it with the kmeans
list_k[[i]] <- kmeans(mtcars, i, nstart = 1)
}
To have access to each kmeans, you can use this:
list_k[[3]]
To have access to each element of each list, this:
list_k[[3]][1]

Combining multiple function arguments inside list2serv(lapply(),)

I'm working with ten training datasets, train1 through train10, and would like to repeat the following statements for 1 through 10 with a single block of code:
train_y_1 <- c(train1$y)
train1$y <-NULL
train_x_1 <- data.matrix(train1)
olsfit_1 <- cv.glmnet(y=train_y_1, x=train_x_1, alpha=1, family="gaussian")
I've read in the forums that lapply() is preferable to for loops. My code:
# Create empty data frames and list (to be populated with values in main program)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10)), envir=.GlobalEnv)
# Create y and x input matrices and run ten lasso regressions
list2env(lapply(mget(paste0('train', 1:10)), mget(paste0('train_y_', 1:10)), mget(paste0('train_x_', 1:10)), mget(paste0('lasso_', 1:10)),
function(a,b,c,d)
{
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}), envir=.GlobalEnv)
which produces the error message:
Error in match.fun(FUN) :
'mget(paste0("train_y_", 1:10))' is not a function, character or symbol
So it looks like R is confused by the four mget() functions which I intended to be reading in values for the a,b,c,d arguments, but I'm not sure how to proceed next.
Any suggestions?
You want to keep all your data in lists whenever possible, avoiding polluting the global environment with a bunch of variables. This isn't tested, and train is missing, but should be a similar list of your train data. Then, you could do something like,
trainy <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10))
trainx <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10))
lasso <- setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10))
f <- function(a,b,c,d) {
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}
mapply(f, train, trainy, trainx, lasso, SIMPLIFY=F)
Although, since your lists are just initializing variables, you probably just want to loop (apply) over a list of your training data,
lapply(train, function(x) {
... # the statements you want to repeat
list(...) # return a list of the three data.frames
})
We can achieve this with the following code.
# Load libraries
library(dplyr);library(glmnet)
# Gather all the variables in global into a list
fit = mget(paste0("train", 1:10), envir = .GlobalEnv) %>%
# Pipe each element of the list into `cv.glmnet` function
lapply(function(dat) {cv.glmnet(y = dat$y,
x = data.matrix(dat %>% mutate(y = NULL)),
alpha = 1,
family = "gaussian")})
Your output will be neatly stored in fit, which is a list with 10 elements. You can call each element with fit[[i]]. For example coef(fit[[1]]) pulls out the coefs for train1 and lapply(fit, coef) pulls the coef for all 10 models and stores them in a list.

For i loop, calling different dataframes

I'm new to loops and I have a problem with calling variable from i'th data frame.
I'm able to call each data frame correctly, but when I should call a specified variable inside each data frame problems come:
Example:
for (i in 1:15) {
assign(
paste("model", i, sep = ""),
(lm(response ~ variable, data = eval(parse(text = paste("data", i, sep = "")))))
)
plot(data[i]$response, predict.lm(eval(parse(text = paste("model", i, sep = ""))))) #plot obs vs preds
}
Here I'm doing a simple one variable linear model 15 times, which works just fine. Problems come when I try to plot the results. How should I call data[i] response?
Let's say there are multiple dataframes with names: data1 ...data15 and that there are no other data-objects that begin with the letters: d,a,t,a. Lets also assume that in each of those dataframes are columns named 'response' and 'variable'. The this would gather the dataframes into a list and draw separate plots for the linear regression lines.
dlist <- lapply ( ls(patt='^data'), get)
lapply(dlist, function(df)
plot(NA, xlim=range(df$variable), ylim=range(df$response)
abline( coef( lm(response ~ variable, data=df) ) )
)
If you wanted to name the dataframes in that list, you could use your paste code to supply names:
names(dlist) <- paste("data", i, sep = "")
There are many other assignments you could make in the context of this loop, but you would need to describe the desired results better than with failed efforts.
Here's modified code that should work. It does one variable lm-model and calculates correlation of predicted and observed values and stores it into an empty matrix. It also plots these values.
Thanks Thomas for help.
par(mfrow=c(4,5))
results.matrix <- matrix(NA, nrow = 20, ncol = 2)
colnames(results.matrix) <- c("Subset","Correlation")
for (i in 1:length(datalist)) {
model <- lm(response ~ variable, data = datalist[[i]])
pred <- predict.lm(model)
cor <- (cor.test(pred, datalist[[i]]$response))
plot(pred, datalist[[i]]$response, xlab="pred", ylab="obs")
results.matrix[i, 1] <- i
results.matrix[i, 2] <- cor$estimate
}

Resources