How to address several objects named in the same structure in R? - r

I'd like to know how to include every object that fulfils certain naming requirements in my arguments in R. Let's say the objects are all called something like this
var01 var02 var03 var04 varnn
What I would do in Stata for instance would be simply this
tab1 var*
and it would tabulate every variable with the first 3 letters "var".
In an earlier version of this post I was quite vague about what I actually want to do in my R project. So here goes. I've got a for loop, that iterates over 650 instances, with the goal of appending 6 datasets for every one of these instances. However, for some (I don't know which), not all 6 datasets exist, which is why the rbind command that's written like this fails:
rbind(data01, data02, data03, data04, data05, data06)
I'd therefore like to run something like this
rbind(data*)
So as to account for missing datasets.
Sorry for the confusion, I wasn't being clear enough when I originally wrote the question.
Just for reference, here is the whole loop:
for(i in 1:650){
try(part1 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/09July/",MP.ID[i],".csv")))
try(part2 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/08July/",MP.ID[i],".csv")))
try(part3 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/16July/",MP.ID[i],".csv")))
try(part4 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/17July/",MP.ID[i],".csv")))
try(part5 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/24July/",MP.ID[i],".csv")))
try(part6 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/24July/",MP.ID[i],".csv")))
allParts <- ls(pattern = "^part*")
allNames <- paste(allParts, collapse = ", ") # this is just what I tried just now, didn't work though
combined.df <- rbind(ALL THE DATASETS WITH PART))
}

Data
var01 <- sample(2, 10, TRUE)
var02 <- sample(2, 10, TRUE)
var03 <- sample(2, 10, TRUE)
vvv01 <- sample(2, 10, TRUE) # variable which should not be tabulated
Code
allV <- ls(pattern = "^var.*") # vector of all variables starting with 'var'
lapply(allV, function(.) table(get(.)))
Explanation
With ls you get all variables which are named according to the pattern you provide. Then, you loop over all these variables, retrieve the variable by its name and tabulate it.
Update
With your recent changes what I would do is the following:
allV <- lapply(ls(pattern = "^part.*"), get) #stores all part variables in a list
combined.df <- do.call(rbind, allV) # rbinds all of them

Related

Issues with correlation function in R

I am writing a function that takes a directory and a threshold (default of 0) for complete cases as arguments and then performs the cor function for two of the columns. The function runs, kind of. However, whenever I run it it seems like it is skipping everything I have in the function and returns numeric(0). I can't figure out what is causing it to return that. I ran each line through the console and made sure each line worked on its own. But once it comes time to put it all together it seems like it all falls apart. Below is what I'm working with.
corr <- function(directory, threshold =0){
files <- list.files(path = directory, pattern = "*.csv", full.names = T)
cr <- numeric()
for(i in seq(files)){
data <- read.csv(files[i], )
cleandata <- complete.cases
data <- data[cleandata, ]
if (nrow(data) > threshold){
correlation <- cor(data$sulfate,data$nitrate)
cr <- append(cr, correlation)
}
}
return(cr)
}
Edit : removed last part of question. After re-reading I realized that I would not get what I wanted from return(files) even if it worked correctly. I should also note that adding the return(files) changed my output from numeric(0) to character(0)
How about something like this. You put all the files in a dataframe and then map out the correlations. As per #r2evans comment, this adds an NA when the threshold condition is not met and as per #Limey comment, this is more efficient than a for loop and append. The benefit here is that we keep the correlation with the file name so that you can see which file has what correlation.
library(tidyverse)
corr <- function(directory, threshold =0){
tibble(file = list.files(path = directory,
pattern = "*.csv",
full.names = TRUE)) |>
mutate(data = map(file, read_csv),
comp = map(data, complete.cases),
nrw = map_dbl(comp, nrow),
cr = case_when(
nrw > threshold ~ map_dbl(comp, ~cor(.x$sulfate, .x$nitrate)),
TRUE ~ NA_real_
))
}
corr_df <- corr(directory = "user/AndS./stack_questions")
corr_vec <- pull(corr_df, cr)

Looping over lists, extracting certain elements and delete the list?

I am trying to create an efficient code that opens data files containing a list, extracts one element within the list, stores it in a data frame and then deletes this object before opening the next one.
My idea is doing this using loops. Unfortunately, I am quite new in learning how to do this using loops, and don't know how write the code.
I have managed to open the data-sets using the following code:
for(i in 1995:2015){
objects = paste("C:/Users/...",i,"agg.rda", sep=" ")
load(objects)
}
The problem is that each data-set is extremely large and R cannot open all of them at once. Therefore, I am now trying to extract an element within each list called: tab_<<i value >>_agg[["A"]] (for example tab_1995_agg[["A"]]), then delete the object and iterate over each i (which are different years).
I have tried using the following code but it does not work
for(i in unique(1995:2015)){
objects = paste("C:/Users/...",i,"agg.rda", sep=" ")
load(objects)
tmp = cat("tab",i,"_agg[[\"A\"]]" , sep = "")
y <- rbind(y, tmp)
rm(list=objects)
}
I apologize for any silly mistake (or question) and greatly appreciate any help.
Here’s a possible solution using a function to rename the object you’re loading in. I got loadRData from here. The loadRData function makes this a bit more approachable because you can load in the object with a different name.
Create some data for a reproducible example.
tab2000_agg <-
list(
A = 1:5,
b = 6:10
)
tab2001_agg <-
list(
A = 1:5,
d = 6:10
)
save(tab2000_agg, file = "2000_agg.rda")
save(tab2001_agg, file = "2001_agg.rda")
rm(tab2000_agg, tab2001_agg)
Using your loop idea.
loadRData <- function(fileName){
load(fileName)
get(ls()[ls() != "fileName"])
}
y <- list()
for(i in 2000:2001){
objects <- paste("", i, "_agg.rda", sep="")
data_list <- loadRData(objects)
tmp <- data_list[["A"]]
y[[i]] <- tmp
rm(data_list)
}
y <- do.call(rbind, y)
You could also turn it into a function rather than use a loop.
getElement <- function(year){
objects <- paste0("", year, "_agg.rda")
data_list <- loadRData(objects)
tmp <- data_list[["A"]]
return(tmp)
}
y <- lapply(2000:2001, getElement)
y <- do.call(rbind, y)
Created on 2022-01-14 by the reprex package (v2.0.1)

use name of dataframe on a list of dataframes

I try to solve a problem from a question I have previously posted looping inside list in r
Is there a way to get the name of a dataframe that is on a list of dataframes?
I have listed a serie of dataframes and to each dataframe I want to apply myfunction. But I do not know how to get the name of each dataframe in order to use it on nameofprocesseddf of myfunction.
Here is the way I get the list of my dataframes and the code I got until now. Any suggestion how I can make this work?
library(missForest)
library(dplyr)
myfunction <- function (originaldf, proceseddf, nonproceseddf, nameofprocesseddf=character){
NRMSE <- nrmse(proceseddf, nonproceseddf, originaldf)
comment(nameofprocesseddf) <- nameofprocesseddf
results <- as.data.frame(list(comment(nameofprocesseddf), NRMSE))
names(results) <- c("Dataset", "NRMSE")
return(results)
}
a <- data.frame(value = rnorm(100), cat = c(rep(1,50), rep(2,50)))
da1 <- data.frame(value = rnorm(100,4), cat2 = c(rep(2,50), rep(3,50)))
dataframes <- dir(pattern = ".txt")
list_dataframes <- llply(dataframes, read.table, header = T, dec=".", sep=",")
n <- length(dataframes)
# Here is where I do not know how to get the name of the `i` dataframe
for (i in 1:n){
modified_list <- llply(list_dataframes, myfunction, originaldf = a, nonproceseddf = da1, proceseddf = list_dataframes[i], nameof processeddf= names(list_dataframes[i]))
write.table(file = sprintf("myfile/%s_NRMSE20%02d.txt", dataframes[i]), modified_list[[i]], row.names = F, sep=",")
}
as a matter of fact, the name of a data frame is not an attribute of the data frame. It's just an expression used to call the object. Hence the name of the data frame is indeed 'list_dataframes[i]'.
Since I assume you want to name your data frame as the text file is named without the extension, I propose you use something like (it require the library stringr) :
nameofprocesseddf = substr(dataframes[i],start = 1,stop = str_length(dataframes[i])-4)

Deleting rows in a sequence for MULTIPLE lists in R

I know how to delete rows in in a sequence for a SINGLE list:
data <- data.table('A' = c(1,2,3,4), 'B' = c(900,6,'NA',2))
row.remove <- data[!(data$A = seq(from=1,to=4,by=2) )]
However, I would like to know how to do so with MULTIPLE lists.
Code I've tried:
file.number <- c(1:5)
data <- setNames(lapply(paste(file.number,".csv"), read.csv, paste(file.number)) # this line imports the lists from csv files - works
data.2 <- lapply(data, data.table) # seems to work
row.remove <- lapply(data.2, function(x) x[!(data.2$A = seq(from=1,to=4,by=2)) # no error message, but deletes all the rows
I feel like I'm missing something obvious, any help will be greatly appreciated.
Solution:
for (i in 1:5){
file.number = i
data <- setNames(lapply(paste(file.number,".csv"), read.csv, paste(file.number))
data <- as.data.table(data)
row.remove <- data[!(data$A = seq(from=1,to=4,by=2) )]
}
Instead of analyzing the list simultaneously, this will analyze the lists one by one. It's not a full solution, but more of a work around.

How to create a vector or list of tables in R

I'm trying to scrape a bunch of tables from a website. I would like to be able to store them all in one or more variables - basically for easy access.
The below code is what I have so far, I'm using the XML Package which I have found works well on a single table but can't get this to work for more than one table.
i <- 1
N <- 3
DSFL1<- 'http://website/results/2012_aussies_thu/results/'
DSFL2 <- '.html'
SportHTML <- vector(length=N)
vectorOfTables <- vector(length=N)
for ( i in i:N) {
DSVL <- i
SportHTML[i] <- paste(DSFL1,DSVL,DSFL2, sep="")
Sport.table <- readHTMLTable(SportHTML[i], header=T, which=3,stringsAsFactors=F)
vectorOfTables[1] <- Sport.table
i <- i + 1
}
Any help would be appreciated.
Your tables being objects of length > 1 (and possibly differing lengths), they must go into a list. So you should do:
vectorOfTables <- vector(mode = "list", length = N)
and when you assign inside the loop, do:
vectorOfTables[[i]] <- Sport.table
However, you can avoid a for loop and create a list using lapply:
SportHTML <- paste0(DSFL1, 1:N, DSFL2)
ListOfTables <- lapply(SportHTML, readHTMLTable, header = TRUE,
which = 3, stringsAsFactors = FALSE)
and it is also a lot more concise as you can see.

Resources