R: Looping through dataframes and subsetting - r

I have a number of dataframes (imported from CSV) that have the same structure. I would like to loop through all these dataframes and keep only two of these columns.
The loop below does not seem to work, any ideas why? Would ideally like to do this using a loop as I am trying to get better at using these.
frames <- ls()
for (frame in frames){
frame <- subset(frame, select = c("Col_A","Col_B"))
}
Cheers in advance for any advice.

For anyone interested I used Richard Scriven's idea of reading in the dataframes as one object, with a function added that showed where the file had been imported from. This allowed me to then use the Plyr package to manipulate the data:
library(plyr)
dataframes <- list.files(path = TEESMDIR, full.names = TRUE)
## Define a function to add the filename to the dataframe
read_csv_filename <- function(filename){
ret <- read.csv(filename)
ret$Source <- filename #EDIT
ret
}
list_dataframes <- ldply(dataframes, read_csv_filename)
selection <- llply(list_dataframes, subset, select = c(var1,var3))

The basic problem is that ls() returns a character vector of all the names of the objects in your environment, not the objects themselves. To get and replace an object using a character variable containing it's name, you can use the get()/assign() functions. You could re-write your function as
frames <- ls()
for (frame in frames){
assign(frame, subset(get(frame), select = c("Col_A","Col_B")))
}

Related

How to iteratively name dataset

Here is an example of my code:
library(Rcpp)
library(readxl)
Sheets<-readxl::excel_sheets("~/data.xlsx")
sheet_names <- sheets[grepl("String", sheets, ignore.case=TRUE)
for (i in 1:length(sheet_names)){
dataset[i] <- read_excel("data.xlsx", sheet = sheet_names[i])
}
What I would like this to do is to return a dataset named "dataset1" where i=1 and "dataset2" where i=2 and so on. Alternatively, I would like to use the name of the sheet itself i.e. sheet_names[i] but when attempting to use that it overwrites the strings in the variable.
I would be grateful for any suggestions on this.
Consider storing the data in a list instead of creating multiple dataframes in global environment. These objects are difficult to manage and they pollute the global environment.
library(readxl)
Sheets <- excel_sheets("~/data.xlsx")
sheet_names <- grep("String", sheets, ignore.case=TRUE, value = TRUE)
list_data <- lapply(sheet_names, function(x) read_excel("data.xlsx", sheet = x))
list_data is a list of dataframes. If you need to access individual dataframes you can use list_data[[1]] to get the 1st dataframe, list_data[[2]] to get the 2nd one and so on.

Looping through similar dataframes to apply changes using for

I have dataframes in which one column has to suffer a modification, handling correctly NAs, characters and digits. Dataframes have similar names, and the column of interest is shared.
I made a for loop to change every row of the column of interest correctly. However I had to create an intermediary object "df" in order to accomplish that.
Is that necessary? or the original dataframes can be modified directly.
sheet1 <- read.table(text="
data
15448
something_else
15334
14477", header=TRUE, stringsAsFactors=FALSE)
sheet2 <- read.table(text="
data
16448
NA
16477", header=TRUE, stringsAsFactors=FALSE)
sheets<-ls()[grep("sheet",ls())]
for(i in 1:length(sheets) ) {
df<-NULL
df<-eval(parse(text = paste0("sheet",i) ))
for (y in 1:length(df$data) ){
if(!is.na(as.integer(df$data[y])))
{
df[["data"]][y]<-as.character(as.Date(as.integer(df$data[y]), origin = "1899-12-30"))
}
}
assign(eval(as.character(paste0("sheet",i))),df)
}
As #d.b. mentions, consider interacting on a list of dataframes especially if similarly structured since you can run same operations using apply procedures plus you save on managing many objects in global environment. Also, consider using the vectorized ifelse to update column.
And if ever you really need separate dataframe objects use list2env to convert each element to separate object. Below wraps as.* functions with suppressWarnings since you do want to return NA.
sheetList <- mget(ls(pattern = "sheet[0-9]"))
sheetList <- lapply(sheetList, function(df) {
df$data <- ifelse(is.na(suppressWarnings(as.integer(df$data))), df$data,
as.character(suppressWarnings(as.Date(as.integer(df$data),
origin = "1899-12-30"))))
return(df)
})
list2env(sheetList, envir=.GlobalEnv)

Use R to add a column to multiple dataframes using lapply

I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.
I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.

R - New variables over several data frames in a loop

In R, I have several datasets, and I want to use a loop to create new variables (columns) within each of them:
All dataframes have the same name structure, so that is what I am using to loop through them. Here is some pseudo-code with what I want to do
Name = Dataframe_1 #Assume the for-loop goes from Dataframe_1 to _10 (loop not shown)
#Pseudo-code
eval(as.name(Name))$NewVariable <- c("SomeString") #This is what I would like to do, but I get an error ("could not find function eval<-")
As a result, I should have the same dataframe with one extra column (NewVariable), where all rows have the value "SomeString".
If I use eval(as.name(Name)) I can call up the dataframe Name with no problem, but none of the usual data frame operators seem to work with that particular call (not <- assignment, or $ or [[]])
Any ideas would be appreciated, thanks in advance!
We can place the datasets in a list and create a new column by looping over the list with lapply. If needed, the original dataframe objects can be updated with list2env.
lst <- mget(paste0('Dataframe_', 1:10))
lst1 <- lapply(lst, transform, NewVariable = "SomeString")
list2env(lst1, envir = .GlobalEnv())
Or another option is with assign
nm1 <- ls(pattern = "^Dataframe_\\d+")
nm2 <- rep("NewVariable", length(nm1))
for(j in seq_along(nm1)){
assign(nm1[j], `[<-`(get(nm1[j]), nm2[j], value = "SomeString"))
}

Using for loops to match pairs of data frames in R

Using a particular function, I wish to merge pairs of data frames, for multiple pairings in an R directory. I am trying to write a ‘for loop’ that will do this job for me, and while related questions such as Merge several data.frames into one data.frame with a loop are helpful, I am struggling to adapt example loops for this particular use.
My data frames end with either “_df1.csv” or ‘_df2.csv”. Each pair, that I wish to merge into an output data frame, has an identical number at the being of the file name (i.e. 543_df1.csv and 543_df2.csv).
I have created a character string for each of the two types of file in my directory using the list.files command as below:
df1files <- list.files(path="~/Desktop/combined files” pattern="*_df1.csv", full.names=T, recursive=FALSE)
df2files <- list.files(path="="~/Desktop/combined files ", pattern="*_df2.csv", full.names=T, recursive=FALSE)
The function and commands that I want to apply in order to merge each pair of data frames are as follows:
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
I am now trying to incorporate these commands into a for loop starting with something along the following lines, to prevent me from having to manually merge the pairs:
for(i in 1:length(df2files)){ ……
I am not yet a strong R programmer, and have hit a wall, so any help would be greatly appreciated.
My intuition (which I haven't had a chance to check) is that you should be able to do something like the following:
# read in the data as two lists of dataframes:
dfs1 <- lapply(df1files, read.csv)
dfs2 <- lapply(df2files, read.csv)
# define your merge commands as a function
merge2 <- function(df1, df2){
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
}
# apply that merge command to the list of lists
mergeddfs <- mapply(merge2, dfs1, dfs2, SIMPLIFY=FALSE)
# write results to files
outfilenames <- gsub("df1","merged",df1files)
mapply(function(x,y) write.csv(x,y), mergeddfs, outfilenames)

Resources