Apologies if this has been asked before. It's at the limit of my understanding of R, so I'm not even sure of the correct language in which to couch the query (hence, my inability to identify duplicate questions).
In my environment, I have an unknown number of objects (dataframes), each of which has an unknown number of columns that have meaningful names but with nonsense endings, which make it hard to reference them. The meaningful parts of the column names are usually followed by a double period and some further text. I want to automate finding and removing the meaningless suffixes. All the objects I want to modify have ".dat" in their names. Here's my attempt at an example:
# create some objects in my environment
a <- "a string, not of interest to me"
b.dat <- data.frame(col1 = 1:2, col2..gibberish = 3:4)
c.dat <- data.frame(col1..some.text = 5:6, col2 = 7:8)
# find the dataframes that I want to manipulate
dfs <- ls(pattern = ".dat")
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# and, the bit that doesn't work: change the problematic column names to their shorter alternatives
names(get(df))[problem.cols] <<- parts
return(0)
})
If I run this line by line, it does everything I want, up to and including names(get(df))[problem.cols], which it knows are the names of the columns in the dataframe I'm trying to alter. However, it won't assign the altered names to that, yielding the error message: Error in get(*tmp*) : invalid first argument.
I'm open to alternative approaches to achieve my desired end-point. However, I'm also intrigued by why this doesn't work and how, more generally, it's possible to alter an object referenced using "get()". Thanks in advance for any advice - and apologies if this is so naive it's been a waste of your time just reading it.
FWIW, I can see the similarity to this question but I can't adapt the answer to my needs.
Actually, I eventually made the link to using the "assign" function. This seems to work (so I've posted it here, in case it helps anyone else) - but I'd still be interested in alternative solutions:
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# change the problematic column names to their shorter alternatives
nms[problem.cols] <- parts
names(dat) <- nms
assign(df, dat, envir = .GlobalEnv)
return(0)
})
Related
I have a problem with cleaning up my code. I understand I could type this all out but we don't want that obviously.
I have only dataframes in my global environment. They are all "data.frame".
I want to check the dimensions of all of them and put that in a tibble. I managed that somehow. I also would like to change their colnames() tolower() which works easy if I just type the name of the data.frame, but there's more than 2 and I want it done automatically. Then I also want to mutate all data.frames in the same way.
Small example of my code:
library(tidyverse)
x <- data.frame(letters[1:2]) #To create the data
y <- data.frame(letters[3:4])
dfs <- as.list(ls()) #I take whatever is in my environment
I managed below to get a tibble of the dimensions:
z <- as_tibble(lapply(seq_along(dfs),
function(j) dim(get(dfs[[j]]))), .name_repair = "unique")
colnames(z) <- dfs
Now for the colnames of all the data.frames stored in my list I basically want to perform this code:
colnames(dfs[[1]]) <- tolower(colnames(dfs[[1]])
but that returns NULL as I found out earlier. So I used get() in there to make it work for the dimensions. But if I use get() to assign colnames it says it can't find function "get<-".
Since all colnames for all dataframes are the same (just different nrows()) I could save the lowercase colnames as value and use that, but that doesn't take away that it cant find the get<- function.
names <- tolower(colnames(x))
sapply(seq_along(dfs),
function(j) colnames(get(dfs[[j]])) <- names)
*Error in colnames(get(dfs[[j]])) <- names :
could not find function "get<-"*
as for the mutating part I tried a for loop:
for(i in seq_along(dfs)){
get(dfs[[i]]) <- get(dfs[[i]]) %>% mutate(cd = ab)
}
But it's the same issue.
Could anyone help clearing this problem for me? (and if a cleaner code for the dimensions is available that would be highly appreciated)
I am just trying to up my coding skills. I would have been long done if I just typed it all out but that defeats the purpose.
Thanks!
-JK
Using base R
lapply(dfs, function(x) transform(setNames(x, tolower(names(x))), X = c('a', 'b')))
I'm using a data.frame that contains many data.frames. I'm trying to access these sub-data.frames within a loop. Within these loops, the names of the sub-data.frames are contained in a string variable. Since this is a string, I can use the [,] notation to extract data from these sub-data.frames. e.g. X <- "sub.df"and then df[42,X] would output the same as df$sub.df[42].
I'm trying to create a single row data.frame to replace a row within the sub-data.frames. (I'm doing this repeatedly and that's why my sub-data.frame name is in a string). However, I'm having trouble inserting this new data into these sub-data.frames. Here is a MWE:
#Set up the data.frames and sub-data.frames
sub.frame <- data.frame(X=1:10,Y=11:20)
df <- data.frame(A=21:30)
df$Z <- sub.frame
Col.Var <- "Z"
#Create a row to insert
new.data.frame <- data.frame(X=40,Y=50)
#This works:
df$Z[3,] <- new.data.frame
#These don't (even though both sides of the assignment give the correct values/dimensions):
df[,Col.Var][6,] <- new.data.frame #Gives Warning and collapses df$Z to 1 dimension
df[7,Col.Var] <- new.data.frame #Gives Warning and only uses first value in both places
#This works, but is a work-around and feels very inelegant(!)
eval(parse(text=paste0("df$",Col.Var,"[8,] <- new.data.frame")))
Are there any better ways to do this kind of insertion? Given my experience with R, I feel like this should be easy, but I can't quite figure it out.
I have a list of 185 data frames called WaFramesNumeric. Each dataframe has several hundred columns and thousands of rows. I want to edit every data frame, so that it leaves all numeric columns as well as any non-numeric columns that I specify.
Using:
for(i in seq_along(WaFramesNumeric)) {
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
}
successfully makes each dataframe contain only its numeric columns.
I've tried to amend this with lines to add specific columns. I have tried:
for (i in seq_along(WaFramesNumeric)) {
a <- WaFramesNumeric[[i]]$Device_Name
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][,sapply(WaFramesNumeric[[i]],is.numeric)]
cbind(WaFramesNumeric[[i]],a)
}
and in an attempt to call the column numbers of all integer columns as well as the specific ones and then combine based on that:
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]],is.numeric))
m <- match("Cost_Center",colnames(WaFramesNumeric[[i]]))
n <- match("Device_Name",colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]][,i,combine]
}
These all return errors and I am stumped as to how I could do this. WaFramesNumeric is a copy of another list of dataframes (WaFramesNumeric <- WaFramesAll) and so I also tried adding the specific columns from the WaFramesAll but this was not successful.
I appreciate any advice you can give and I apologize if any of this is unclear.
You are mistakenly assuming that the last commmand in a for loop is meaningful. It is not. In fact, it is being discarded, so since you never assigned it anywhere (the cbind and the indexing of WaFramesNumeric...), it is silently discarded.
Additionally, you are over-indexing your data.frame in the third code block. First, it's using i within the data.frame, even though i is an index within the list of data.frames, not the frame itself. Second (perhaps caused by this), you are trying to index three dimensions of a 2D frame. Just change the last indexing from [,i,combine] to either [,combine] or [combine].
Third problem (though perhaps not seen yet) is that match will return NA if nothing is found. Indexing a frame with an NA returns an error (try mtcars[,NA] to see). I suggest that you can replace match with grep: it returns integer(0) when nothing is found, which is what you want in this case.
for (i in seq_along(WaFramesNumeric)) {
f <- which(sapply(WaFramesNumeric[[i]], is.numeric))
m <- grep("Cost_Center", colnames(WaFramesNumeric[[i]]))
n <- grep("Device_Name", colnames(WaFramesNumeric[[i]]))
combine <- c(f,m,n)
WaFramesNumeric[[i]] <- WaFramesNumeric[[i]][combine]
}
I'm not sure what you mean by "an attempt to call the column numbers of all integer columns...", but in case you want to go through a list of data frames and select some columns based on some function and keep given a column name you can do like this:
df <- data.frame(a=rnorm(20), b=rnorm(20), c=letters[1:20], d=letters[1:20], stringsAsFactors = FALSE)
WaFramesNumeric <- rep(list(df), 2)
Selector <- function(data, select_func, select_names) {
select_func <- match.fun(select_func)
idx_names <- match(select_names, colnames(data))
idx_names <- idx_names[!is.na(idx_names)]
idx_func <- which(sapply(data, select_func))
idx <- unique(c(idx_func, idx_names))
return(data[, idx])
}
res <- lapply(X = WaFramesNumeric, FUN = Selector, select_names=c("c"), select_func = is.numeric)
I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.
I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.
Suppose I have following data frame:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
I want to get following data frame at the end:
mydataframe[-which(is.na(mydataframe$ID)),]
I need to do this kind of cleaning (and other similar manipulations) with many other data frames. So, I decided to assign a name to mydataframe, and variable of interest.
dbname <- "mydataframe"
varname <- "ID"
attach(get(dbname))
I get an error in the following line, understandably.
get(dbname) <- get(dbname)[-which(is.na(get(varname))),]
detach(get(dbname))
How can I solve this? (I don't want to assign to a new data frame, even though it seems only solution right now. I will use "dbname" many times afterwards.)
Thanks in advance.
There is no get<- function, and there is no get(colname) function (since colnames are not first class objects), but there is an assign() function:
assign(dbname, get(dbname)[!is.na( get(dbname)[varname] ), ] )
You also do not want to use -which(.). It would have worked here since there were some matches to the condition. It will bite you, however, whenever there are not any rows that match and instead of returning nothing as it should, it will return everything, since vec[numeric(0)] == vec. Only use which for "positive" choices.
As #Dason suggests, lists are made for this sort of work.
E.g.:
# make a list with all your data.frames in it
# (just repeating the one data.frame 3x for this example)
alldfs <- list(mydataframe,mydataframe,mydataframe)
# apply your function to all the data.frames in the list
# have replaced original function in line with #DWin and #flodel's comments
# pointing out issues with using -which(...)
lapply(alldfs, function(x) x[!is.na(x$ID),])
The suggestion to use a list of data frames is good, but I think people are assuming that you're in a situation where all the data frames are loaded simultaneously. This might not necessarily be the case, eg if you're working on a number of projects and just want some boilerplate code to use in all of them.
Something like this should fit the bill.
stripNAs <- function(df, var) df[!is.na(df[[var]]), ]
mydataframe <- stripNAs(mydataframe, "ID")
cars <- stripNAs(cars, "speed")
I can totally understand your need for this, since I also frequently need to cycle through a set of data frames. I believe the following code should help you out:
mydataframe <- data.frame(ID=c(1,2,NA,4,5,NA),score=11:16)
#define target dataframe and varname
dbname <- "mydataframe"
varname <- "ID"
tmp.df <- get(dbname) #get df and give it a temporary name
col.focus <- which(colnames(tmp.df) == varname) #define the column of focus
tmp.df <- tmp.df[which(!is.na(tmp.df[,col.focus])),] #cut out the subset of the df where the column of focus is not NA.
#Result
ID score
1 1 11
2 2 12
4 4 14
5 5 15