I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.
I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.
Related
Apologies if this has been asked before. It's at the limit of my understanding of R, so I'm not even sure of the correct language in which to couch the query (hence, my inability to identify duplicate questions).
In my environment, I have an unknown number of objects (dataframes), each of which has an unknown number of columns that have meaningful names but with nonsense endings, which make it hard to reference them. The meaningful parts of the column names are usually followed by a double period and some further text. I want to automate finding and removing the meaningless suffixes. All the objects I want to modify have ".dat" in their names. Here's my attempt at an example:
# create some objects in my environment
a <- "a string, not of interest to me"
b.dat <- data.frame(col1 = 1:2, col2..gibberish = 3:4)
c.dat <- data.frame(col1..some.text = 5:6, col2 = 7:8)
# find the dataframes that I want to manipulate
dfs <- ls(pattern = ".dat")
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# and, the bit that doesn't work: change the problematic column names to their shorter alternatives
names(get(df))[problem.cols] <<- parts
return(0)
})
If I run this line by line, it does everything I want, up to and including names(get(df))[problem.cols], which it knows are the names of the columns in the dataframe I'm trying to alter. However, it won't assign the altered names to that, yielding the error message: Error in get(*tmp*) : invalid first argument.
I'm open to alternative approaches to achieve my desired end-point. However, I'm also intrigued by why this doesn't work and how, more generally, it's possible to alter an object referenced using "get()". Thanks in advance for any advice - and apologies if this is so naive it's been a waste of your time just reading it.
FWIW, I can see the similarity to this question but I can't adapt the answer to my needs.
Actually, I eventually made the link to using the "assign" function. This seems to work (so I've posted it here, in case it helps anyone else) - but I'd still be interested in alternative solutions:
# loop through the objects in question, finding and changing the problematic column names
colrename <- lapply(dfs, function(df){
# get the relevant dataframe
dat <- get(df)
# find its column names
nms <- names(dat)
# find the column names with the problematic ".." suffixes
problem.cols <- grep("\\.\\.",nms)
# pull out the meaningful first parts of each problematic name
parts <- strsplit(nms[problem.cols],"\\.\\.")
parts <- sapply(parts, function(x) x[1])
# change the problematic column names to their shorter alternatives
nms[problem.cols] <- parts
names(dat) <- nms
assign(df, dat, envir = .GlobalEnv)
return(0)
})
I have a problem with cleaning up my code. I understand I could type this all out but we don't want that obviously.
I have only dataframes in my global environment. They are all "data.frame".
I want to check the dimensions of all of them and put that in a tibble. I managed that somehow. I also would like to change their colnames() tolower() which works easy if I just type the name of the data.frame, but there's more than 2 and I want it done automatically. Then I also want to mutate all data.frames in the same way.
Small example of my code:
library(tidyverse)
x <- data.frame(letters[1:2]) #To create the data
y <- data.frame(letters[3:4])
dfs <- as.list(ls()) #I take whatever is in my environment
I managed below to get a tibble of the dimensions:
z <- as_tibble(lapply(seq_along(dfs),
function(j) dim(get(dfs[[j]]))), .name_repair = "unique")
colnames(z) <- dfs
Now for the colnames of all the data.frames stored in my list I basically want to perform this code:
colnames(dfs[[1]]) <- tolower(colnames(dfs[[1]])
but that returns NULL as I found out earlier. So I used get() in there to make it work for the dimensions. But if I use get() to assign colnames it says it can't find function "get<-".
Since all colnames for all dataframes are the same (just different nrows()) I could save the lowercase colnames as value and use that, but that doesn't take away that it cant find the get<- function.
names <- tolower(colnames(x))
sapply(seq_along(dfs),
function(j) colnames(get(dfs[[j]])) <- names)
*Error in colnames(get(dfs[[j]])) <- names :
could not find function "get<-"*
as for the mutating part I tried a for loop:
for(i in seq_along(dfs)){
get(dfs[[i]]) <- get(dfs[[i]]) %>% mutate(cd = ab)
}
But it's the same issue.
Could anyone help clearing this problem for me? (and if a cleaner code for the dimensions is available that would be highly appreciated)
I am just trying to up my coding skills. I would have been long done if I just typed it all out but that defeats the purpose.
Thanks!
-JK
Using base R
lapply(dfs, function(x) transform(setNames(x, tolower(names(x))), X = c('a', 'b')))
I have dataframes in which one column has to suffer a modification, handling correctly NAs, characters and digits. Dataframes have similar names, and the column of interest is shared.
I made a for loop to change every row of the column of interest correctly. However I had to create an intermediary object "df" in order to accomplish that.
Is that necessary? or the original dataframes can be modified directly.
sheet1 <- read.table(text="
data
15448
something_else
15334
14477", header=TRUE, stringsAsFactors=FALSE)
sheet2 <- read.table(text="
data
16448
NA
16477", header=TRUE, stringsAsFactors=FALSE)
sheets<-ls()[grep("sheet",ls())]
for(i in 1:length(sheets) ) {
df<-NULL
df<-eval(parse(text = paste0("sheet",i) ))
for (y in 1:length(df$data) ){
if(!is.na(as.integer(df$data[y])))
{
df[["data"]][y]<-as.character(as.Date(as.integer(df$data[y]), origin = "1899-12-30"))
}
}
assign(eval(as.character(paste0("sheet",i))),df)
}
As #d.b. mentions, consider interacting on a list of dataframes especially if similarly structured since you can run same operations using apply procedures plus you save on managing many objects in global environment. Also, consider using the vectorized ifelse to update column.
And if ever you really need separate dataframe objects use list2env to convert each element to separate object. Below wraps as.* functions with suppressWarnings since you do want to return NA.
sheetList <- mget(ls(pattern = "sheet[0-9]"))
sheetList <- lapply(sheetList, function(df) {
df$data <- ifelse(is.na(suppressWarnings(as.integer(df$data))), df$data,
as.character(suppressWarnings(as.Date(as.integer(df$data),
origin = "1899-12-30"))))
return(df)
})
list2env(sheetList, envir=.GlobalEnv)
I am using the ExtremeBounds package which provides as a result a multi level list with (amongst others) dataframes at the lowest level. I run this package over several specifications and I would like to collect some columns of selected dataframes in these results. These should be collected by specification (spec1 and spec2 in the example below) and arranged in a list of dataframes. This list of dataframes can then be used for all kind of things, for example to export the results of different specifications into different Excel Sheets.
Here is some code which creates the problematic object (just run this code blindly, my problem only concerns how to deal with the kind of list it creates: eba_results):
library("ExtremeBounds")
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,2,0.2),
var3=rnorm(30),var4=rnorm(30),var5=rnorm(30))
spec1 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4"))
spec2 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4","var5"))
indicators <- c("spec1","spec2")
ebaFun <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, weights = "lri", family = binomial(logit))}
eba_results <- lapply(mget(indicators),ebaFun) #eba_results is the object in question
Manually I know how to access each element, for example:
eba_results$spec1$bounds$type #look at str(eba_results) to see the different levels
So "bounds" is a dataframe with identical column names for both spec1 and spec2. I would like to collect the following 5 columns from "bounds":
type, cdf.mu.normal, cdf.above.mu.normal, cdf.mu.generic, cdf.above.mu.generic
into one dataframe per spec. Manually this is simple but ugly:
collectedManually <-list(
manual_spec1 = data.frame(
type=eba_results$spec1$bounds$type,
cdf.mu.normal=eba_results$spec1$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec1$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec1$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec1$bounds$cdf.above.mu.generic),
manual_spec2= data.frame(
type=eba_results$spec2$bounds$type,
cdf.mu.normal=eba_results$spec2$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec2$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec2$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec2$bounds$cdf.above.mu.generic))
But I have more than 2 specifications and I think this should be possible with lapply functions in a prettier way. Any help would be appreciated!
p.s.: A generic example to which hrbrmstr's answer applies but which turned out to be too simplistic:
exampleList = list(a=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))),
b=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))))
and I want to have an object which collects, for example, all the A and B vectors into two data frames (each with its respective A and B) which are then a list of data frames. Manually this would look like:
dfa <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
dfb <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
collectedResults <- list(a=dfa, b=dfb)
There's probably a less brute-force way to do this.
If you want lists of individual columns this is one way:
get_col <- function(my_list, col_name) {
unlist(lapply(my_list, function(x) {
lapply(x, function(y) { y[, col_name] })
}), recursive=FALSE)
}
get_col(exampleList, "A")
get_col(exampleList, "B")
If you want a consolidated data.frame of indicator columns this is one way:
collect_indicators <- function(my_list, indicators) {
lapply(my_list, function(x) {
do.call(rbind, c(lapply(x, function(y) { y[, indicators] }), make.row.names=FALSE))
})[[1]]
}
collect_indicators(exampleList, c("A", "B"))
If you just want to bring the individual data.frames up a level to make it easier to iterate over to write to a file:
unlist(exampleList, recursive=FALSE)
Much assumption about the true output format is being made (the question was a bit vague).
There is a brute force way which works but is dependent on several named objects:
collectEBA <- function(x){
df <- paste0("eba_results$",x,"$bounds")
df <- eval(parse(text=df))[,c("type",
"cdf.mu.normal","cdf.above.mu.normal",
"cdf.mu.generic","cdf.above.mu.generic")]
df[is.na(df)] <- "NA"
df
}
eba_export <- lapply(indicators,collectEBA)
names(eba_export) <- indicators
I have a list of dataframes inside of my folder directory which I want to process for analyses. I read them by using inside of lapply function first, then I want to process its columns and order its rows by grouping. Therefore most of times I needed to combine dplyr and lapply functions to process faster of my data.
I looked through out the web and check some books but most of the examples are easy ones and do not cover combination of these two functions.
Here is the sample code which I'm using:
files <- mixedsort(dir(pattern = "*.txt",full.names = FALSE)) # to read data
data <- lapply(files,function(x){
tmp <- read.table(file=x, fill=T, sep = "\t", dec=".", header=F,stringsAsFactors=F)
df <- tmp [!grepl(c("AC"),tmp $V1),]
new.df <- select(df, V1:V26)
new.df <- apply(new.df, function(x){ x[11:26] <- x[11:26]/10000;x })
I am getting the following error:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
Here is the reproducible example which looks like my data. Lets say I want to process 2nd and 3rd column of my dat and group by let column. When I try to put below fun command inside of data code above I got error. Any guidance will be appreciated.
dat <- lapply(1:3, function(x)data.frame(let=sample(letters,4),a=sort(runif(20,0,10000),decreasing=TRUE), b=sort(runif(20,0,10000),decreasing=TRUE), c=rnorm(20),d=rnorm(20)))
fun <- lapply(dat, function(x){x[2:3] <-x[2:3] /10000; x})
as mentioned in the comments to your question, the apply function was causing the error. However I don't think apply is what you want, because it aggregates your dataframe.
using just dplyr-syntax your problem can be solved like this:
tmp %>%
filter(!grepl("AC",V1)) %>%
select(V1:V26) %>%
mutate_each(funs(./1000), V11:V26)