List vs Df difference within lapply bulk download - r

I was using this function below to bulk load several data txt files. Before deciding to subset within the lapply, my loaded frames were loaded as a list in RStudio. It was transformed into a df using the do call.
I decided to add two essential subset arguments within the function to remove empty columns. Without the final do.call, df first became a data frame, not a list. Albeit it still loads in correctly for do call, I am wondering what the main difference is with lists and df, specifically with regards to this situation.
df <- lapply(temp, function(x) {
dfs<- read.table(x,
header=TRUE,
fill=TRUE,
na.strings= "N/A",
sep="\t",
stringsAsFactors = TRUE)
#Added, then changed type of df before do.call
dfs<- dfs[!dfs$Variable %in% "",]
dfs<- dfs[!dfs$X5TH %in% "",]
#
dfs<- cbind(codes, dfs)
return(dfs)
})
CR13distr<- do.call("rbind", df)
remove(df)
I have checked the results, and they seem to be identical outputs (except the removal of empty rows). However, to be sure, does that intermediate change from list to df change anything significant?

Related

Merging DF in a list individually to another DF

I have a list of dataframes with varying dimensions filled with data and row/col names of Countries. I also have a "master" dataframe outside of this list that is blank with square dimensions of 189x189.
I wish to merge each dataframe inside the list individually on top of the "master" sheet perserving the square matrix dimensions. I have been able to achieve this individually using this code:
rownames(Trade) <- Trade$X
Trade <- Trade[, 2:length(Trade)]
Full[row.names(Trade), colnames(Trade)] <- Trade
With "Full" being my master sheet and "Trade" being an individual df.
I have attempted to create a function to apply this process to a list of dataframes but am unable to properly do this.
Function and code in question:
DataMerge <- function(df) {
rownames(df) <- df$Country
Trade <- Trade[, 2:length(Trade)]
Country[row.names(df), colnames(df)] <- df
}
Applied using :
DataMergeDF <- lapply(TradeMatrixDF, DataMerge)
filenames <- paste0("Merged",names(DataMergeDF), ".csv")
mapply(write.csv, DataMergeDF, filenames)
Country <- read.csv("FullCountry.csv")
However what ends up happening is that the data does not end up merging properly / the dimensions are not preserved.
I asked a question pertaining to this issue a few days ago (CSV generated from not matching to what I have in R) , but I have a suspicion that I am running into this issue due to my use of "lapply". However, I am not 100% sure.
If we return the 'Country' at the end it should work. Also, better to pass the other data as an argument
DataMerge <- function(Country, df) {
rownames(df) <- df$Country
df <- df[, 2:length(df)]
Country[row.names(df), colnames(df)] <- df
Country
}
then, we call the function as
DataMergeDF <- lapply(TradeMatrixDF, DataMerge, Country = Country)

R: transforming multiple sets to dataframes at once

I have 31 datasets corresponding to data about 31 teachers. I need to perform multiple transformations on all these datasets. One of them is transforming all of them into dataframes
class(alexandre)
[1] "tbl_df" "tbl" "data.frame"
As I said, I have 31 similar datasets, and I need to transform all into dataframes. My code to do so has been
alexandre <- as.data.frame(alexandre)
adrian <- as.data.frame(adrian)
akemi <- as.data.frame(akemi)
arcanjo <- as.data.frame(arcanjo)
ana_barbara <- as.data.frame(ana_barbara)
brigida <- as.data.frame(brigida)
cleiton <- as.data.frame(cleiton)
daniela <- as.data.frame(daniela)
davi <- as.data.frame(davi)
eliezer <- as.data.frame(eliezer)
eduardo <- as.data.frame(eduardo)
eustaquio <- as.data.frame(eustaquio)
gilberto <- as.data.frame(gilberto)
gilmar <- as.data.frame(gilmar)
jorge <- as.data.frame(jorge)
juarez <- as.data.frame(juarez)
junior <- as.data.frame(junior)
... and add some rows to this code (31 lines of this). Obviously all these lines of code take too much space and there must be a faster(and more elegant) way to accomplish this. In fact, I tried this
teachers <- c(alexandre, akemi, adrian, brigida, davi, ...)
cnames <- function(x){
colnames(x) <- c(1:18)
}
mapply(cnames, teachers)
Then I would do all the work with a few lines of code. And this method (form a vector containing all datasets, then use mapply on the vector) would make my work much easier because, as I said, I have to perform multiple transformation on all these datasets.
This code does not work, however. I get the following error:
Error in `colnames<-`(`*tmp*`, value = c(1:18)) :
attempt to set 'colnames' on an object with less than two dimensions
This error message is very unenlightening, I find. I have no idea what to do to to make the code work, which is obviously why I'm here. Any other methods to accomplish what I'm trying to do are welcome. Thanks.
As commented and often discussed in the R tag of SO, simply use a list to maintain all your individual, similarly structured data frames. Doing so allows you the following benefits:
Easily run operations consistently across all items using loops or apply family calls without separate naming assignments.
Organizes your environment and workspace with maintenance of one object with easy reference by number or name instead of 31 objects flooding your global environment.
Facilitates data frame migrations and handling with rbind, cbind, split, by, or other operations.
To create a list of all current data frames in global environment use eapply or mget filtering on data frame objects. Each returns a named list of data frames.
teachers_df_list <- Filter(is.data.frame, eapply(.GlobalEnv, identity))
teachers_df_list <- Filter(is.data.frame, mget(x=ls()))
Alternatively, source your data frames originally from file sources using list objects such as list.files:
teachers_df_list <- lapply(list.files(...), function(f) read.csv(f, ...))
You lose no functionality of data frame if stored inside a list.
head(teachers_df_list$alexandre)
tail(teachers_df_list$adrian)
summary(teachers_df_list$akemi)
...
Then run your needed operations with lapply like renaming columns with right-hand side function, setNames. Run other needed operations: aggregate or lm.
new_teachers_df_list <- lapply(teachers_df_list,
function(df) setNames(df, paste0("col_", c(1:18)))
new_teachers_agg_list <- lapply(teachers_df_list,
function(df) aggregate(col1 ~ col2, df, sum))
new_teachers_model_list <- lapply(teachers_df_list,
function(df) summary(lm(col1 ~ col2, df)))
Even compile all data frames into one master version using do.call + rbind:
# ADD A TEACHER INDICATOR COLUMN
new_teachers_df_list <- Map(function(df, n) transform(df, teacher=n),
new_teachers_df_list, names(new_teachers_df_list))
# BUILD SINGLE DF
teachers_df <- do.call(rbind, new_teachers_df_list)
Even split master version back into individual groupings if needed later on:
# SPLIT BACK TO LIST OF DFs
teachers_df_list <- split(teachers_df, teachers_df$teacher)
Maybe you could use a list to stock all your data.frame. It seems to work, but you need to find a way to extract all data.frame in the list after that.
df_1 <- data.frame(c(0, 1, 0), c(3, 4, 5))
df_2 <- data.frame(c(0, 1, 0), c(3, 4, 5))
l <- list(df_1, df_2)
lapply(l, function(x){
colnames(x) <- 1:2
return(x)
})

Saving data frames to values in a list

I have a list of titles that I would like to iterate over and create/save data frames to. I have tried the using the paste() function (as seen below) but that does not work for me. Any advice would be greatly appreciated.
samples <- list("A","B","C")
for (i in samples){
paste(i,sumT,sep="_") <- data.frame(col1=NA,col1=NA)
}
My desired output is three empty data frames named: A_sumT, B_sumT and C_sumT
Here's an answer with purrr.
samples <- list("A", "B", "C")
samples %>%
purrr::map(~ data.frame()) %>%
purrr::set_names(~ paste(samples, "sumT", sep="_"))
Consider creating a list of dataframes and avoid many separate objects flooding global environment as this example can extend to hundreds and not just three. Plus with this approach, you will maintain one container capable of running bulk operations across all dataframes.
By using sapply below on a character vector, you create a named list:
samples <- c("A","B","C") # OR unlist(list("A","B","C"))
df_list <- sapply(samples, function(x) data.frame(col1=NA,col2=NA), simplify=FALSE)
# RUN ANY DATAFRAME OPERATION
head(df_list$A)
tail(df_list$B)
summary(df_list$C)
# BULK OPERATIONS
stacked_df <- do.call(rbind, df_list)
stacked_df <- do.call(cbind, df_list)
merged_df <- Reduce(function(x,y) merge(x,y,by="col1"), df_list)
Or if you need to rename list
# RENAME LIST
df_list <- setNames(df_list, paste0(samples, "_sumT"))
# RUN ANY DATAFRAME OPERATION
head(df_list$A_sumT)
tail(df_list$B_sumT)
summary(df_list$C_sumT)

Use R to add a column to multiple dataframes using lapply

I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.
I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.

lapply and dplyr combination to process nested data frames

I have a list of dataframes inside of my folder directory which I want to process for analyses. I read them by using inside of lapply function first, then I want to process its columns and order its rows by grouping. Therefore most of times I needed to combine dplyr and lapply functions to process faster of my data.
I looked through out the web and check some books but most of the examples are easy ones and do not cover combination of these two functions.
Here is the sample code which I'm using:
files <- mixedsort(dir(pattern = "*.txt",full.names = FALSE)) # to read data
data <- lapply(files,function(x){
tmp <- read.table(file=x, fill=T, sep = "\t", dec=".", header=F,stringsAsFactors=F)
df <- tmp [!grepl(c("AC"),tmp $V1),]
new.df <- select(df, V1:V26)
new.df <- apply(new.df, function(x){ x[11:26] <- x[11:26]/10000;x })
I am getting the following error:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
Here is the reproducible example which looks like my data. Lets say I want to process 2nd and 3rd column of my dat and group by let column. When I try to put below fun command inside of data code above I got error. Any guidance will be appreciated.
dat <- lapply(1:3, function(x)data.frame(let=sample(letters,4),a=sort(runif(20,0,10000),decreasing=TRUE), b=sort(runif(20,0,10000),decreasing=TRUE), c=rnorm(20),d=rnorm(20)))
fun <- lapply(dat, function(x){x[2:3] <-x[2:3] /10000; x})
as mentioned in the comments to your question, the apply function was causing the error. However I don't think apply is what you want, because it aggregates your dataframe.
using just dplyr-syntax your problem can be solved like this:
tmp %>%
filter(!grepl("AC",V1)) %>%
select(V1:V26) %>%
mutate_each(funs(./1000), V11:V26)

Resources