lapply and dplyr combination to process nested data frames

lapply and dplyr combination to process nested data frames - r

I have a list of dataframes inside of my folder directory which I want to process for analyses. I read them by using inside of lapply function first, then I want to process its columns and order its rows by grouping. Therefore most of times I needed to combine dplyr and lapply functions to process faster of my data.
I looked through out the web and check some books but most of the examples are easy ones and do not cover combination of these two functions.
Here is the sample code which I'm using:
files <- mixedsort(dir(pattern = "*.txt",full.names = FALSE)) # to read data
data <- lapply(files,function(x){
tmp <- read.table(file=x, fill=T, sep = "\t", dec=".", header=F,stringsAsFactors=F)
df <- tmp [!grepl(c("AC"),tmp $V1),]
new.df <- select(df, V1:V26)
new.df <- apply(new.df, function(x){ x[11:26] <- x[11:26]/10000;x })
I am getting the following error:
Error in match.fun(FUN) : argument "FUN" is missing, with no default
Here is the reproducible example which looks like my data. Lets say I want to process 2nd and 3rd column of my dat and group by let column. When I try to put below fun command inside of data code above I got error. Any guidance will be appreciated.
dat <- lapply(1:3, function(x)data.frame(let=sample(letters,4),a=sort(runif(20,0,10000),decreasing=TRUE), b=sort(runif(20,0,10000),decreasing=TRUE), c=rnorm(20),d=rnorm(20)))
fun <- lapply(dat, function(x){x[2:3] <-x[2:3] /10000; x})

as mentioned in the comments to your question, the apply function was causing the error. However I don't think apply is what you want, because it aggregates your dataframe.
using just dplyr-syntax your problem can be solved like this:
tmp %>%
filter(!grepl("AC",V1)) %>%
select(V1:V26) %>%
mutate_each(funs(./1000), V11:V26)

Related

rownames on multiple dataframe with for loop in R

I have several dataframe. I want the first column to be the name of each row.
I can do it for 1 dataframe this way :
# Rename the row according the value in the 1st column
row.names(df1) <- df1[,1]
# Remove the 1st column
df1 <- df1[,-1]
But I want to do that on several dataframe. I tried several strategies, including with assign and some get, but with no success. Here the two main ways I've tried :
# Getting a list of all my dataframes
my_df <- list.files(path="data")
# 1st strategy, adapting what works for 1 dataframe
for (i in 1:length(files_names)) {
rownames(get(my_df[i])) <- get(my_df[[i]])[,1] # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is Could not find function 'get>-'
# 2nd strategy using assign()
for (i in 1:length(my_df)) {
assign(rownames(get(my_df[[i]])), get(my_df[[i]])[,1]) # The problem seems to be in this line
my_df[i] <- my_df[i][,-1]
}
# The error is : Error in assign(rownames(my_df[i]), get(my_df[[i]])[, 1]) : first argument incorrect
I really don't see what I missed. When I type get(my_df[i]) and get(my_df[[i]])[,1], it works alone in the console...
Thank you very much to those who can help me :)

You may write the code that you have in a function, read the data and pass every dataframe to the function.
change_rownames <- function(df1) {
row.names(df1) <- df1[,1]
df1 <- df1[,-1]
df1
}
my_df <- list.files(path="data")
list_data <- lapply(my_df, function(x) change_rownames(read.csv(x)))

We can use a loop function like lapply or purrr::map to loop through all the data.frames, then use dplyr::column_to_rownames, which simplifies the procedure a lot. No need for an explicit for loop.
library(purrr)
library(dplyr)
map(my_df, ~ .x %>% read.csv() %>% column_to_rownames(var = names(.)[1]))

colnames and mutate on multiple dataframes

I have a problem with cleaning up my code. I understand I could type this all out but we don't want that obviously.
I have only dataframes in my global environment. They are all "data.frame".
I want to check the dimensions of all of them and put that in a tibble. I managed that somehow. I also would like to change their colnames() tolower() which works easy if I just type the name of the data.frame, but there's more than 2 and I want it done automatically. Then I also want to mutate all data.frames in the same way.
Small example of my code:
library(tidyverse)
x <- data.frame(letters[1:2]) #To create the data
y <- data.frame(letters[3:4])
dfs <- as.list(ls()) #I take whatever is in my environment
I managed below to get a tibble of the dimensions:
z <- as_tibble(lapply(seq_along(dfs),
function(j) dim(get(dfs[[j]]))), .name_repair = "unique")
colnames(z) <- dfs
Now for the colnames of all the data.frames stored in my list I basically want to perform this code:
colnames(dfs[[1]]) <- tolower(colnames(dfs[[1]])
but that returns NULL as I found out earlier. So I used get() in there to make it work for the dimensions. But if I use get() to assign colnames it says it can't find function "get<-".
Since all colnames for all dataframes are the same (just different nrows()) I could save the lowercase colnames as value and use that, but that doesn't take away that it cant find the get<- function.
names <- tolower(colnames(x))
sapply(seq_along(dfs),
function(j) colnames(get(dfs[[j]])) <- names)
*Error in colnames(get(dfs[[j]])) <- names :
could not find function "get<-"*
as for the mutating part I tried a for loop:
for(i in seq_along(dfs)){
get(dfs[[i]]) <- get(dfs[[i]]) %>% mutate(cd = ab)
}
But it's the same issue.
Could anyone help clearing this problem for me? (and if a cleaner code for the dimensions is available that would be highly appreciated)
I am just trying to up my coding skills. I would have been long done if I just typed it all out but that defeats the purpose.
Thanks!
-JK

Using base R
lapply(dfs, function(x) transform(setNames(x, tolower(names(x))), X = c('a', 'b')))

R: transforming multiple sets to dataframes at once

I have 31 datasets corresponding to data about 31 teachers. I need to perform multiple transformations on all these datasets. One of them is transforming all of them into dataframes
class(alexandre)
[1] "tbl_df" "tbl" "data.frame"
As I said, I have 31 similar datasets, and I need to transform all into dataframes. My code to do so has been
alexandre <- as.data.frame(alexandre)
adrian <- as.data.frame(adrian)
akemi <- as.data.frame(akemi)
arcanjo <- as.data.frame(arcanjo)
ana_barbara <- as.data.frame(ana_barbara)
brigida <- as.data.frame(brigida)
cleiton <- as.data.frame(cleiton)
daniela <- as.data.frame(daniela)
davi <- as.data.frame(davi)
eliezer <- as.data.frame(eliezer)
eduardo <- as.data.frame(eduardo)
eustaquio <- as.data.frame(eustaquio)
gilberto <- as.data.frame(gilberto)
gilmar <- as.data.frame(gilmar)
jorge <- as.data.frame(jorge)
juarez <- as.data.frame(juarez)
junior <- as.data.frame(junior)
... and add some rows to this code (31 lines of this). Obviously all these lines of code take too much space and there must be a faster(and more elegant) way to accomplish this. In fact, I tried this
teachers <- c(alexandre, akemi, adrian, brigida, davi, ...)
cnames <- function(x){
colnames(x) <- c(1:18)
}
mapply(cnames, teachers)
Then I would do all the work with a few lines of code. And this method (form a vector containing all datasets, then use mapply on the vector) would make my work much easier because, as I said, I have to perform multiple transformation on all these datasets.
This code does not work, however. I get the following error:
Error in `colnames<-`(`*tmp*`, value = c(1:18)) :
attempt to set 'colnames' on an object with less than two dimensions
This error message is very unenlightening, I find. I have no idea what to do to to make the code work, which is obviously why I'm here. Any other methods to accomplish what I'm trying to do are welcome. Thanks.

As commented and often discussed in the R tag of SO, simply use a list to maintain all your individual, similarly structured data frames. Doing so allows you the following benefits:
Easily run operations consistently across all items using loops or apply family calls without separate naming assignments.
Organizes your environment and workspace with maintenance of one object with easy reference by number or name instead of 31 objects flooding your global environment.
Facilitates data frame migrations and handling with rbind, cbind, split, by, or other operations.
To create a list of all current data frames in global environment use eapply or mget filtering on data frame objects. Each returns a named list of data frames.
teachers_df_list <- Filter(is.data.frame, eapply(.GlobalEnv, identity))
teachers_df_list <- Filter(is.data.frame, mget(x=ls()))
Alternatively, source your data frames originally from file sources using list objects such as list.files:
teachers_df_list <- lapply(list.files(...), function(f) read.csv(f, ...))
You lose no functionality of data frame if stored inside a list.
head(teachers_df_list$alexandre)
tail(teachers_df_list$adrian)
summary(teachers_df_list$akemi)
...
Then run your needed operations with lapply like renaming columns with right-hand side function, setNames. Run other needed operations: aggregate or lm.
new_teachers_df_list <- lapply(teachers_df_list,
function(df) setNames(df, paste0("col_", c(1:18)))
new_teachers_agg_list <- lapply(teachers_df_list,
function(df) aggregate(col1 ~ col2, df, sum))
new_teachers_model_list <- lapply(teachers_df_list,
function(df) summary(lm(col1 ~ col2, df)))
Even compile all data frames into one master version using do.call + rbind:
# ADD A TEACHER INDICATOR COLUMN
new_teachers_df_list <- Map(function(df, n) transform(df, teacher=n),
new_teachers_df_list, names(new_teachers_df_list))
# BUILD SINGLE DF
teachers_df <- do.call(rbind, new_teachers_df_list)
Even split master version back into individual groupings if needed later on:
# SPLIT BACK TO LIST OF DFs
teachers_df_list <- split(teachers_df, teachers_df$teacher)

Maybe you could use a list to stock all your data.frame. It seems to work, but you need to find a way to extract all data.frame in the list after that.
df_1 <- data.frame(c(0, 1, 0), c(3, 4, 5))
df_2 <- data.frame(c(0, 1, 0), c(3, 4, 5))
l <- list(df_1, df_2)
lapply(l, function(x){
colnames(x) <- 1:2
return(x)
})

Use R to add a column to multiple dataframes using lapply

I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.

I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.

List vs Df difference within lapply bulk download

I was using this function below to bulk load several data txt files. Before deciding to subset within the lapply, my loaded frames were loaded as a list in RStudio. It was transformed into a df using the do call.
I decided to add two essential subset arguments within the function to remove empty columns. Without the final do.call, df first became a data frame, not a list. Albeit it still loads in correctly for do call, I am wondering what the main difference is with lists and df, specifically with regards to this situation.
df <- lapply(temp, function(x) {
dfs<- read.table(x,
header=TRUE,
fill=TRUE,
na.strings= "N/A",
sep="\t",
stringsAsFactors = TRUE)
#Added, then changed type of df before do.call
dfs<- dfs[!dfs$Variable %in% "",]
dfs<- dfs[!dfs$X5TH %in% "",]
#
dfs<- cbind(codes, dfs)
return(dfs)
})
CR13distr<- do.call("rbind", df)
remove(df)
I have checked the results, and they seem to be identical outputs (except the removal of empty rows). However, to be sure, does that intermediate change from list to df change anything significant?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

lapply and dplyr combination to process nested data frames - r

Related

rownames on multiple dataframe with for loop in R

colnames and mutate on multiple dataframes

R: transforming multiple sets to dataframes at once

Use R to add a column to multiple dataframes using lapply

List vs Df difference within lapply bulk download

Categories

Resources