Passing dataframe as argument to function - r

I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?

Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})

Related

How to dynamically complement colnames in a list of data.frames by information from a vector

I have a list of data.frames whereas the first column's colname in each data.frame is supposed to be complemented by dynamic information from a vector.
Example:
set.seed(1)
df1 <- data.frame(matrix(sample(32), ncol = 8))
names(df1) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")
set.seed(2)
df2 <- data.frame(matrix(sample(32), ncol = 8))
names(df2) <- paste(rep(c("a", "b"), each = 4), 1:4, sep = "")
list_dfs <- list(df1, df2)
add_info <- c("add1", "other")
How can I add information from add_info to change the colname for a1 in df1 to "a1 add1" and a1 in df2 to "a1 add2" in a scalable way within the given list structure? The other colnames are not supposed to be changed.
I tried several approaches setting colnames using paste0 within lapply or a for loop and reviewed similar questions on SO but couldn't solve this problem so far.
You can do the following:
list_dfs <- lapply(1:length(list_dfs), function(i) {
setNames(list_dfs[[i]],paste(names(list_dfs[[i]]),add_info[[i]]))
})
Now the first dataframe in the list has its original name concatenated with the first element of add_info, the second has its names concatenated with second element of add_info. You can easily scale this to longer lists of data.frames and corresponding add_info-vectors.
Update:
If you only want to change the first name, do
list_dfs <- lapply(1:length(list_dfs), function(i) {
lastNames <- names(list_dfs[[i]])[2:NCOL(list_dfs[[i]])]
firstName <- paste(names(list_dfs[[i]])[1],add_info[[i]])
setNames(list_dfs[[i]],c(firstName,lastNames))
})

Loop through rows in list of dataframes and extract data. (Nested "apply" functions)

I am new to R and trying to do things the "R" way, which means no for loops. I would like to loop through a list of dataframes, loop through each row in the dataframe, and extract data based on criteria and store in a master dataframe.
Some issues I am having are with accessing the "global" dataframe. I am unsure the best approach (global variable, pass by reference).
I have created an abstract example to try to show what needs to be done:
rm(list=ls())## CLEAR WORKSPACE
assign("last.warning", NULL, envir = baseenv())## CLEAR WARNINGS
# Generate a descriptive name with name and size
generateDescriptiveName <- function(animal.row, animalList.vector){
name <- animal.row["animal"]
size <- animal.row["size"]
# if in list of interest prepare name for master dataframe
if (any(grepl(name, animalList.vector))){
return (paste0(name, "Sz-", size))
}
}
# Animals of interest
animalList.vector <- c("parrot", "cheetah", "elephant", "deer", "lizard")
jungleAnimals <- c("ants", "parrot", "cheetah")
jungleSizes <- c(0.1, 1, 50)
jungle.df <- data.frame(jungleAnimals, jungleSizes)
fieldAnimals <- c("elephant", "lion", "hyena")
fieldSizes <- c(1000, 100, 80)
field.df <- data.frame(fieldAnimals, fieldSizes)
forestAnimals <- c("squirrel", "deer", "lizard")
forestSizes <- c(1, 40, 0.2)
forest.df <- data.frame(forestAnimals, forestSizes)
ecosystems.list <- list(jungle.df, field.df, forest.df)
# Final master list
descriptiveAnimal.df <- data.frame(name = character(), descriptive.name = character())
# apply to all dataframes in list
lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
# apply to each row in dataframe
output <- apply(ecosystem.df, 1, function(row){generateDescriptiveName(row, animalList.vector)})
if(!is.null(output)){
# Add generated names to unique master list (no duplicates)
}
})
The end result would be:
name descriptive.name
1 "parrot" "parrot Sz-0.1"
2 "cheetah" "cheetah Sz-50"
3 "elephant" "elephant Sz-1000"
4 "deer" "deer Sz-40"
5 "lizard" "lizard Sz-0.2"
I did not use your function generateDescriptiveName() because I think it is a bit too laborious. I also do not see a reason to use apply() within lapply(). Here is my attempt to generate the desired output. It is not perfect but I hope it helps.
df_list <- lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
temp <- ecosystem.df[ecosystem.df$animal %in% animalList.vector, ]
if(nrow(temp) > 0){
data.frame(name = temp$animal, descriptive.name = paste0(temp$animal, " Sz-", temp$size))
}
})
do.call("rbind",df_list)

How to check identical for multiple R objects?

Suppose I have a list object such like:
set.seed(123)
df <- data.frame(x = rnorm(5), y = rbinom(5,2,0.5))
rownames(df) <- LETTERS[1:5]
ls <- list(df1 = df, df2 = df, df3 = df)
My question is how to quickly check the row names are identical across the three elements (data frames) in the ls.
You can try
all(sapply(ls, rownames) == rownames(ls[[1]]))
To check only the name of the ith column, you can modify this to
all(sapply(ls, rownames)[i, ] == rownames(ls[[1]])[i])
You can get a list of row names with:
Map(rownames, ls)
so you can check that all the dataframes have the same rownames checking that there is only one unique value of row.names vector with:
length(unique(Map(rownames, ls))) == 1

R: Looping through list of dataframes in a vector

I have a dataset where I only want to loop through certain columns in a dataframe one at a time to create a graph. The structure of my dataframe consists of data that I parsed from a larger dataset into a vector containing multiple dataframes.
I want to call one column from one dataframe in the vector. I want to loop on the dataframes to call each column.
See example below:
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4))
my.list <- list(d1, d2)
All I have to work with is my.list
How would I do this?
You can use lapply to plot each of the individual data frames in your list. For example,
d1 <- data.frame(y1=c(1,2,3),y2=c(4,5,6),y3=c(7,8,9))
d2 <- data.frame(y1=c(3,2,1),y2=c(6,5,4),y3=c(11,12,13))
mylist <- list(d1, d2)
par(mfrow=c(2,1))
# lapply on a subset of columns
lapply(mylist, function(x) plot(x$y2, x$y3))
You don't need a for loop to get their data points. You can call the column by their column names.
# a toy dataframe
d <- data.frame(A = 1:20, B = sample(c(FALSE, TRUE), 20, replace = TRUE),
C = LETTERS[1:20], D = rnorm(20, 0, 1))
col_names <- c("A", "B", "D") # names of columns I want to get
d[,col_names] # returns a dataset with the values of the columns you want
Here is a solution to your problem using a for loop:
# a toy dataframe
mylist <- list(dat1 = data.frame(A = 1:20, B = LETTERS[1:20]),
dat2 = data.frame(A = 21:40, B = LETTERS[1:20]),
dat3 = data.frame(A = 41:60, B = LETTERS[1:20]))
col_names <- c("A") # name of columns I want to get
for (i in 1:length(mylist)){
# you can do whatever you want with what is returned;
# here I am just print them out
print(names(mylist)[i]) # name of the data frame
print(mylist[[i]][,col_names]) # values in Column A
}
I think the simplest answer to your question is to use double brackets.
for (i in 1:length(my.list)) {
print(my.list[[i]]$column)
}
That works assuming all of the columns in your list of data frames have the same names. You could also call the position of the column in the data frame if you wanted.
Yes, lapply can be more elegant, but in some situations a for loop makes more sense.

How to add colums to a blank data frame columns by columns in R?

I try to create a data.fame, and then add some columns to this data.frame.
I try following code, but it does not work:
test.dim <- as.data.frame(matrix(nrow=0, ncol=4))
names <- c("A", "B", "C", "D")
colnames(test.dim) <- names
for (i in 1:4) {
name = names[i]
# do some calculations, at last get another data.fame named x.data
mean.data <- apply(x.data, 1, mean)
test.dim[, name] <- mean.data
}
Usually one would already have a data.frame (call it df) and simply add frames by calling df$newColName = values or df[,newColNames] = frame_of_values.
Your question indicates that you are separating the creation of your values from putting them in the data frame (which I do not recommend). But if you really want to start from a zero row zero col frame here are some options:
colnamesToAdd = LETTERS[1:4]
test.dim = data.frame( matrix(rep(NA),length(colnamesToAdd),nrow=1) )
colnames(test.dim) = colnamesToAdd
test.dim = test.dim[-1,]
Another option:
colnamesToAdd = LETTERS[1:4]
test.dim = data.frame("USELESS" = NA)
test.dim[,colnamesToAdd] = NA
test.dim = test.dim[-1,-1]
If you are looking to add a mean to your table and repeat it for every factor:
library(data.table);
test.dim = data.table("FACTOR" = sample(letters[1:4],100,replace=TRUE), "VALUE" = runif(100), "MEAN" = NA)
means = test.dim[,list(AVG=mean(VALUE)),by="FACTOR"]
# without data.table: by(test.dim$VALUE, test.dim$FACTOR, mean)
for(x in 1:nrow(means)) { test.dim$MEAN[test.dim$FACTOR==means$FACTOR[x]] = means$AVG[x] } # normally I would use the foreach package instead of this last for loop

Resources