Extract then row.bind data.frames from nested lists - r

I have a function that outputs a large matrix (Mat1) and a small dataframe (Smalldf1) I store them in a list called "Result".
This function is run inside a loop so I create many versions of my "Result" list until my loop ends....each time I add them to a list called "FINAL".
At the end of my loop I end up with a List called FINAL, that has many smaller "Result" lists inside...each containing a small dataframe and a large Matrix.
I want to rbind all of the small dataframes together to form one larger dataframe and call it DF1 - I'm not sure how I access these now that I'm accessing a list within a list..?
A similar question on here gave a solution like this:
DF1 <- do.call("rbind",lapply(FINAL, function(x) x["Smalldf1"]))
However this gives the output as a single column called "Smalldf1" with the description of Smalldf1...ie this is literally printed in the column list(X1="xxx", X2="xxx", X3="xxx")...I need this broken out to look like the original format, 3 columns housing the information...?
Any help would be great.

I make my comment into an answer. This could be your data:
df <- data.frame(X1=1:3, X2=4:6, X3=7:9)
FINAL=list(Result=list(Smalldf1=df, Mat1=as.matrix(df)),
Result=list(Smalldf1=df+1, Mat1=as.matrix(df+1)))
You can combine lapply to extract the first (or Nth, just change the 1) elements of the nested lists, and then do a rbind on this result, either with do.call or with dplyr:
#### # Doing it in base R:
do.call("rbind", lapply(FINAL, "[[", 1) )
#### # Or doing it with dplyr:
library(dplyr)
lapply(FINAL, "[[", 1) %>% bind_rows
#### X1 X2 X3
#### 1 1 4 7
#### 2 2 5 8
#### 3 3 6 9
#### 4 2 5 8
#### 5 3 6 9
#### 6 4 7 10
This should be your expected result
WARNING:
The solution using dplyr doesn't work for old versions of dplyr (i tested it on dplyr_0.5.0, but it returns an error on dplyr_0.2 for instance)

Related

How do I use dataframe names as inputs for var in for loop (R language)?

In R, I defined the following function:
race_ethn_tab <- function(x) {
x %>%
group_by(RAC1P) %>%
tally(wt = PWGTP) %>%
print(n = 15) }
The function simply generates a weighted tally for a given dataset, for example, race_ethn_tab(ca_pop_2000) generates a simple 9 x 2 table:
1 Race 1 22322824
2 Race 2 2144044
3 Race 3 228817
4 Race 4 1827
5 Race 5 98823
6 Race 6 3722624
7 Race 7 116176
8 Race 8 3183821
9 Race 9 1268095
I have to do this for several (approx. 10 distinct datasets) where it's easier for me to keep the dfs distinct rather than bind them and create a year variable. So, I am trying to use either a for loop or purrr::map() to iterate through my list of dfs.
Here is what I tried:
dfs_test <- as.list(as_tibble(ca_pop_2000),
as_tibble(ca_pop_2001),
as_tibble(ca_pop_2002),
as_tibble(ca_pop_2003),
as_tibble(ca_pop_2004))
# Attempt 1: Using for loop
for (i in dfs_test) {
race_ethn_tab(i)
}
# Attempt 2: Using purrr::map
race_ethn_outs <- map(dfs_test, race_ethn_tab)
Both attempts are telling me that group_by can't be applied to a factor object, but I can't figure out why the elements in dfs_test are being registered as factors given that I am forcing them into the tibble class. Would appreciate any tips based on my approach or alternative approaches that could make sense here.
This, from #RonakShah, was exactly what was needed:
You code should work if you use list instead of as.list. See output of
as.list(as_tibble(mtcars), as_tibble(iris)) vs list(as_tibble(mtcars),
as_tibble(iris)) – Ronak Shah Oct 2 at 0:23
We can use mget to return a list of datasets, then loop over the list and apply the function
dfs_test <- mget(paste0("ca_pop_", 2000:2004))
It can be also made more general if we use ls
dfs_test <- mget(ls(pattern = '^ca_pop_\\d{4}$'))
map(dfs_test, race_ethn_tab)
This would make it easier if there are 100s of objects already created in the global environment instead of doing
list(ca_pop_2000, ca_pop_2001, .., ca_pop_2020)

Use starts_with() outside of a selecting function to define a vector

I have a large original data.frame that I filter to form smaller data.frames throughout the analysis.
The original data.frame has the format of:
> head(Moment_all)
Exterior Interior Sections Spacing UG Names
1 0.02736669 0.03067941 84-12 12 UG-84 Sample 1
2 0.53220402 0.53739861 124-9 9 UG-124 AASHTO
3 0.54016470 0.54016538 116-9 9 UG-116 Sample 10
4 0.54151540 0.51516650 124-9 9 UG-124 Sample 8
5 0.54663913 0.52989489 124-9 9 UG-124 Sample ./124-9-DIA
6 0.54960475 0.51772120 116-9 9 UG-116 Mean
I define the rows that I want to exclude for one of the subsets as:
notbaseline <- c(starts_with("Sample ./"),"Mean","AASHTO")
Then I define new data.frame as:
data.exterior <- Moment_all[!grepl(paste(notbaseline,collapse = "|"),Moment_all$Names),]
The problem is that I cannot use starts_with when defending the column and I have various types of characters that I would need to filter that don't always have an obvious pattern to directly call them.
Is there a better way of removing the rows based on the character in the Names column?
Thank you!
starts_with() is a dplyr function that is meant to help select particular columns of a data frame.
What you'd want to do is something like the below:
library(dplyr)
library(stringr)
filter(Moment_all, str_detect(Names, "Sample \\./", negate = TRUE), !Names %in% c("Mean", "AASHTO"))
We can use base R with subset and grepl
subset(Moment_all, !startsWith(Names, 'Sample') & !Names %in% c("Mean", 'AASHTO'))

melt giving several value columns

I am reading in parameter estimates from some results files that I would like to compare side by side in a table. But I cant get the dataframe to the structure that I want to have (Parameter name, Values(file1), Values(file2))
When I read in the files I get a wide dataframe with each parameter in a separate column that I would like to transform to "long" format using melt. But that gives only one column with values. Any idea on how to get several value columns without using a for loop?
paraA <- c(1,2)
paraB <- c(6,8)
paraC <- c(11,9)
Source <- c("File1","File2")
parameters <- data.frame(paraA,paraB,paraC,Source)
wrong_table <- melt(parameters, by="Source")
You can use melt in combination with cast to get what you want. This is in fact the intended pattern of use, which is why the functions have the names they do:
m<-melt(parameters)
dcast(m,variable~Source)
# variable File1 File2
# 1 paraA 1 2
# 2 paraB 6 8
# 3 paraC 11 9
Converting #alexis's comment to an answer, transpose (t()) pretty much does what you want:
setNames(data.frame(t(parameters[1:3])), parameters[, "Source"])
# File1 File2
# paraA 1 2
# paraB 6 8
# paraC 11 9
I've used setNames above to conveniently rename the resulting data.frame in one step.

R applying to a line

I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Resources