How to perform selective rbind on datasets in R - r

I am trying to use rbind to append different datasets countrywise. The list of the datasets is
data <- c('a1','a2','a3','b1','b2','bu3','bu4','c1','c3')
code <- c('a','b','bu',c)
The structure of the data is somewhat like this -
countrya1 <- c("a","a","a")
yeara1 <- c("1","1","1")
inca1 <- c("1","2","3")
a1 <- data.frame(countrya1,yeara1,inca1)
countrya2 <- c("a","a","a")
yeara2 <- c("2","2","2")
inca2 <- c("1","4","3")
a2 <- data.frame(countrya2,yeara2,inca2)
countryb1 <- c("b","b","b")
yearb1 <- c("1","1","1")
incb1 <- c("1","2","7")
b1 <- data.frame(countryb1,yearb1,incb1)
countryb2 <- c("b","b","b")
yearb2 <- c("2","2","2")
incb2 <- c("6","2","3")
b2 <- data.frame(countryb2,yearb2,incb2)
The code that I used to combine all the datasets is as follows -
df=NULL
for (i in length(data)){
df1 <-read.dta(data[i])
df <-rbind(df,df1)
}
This binds all the datasets together in df.
Is there a way to bind a1,a2,a3 together and b1,b2,b3 together and so on. In short, I want to bind the datasets by 'code'. Is there a way to do it in R?
Thanks in advance for the help.

We can split the 'data' by creating a group without the numbers and then with read.dta, read the datasets and rbind the datasets of same name
lst <- lapply(split(data, sub("\\d+", "", data)),
function(x) do.call(rbind, lapply(x, read.dta)))
If we want to use the 'code', then use grep by looping through the 'code'
lapply(code, function(x) do.call(rbind, lapply(grep(x, data, value = TRUE), read.dta)))
data
code <- c('a','b','bu','c')

Related

Merge dataframes stored in two lists of the same length

I have two long lists of large dataframes that are equal in length. I want to merge Dataframe1 (from list1) with Dataframe1 (from list2) and Dataframe2 (from list1) with Dataframe2 (from list2) etc...
Below is a minimal reproducible example and some attempts.
#### EXAMPLE
#Create Dataframes
df_1 <- data.frame(c("Bah",NA,2,3,4),c("Bug",NA,5,6,NA))
df_2 <- data.frame(c("Blu",7,8,9,10),c(NA,NA,NA,12,13))
df_3 <- data.frame(c("Bah",NA,21,32,43),c("Rgh",NA,51,63,NA))
df_4 <- data.frame(c("Gar",7,8,9,10),c("Ghh",NA,NA,121,131))
#Create Lists
list1 <- list(df_1,df_2)
list2 <- list(df_3,df_4)
#Set column and row names for each dataframe
colnames(list1[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list1[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
rownames(list1[[1]]) <- c("1","2")
rownames(list1[[2]]) <- c("1","2")
rownames(list2[[1]]) <- c("1","2")
rownames(list2[[2]]) <- c("1","2")
My desired output is a list of the same length as the input lists but with each dataframe merged by position into a single dataframe. The following yields my desired output for the dataframes and list but is low throughput.
#### DESIRED OUTPUT
DesiredOutput_DF1_Format <- merge(list1[[1]],list2[[1]], all = TRUE, by = "SampleID")
DesiredOutput_DF2_Format <- merge(list1[[2]],list2[[2]], all = TRUE, by = "SampleID")
DesiredOutput_List <- list(DesiredOutput_DF1_Format, DesiredOutput_DF2_Format)
How can I generate an output list in my desired format in a highthroughput way using an apply-like approach?
#### ATTEMPTS
#Attempt1:
attempt1 <- mapply(cbind, list1, list2, simplify=FALSE)
#Attempt2:
My instinct is to use `lapply` but i cant figure how to make it iterate through two lists simultaneously.
#Attempt3: Works but the order of the output list appears inverted. This is not intuitive, though it is easily corrected... There has to be a cleaner way.
output_list <- list()
dataset_iterator <- 1:length(list1)
for (x in dataset_iterator) {
df1 <- data.frame(list1[[x]])
df2 <- data.frame(list2[[x]])
df_merged <- data.frame(merge(df1, df2, by = "Barcodes", all=TRUE))
output_list <- append(output_list, list(df_merged), 0)
Based on the code showed, we may need Map (or mapply with SIMPLIFY = FALSE)
out <- Map(merge, list1, list2, MoreArgs = list(all = TRUE, by = "SampleID"))
-checking with expected output
> identical(DesiredOutput_List, out)
[1] TRUE
Or using tidyverse
library(purrr)
library(dplyr)
map2(list1, list2, full_join, by = "SampleID")

Rbinding large list of dataframes after I did some data cleaning on the list

My problem is, that I can't merge a large list of dataframes before doing some data cleaning. But it seems like my data cleaning is missing from the list.
I have 43 xlsx-files, which I've put in a list.
Here's my code for that part:
file.list <- list.files(recursive=T,pattern='*.xlsx')
dat = lapply(file.list, function(i){
x = read.xlsx(i, sheet=1, startRow=2, colNames = T,
skipEmptyCols = T, skipEmptyRows = T)
# Create column with file name
x$file = i
# Return data
x
})
I then did some datacleaning. Some of the dataframes had some empty columns that weren't skipped in the loading and some columns I just didn't need.
Example of how I removed one column (X1) from all dataframes in the list:
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
I also applies column names:
colnames <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
"IART", "HTRANS", "DTRANS", "BELOB", "REGD",
"BOGFD", "AFVBOGFD", "VALORD", "UDLIGND",
"UÅ", "AFSTEMNGL", "NRBASIS", "SPECIFIK1",
"SPECIFIK2", "SPECIFIK3", "PERIODE","FILE")
dat <- lapply(dat, setNames, colnames)
My problem is, when I open the list or look at the elements in the list, my data cleaning is missing.
And I can't bind the dataframes before the data cleaning since they're aren't looking the same.
What am I doing wrong here?
EDIT: Sample data*
# Sample data
a <- c("a","b","c")
b <- c(1,2,3)
X1 <- c("", "","")
c <- c("a","b","c")
X2 <- c(1,2,3)
X1 <- c("", "","")
df1 <- data.frame(a,b,c,X1)
df2 <- data.frame(a,b,c,X1,X2)
# Putting in list
dat <- list(df1,df2)
# Removing unwanted columns
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
dat <- lapply(dat, function(x) { x["X2"] <- NULL; x })
# Setting column names
colnames <- c("Alpha", "Beta", "Gamma")
dat <- lapply(dat, setNames, colnames)
# Merging dataframes
df <- do.call(rbind,dat)
So I've just found that with my sample data this goes smoothly.
I had to reopen the list in View-mode to see the changes I made. That doesn't change the fact that when writing to csv and reopening all the data cleaning is missing (haven'tr tried this with my sample data).
I am wondering if it's because I've changed the merge?
# My merge when I wrote this question:
df <- do.call("rbindlist", dat)
# My merge now:
df <- do.call(rbind,dat)
When I use my real data it doesnøt go as smoothly, so I guess the sample data is bad. I don't know what I'm doing wrong so I can't give some better sample data.
The message I get when merging with rbind:
error in rbind(deparse.level ...) numbers of columns of arguments do not match

Building forvalues loops in R

[Working with R 3.2.2]
I have three data frames with the same variables. I need to modify the value of some variables and change the name of the variables (rename the columns). Instead of doing this data frame by data frame, I would like to use a loop.
This is the code I want to run:
#Change the values of the variables
vlist <- c("var1", "var2", "var3")
dataframe0[,vlist] <- dataframe0[,vlist]/10
dataframe1[,vlist] <- dataframe1[,vlist]/10
dataframe2[,vlist] <- dataframe2[,vlist]/10
#Change the name of the variables
colnames(dataframe0)[colnames(dataframe0)=="var1"] <- "temp_min"
colnames(dataframe0)[colnames(dataframe0)=="var2"] <- "temp_max"
colnames(dataframe0)[colnames(dataframe0)=="var3"] <- "prep"
colnames(dataframe1)[colnames(dataframe1)=="var1"] <- "temp_min"
colnames(dataframe1)[colnames(dataframe1)=="var2"] <- "temp_max"
colnames(dataframe1)[colnames(dataframe1)=="var3"] <- "prep"
colnames(dataframe2)[colnames(dataframe2)=="var1"] <- "temp_min"
colnames(dataframe2)[colnames(dataframe2)=="var2"] <- "temp_max"
colnames(dataframe2)[colnames(dataframe2)=="var3"] <- "prep"
I know the logic to do it with programs like Stata, with a forvalues loop:
#Change the values of the variables
forvalues i=0/2 {
dataframe`i'[,vlist] <- dataframe`i'[,vlist]/10
#Change the name of the variables
colnames(dataframe`i')[colnames(dataframe`i')=="var1"] <- "temp_min"
colnames(dataframe`i')[colnames(dataframe`i')=="var2"] <- "temp_max"
colnames(dataframe`i')[colnames(dataframe`i')=="var3"] <- "prep"
}
But, I am not able to reproduce it in R. How should I proceed? Thanks in advance!
I would go working with a list of dataframe, you can still 'split' it after if really needed:
df1 <- data.frame("id"=1:10,"var1"=11:20,"var2"=11:20,"var3"=11:20,"test"=1:10)
df2 <- df1
df3 <- df1
dflist <- list(df1,df2,df3)
for (i in seq_along(dflist)) {
df[[i]]['test'] <- df[[i]]['test']/10
colnames( dflist[[i]] )[ colnames(dflist[[i]]) %in% c('var1','var2','var3') ] <- c('temp_min','temp_max','prep')
# eventually reassign df1-3 to their list value:
# assign(paste0("df",i),dflist[[i]])
}
The interest of using a list is that you can access them a little more easily in a programmatic way.
I did change your code from 3 calls to only one, as colnames give a vector you can subset it and replace in one pass, this is assuming your var1 to var3 are always in the same order.
Addendum: if you want a single dataset at end you can use do.call(rbind,dflist) or with data.table package rbindlist(dflist).
More details on working with list of data.frames in Gregor's answer here

How to read and use the dataframes with the different names in a loop?

I'm struggling with the following issue: I have many data frames with different names (For instance, Beverage, Construction, Electronic etc., dim. 540x1000). I need to clean each of them, calculate and save as zoo object and R data file. Cleaning is the same for all of them - deleting the empty columns and the columns with some specific names.
For example:
Beverages <- Beverages[,colSums(is.na(Beverages))<nrow(Beverages)] #removing empty columns
Beverages_OK <- Beverages %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
Beverages_OK[, 1] <- NULL #dropping the first column
Beverages_OK <- cbind(data[1], Beverages_OK) # adding a date column
Beverages_zoo <- read.zoo(Beverages_OK, header = FALSE, format = "%Y-%m-%d")
save (Beverages_OK, file = "StatisticsInRFormat/Beverages.RData")
I tied to use 'lapply' function like this:
list <- ls() # the list of all the dataframes
lapply(list, function(X) {
temp <- X
temp <- temp [,colSums(is.na(temp))< nrow(temp)] #removing empty columns
temp <- temp %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
temp[, 1] <- NULL
temp <- cbind(data[1], temp)
X_zoo <- read.zoo(X, header = FALSE, format = "%Y-%m-%d") # I don't know how to have the zame name as X has.
save (X, file = "StatisticsInRFormat/X.RData")
})
but it doesn't work. Is any way to do such a job? Is any r-package that facilitates it?
Thanks a lot.
If you are sure the you have only the needed data frames in the environment this should get you started:
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
list <- ls()
lapply(list, function(x) {
tmp <- get(x)
})

Apply function between lists of data frames

I have 2 lists (my.listA and my.listB) in R including 3 data frames each:
da1 <- data.frame(x=c(1,2,3),y=c(4,5,6))
da2 <- data.frame(x=c(3,2,1),y=c(6,5,4))
da3 <- data.frame(x=c(5,4,1),y=c(8,5,7))
my.listA <- list(da1, da2, da3)
db1 <- data.frame(z=c(2))
db2 <- data.frame(z=c(3))
db3 <- data.frame(z=c(4))
my.listB <- list(db1, db2, db3)
I am trying to obtain a new list (my.listAB) so that it includes 3 data frames showing the element by element product of the data frames in my.listA and my.listB paired according to the number at the end of the data frames' names, that is, the product of elements in da1 by elements in db1, the product of da2 by db2 and the product of da3 by db3.
This would be my desired result:
dab1 <- data.frame(x=c(2,4,6),y=c(8,10,12))
dab2 <- data.frame(x=c(9,6,3),y=c(18,15,12))
dab3 <- data.frame(x=c(20,16,4),y=c(32,20,28))
my.listAB <- list(dab1 , dab2 , dab3)
I tried the following, but it did not work:
for (i in 1:3) {
my.listAB <- my.listA[[i]]*my.listB[[i]]
};
Ideally someone could guide me towards a solution using the lapply function?
Many thanks!
You can use
l <- lapply(1:3, function(x) my.listA[[x]] * my.listB[[x]]$z)
or
l <- list()
for (x in 1:3)
l[[x]] <- my.listA[[x]] * my.listB[[x]]$z
In addition to the lapply and for loop option suggested by #lukeA in the comments, you could also try Map
r1 <- Map(`*`, my.listA,unlist(my.listB))
identical(r1, my.listAB)
#[1] TRUE

Resources