Merging DF in a list individually to another DF - r

I have a list of dataframes with varying dimensions filled with data and row/col names of Countries. I also have a "master" dataframe outside of this list that is blank with square dimensions of 189x189.
I wish to merge each dataframe inside the list individually on top of the "master" sheet perserving the square matrix dimensions. I have been able to achieve this individually using this code:
rownames(Trade) <- Trade$X
Trade <- Trade[, 2:length(Trade)]
Full[row.names(Trade), colnames(Trade)] <- Trade
With "Full" being my master sheet and "Trade" being an individual df.
I have attempted to create a function to apply this process to a list of dataframes but am unable to properly do this.
Function and code in question:
DataMerge <- function(df) {
rownames(df) <- df$Country
Trade <- Trade[, 2:length(Trade)]
Country[row.names(df), colnames(df)] <- df
}
Applied using :
DataMergeDF <- lapply(TradeMatrixDF, DataMerge)
filenames <- paste0("Merged",names(DataMergeDF), ".csv")
mapply(write.csv, DataMergeDF, filenames)
Country <- read.csv("FullCountry.csv")
However what ends up happening is that the data does not end up merging properly / the dimensions are not preserved.
I asked a question pertaining to this issue a few days ago (CSV generated from not matching to what I have in R) , but I have a suspicion that I am running into this issue due to my use of "lapply". However, I am not 100% sure.

If we return the 'Country' at the end it should work. Also, better to pass the other data as an argument
DataMerge <- function(Country, df) {
rownames(df) <- df$Country
df <- df[, 2:length(df)]
Country[row.names(df), colnames(df)] <- df
Country
}
then, we call the function as
DataMergeDF <- lapply(TradeMatrixDF, DataMerge, Country = Country)

Related

Loop in R is only adding the first and last set of data to dataframe

I'm trying to loop through an API, to get data from specific sitecodes and merge it into one dataframe, and for some reason the following code is only getting the original dataframe (RoyalLondon_List) and the last sensor (CDP0004)
SiteCodes_all <- c('CLDP0002', 'CLDP0003', 'CLDP0004')
for(i in 1:length(SiteCodes_all)) {
allsites <- paste0(Base,Node,SiteCodes_all[i],'/',Pollutant,StartTime,EndTime,Averaging,Key)
temp_raw <- GET(allsites)
temp_list <- fromJSON(rawToChar(temp_raw$content))
df <- rbind(RoyalLondon_List, temp_list)
}
Any help appreaciated!
The above code combines the previous data and not the looped API url
Try use this
df <- RoyalLondon_List
for(i in 1:length(SiteCodes_all)) {
allsites <- paste0(Base,Node,SiteCodes_all[i],'/',Pollutant,StartTime,EndTime,Averaging,Key)
temp_raw <- GET(allsites)
temp_list <- fromJSON(rawToChar(temp_raw$content))
df <- dplyr::bind_rows(df, temp_list)
}
dplyr::bind_rows() is a function in the dplyr package that allows you to combine multiple dataframes by appending the rows of one dataframe to the bottom of another.see here to more info about it.

Use R to add a column to multiple dataframes using lapply

I would like to add a column containing the year (found in the file name) to each column. I've spent several hours googling this, but can't get it to work. Am I making some simple error?
Conceptually, I'm making a list of the files, and then using lapply to calculate a column for each file in the list.
I'm using data from Census OnTheMap. Fresh download. All files are named thus: "points_2013" "points_2014" etc. Reading in the data using the following code:
library(maptools)
library(sp)
shps <- dir(getwd(), "*.shp")
for (shp in shps) assign(shp, readShapePoints(shp))
# the assign function will take the string representing shp
# and turn it into a variable which holds the spatial points data
My question is very similar to this one, except that I don't have a list of file names--I just want extract the entry in a column from the file name. This thread has a question, but no answers. This person tried to use [[ instead of $, with no luck. This seems to imply the fault may be in cbind vs. rbind..not sure. I'm not trying to output to csv, so this is not fully relevant.
This is almost exactly what I am trying to do. Adapting the code from that example to my purpose yields the following:
dat <- ls(pattern="points_")
dat
ldf = lapply(dat, function(x) {
# Add a column with the year
dat$Year = substr(x,8,11)
return(dat)
})
ldf
points_2014.shp$Year
But the last line still returns NULL!
From this thread, I adapted their solution. Omitting the do.call and rbind, this seems to work:
lapply(points,
function(x) {
dat=get(x)
dat$year = sub('.*_(.*)$','\\1',x)
return(dat)
})
points_2014.shp$year
But the last line returns a null.
Starting to wonder if there is something wrong with my R in some way. I tested it using this example, and it works, so the trouble is elsewhere.
# a dataframe
a <- data.frame(x = 1:3, y = 4:6)
a
# make a list of several dataframes, then apply function
#(change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {
names(x) <- c("a", "b")
return(x)})
my.list
After some help from this site, my final code was:
#-------takes all the points files, adds the year, and then binds them together
points2<-do.call(rbind,lapply(ls(pattern='points_*'),
function(x) {
dat=get(x)
dat$year = substr(x,8,11)
dat
}))
points2$year
names(points2)
It does, however, use an rbind, which is helpful in the short term. In the long term, I will need to split it again, and use a cbind, so I can substract two columns from each other.
I use the following Code:
for (i in names.of.objects){
temp <- get(i)
# do transformations on temp
assign(i, temp)
}
This works, but is definitely not performant, since it does assignments of the whole data twice in a call by value manner.

R - Subset a Dataframe with a Programmatically built Formula

I'm working with a large data frame that is pulled from a data lake which I need to subset according to multiple different columns and run an analysis on. The basic subsettings come from an external Excel file which I read in and generate all possible combinations of. I want something to loop through each of these columns and subset my data accordingly.
A few of the subsettings follow a similar form to:
data_settings <- data.frame(country = rep(c('DE','RU','US','CA','BR'),6),
transport=rep(c('road','air','sea')),
category = rep(c('A','B')))
And my data lake extract has a form like:
df <- data.frame(country = rep(unique(data_settings$country),6),
transport = rep(unique(data_settings$transport),10),
category = rep(c('A','B'),15),
values = round(runif(30) * 10))
I need to subset the data according to each of the rows in my data_settings data frame, so I built a loop which constructs the formula according to what is in my data_settings data frame.
for(i in 1:nrow(data_settings)){
sub_string <- paste0(names(data_settings[1]), '==', data_settings[i,1])
for(j in 2:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
sub_string <- paste0(sub_string, ' & ', col," == ","'",val,"'")
}
df_sub <- subset(df, formula(sub_string))
}
This successfully builds my strings which I try to pass to formula or as.formula, but I receive an error at that point. I've tried a few different formulations without any success. In my actual case, there are thousands of combinations with different columns and values to filter against.
Thanks in advance for your help!
Try this:
merge(data_settings, df)
I worked with my previous approach a bit more today without using subset, filter, etc. and put this together which seems to do what I want well enough by filtering recursively according to the next item in the data_settings frame.
for(i in 1:nrow(data_settings)){
df_sub <- df
for(j in 1:ncol(data_settings)){
col <- names(data_settings)[j]
val <- as.character(data_settings[i,j])
df_col <- grep(col, names(df))
df_sub <- df_sub[df_sub[,df_col] == val,]
}
# Run further analysis here...
}

R for loop - appending results outside the loop

I'm taking an introductory R-programming course on Cousera. The first assignment has us evaluating a list of hundreds of csv files in a specified directory ("./specdata/). Each csv file, in turn, contains hundreds of records of sample pollutant data in the atmosphere - a date, a sulfite sample, a nitrate sample, and an ID of that identifies the sampling location.
The assignment asks us to create a function that takes the pollutant an id or range of ids for sampling location and returns a sample mean, given the supplied arguments.
My code (below) uses a for loop to use the id argument to only read the files of interest (seems more efficient than reading in all 322 files before doing any processing). That works great.
Within the loop, I assign the contents of the csv file to a variable. I then make that variable a data frame and use rbind to append to it the file read in during each loop. I use na.omit to remove the missing files from the variable. Then I use rbind to append the result of each iteration of the loop to variable. When I print the data frame variable within the loop, I can see the entire full list, subgrouped by id. But when I print the variable outside the loop, I only see the last element in the id vector.
I would like to create a consolidated list of all records matching the id argument within the loop, then pass the consolidate list outside the loop for further processing. I can't get this to work. My code is shown below.
Is this the wrong approach? Seems like it could work. Any help would be most appreciated. I searched StackOverflow and couldn't find anything that quite addresses what I'm trying to do.
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
id1 <- id[1]
id2 <- id[length(id)]
for (i in id1:id2) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df <- rbind(df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
}
You can just define the data frame outside the for loop and append to it. Also you can skip some steps in between... There are more ways to improve here... :-)
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
df_final <- data.frame()
for (i in id) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df_final <- rbind(df_final, df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
return(df_final)
}
by only calling df <- rbind(df) you are effectively overwriting df everytime. You can fix this by doing something like this:
df = data.frame() # empty data frame
for(i in 1:10) { # for all you csv files
x <- mean(rnorm(10)) # some new information
df <- rbind(df, x) # bind old dataframe and new value
}
By the way, if you know how big df will be beforehand then this is not the proper way to do it.

R: Adress objects deep inside lists with filter commands inside functions/loops (ExtremeBounds package)

I am using the ExtremeBounds package which provides as a result a multi level list with (amongst others) dataframes at the lowest level. I run this package over several specifications and I would like to collect some columns of selected dataframes in these results. These should be collected by specification (spec1 and spec2 in the example below) and arranged in a list of dataframes. This list of dataframes can then be used for all kind of things, for example to export the results of different specifications into different Excel Sheets.
Here is some code which creates the problematic object (just run this code blindly, my problem only concerns how to deal with the kind of list it creates: eba_results):
library("ExtremeBounds")
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,2,0.2),
var3=rnorm(30),var4=rnorm(30),var5=rnorm(30))
spec1 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4"))
spec2 <- list(y=c("var1"),
freevars=c("var2"),
doubtvars=c("var3","var4","var5"))
indicators <- c("spec1","spec2")
ebaFun <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, weights = "lri", family = binomial(logit))}
eba_results <- lapply(mget(indicators),ebaFun) #eba_results is the object in question
Manually I know how to access each element, for example:
eba_results$spec1$bounds$type #look at str(eba_results) to see the different levels
So "bounds" is a dataframe with identical column names for both spec1 and spec2. I would like to collect the following 5 columns from "bounds":
type, cdf.mu.normal, cdf.above.mu.normal, cdf.mu.generic, cdf.above.mu.generic
into one dataframe per spec. Manually this is simple but ugly:
collectedManually <-list(
manual_spec1 = data.frame(
type=eba_results$spec1$bounds$type,
cdf.mu.normal=eba_results$spec1$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec1$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec1$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec1$bounds$cdf.above.mu.generic),
manual_spec2= data.frame(
type=eba_results$spec2$bounds$type,
cdf.mu.normal=eba_results$spec2$bounds$cdf.mu.normal,
cdf.above.mu.normal=eba_results$spec2$bounds$cdf.above.mu.normal,
cdf.mu.generic=eba_results$spec2$bounds$cdf.mu.generic,
cdf.above.mu.generic=eba_results$spec2$bounds$cdf.above.mu.generic))
But I have more than 2 specifications and I think this should be possible with lapply functions in a prettier way. Any help would be appreciated!
p.s.: A generic example to which hrbrmstr's answer applies but which turned out to be too simplistic:
exampleList = list(a=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))),
b=list(aa=data.frame(A=rnorm(10),B=rnorm(10)),bb=data.frame(A=rnorm(10),B=rnorm(10))))
and I want to have an object which collects, for example, all the A and B vectors into two data frames (each with its respective A and B) which are then a list of data frames. Manually this would look like:
dfa <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
dfb <- data.frame(A=exampleList$a$aa$A,B=exampleList$a$aa$B)
collectedResults <- list(a=dfa, b=dfb)
There's probably a less brute-force way to do this.
If you want lists of individual columns this is one way:
get_col <- function(my_list, col_name) {
unlist(lapply(my_list, function(x) {
lapply(x, function(y) { y[, col_name] })
}), recursive=FALSE)
}
get_col(exampleList, "A")
get_col(exampleList, "B")
If you want a consolidated data.frame of indicator columns this is one way:
collect_indicators <- function(my_list, indicators) {
lapply(my_list, function(x) {
do.call(rbind, c(lapply(x, function(y) { y[, indicators] }), make.row.names=FALSE))
})[[1]]
}
collect_indicators(exampleList, c("A", "B"))
If you just want to bring the individual data.frames up a level to make it easier to iterate over to write to a file:
unlist(exampleList, recursive=FALSE)
Much assumption about the true output format is being made (the question was a bit vague).
There is a brute force way which works but is dependent on several named objects:
collectEBA <- function(x){
df <- paste0("eba_results$",x,"$bounds")
df <- eval(parse(text=df))[,c("type",
"cdf.mu.normal","cdf.above.mu.normal",
"cdf.mu.generic","cdf.above.mu.generic")]
df[is.na(df)] <- "NA"
df
}
eba_export <- lapply(indicators,collectEBA)
names(eba_export) <- indicators

Resources