I am having a bit of trouble with trying to script a code in R so that it separates a data frame based on the character in a data frame column without manually specifying a subset command. Below is the script for reproduction in R:
a=c("Model_A","R1",358723.0,171704.0,1.0,36.818500,4.0222700,1.38895000)
b=c("Model_A","R2",358723.0,171704.0,2.6,36.447300,4.0116100,1.37479000)
c=c("Model_A","R3",358723.0,171704.0,5.0,35.615400,3.8092600,1.34301000)
d=c("Model_B","R1",358723.0,171704.0,1.0,39.818300,2.4475600,1.50384000)
e=c("Model_B","R2",358723.0,171704.0,2.6,39.391600,2.4209900,1.48754000)
f=c("Model_B","R3",358723.0,171704.0,5.0,38.442700,2.3618400,1.45126000)
g=c("Model_C","R1",358723.0,171704.0,1.0,31.246400,2.2388000,1.30652000)
h=c("Model_C","R2",358723.0,171704.0,2.6,30.911600,2.2144800,1.29234000)
i=c("Model_C","R3",358723.0,171704.0,5.0,30.166700,2.1603000,1.26077000)
df=data.frame(a,b,c,d,e,f,g,h,i)
df=t(df)
df=data.frame(df)
col_list=list("Model","Receptor.name","X(m.)","Y(m.)","Z(m.)",
"nox","PM10","PM2.5")
colnames(df)=col_list
Essentially what I am trying is to separate the data frame (df) by the Model names ("Model_A", "Model_B", and "Model_C") and store them in new and different data frames. I have been trying to use the following command
df_test=split(df,with(df,interaction(Model,Model)), drop = TRUE)
This command separates the data frame but stores them in lists, and I don't know how to extract the lists individually and store them as data frames. Is there a simpler solution (avoiding the subset command if possible as I need the script to be dynamic and relative) or does anyone know how to use the last command shown above to separate the lists into individual data frames? Also if possible, is it possible to name the data frame after the model?
I apologize if these are a lot of questions but any help would be hugely appreciated! Thank you!
list2env(split(df, df$Model), envir = .GlobalEnv) will give you three dataframes in your global environment, named after the models, containing the relevant rows.
> Model_A
Model Receptor.name X(m.) Y(m.) Z(m.) nox PM10 PM2.5
a Model_A R1 358723 171704 1 36.8185 4.02227 1.38895
b Model_A R2 358723 171704 2.6 36.4473 4.01161 1.37479
c Model_A R3 358723 171704 5 35.6154 3.80926 1.34301
Although I would just keep the list of three dataframes by only using dflist <- split(df, df$Model).
Why a list? Lists allow you the use of lapply - a looping function that applies an operation over every list element. A quick example: Let's say you'd want to get a frequency table for both PM variables in your data for all three datasets.
For single elements in your global environment this would be
table(Model_A$PM10)
table(Model_A$PM2.5)
...
table(Model_C$PM2.5)
With a list, it would be
lapply(dflist, function(x) table(x["PM10"]))
lapply(dflist, function(x) table(x["PM2.5"]))
Right now, it seems to only save some lines of code, but better yet, the output of lapply is again a list, which you can store in an object and further use for different operations. Due to this, you can have a global environment with only a few objects in it, each being lists which contain certain similar objects, like dataframes, tables, summaries or even plots.
I have this lookup DATA FRAME:
VAR1=c('X1')
VAR2=c('X2')
VAR3=c('X3')
VAR4=c('X4')
VAR5=c('NA')
df<-data.frame(VAR1,VAR2,VAR3,VAR4,VAR5)
which I need to cross reference with a main DATA FRAME so that I select variables X1 to X5. Sometimes, like the example, column 5 is simply NA.
I would typically use something like the below:
main_data <-subset(main_data, select=c(df[1,1],df[1,2],df[1,3]))
main_data <-subset(main_data, select=c(df[1,1:max(col(df))]))
but there are NAs, and moreover I will have a dynamic count of columns and these don't work.
The other idea is to use grepl on main_data but I cannot get it to work with more than one variable at a time:
main_data <- main_data[, grepl(paste0(df[1:max(col(df))], colnames(main_data)))]
I am certain there is a straightforward way to do this but I cannot find it.
With Roman's help I got it:
df<-as.vector(unlist(df))
main_data<-main_data[, names(main_data) %in% df]
I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])
I have a list of 26 data frames called score.list and I have written a code that tells me which data frames are not complete. So this code gives me the name of the data frame within the list, but it doesn't tell me the index of the data frame in the list.
Example... the code tells me that a data frame named p08 and another data frame named p18 are not complete. Therefore, they need to combined with whichever data frame that follows after these. So if the data frame named p08 is score.list[[8]], then it should be combined with score.list[[9]]. It should replace [[8]] with the newly made data frame then score.list[[9]] should be deleted from the list.
I'm guessing something like the code below may work to combine & replace a data frame... I'm not sure if the following code works..
score.list[[8]] <- rbind(score.list[[8]], score.list[[9]])
This is what I tried doing... but didn't exactly work because it didn't make a new data frame after combining it. And I get this error message:
Error in if (names(score.list[i]) == names(score.list[i + 1])) { :
missing value where TRUE/FALSE needed
for(i in 1:length(score.list)){
if(names(score.list[i])==names(score.list[i+1])) {
a <- score.list[i]
b <- score.list[i+1]
score.list[[i]] <- rbind(a, b)
print(score.list[[i]])
}
}
Reason I wrote if(names(score.list[i]==names(score.list[i+1])) as that is because the names of the data frames that need to be combined together are the same in the list. The data frame that is not complete has the same name as the one that follows it. So name of the data frame score.list[[8]] is same as the name of the data frame score.list[[9]].
Please let me know if there are confusing parts.. I tried to write it as clear as I can. Thank you!
This should help you :
## a list example
score.list <-
list(l1= data.frame(x=1),
l2=data.frame(x=2),
l3= data.frame(x=3))
## use %in% to select some elements
## here I am selecting list l1 and l3
do.call(rbind,
score.list[names(score.list) %in% c('l1','l3')])
Is it possible to append a column to data frame in the following scenario?
dfWithData <- data.frame(start=c(1,2,3), end=c(11,22,33))
dfBlank <- data.frame()
..how to append column start from dfWithData to dfBlank?
It looks like the data should be added when data frame is being initialized. I can do this:
dfBlank <- data.frame(dfWithData[1])
but I am more interested if it is possible to append columns to an empty (but inti)
I would suggest simply subsetting the data frame you get back from the RODBC call. Something like:
df[,c('A','B','D')]
perhaps, or you can also subset the columns you want with their numerical position, i.e.
df[,c(1,2,4)]
dfBlank[1:nrow(dfWithData),"start"] <- dfWithData$start