Separating data frame based on column values - r

I am having a bit of trouble with trying to script a code in R so that it separates a data frame based on the character in a data frame column without manually specifying a subset command. Below is the script for reproduction in R:
a=c("Model_A","R1",358723.0,171704.0,1.0,36.818500,4.0222700,1.38895000)
b=c("Model_A","R2",358723.0,171704.0,2.6,36.447300,4.0116100,1.37479000)
c=c("Model_A","R3",358723.0,171704.0,5.0,35.615400,3.8092600,1.34301000)
d=c("Model_B","R1",358723.0,171704.0,1.0,39.818300,2.4475600,1.50384000)
e=c("Model_B","R2",358723.0,171704.0,2.6,39.391600,2.4209900,1.48754000)
f=c("Model_B","R3",358723.0,171704.0,5.0,38.442700,2.3618400,1.45126000)
g=c("Model_C","R1",358723.0,171704.0,1.0,31.246400,2.2388000,1.30652000)
h=c("Model_C","R2",358723.0,171704.0,2.6,30.911600,2.2144800,1.29234000)
i=c("Model_C","R3",358723.0,171704.0,5.0,30.166700,2.1603000,1.26077000)
df=data.frame(a,b,c,d,e,f,g,h,i)
df=t(df)
df=data.frame(df)
col_list=list("Model","Receptor.name","X(m.)","Y(m.)","Z(m.)",
"nox","PM10","PM2.5")
colnames(df)=col_list
Essentially what I am trying is to separate the data frame (df) by the Model names ("Model_A", "Model_B", and "Model_C") and store them in new and different data frames. I have been trying to use the following command
df_test=split(df,with(df,interaction(Model,Model)), drop = TRUE)
This command separates the data frame but stores them in lists, and I don't know how to extract the lists individually and store them as data frames. Is there a simpler solution (avoiding the subset command if possible as I need the script to be dynamic and relative) or does anyone know how to use the last command shown above to separate the lists into individual data frames? Also if possible, is it possible to name the data frame after the model?
I apologize if these are a lot of questions but any help would be hugely appreciated! Thank you!

list2env(split(df, df$Model), envir = .GlobalEnv) will give you three dataframes in your global environment, named after the models, containing the relevant rows.
> Model_A
Model Receptor.name X(m.) Y(m.) Z(m.) nox PM10 PM2.5
a Model_A R1 358723 171704 1 36.8185 4.02227 1.38895
b Model_A R2 358723 171704 2.6 36.4473 4.01161 1.37479
c Model_A R3 358723 171704 5 35.6154 3.80926 1.34301
Although I would just keep the list of three dataframes by only using dflist <- split(df, df$Model).
Why a list? Lists allow you the use of lapply - a looping function that applies an operation over every list element. A quick example: Let's say you'd want to get a frequency table for both PM variables in your data for all three datasets.
For single elements in your global environment this would be
table(Model_A$PM10)
table(Model_A$PM2.5)
...
table(Model_C$PM2.5)
With a list, it would be
lapply(dflist, function(x) table(x["PM10"]))
lapply(dflist, function(x) table(x["PM2.5"]))
Right now, it seems to only save some lines of code, but better yet, the output of lapply is again a list, which you can store in an object and further use for different operations. Due to this, you can have a global environment with only a few objects in it, each being lists which contain certain similar objects, like dataframes, tables, summaries or even plots.

Related

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

R for loop: creating data frames using split?

I have data that I want to separate by date, I have managed to do this manually through:
tsssplit <- split(tss, tss$created_at)
and then creating dataframes for each list which I then use.
t1 <- tsssplit[[1]]
t2 <- tsssplit[[2]]
But I don't know how many splits I will need, as sometimes the og data frame may may have 6 dates to split up by, and sometimes it may have 5, etc. So I want to create a for loop.
Within the for loop, I want to incorporate this code, which connects to a function:
bscore3 <- score.sentiment(t3$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
Then I want to be able to create a new data frame that has the scores for each list.
So essentially I want the for loop to:
split the data into lists using split
split each list into a separate data frames for each different day
Come out with a score for each data frame
Put that into a new data frame
It doesn't have to be exactly like this as long as I can come up with a visualisation of the scores at the end.
Thanks!
It is not recommended to create separate dataframes in the global environment, they are difficult to keep track of. Put them in a list instead. You have started off well by using split and creating list of dataframes. You can then iterate over each dataframe in the list and apply the function on each one of them.
Using by this would look like as :
by(tss, tss$created_at, function(x) {
bscore3 <- score.sentiment(x$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
return(score3)
}) -> result
result

using rbind to combine all data sets the names of all data set start with common characters

I want to combine all rows of different data sets. The names of all data sets starts with test. All data sets have same number of observations. I know i can combine it by using rbind(). But typing the names of every data set will take a lot of time. Suggest me some better approach.
rbind(test1,test2,test3,test4)
Try first obtaining a vector of all matching objects using ls() with the pattern ^test:
dfs <- lapply(ls(pattern="^test"), function(x) get(x))
result <- rbindlist(dfs)
I am taking the suggestion by #Rohit to use rbindlist to make our lives easier to rbind together a list of data frames.
Second line of above code will work only if data sets are in data.table form or data frame form. IF data sets are in xts/zoo format then one have to make slight improvement use do.call() function.
## First make a list of all your data sets as suggested above
list_xts <- lapply(ls(pattern="^test"), function(x) get(x))
## then use do call and rbind()
xts_results<-do.call(rbind,list_xts)

How to create multiple data frames with separate names in r based on an input variable

I am very new to r and am working on developing code to simulate a set of equations for my job.
I want to create multiple empty data frames based on an input variable. That is, if n=4, I want to create 4 separate data frames with separate names such as x1, x2, x3, x4. If n=10, i want 10 data frames, etc.
I want to be able to see these data frames in the global environment (that open up looking similar to an excel sheet).
Code
To make the answer generic, since that seems to be what you want, I would make a list, then populate that list with dataframes.
my_list <- list()
for (i in seq(10)) {
my_list[[i]] = data.frame(x=runif(100), y=rnorm(100))
}
Explanation
Upon execution of this code, you will have a list with 10 items, labelled 1 - 10. Each of those items is its own dataframe, with 2 columns: one containing 100 uniform random numbers, and another containing 100 Gaussian random numbers (chosen from a standard normal distribution).
If you want to access, say, the third dataframe in the list, you'd simply type
my_list[[3]]
to get the contents of that dataframe.
(Lists use the double bracket notation in R, and you just have to "get used to it". It's fairly easy to figure out how to use them properly, though. E.g., my_list[3] will return a list with only 1 item in it, which is that third dataframe. But my_list[[3]] - notice the extra bracket - will return a dataframe, the third dataframe.)
Use R Studio to run R and to get an Excel-spreadsheety look at your data:
View (name.of.your.list [[n]])
where name.of.your.list is the name of your list of data.frames, and n is the n'th data.frame you want to view.
If you will have a list of lists of data.frames, then just keep tagging [[n's]]
View (name.of.your.list [[n]][[n2]])
As an example:
dat.all = list ()
dat.all [[1]] = list ()
dat.all [[1]][[1]] = data.table ("lol" = 1:5, "whatever" = 6:10)
View (dat.all [[1]][[1]])
Also, if you are new to R like me, then I suggest learning data.table instead of data.frame, it is much more powerful, and will probably prevent you from having to make lists of lists of data.frames.
Cheers.

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

Resources