R data.tree recurse - r

I'm having trouble understanding how to create a data.tree from a data frame. I have a data frame with two columns:
EmpID
SupervisorUserID
Code:
OfficeOrg <- read_csv("hierarchy")
OfficeOrg$pathString <- paste("Root",
OfficeOrg$SupervisorEmpID, OfficeOrg$EmpID, sep = "/")
RptTree <- as.Node(OfficeOrg)
The sample data has 25 rows. By inspecting the data, I can see that there are five levels. That is to say, I expect the RptTree object to show EmpIDs grouped under SupervisorEmpID to a depth of five.
Root
|_TopLevelSupervisor
|_SecondLevelSupervisor
|_ThirdLevelSupervisor
|_Employee
Instead, I see only three levels. The root, one for each SupervisorEmpID and the employees.
Root
|_Supervisor
|_ Employee
The tree isn't being built by recursing through all levels.
Usually this means that I'm staring something in the face, but not recognizing it.
What am I missing?

After searching off and on for several days, I found the solution to my problem at this Stack Overv Flow post:
data.tree nodes through Id's

Related

R: Searching a column in a dataframe for matches to a reference list in another dataframe

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.
Example of dataframe A
dfA
Example of dataframe B
dfB
Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.
Looking something like this...
Example of Ideal Solution
However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.
Example of Acceptable Solution
So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.
I have had partial success with the acceptable solution approach and using:
dfA <- dfA %>%
mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])
The solution seems okay, however, it does return an error along the lines of
problem with mutate() column GO_Cat_1.
i GO_Cat_1 = ...[].
i longer object length is not a multiple of shorter object length
I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.
Any assistance is greatly appreciated!

Extracting different vectors from a single column of data (in R)

I have a small problem, which I don't think is too hard, but I couldn't find any answer here (maybe I phrased my research wrong so please excuse me if the question has already been asked!)
I am importing data from an excel sheet which is split in two columns as in the following picture:
Now, I am trying to import all the data in the second column to my R script, but by splitting it into different vectors: one vector for category A, one for category B, etc... by keeping the data points in the order they are in the file (because as it happens, they are in chronological order).
Now, the categories each have a different number of elements, however, they are ordered alphabetically (ie you'll never find an A in the B's, for example). So I guess that makes it easier, but I'm still a novice with R and I don't really know how to proceed without getting really messy with the code and I know there's probably a simple way of doing it.
Does anyone have an idea on how to treat this nicely please? :)
We can use split in base R to return a list of vectors of 'Data' based on the unique values in 'Category'
lst1 <- split(df1$Data, df1$Category)

Rstudio - how to write smaller code

I'm brand new to programming and an picking up Rstudio as a stats tool.
I have a dataset which includes multiple questionnaires divided by weeks, and I'm trying to organize the data into meaningful chunks.
Right now this is what my code looks like:
w1a=table(qwest1,talm1)
w2a=table(qwest2,talm2)
w3a=table(quest3,talm3)
Where quest and talm are the names of the variable and the number denotes the week.
Is there a way to compress all those lines into one line of code so that I could make w1a,w2a,w3a... each their own object with the corresponding questionnaire added in?
Thank you for your help, I'm very new to coding and I don't know the etiquette or all the vocabulary.
This might do what you wanted (but not what you asked for):
tbl_list <- mapply(table, list(qwest1, qwest2, quest3),
list(talm1, talm2, talm3) )
names(tbl_list) <- c('w1a', 'w2a','w3a')
You are committing a fairly typical new-R-user error in creating multiple similarly named and structured objects but not putting them in a list. This is my effort at pushing you in that direction. Could also have been done via:
qwest_lst <- list(qwest1, qwest2, quest3)
talm_lst <- list(talm1, talm2, talm3)
tbl_lst <- mapply(table, qwest_lst, talm_lst)
names(tbl_list) <- paste0('w', 1:3, 'a')
There are other ways to programmatically access objects with character vectors using get or wget.

Best approach to splitting up clusters of data

I am working on a way to split up data in a CSV file based on a timestamp.
For example, for a given object id, check each entries date and see if it is within a given, allowed range. So if a set of rows in the table were:
OBJECT ID - Info - Date
obj1 xyz 1/1/12
obj1 xyw 1/2/12
obj1 cya 1/3/12
obj1 abc 2/1/12
...
In this example, the fourth entry is well outside of the area of time that the other entries are in. Therefore, my desired behavior is for a script to assign that entry to a new object, say 'obj2' for example, such that it is separated from data within its own cluster. Note that the dataset this will be applied to will be somewhat large, at the very least in the 10s of thousands, so I don't know if manual algorithms will be fast enough.
I'm using R for the moment to try to get this done using the PAM and PAMK functions in the FPC package. This gives me a plot of the clusters (I think), but I don't know how to apply this information to the actual data.
Any thoughts or ideas on the best way to do this?
I figured out a solution using the following steps:
// Convert the timestamps to milliseconds
newData <- as.POSIXct(data$date, format="date_format_here")
// Split the data using the object ID as the parameter
splitData <- split(data, f=data$id)
// Iterate over the split sessions, concatenating the cluster IDs as it goes using paste
for each {
pamk.result <- pamk(splitData[[i]][dataColumnIndex]
newData[i,1] <- paste(data[i,1],
pamk.result$pamobject$clustering[[x]],
sep="delimiter_here")
}
Anyway, this is a rough outline of how I approached the problem. Maybe this will give some ideas to others down the line.

nested for loops in R to parse csv files?

Edit: I've corrected the typo in the coding (copy and paste error). I can't add an example of the csv files, as its too complex to model in a simple example (I tried..)
I've spent hours looking through similarly titled questions to solve a for loop problem in R, and have tried a lot of different approaches, but I'm having no luck.
I have many different csv files, each of which has a set of 10 separate strings (variables) identifying a specific row (e.g., names = c("Delta values", "Scream factor", "nightmare mode"). Two rows below such a string, I need the max value of that row of data. I can create loops scanning files for such a value in single csv files using the following
test files-
test1.csv, test2.csv, test3.csv test4.csv
names<-list.files(pattern=".csv")
DF <- NULL
for (i in names){
dat <- read.csv(i, header=FALSE, stringsAsFactors=FALSE)
index <- which(dat=="Delta values", arr.ind=TRUE)
row=as.numeric(rownames(dat)[index[1]])
aver=dat[row+2,]
p=max(na.omit(as.numeric(aver)))
DF=rbind(DF, p)
colnames(DF)=dat[index]}
However, my problem comes in trying to generalize it, so that I get a data frame returned indicating the file each value was retrieved from as a row (not "p") and looping over the files so that I can retrieve the next several variables, while appending to the same data frame so that I end up with a data frame listing by row the filename the variable was derived from, and each variable listed in a separate column.
I'm pretty sure I need a nested loop listing the values I want to retrieve as calculated by "p" but I can't find any good examples describing how to iteratively loop using such an approach, and append the new variables to the growing data frame while staying consistent with the row numbering by file.
please help!

Resources