R naming convention/tricks for many columns in data.table / data.frame - r

I have a list of, say, n=10 data.tables (or data.frames).
Performing names(myList) returns the unique table names.
Performing names(myList[[i]]) (for i in 1:n) returns identical output for each value of i - i.e. each data.table has identical column names.
I need to merge all the data tables into one large data table, but would like to preserve the name of the list data.table for each column somehow, in order to keep an overview of where each column originated from.
Is there a trick to doing this, such as giving the columns keys? Or must one just prepend the table name to each of the columns in the final result? This would make the names pretty long in my case.
I want to avoid having to remember (or think about) which columns belongs to which table. Just for comparisons sake, I'd like to run str(myBigTable) or summary(myBigTable) and see something like Excel shows here [but vertically displayed in R]:

Related

R: Searching a column in a dataframe for matches to a reference list in another dataframe

I am trying to categorize genes with multiple GO descriptors into bins based on what those GO descriptors are related to. I have dataframe A which contains the raw data associated with a list of geneIDs (>500,000) and their associated GO descriptors and dataframe B which classifies these GO descriptors into larger groups.
Example of dataframe A
dfA
Example of dataframe B
dfB
Ideally, the final output would reference the entire list and generate a new column in dataframe A classifying the GeneIDs into the GO_Category's associated with its specific GO_IDs -- bonus points if it removes duplicate hits on the GO_Categorys.
Looking something like this...
Example of Ideal Solution
However, I know that the ideal solution might be difficult to obtain, and I already have dataframe B listed out based on the unique GO_Categories so a solution like this might be easier to obtain.
Example of Acceptable Solution
So far I have struggled with getting any command to search for partial strings using a list from another dataframe with the goal of returning all matches.
I have had partial success with the acceptable solution approach and using:
dfA <- dfA %>%
mutate(GO_Cat_1 = c('No', 'Yes')[1+str_detect(dfA$GO_IDs, as.character(dfB$GO_IDs))])
The solution seems okay, however, it does return an error along the lines of
problem with mutate() column GO_Cat_1.
i GO_Cat_1 = ...[].
i longer object length is not a multiple of shorter object length
I have also tried to look into applying grepl/grep - but struggled to feed it a list of terms to look for partial string matches in dfA.
Any assistance is greatly appreciated!

Use variable to hold dataframe name

I have four dataframes, df1, df2, df3, and df4. They are all formatted identically.
I would like to be able to store the dataframe name in a variable, and access that dataframe later. I can do the below, but this just copies the entire dataframe. Is there a way to do this without copying the whole dataframe?
chart.df <- df1
plot(chart.df$x, chart.df$y)
Note that this is just an example. I would like to do other things aside from just plotting.
In some circumstances, you can store the names of the data.frames as a character vector and then use get() to access the objects. In my experience #Joran's solution is more flexible as you can loop (or apply) through the list items in either by name or position depending on your application.

List rows with specific columns with a specific value in R from dataset

I have a dataset called "flights" and I am attempting to list all the rows that have the value of "Escanaba, Michigan" in the column Destination. I would like to show 5 columns and then all the rows that apply to Escanaba.
Currently I have...
flights[,c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
That works perfectly for what I want, except it shows all rows.
How do I call out a specific value from a column in a dataset?
This is a pretty basic indexing question (see e.g here, which was the first hit when I googled "R indexing"); you need to construct a logical vector that is TRUE for the relevant rows.
flights[flights$Destination=="Escanaba, Michigan",
c("FlightDate","Carrier","Destination","DestCityName","AirTime")]
A prettier alternative for interactive work (not entirely safe for programmatic use):
subset(flights,Destination=="Escanaba, Michigan",
select=c(FlightDate,Carrier,
Destination,DestCityName,AirTime))
If you want to allow for more than one possible value of Destination, try %in%

Changing hundreds of column names simultaneously in R

I have a data frame with hundreds of columns whose names I want to change. I'm very new to R, so it's rather easy to think through the logic of this, but I simply can't find a relevant example online.
The closest I could sort of get was this:
projectFileAllCombinedNames <- for (i in 1:200){names(projectFileAllCombined)[i+1] <-variableNames[i]}
Basically, starting at the second column of projectFileAllCombined, I want to loop through the columns in the dataframe and assign them the data values in the second data frame. I was able to change one column name manually with this code:
colnames(projectFileAllCombined)[2]<-"newColumnName"
but I can't possibly do that for hundreds of columns. I've spent multiple hours on this and can't crack it with any number of Google searches on "change multiple columns in r" or "change column names in r". The best I can find online is examples where people change a few columns with a c() function and I get how that works, but that still seems to require typing out all the column names as parameters to the function, unless there is a way to just pass the "variableNames" file into that c() function, but I don't know of one.
Will
colnames(projectFileAllCombined)[-1] <- variableNames
not suffice?
This assumes the ordering of columns in projectFileAllCombined is the same as the ordering of the new variable names in variableNames, and that
length(variableNames) == (ncol(projectFileAllCombined) - 1)
The key point here is that the replacement function 'colnames<-'() is vectorised and can replace any number of column names in a single call if passed a vector of replacement values.

How to order a matrix by all columns

Ok, I'm stuck in a dumbness loop. I've read thru the helpful ideas at How to sort a dataframe by column(s)? , but need one more hint. I'd like a function that takes a matrix with an arbitrary number of columns, and sorts by all columns in sequence. E.g., for a matrix foo with N columns,
does the equivalent of foo[order(foo[,1],foo[,2],...foo[,N]),] . I am happy to use a with or by construction, and if necessary define the colnames of my matrix, but I can't figure out how to automate the collection of arguments to order (or to with) .
Or, I should say, I could build the entire bloody string with paste and then call it, but I'm sure there's a more straightforward way.
The most elegant (for certain values of "elegant") way would be to turn it into a data frame, and use do.call:
foo[do.call(order, as.data.frame(foo)), ]
This works because a data frame is just a list of variables with some associated attributes, and can be passed to functions expecting a list.

Resources