R column numbers - r

Have been working with many different datasets lately and need a quick way to identify the column number of different columns. For example I have a dataset that has 75 variables (or columns). The variables that I need to use are in the middle of the dataset, I know the names of these variables, i.e. g, h, I, j, and k. Rather then writing the names of these variables each time I want to use them or change or reference them I usually use the column number i.e.
for (i in 35:39) { do bla bla bla}
the usual way I find the column number is I look at the data frame and count the columns until I get to the one I want, then I count how many of them there are to get my 35:39. Is there a better way to do this? Is there a better way to find out that column/ variable g is column number 35 and column/variable k is # 39?

Just an expanded version of my comment. As I've said there are several ways to do so, I do not think the right one exist. Here is a possible solution (if I get what you want to achieve of course).
as.data.frame(cbind(column = 1:ncol(iris),names = names(iris)))
column names
1 1 Sepal.Length
2 2 Sepal.Width
3 3 Petal.Length
4 4 Petal.Width
5 5 Species
In such a way you know what name at which column correspond.

If you want to see which column is named g you could do
which(names(mydataframe) == 'g')
which gives you the index of the column with name "g".

You can use match instead of which as you need just one column match(which i suppose would be faster as well).
match('g',names(mydataframe))

Related

How to label CCA-Plot with row.names in R

I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!
The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.

Subtraction of rows from data table in R

I'm new to this site (and new to R) so I hope this is the right way to approach my problem.
I searched at this site but couldn't find the answer I'm looking for.
My problem is the following:
I have imported a table from a database into R (it says it's a data frame) and I want to substract the values from a particular columnn (row by row). Thereafter, I'd like to assign these differences to a new column called 'Difference' in the same data frame.
Could anyone please tell me how to do this?
Many thanks,
Arjan
To add a new column, just do df <- df$newcol, where df is the name of your data frame, and newcol is the name you want, in this case it would be "Difference". If you want to subtract an existing column using an existing column just use arithmetic operations.
df$Difference <- (df$col1 - df$col2)
I'm going to assume you want to subtract the values in one column from another is this correct? This can be done pretty easily see code below.
first I'm just going to make up some data.
df <- data.frame(v1 = rnorm(10,100,4), v2 = rnorm(10,25,4))
You can subtract values in one column from another by doing just that (see below).
Use $ to specify columns. Adding a new name after the $ will create a new column.
(see code below)
df$Differences <- df$v1 - df$v2
df
v1 v2 Differences
1 98.63754 29.54652 69.09102
2 99.49724 24.27766 75.21958
3 102.73056 25.01621 77.71435
4 100.87495 26.92563 73.94933
5 103.01357 17.46149 85.55208
6 97.24901 20.82983 76.41917
7 100.73915 27.95460 72.78454
8 98.14175 24.19351 73.94824
9 102.63738 21.74604 80.89133
10 105.78443 16.79960 88.98483
Hope this helps

How to replace only the final character of multiple variable names in R?

Below is some background information about my dataset if you want to understand where my question comes from (I actually want to merge datasets, so maybe somebody knows a more efficient way).
The question:
How to replace only the final character of a variable name in R with nothing (for multiple variables)?
I tried using the sub() function and it worked fine, however, some variable names contain the character I want to change multiple times (e.g. str2tt2). I only want to 'remove' or replace the last '2' with blank space.
Example:
Suppose I have a dataset with these variable names, and I only want to remove the last '_2' characters, I tried this:
h_2ello_2 how_2 are_2 you_2
1 1 3 5 7
2 2 4 6 8
names(data) <- sub('_2', '', names(data))
Output:
hello_2 how are you
1 1 3 5 7
2 2 4 6 8
Now, I want my code to remove the last '_2', so that it returns 'h_2ello' instead of hello_2.
Does anyone know how to?
Thank you in advance!
Background information:
I am currently trying to build a dataset from three separate ones. These three different ones are from three different measurement moments, and thus their variable names include a character after each variable name respective to their measurement moment. That is, for measurement moment 2, the variable names are scoreA2, scoreB2, scoreC2 and for measurement moment 3, the variable names are scoreA3, scoreB3 and scoreC3.
Since I want to merge these files together, I want to remove the '2' and '3' in the datasets and then merge them so that it seems like everyone was measured at the same moment.
However, some score names include the character 2 and 3 as well. For example: str2tt2 is the variable name for Stroop card 2 total time measurement moment 2. I only want to remove the last '2', but when using the sub() function I only remove the first one.
We need to use the metacharacter $ suggesting the end of the string on the original dataset column names
names(data) <- sub('_2$', '', names(data))
names(data)
#[1] "h_2ello" "how" "are" "you"
In the OP's code, the _2 matches the first instance in h_2ello_2 as it is sub and removes the _2 from h_2. Instead we need to specify the position to be the last characters of the string.

r - How can I "add" additional information to column names without altering the names themselves?

I have a matrix with individual column names (the row names are not important), like this
TestMat<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat)<-c("A","B","C","D","E")
TestMat
For various reasons, but mostly because a package will later need it, I can't alter the values in the matrix and they all have to be integers.
Now I want to categorize my colum names (e.g. A, B and D into "Group 1" and C and E into "Group 2"). The idea is, that the matrix will get smaller later on, as values in the matrix are randomly diminished. As soon as a column-sum reaches zero, that column will be dropped. Along this process I want to see how the fraction/size of one group changes, compared to the other groups.
I thought the easiest way would be to just name all the corresponding columns identical:
TestMat2<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat2)<-c("Group1","Group1","Group2","Group1","Group2")
TestMat2
But this gives me error-messages later on in the analysis, as R starts numbering the identical column-names in a way of "Group1" "Group1.1" "Group2" "Group1.2" "Group2.1".
I have tried my luck with "class", "attr" and "factor" commands to my column names, but don't get anywhere.
Is there a trick or command, I've maybe never heard of?
as per the comments why not put the grouping in another variable then something like:
> TestMat<-matrix(1:25,ncol=5,nrow=5)
> colnames(TestMat)<-c("A","B","C","D","E")
> F=factor(c("Group1","Group1","Group2","Group1","Group2"))
... do something to your matrix...
> summary(F[colSums(TestMat) >= 40])
Group1 Group2
1 2
Is that it (subs. 40 for 0)?
The Bioconductor package Bioboase defines a class ExpressionSet that allows annotations on rows and columns of a matrix
library(Biobase)
exprs = matrix(1:25,ncol=5,nrow=5, dimnames=list(NULL, LETTERS[1:5]))
df = data.frame(grp=c("Group1","Group1","Group2","Group1","Group2"),
row.names=colnames(exprs))
eset = ExpressionSet(exprs, AnnotatedDataFrame(df))
You can access columns in the data frame with $, subset with [, and extract with exprs(), e.g.,
> exprs(eset[, eset$grp == "Group1"])
A B D
1 1 6 16
2 2 7 17
3 3 8 18
4 4 9 19
5 5 10 20
or
> eset[,colSums(exprs(eset)) > 40]$grp
[1] Group2 Group1 Group2
Levels: Group1 Group2
The GenomicRanges package defines a similar class SummarizedExperiment when the rows are annotated with genomic ranges.
This coordinated integration of data and annotation on data is a really good thing, reducing the chance for 'clerical' errors when matrix and annotation are independent; I'm surprised so many comments suggest that you separately maintain two structures.
Thanks for all the helpful comments. I haven't posted here since my original post, because I first wanted to try all promising approaches and find a final solution to my problem.
I tried the Biobase package with its option for annotations, as well as Stephen's idea of grouping everything via a second variable.
As it turned out, as soon as the matrix diminished in size (as a part of the analysis) the external grouping failed, as column-numbers and grouping didn't match anymore and I couldn't find a way to combine the Bioconductor approach and my code.
I found a (somewhat roundabout) solution, though, if anybody cares:
I already stated, that, if I group my column-names identical for grouping, R later numbers my groups and they are thus not idential any longer.
But I then just searched for the first such-and-such neccessary letters to identify the proper group:
length(colnames(TestMat2)[substr(colnames(TestMat2),1,6) == "Group1"])
This way I can always check the fraction of one group of columns versus the others.
Thanks for your answers and help. I learned a lot and I think Bioconductor will come in handy in the future.
Cheers!

Trying to use user-defined function to populate new column in dataframe. What is going wrong?

Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3

Resources