Column indexing based on row value - r

I have the data frame:
DT=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100))
The objective is to create a new column Quantity that selects the value for each row in the column equal to Price, such that:
DT.Objective=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100),
Quantity= c(400,200,200,100,100))
The dataset is very large so efficiency is important. I currently use and looking to make more efficient:
Names <- names(DT)
DT$Quantity<- DT[Names][cbind(seq_len(nrow(DT)), match(DT$Price, Names))]
For some reason the column names in the example come with an "X" in front of them, whereas in the actual data there is no X.
Cheers.

We can do this with row/column indexing after removing the prefix 'X' using sub or substring and then do the match as showed in the OP's post
DT$Quantity <- DT[cbind(1:nrow(DT), match(DT$Price, sub("^X", "", names(DT))))]
DT$Quantity
#[1] 400 200 200 100 100
The X is attached as prefix when the column names starts with numbers. One way to take care of this would be using check.names=FALSE in the data.frame call or read.csv/read.table

#akrun is correct, check.names=TRUE is the default behavior for data.frame(); from the man page:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
If possible, you may want to make your column names a bit more descriptive.

Related

How to remove duplicate column names in R?

I've got a big data frame, and like to remove the duplicate column
For simplicity, let's pretend this is my data:
df <- data.frame(id1 = c("Aa","Aa","Ba","Ca","Da"), id2 = c(2,1,4,5,10), location=c(351,261,101,91,51), comment=c(35,26,10,9,5), comment=c(5,16,25,14,11), hight=c(15,21,5,19,18), check.names = FALSE)
I can remove the duplicate column name "comment" using:
df <- df[!duplicated(colnames(df))]
However, when I apply same code in my real dataframe it returns an error:
Error in `[.data.table`(SNV_wild, !duplicated(colnames(SNV_wild))) :
i evaluates to a logical vector length 1883 but there are 60483 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Sorry, I can't post real data since it is quite large which you can see in error.
How can I troubleshoot this - I have gone through all columns names and there are duplicate column name.
Thank you in advance
Your real dataframe is of class data.table, while your small example is not. You can try:
df[,!duplicated(colnames(df)), with=F]

Assigning Unnamed Columns To Another DataFrame

I'm in a very basic class that introduces R for genetic purposes. I'm encountering a rather peculiar problem in trying to follow the instructions given. Here is what I have along with the instructor's notes:
MangrovesRaw<-read.csv("C:/Users/esteb/Documents/PopGen/MangrovesSites.csv")
#i'm going to make a new dataframe now, with one column more than the mangrovesraw dataframe but the same number of rows.
View(MangrovesRaw)
Mangroves<-data.frame(matrix(nrow = 528, ncol = 23))
#next I want you to name the first column of Mangroves "pop"
colnames(Mangroves)<-c(col1="pop")
#i'm now assigning all values of that column to be 1
Mangroves$pop<-1
#assign the rest of the columns (2 to 23) to the entirety of the MangrovesRaw dataframe
#then change the names to match the mangroves raw names
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
I'm not really sure how to assign columns that haven't been named used the $ as we have in the past. A friend suggested I first run
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw
#X338 is the name of the first column from MangrovesRaw
But while this does transfer the data from MangrovesRaw, it comes at the cost of having my column names messed up with X338. added to every subsequent column. In an attempt to modify this I found the following "fix"
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw[,2]
#Mangroves$X338<-MangrovesRaw[,2:22]
#MangrovesRaw has 22 columns in total
While this transferred all the data I needed for the X338 Column, it didn't transfer any data for the remaining 21 columns. The code in # just results in the same problem of having X388. show up in all my column names.
What am I doing wrong?
There are a few ways to solve this problem. It may be that your instructor wants it done a certain way, but here's one simple solution: just cbind() the Mangroves$pop column with the real data. Then the data and column names are already added.
Mangroves <- cbind(Mangroves$pop, MangrovesRaw)
Here's another way:
Mangroves[, 2:23] <- MangrovesRaw
colnames(Mangroves)[2:23] <- colnames(MangrovesRaw)

Assigning name to rows in R

I would like to assign names to rows in R but so far I have only found ways to assign names to columns. My data is in two columns where the first column (geo) is assigned with the name of the specific location I'm investigating and the second column (skada) is the observed value at that specific location. To clarify, I want to be able to assign names for every location instead of just having them all in one .txt file so that the data is easier to work with. Anyone with more experience than me that knows how to handle this in R?
First you need to import the data to your global environment. Try the function read.table()
To name rows, try
(assuming your data.frame is named df):
rownames(df) <- df[, "geo"]
df <- df[, -1]
Well, your question is not that clear...
I assume you are trying to create a data.frame with named rows. If you look at the data.frame help you can see the parameter row.names description
NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
which means you can manually specify the row names when you create the data.frame or the column containing the names. The former can be achived as follows
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
row.names=letters[1:10] # take the first 10 letters and use them as row header
)
while the latter is
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
r=letters[1:10], # take the first 10 letters
row.names=3 # the column with the row headers is the 3rd
)
If you are reading the data from a file I will assume you are using the command read.table. Many of its parameters are the same of data.frame, in particular you will find that the row.headers parameter works the same way:
a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names.
Finally, if you have already read the data.frame and you want to change the row names, Pierre's answer is your solution

How to remove ending from sample names

I am trying to remove endings from sample names in my data frame. There are about 200 samples so I was hoping there was a way to end the name before the first - (common to each sample).
Examples of names are:
Glyc.1.20C.1wk-ATGGTTCACCCG-CATCAGTACGCC-R1.fastq
Glyc.1.20C.2m-CACTACGCTAGA-GTTCCTCCATTA-R1.fastq
Glyc.1.20C.2wk-GCTCGAAGATTC-CGAGGGAAAGTC-R1.fastq
Glyc.1.20C.3m-GTAGGTGCTTAC-GCATAAACGACT-R1.fastq
Using the change colnames(x) <- c("Glyc.1.20C.1wk, etc) would take me forever.
Any ideas?
If df is your dataframe, take the names, remove everything after the first -, and reset the names to the new short values...
names(df) <- gsub("\\-.+","",names(df))

Change part of data frame to character, then to numeric in R

I have a simple problem. I have a data frame with 121 columns. columns 9:121 need to be numeric, but when imported into R, they are a mixture of numeric and integers and factors. Columns 1:8 need to remain characters.
I’ve seen some people use loops, and others use apply(). What do you think is the most elegant way of doing this?
Thanks very much,
Paul M
Try the following... The apply function allows you to loop over either rows, cols, or both, of a dataframe and apply any function, so to make sure all your columns from 9:121 are numeric, you can do the following:
table[,9:121] <- apply(table[,9:121],2, function(x) as.numeric(as.character(x)))
table[,1:8] <- apply(table[,1:8], 2, as.character)
Where table is the dataframe you read into R.
Briefly I specify in the apply function the table I want to loop over - in this case the subset of your table we want to make changes to, then we specify the number 2 to indicate columns, and finally give the name of the as.numeric or as.character functions. The assignment operator then replaces the old values in your table with the new ones of correct format.
-EDIT: Just changed the first line as I recalled that if you convert from a factor to a number, what you get is the integer of the factor level and not the number you think you are getting to factors first need to be converted to characters, then numbers, which was can do just by wrapping as.character inside as.numeric.
When you read in the table use strinsAsFactors=FALSE then there will not be any factors.

Resources