I've got a big data frame, and like to remove the duplicate column
For simplicity, let's pretend this is my data:
df <- data.frame(id1 = c("Aa","Aa","Ba","Ca","Da"), id2 = c(2,1,4,5,10), location=c(351,261,101,91,51), comment=c(35,26,10,9,5), comment=c(5,16,25,14,11), hight=c(15,21,5,19,18), check.names = FALSE)
I can remove the duplicate column name "comment" using:
df <- df[!duplicated(colnames(df))]
However, when I apply same code in my real dataframe it returns an error:
Error in `[.data.table`(SNV_wild, !duplicated(colnames(SNV_wild))) :
i evaluates to a logical vector length 1883 but there are 60483 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Sorry, I can't post real data since it is quite large which you can see in error.
How can I troubleshoot this - I have gone through all columns names and there are duplicate column name.
Thank you in advance
Your real dataframe is of class data.table, while your small example is not. You can try:
df[,!duplicated(colnames(df)), with=F]
Related
I am simply trying to create a dataframe.
I read in data by doing:
>example <- read.csv(choose.files(), header=TRUE, sep=";")
The data contains 2 columns with 8736 rows plus a header.
I then simply want to combine this with the column of a dataframe with the same amount of rows (!) by doing:
>data_frame <- as.data.frame(example$x, example$y, otherdata$z)
It produces the following error
Warning message:
In as.data.frame.numeric(example$x, example$y, otherdata$z) :
'row.names' is not a character vector of length 8736 -- omitting it. Will be an error!
I have never had this problem before. It seems so easy to tackle but I cant help myself at the moment.
Overview
As long as the nrow(example) equals length(otherdata$z), use cbind.data.frame to combine columns into one data frame. An advantage with cbind.data.frame() is that there is no need to call the individual columns within example when binding them with otherdata$z.
# create a new data frame that adds the 'z' field from another source
df_example <- cbind.data.frame(example, otherdata$z)
I have the data frame:
DT=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100))
The objective is to create a new column Quantity that selects the value for each row in the column equal to Price, such that:
DT.Objective=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100),
Quantity= c(400,200,200,100,100))
The dataset is very large so efficiency is important. I currently use and looking to make more efficient:
Names <- names(DT)
DT$Quantity<- DT[Names][cbind(seq_len(nrow(DT)), match(DT$Price, Names))]
For some reason the column names in the example come with an "X" in front of them, whereas in the actual data there is no X.
Cheers.
We can do this with row/column indexing after removing the prefix 'X' using sub or substring and then do the match as showed in the OP's post
DT$Quantity <- DT[cbind(1:nrow(DT), match(DT$Price, sub("^X", "", names(DT))))]
DT$Quantity
#[1] 400 200 200 100 100
The X is attached as prefix when the column names starts with numbers. One way to take care of this would be using check.names=FALSE in the data.frame call or read.csv/read.table
#akrun is correct, check.names=TRUE is the default behavior for data.frame(); from the man page:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
If possible, you may want to make your column names a bit more descriptive.
To remove rows from a data frame, I use the following command:
data <- data[-1, ]
for example to remove the first row. I need to remove the first 6 rows, so I used the following:
data <- data[-c(1,2,3,4,5,6), ]
OR
data <- data[-(1:6), ]
this works as far as removing the row names, but introduced a new column called row.names that I cannot get rid of unless I use the command:
row.names(data) <- NULL
What is the reason for this? Is there a better way of removing a number of rows/columns with one command?
Example:
after the following code:
tquery <- tquery[-(1:6), ]
This is the data:
Although it seems as such, you are not actually adding a column to the data. What you are seeing is just a result of using View(). The function is showing the "row.names" attribute of the data frame as the first column, but you didn't really add the column.
This is expected and documented behavior. From the Details section of help(View)
If there are row names on the data frame that are not 1:nrow, they are displayed in a separate first column called row.names.
So since you subsetted the data, the row names are technically not 1:nrow any more and hence the new column is introduced in the viewer.
Print your data in the console and you'll see the difference.
View(mtcars) ## because the mtcars row names are not 1:nrow
versus
mtcars
Basically, don't trust View() to display an exact representation of the actual data. Instead use attributes(), *names(), dim(), length(), etc. or just peek at the data with head().
See r help via "?row.names" for more info. From the documentation, "All data frames have a row names attribute"
?row.names ## get more information about row.names from r help
row.names is not a new column, but rather an attribute of every single data frame. This is simply meta data and is ignored by most data. When you output this data (i.e. CSV) or use it in a function, this data will not interfere. This is similar to how excel has row numbers on the left margin, which is referential data for the application.
str(your_dataframe) ## see that those columns don't exist
colnames(your_dataframe) ## see column names
I'm getting the following error in R:
argument lengths differ.
I have a data set I would like to order on two columns, first on caseID, then on a column that contains a timestamp. I use the following code:
mydata <- mydata[order(mydata[ ,col1], mydata[ ,col2], decreasing = FALSE),]
Col1 and col2 are two variables holding an integer. I have looked at similar questions and tried the solutions that were proposed there, but nothing worked ;).
Could someone please help me?
Kind regards
R thinks that you 2 columns have different lengths, sometimes that happens when you accidentally access a column that does not exist, check the values of col1 and col2 to make sure that they are appropriate numbers. Also look at length(mydata[,col1]) and length(mydata[,col2]) to see if those 2 values match. Also check for missing , or other punctuation, sometimes if you don't have the syntax exactly right then you get a list of length 1, or a single element vector which does not match the other vector in length.
I was having this same problem, but was able to get my code working. Try this code.
with(mydata, mydata[order(col1,col2),]).
The result is decreasing, so adding function decreasing = False was not necessary. Hope that helps.
Probably it's nice to check this similar post out, uses dplyr package to solve it and it helped me: Arrange within a group with dplyr
This might do the trick:
library(dplyr)
mydata <- mydata %>%
arrange(
col1,
col2,
desc(col3)
)
I created a random forest and predicted the classes of my test set, which are living happily in a dataframe:
row.names class
564028 1
275747 1
601137 0
922930 1
481988 1
...
The row.names attribute tells me which row is which, before I did various operations that scrambled the order of the rows during the process. So far so good.
Now I would like get a general feel for the accuracy of my predictions. To do this, I need to take this dataframe and reorder it in ascending order according to the row.names attribute. This way, I can compare the observations, row-wise, to the labels, which I already know.
Forgive me for asking such a basic question, but for the life of me, I can't find a good source of information regarding how to do such a trivial task.
The documentation implores me to:
use attr(x, "row.names") if you need to retrieve an integer-valued set of row names.
but this leaves me with nothing but NULL.
My question is, how can I use row.names which has been loyally following me around in the various incarnations of dataframes throughout my workflow? Isn't this what it is there for?
None of the other solutions would actually work.
It should be:
# Assuming the data frame is called df
df[ order(as.numeric(row.names(df))), ]
because the row name in R is character, when the as.numeric part is missing it, it will arrange the data as 1, 10, 11, ... and so on.
This worked for me:
new_df <- df[ order(row.names(df)), ]
If you have only one column in your dataframe like in my case you have to add drop=F:
df[ order(rownames(df)) , ,drop=F]
For completeness:
#BondedDust's answer works perfectly for the rownames attribute, but your example does not use the rownames attribute. The output provided in your question indicates use of a column named "row.names", which isn't the same thing (all listed in #BondedDust's comment). Here would be the answer if you wished to sort by the "row.names" column in example given in your question (there is another posting on this, located here). This answer assumes you are using a dataframe named "df", with one column named "row.names":
ordered.df <- df[order(df$row.names),] #this orders the df by the "row.names" column
Alternatively, to order by the first column (same thing if you're still using your example):
ordered.df <- df[order(df[,1]),] #this orders the df by the first column
Hope this is helpful!
This will be done almost automatically since the "[" function will display in lexical order of any vector that can be matched to rownames():
df[ rownames(df) , ]
You might have thought it would be necessary to use:
df[ order(rownames(df)) , ]
But that would have given you an ordering of 1:100 of 1,10,100, 12,13, ...,2,20,21, ... , because the argument to "[" gets coerced to character.
Assuming your data frame is named 'df'you can create a new ordered data frame 'ord.df' that will contain the row names of df as well as it values in the following one line of code:
>ord.df<-cbind(rownames(df)[order(rownames(df))], df[order(rownames(df)),])
new_df <- df[ order(row.names(df)), ]
or something similar won't work. After this statement, the new_df does not have a rowname any more. I guess a better solution is to add a column as rowname, sort by it, and set it as the rowname
you can simply sort your df by using this :
df <- df[sort(rownames(df)),]
and then do what you want !