I have a data.frame like this:
id<-c("001-020", "001-010", "001-051")
name<-c("Fred", "Sue", "Liam")
df<-data.frame(id, name)
I tried:
df[with(df, order(id)), ]
# id name
# 2 001-010 Sue
# 1 001-020 Fred
# 3 001-051 Liam
which orders the data.frame correctly, but doesn't touch the rownames.
How may I reorder the data.frame using the ascending order of the id field and rewrite the rownames in one go?
You could try
newdf <- df[with(df, order(id)), ]
row.names(newdf) <- NULL
Or it can be done in a single step
newdf <- `row.names<-`(df[with(df,order(id)),], NULL)
Setting row.names to NULL will also work when you have an empty data.frame.
d1 <- data.frame()
row.names(d1) <- NULL
d1
#data frame with 0 columns and 0 rows
If we do the same with 1:nrow
row.names(d1) <-1:nrow(d1)
#Error in `row.names<-.data.frame`(`*tmp*`, value = c(1L, 0L)) :
#invalid 'row.names' length
Or another option is data.table
library(data.table)#v1.9.4+
setorder(setDT(df), id)[]
Or
setDT(df)[order(id)]
Or using sqldf
library(sqldf)
sqldf('select * from df
order by id')
You can simply assign new rownames:
df2 <- df[with(df, order(id)), ]
rownames(df2) <- 1:nrow(df2)
And a cleaner solution with magrittr:
library(magrittr)
df %>% extract(order(df$id), ) %>% set_rownames(1:nrow(df))
I am surprised it's not in the previous answers.
What you are looking for is arrange from plyr:
library(plyr)
arrange(df, id)
# id name
#1 001-010 Sue
#2 001-020 Fred
#3 001-051 Liam
Since row names are stored as an attribute on the object, perhaps structure() would be appropriate here:
structure(df[order(df$id),],row.names=rownames(df));
## id name
## 1 001-010 Sue
## 2 001-020 Fred
## 3 001-051 Liam
Related
I have two dataframes: 1) an old dataframe (let's call it "df1") and 2) an updated dataframe ("df2"). I need to identify what has been added to or removed from df1 to create df2. So, I need a new dataframe with a new column identifying what rows should be added to or removed from df1 in order to get df2.
The two dataframes are of differing lengths, and Vessel_ID is the only unique identifier.
Here is a reproducible example:
df1 <- data.frame(Name=c('Vessel1', 'Vessel2', 'Vessel3', 'Vessel4', 'Vessel5'),
Vessel_ID=c('1','2','3','4','5'), special_NO=c(10,20,30,40,50),
stringsAsFactors=F)
df2 <- data.frame(Name=c('Vessel1', 'x', 'y', 'Vessel3', 'x', 'Vessel6'), Vessel_ID=c('1', '6', '7', '3', '5', '10'), special_NO=NA, stringsAsFactors=F)
Ideally I would want an output like this:
df3
Name Vessel_ID special_NO add_remove
Vessel2 2 20 remove
Vessel4 4 40 remove
Vessel6 10 NA add
x 6 NA add
y 7 NA add
Also, if the Vessel_ID matches, I want to substitute the special_NO from df1 for NA in df2...but maybe that's for another question.
I tried add a new column to both df1 and df2 to identify which df they originally belonged to, then merging the dataframes and using the duplicated () function. This seemed to work, but I still wasn't sure which rows to remove or to add, and got different results depending on if I specified fromLast=T or fromLast=F.
An approach using bind_rows
library(dplyr)
bind_rows(df1 %>% mutate(add_remove="remove"),
df2 %>% mutate(add_remove="add")) %>%
group_by(Vessel_ID) %>%
filter(n() == 1) %>%
ungroup()
# A tibble: 5 × 4
Name Vessel_ID special_NO add_remove
<chr> <chr> <dbl> <chr>
1 Vessel2 2 20 remove
2 Vessel4 4 40 remove
3 x 6 NA add
4 y 7 NA add
5 Vessel6 10 NA add
Thanks for the comment! That looks like it would work too. Here's another solution a friend gave me using all base R:
df1$old_new <- "old"
df2$old_new <- "new"
#' Use the full_join function in the dplyr package to join both data.frames based on Name and Vessel_ID
df.comb <- dplyr::full_join(df1, df2, by = c("Name", "Vessel_ID"))
#' If you want to go fully base, you can use the merge() function to get the same result.
# df.comb <- merge(df1, df2, by = c("Name", "Vessel_ID"), all = TRUE, sort = FALSE)
#' Create a new column that sets the 'status' of a row
#' If old_new.x is NA, that row came from df2, so it is "new"
df.comb$status[is.na(df.comb$old_new.x)] <- "new"
# If old_new.x is not NA and old_new.y is NA then that row was in df1, but isn't in df2, so it has been "deleted"
df.comb$status[!is.na(df.comb$old_new.x) & is.na(df.comb$old_new.y)] <- "deleted"
# If old_new.x is not NA and old_new.y is not NA then that row was in both df1 and df2 = "same"
df.comb$status[!is.na(df.comb$old_new.x) & !is.na(df.comb$old_new.y)] <- "same"
# only keep the columns you need
df.comb <- df.comb[, c("Name", "Vessel_ID", "special_NO", "status")]
I am trying to find a way to add a column to df1 with information from df2, conditional on the content of each row in df1 without looping through df1.
Specifically, I want to add general information from df2 ("mammal") as a new column to already existing specific information in df1 ("tiger").
Following code works, but I am looking for are faster/vectorized/more elegant version of it, because it's (of course) very slow.
for (i in (1:nrow(df1))) {
subCategories <- unlist(df1$categories_split[i])
currentAggrCategories <- unique(df2[df2$subcategory %in% subCategories, 2])
if (length(currentAggrCats) == 0 ) {
currentAggrCats <- NA
}
df1$aggregatedCategories[[i]] <- currentAggrCats
}
Data looks like this:
df1:
name sex categories_split
===== === ================
john m c(tiger)
clara f c(crocodile)
ben m c(butterfly, metalmarks)
df2:
subcategory category
=========== ============
tiger mammal
crocodile reptile
butterfly insect
metalmark insect
Note that, due to the data structure (which is unfortunately given), there might be multiple hits in df2 which might be unique or not.
Thanks a lot for your help!
Unlist and match the column from df1 to the data in df2
idx <- match(unlist(df1$categories_split), df2$subcategory)
Add the re-listed matches to the original data; this exploits unlist() / relist() semantics to retain the original geometry.
df1$aggregate <- relist(df2$category[idx], df1$categories_split)
Use either stringsAsFactors = FALSE when constructing df1, or as.character(df2$category[idx1]) during the relist to avoid coercion of the factor to integer. Post-process as needed, e.g.,
df1$aggregate = lapply(df1$aggregate, unique)
Use vapply(df1$aggregate, unique, character(1)) if the expectation is that the aggregate column contains a single element.
here is a base solution:
#unlist the categories_split column
namedf <- do.call(rbind, by(df1, df1$name, function(x) {
data.frame(name=x$name, sex=x$sex, categories_split=unlist(x$categories_split))
}))
rownames(namedf) <- NULL
#perform lookup
namedf$category <- df2$category[match(namedf$categories_split, df2$subcategory)]
namedf
or a data.table solution:
library(data.table)
setDT(df1)
setDT(df2)
df2[
df1[.(name, sex), .(categories_split=unlist(categories_split)), by=.(name, sex), on=.(name, sex)],
on=c("subcategory"="categories_split")]
data:
df1 <- data.frame(name=c("john","clara","ben"),
sex=c("m","f","m"))
df1$categories_split <- list("tiger", "crocodile", c("butterfly","metalmark"))
df2 <- read.table(text="subcategory category
tiger mammal
crocodile reptile
butterfly insect
metalmark insect", header=TRUE)
Is there any way to use string stored in variable as a column name in a new data frame? The expected result should be:
col.name <- 'col1'
df <- data.frame(col.name=1:4)
print(df)
# Real output
col.name
1 1
2 2
3 3
4 4
# Expected output
col1
1 1
2 2
3 3
4 4
I'm aware that I can create data frame and then use names() to rename column or use df[, col.name] for existing object, but I'd like to know if there is any other solution which could be used during creating data frame.
You cannot pass a variable into the name of an argument like that.
Instead what you can do is:
df <- data.frame(placeholder_name = 1:4)
names(df)[names(df) == "placeholder_name"] <- col.name
or use the default name of "V1":
df <- data.frame(1:4)
names(df)[names(df) == "V1"] <- col.name
or assign by position:
df <- data.frame(1:4)
names(df)[1] <- col.name
or if you only have one column just replace the entire names attribute:
df <- data.frame(1:4)
names(df) <- col.name
There's also the set_names function in the magrittr package that you can use to do this last solution in one step:
library(magrittr)
df <- set_names(data.frame(1:4), col.name)
But set_names is just an alias for:
df <- `names<-`(data.frame(1:4), col.name)
which is part of base R. Figuring out why this expression works and makes sense will be a good exercise.
In addition to ssdecontrol's answer, there is a second option.
You're looking for mget. First assign the name to a variable, then the value to the variable that you have previously assigned. After that, mget will evaluate the string and pass it to data.frame.
assign("col.name", "col1")
assign(paste(col.name), 1:4)
df <- data.frame(mget(col.name))
print(df)
col1
1 1
2 2
3 3
4 4
I don't recommend you do this, but:
col.name <- 'col1'
eval(parse(text=paste0('data.frame(', col.name, '=1:4)')))
I want to get the number of unique values in each of the columns of a data frame.
Let's say I have the following data frame:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
then it should return that there are 3 distinct values for v1, and 2 for v2.
I tried unique(DF), but it does not work as each rows are different.
Or using unique:
rapply(DF,function(x)length(unique(x)))
v1 v2
3 2
sapply(DF, function(x) length(unique(x)))
In dplyr:
DF %>% summarise_all(funs(n_distinct(.)))
Here's one approach:
> lapply(DF, function(x) length(table(x)))
$v1
[1] 3
$v2
[1] 2
This basically tabulates the unique values per column. Using length on that tells you the number. Removing length will show you the actual table of unique values.
For the sake of completeness: Since CRAN version 1.9.6 of 19 Sep 2015, the data.table package includes the helper function uniqueN() which saves us from writing
function(x) length(unique(x))
when calling one of the siblings of apply():
sapply(DF, data.table::uniqueN)
v1 v2
3 2
Note that neither the data.table package needs to be loaded nor DF coerced to class data.table in order to use uniqueN(), here.
In dplyr (>=1.0.0 - june 2020):
DF %>% summarize_all(n_distinct)
v1 v2
1 3 2
I think a function like this would give you what you are looking for. This also shows the unique values, in addition to how many NA's there are in each dataframe's columns. Simply plug in your dataframe, and you are good to go.
totaluniquevals <- function(df) {
x <<- data.frame("Row Name"= numeric(0), "TotalUnique"=numeric(0), "IsNA"=numeric(0))
result <- sapply(df, function(x) length(unique(x)))
isnatotals <- sapply(df, function(x) sum(is.na(x)))
#Now Create the Row names
for (i in 1:length(colnames(df))) {
x[i,1] <<- (names(result[i]))
x[i,2] <<- result[[i]]
x[i,3] <<- isnatotals[[i]]
}
return(x)
}
Test:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
totaluniquevals(DF)
Row.Name TotalUnique IsNA
1 v1 3 0
2 v2 2 0
You can then use unique on whatever column, to see what the specific unique values are.
unique(DF$v2)
[1] a b
Levels: a b
This should work for getting an unique value for each variable:
length(unique(datasetname$variablename))
This will give you unique values in DF dataframe of column 1.
unique(sc_data[,1])
I'm working from this answer trying to optimize the second argument in the plyr:rename, as suggested by Jared.
In short they are renaming some columns in a data frame using plyr like this,
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df
newNames <- c("new_col1", "new_col2", "new_col3")
oldNames <- names(df)
require(plyr)
df <- rename(df, c("col1"="new_col1", "col2"="new_col2", "col3"="new_col3"))
df
In passing Jared writes '[a]nd you can be creative in making that second argument to rename so that it is not so manual.'
I've tried being creative like this,
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df
secondArgument <- paste0('"', oldNames, '"','=', '"',newNames, '"',collapse = ',')
df <- rename(df, secondArgument)
df
But it does not work, can anyone help me automates this?
Thanks!
Update Sun Sep 9 11:55:42PM
I realized I should have been more specific in my question.
I'm using plyr::rename because I, in my real life example, have other variables and I don't always know the position of the variables I want to rename. I'll add an update to my question
My case look like this, but with 100+ variables
df2 <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df2
df2 <- rename(df2, c("col1"="new_col1", "col3"="new_col3"))
df2
df2 <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df2
newNames <- c("new_col1", "new_col3")
oldNames <- names(df[,c('col1', 'col3')])
secondArgument <- paste0('"', oldNames, '"','=', '"',newNames, '"',collapse = ',')
df2 <- rename(df2, secondArgument)
df2
Please add an comment if there is anything I need to clarify.
Solution to modified question:
df2 <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df2
newNames <- c("new_col1", "new_col3")
oldNames <- names(df2[,c('col1', 'col3')])
(Isn't oldNames equal toc('col1','col3') by definition?)
Solution with plyr:
secondArgument <- setNames(newNames,oldNames)
library(plyr)
df2 <- rename(df2, secondArgument)
df2
Or in base R you could do:
names(df2)[match(oldNames,names(df2))] <- newNames
Set the names on newNames to the names from oldNames:
R> names(newNames) <- oldNames
R> newNames
col1 col2 col3
"new_col1" "new_col2" "new_col3"
R> df <- rename(df, newNames)
R> df
new_col1 new_col2 new_col3
1 1 3 6
2 2 4 7
3 3 5 8
plyr::rename requires a named character vector, with new names as values, and old names as names.
This should work:
names(newNames) <- oldNames
df <- rename(df, newNames)
df
new_col1 new_col2 new_col3
1 1 3 6
2 2 4 7
3 3 5 8