How to use bind_rows() and ignore column names [duplicate] - r

This question already has answers here:
Simplest way to get rbind to ignore column names
(2 answers)
Closed 4 years ago.
This question probably has been answered before, but I can't seem to find the answer. How do you use bind_rows() to just union the two tables and ignore the column names.
The documentation on bind_rows() has the following example:
#Columns don't need to match when row-binding
bind_rows(data.frame(x = 1:3), data.frame(y = 1:4))
This returns column x and y. How do I just get a single column back without having to change the column names?
Desired output, I don't really care what the column name ends up being:
x
1 1
2 2
3 3
4 1
5 2
6 3
7 4

You can do this with a quick 2-line function:
force_bind = function(df1, df2) {
colnames(df2) = colnames(df1)
bind_rows(df1, df2)
}
force_bind(df1, df2)
Output:
x
1 1
2 2
3 3
4 1
5 2
6 3
7 4

I think we still need change the names here
bind_rows(data.frame(x = 1:3), setNames(rev(data.frame(y = 1:4)), names(data.frame(x = 1:3))))
x
1 1
2 2
3 3
4 1
5 2
6 3
7 4

Related

How to modify the variable names by combining current variable names and row 1 values? [duplicate]

This question already has answers here:
Concatenate column name and 1st row in R
(2 answers)
Pasting the first row to the column name within a list
(1 answer)
Closed 1 year ago.
How can I modify raw_dataframe to wished_dataframe?
raw_dataframe <- data.frame(
category=c('a','1','2','3','4'),
subcategory=c('b','3','2','1','0'),
item=c('wd','4','5','7','0'))
wished_dataframe <- data.frame(
category_a=c('1','2','3','4'),
subcategory_b=c('3','2','1','0'),
item_wd=c('4','5','7','0'))
I actually have many csv files, the structure like 'raw_dataframe ' and (I want to combine row 1 and row 2 as the variable name. Any one can help?
# Paste colnames with values of row 1
colnames(raw_dataframe) <- paste0(colnames(raw_dataframe), "_", raw_dataframe[1, ])
# Remove row 1 and save in `wished_dataframe`
wished_dataframe <- raw_dataframe[-1, ]
A dplyr way: We could use rename_with:
library(dplyr)
raw_dataframe %>%
rename_with(~paste0(.,"_", raw_dataframe[1,])) %>%
slice(-1)
category_a subcategory_b item_wd
1 1 3 4
2 2 2 5
3 3 1 7
4 4 0 0
An option with janitor
library(janitor)
library(stringr)
library(dplyr)
row_to_names(raw_dataframe, 1) %>%
rename_with(~ str_c(names(raw_dataframe), '_', .))
category_a subcategory_b item_wd
2 1 3 4
3 2 2 5
4 3 1 7
5 4 0 0

How do I use the tidyverse packages to get a running total of unique values occurring in a column? [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 3 years ago.
I'm trying to use the tidyverse (whatever package is appropriate) to add a column (via mutate()) that is a running total of the unique values that have occurred in the column so far. Here is some toy data, showing the desired output.
data.frame("n"=c(1,1,1,6,7,8,8),"Unique cumsum"=c(1,1,1,2,3,4,4))
Who knows how to accomplish this in the tidyverse?
Here is an option with group_indices
library(dplyr)
df1%>%
mutate(unique_cumsum = group_indices(., n))
# n unique_cumsum
#1 1 1
#2 1 1
#3 1 1
#4 6 2
#5 7 3
#6 8 4
#7 8 4
data
df1 <- data.frame("n"=c(1,1,1,6,7,8,8))
Here's one way, using the fact that a factor will assign a sequential value to each unique item, and then converting the underlying factor codes with as.numeric:
data.frame("n"=c(1,1,1,6,7,8,8)) %>% mutate(unique_cumsum=as.numeric(factor(n)))
n unique_cumsum
1 1 1
2 1 1
3 1 1
4 6 2
5 7 3
6 8 4
7 8 4
Another solution:
df <- data.frame("n"=c(1,1,1,6,7,8,8))
df <- df %>% mutate(`unique cumsum` = cumsum(!duplicated(n)))
This should work even if your data is not sorted.

Create a new dataframe according to the contrast between two similar df [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have a dataframe made like this:
X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3
After several steps (not important which one) i obtained this df:
X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3
i want to obtain a new dataframe made by only the rows which didn't change during the steps; the result would be this one:
X Y Z T
1 2 4 2
7 5 NA 3
How could I do?
One option with base R would be to paste the rows of each dataset together and compare (==) to create a logical vector which we use for subsetting the new dataset
dfO[do.call(paste, dfO) == do.call(paste, df),]
# X Y Z T
#1 1 2 4 2
#3 7 5 NA 3
where 'dfO' is the old dataset and 'df' is the new
You can use dplyr's intersect function:
library(dplyr)
intersect(d1, d2)
# X Y Z T
#1 1 2 4 2
#2 7 5 NA 3
This is a data.frame-equivalent of base R's intersect function.
In case you're working with data.tables, that package also provides such a function:
library(data.table)
setDT(d1)
setDT(d2)
fintersect(d1, d2)
# X Y Z T
#1: 1 2 4 2
#2: 7 5 NA 3
Another dplyr solution: semi_join.
dt1 %>% semi_join(dt2, by = colnames(.))
X Y Z T
1 1 2 4 2
2 7 5 NA 3
Data
dt1 <- read.table(text = "X Y Z T
1 2 4 2
3 2 1 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
dt2 <- read.table(text = " X Y Z T
1 2 4 2
3 2 NA 4
7 5 NA 3",
header = TRUE, stringsAsFactors = FALSE)
I am afraid that neither semi join, nor intersect or merge are the correct answers. merge and intersect will not handle duplicate rows properly. semi join will change order of the rows.
From this perspective, I think the only correct one so far is akrun's.
You could also do something like:
df1[rowSums(((df1 == df2) | (is.na(df1) & is.na(df2))), na.rm = T) == ncol(df1),]
But I think akrun's way is more elegant and likely to perform better in terms of speed.

Delete Duplicates when Merging DF [duplicate]

This question already has answers here:
Select only the first row when merging data frames with multiple matches
(4 answers)
Closed 5 years ago.
I know, I know.... Another merging Df question, please hear me out as I have searched SO for an answer on this but none has come.
I am merging two Df's, one smaller than the other, and doing a left merge, to match up the longer DF to the smaller DF.
This works well except for one issue, rows get added to the left (smaller) df when the right(longer) df has duplicates.
An Example:
Row<-c("a","b","c","d","e")
Data<-(1:5)
df1<-data.frame(Row,Data)
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
names(df2)<-c("Row","Data2")
DATA<-merge(x = df1, y = df2, by = "Row", all.x = TRUE)
>DATA
Row Data Data2
1 a 1 1
2 b 2 2
3 b 2 3
4 c 3 4
5 d 4 5
6 e 5 6
See the extra "b" row?, that is what I want to get rid of, I want to keep the left DF, but very strictly, as in if there are 5 rows in DF1, when merged I want there to only be 5 rows.
Like this...
Row Data Data2
1 a 1 1
2 b 2 2
3 c 3 4
4 d 4 5
5 e 5 6
Where it only takes the first match and moves on.
I realize the merge function is only doing its job here, so is there another way to do this to get my expected result? OR is there a post-merge modification that should be done instead.
Thank you for your help and time.
Research:
How to join (merge) data frames (inner, outer, left, right)?
deleting duplicates
Merging two data frames with different sizes and missing values
We can use the duplicated function as follows:
DATA[!duplicated(DATA$Row),]
Row Data Data2
1 a 1 1
2 b 2 2
4 c 3 4
5 d 4 5
6 e 5 6
It´s possible also like
merge(x = df1, y = df1[unique(df1$Row),], by = "Row", all.x = TRUE)
# Row Data.x Data.y
#1 a 1 1
#2 b 2 2
#3 c 3 3
#4 d 4 4
#5 e 5 5
Since you only want the first row and don't care what variables are chosen, then you can use this code (before you merge):
Row2<-c("a","b","b","c","d","e","f","g","h")
Data2<-(1:9)
df2<-data.frame(Row2,Data2)
library(dplyr)
df2 %>%
group_by(Row2) %>%
slice(1)

Apply a maximum value to whole group [duplicate]

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 6 years ago.
I have a df like this:
Id count
1 0
1 5
1 7
2 5
2 10
3 2
3 5
3 4
and I want to get the maximum count and apply that to the whole "group" based on ID, like this:
Id count max_count
1 0 7
1 5 7
1 7 7
2 5 10
2 10 10
3 2 5
3 5 5
3 4 5
I've tried pmax, slice etc. I'm generally having trouble working with data that is in interval-specific form; if you could direct me to tools well-suited to that type of data, would really appreciate it!
Figured it out with help from Gavin Simpson here: Aggregate a dataframe on a given column and display another column
maxcount <- aggregate(count ~ Id, data = df, FUN = max)
new_df<-merge(df, maxcount)
Better way:
df$max_count <- with(df, ave(count, Id, FUN = max))

Resources