Merge 2 csv files into one with different columns in R [duplicate] - r

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have 2 csv datasets, each one with about 10k columns. The datasets are extracted from the same source, but the column sequence of these datasets is different (there are some new columns on the 2nd ds). So, I want to merge data of the 2nd dataset into the first one, keeping the column sequence of the first dataset. How can I do this?
Here follows an example:
Dataset 1:
Brand Year Model Price
Ford 2010 Taurus 5K
Toyota 2015 Yaris 4K
Dataset 2:
Brand Year Model Color Location Price
Chevrolet 2013 Spark Dark Gray PHI 2K
I would like to ignore the new columns (color, location) on the 2nd dataset and add the data with the same columns (brand, year, model, price) of the 2nd dataset into the first one.
Thanks in advance.

If you want to append the two datasets, try using bind_rows from the dplyr library. Use the first dataset as the first argument.
Here's a reproducible example you can modify if this result doesn't get you what you are looking for. Remember, a reproducible example means that you provide code that others can run when they are testing solutions for you. Your example doesn't allow users to copy data into R and test a solution currently. Try using dput on a small dataset to get some data for folks on stack overflow to use.
library(dplyr)
# Make up data
df <- data.frame(a = c(1, 2), b = c(3, 4))
df2 <- data.frame(a = c(5,6), b = c(2, 3), c = c(7, 8), d = c(1, 5))
# determine columns to remove from df2:
remove.these <- setdiff(colnames(df2), colnames(df))
# remove them before binding to save time
df2 <- select(df2, -remove.these)
# bind two dataframes together
finaldf <- bind_rows(df, df2)

Related

Merge two data set lead to duplicate rows or no duplicate rows but with NA data in R by Tidyverse

I am trying to merge two data set with same columns of "Breed" which represent dog breeds, data1 have dog traits and score for it, data2 have same breed as data1 with there rank of popularity in America from 2013 -2020. I have trouble when trying to merge two data set into one. It either shows NA on the 2013-2020 rank information or it shows duplicate rows of same breed, one rows are data from data set 1 and another row is data from data set 2. The closest i can get is by using merge(x,y, by = 'row.names', all = TRUE) and i get all data in correctly but with two duplicated column of Breed.x and Breed.y. I am looking for a way to solve it with one Breed column only and all data in correctly.
here is the data i am using, breed_traits is the data set 1 i am saying, breed_rank_all is the data set 2 i want to merge in to breed_traits
breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
this is the function i used with the most correctly one but with
Breed.y
breed_total <- merge(breed_traits, breed_rank_all, by = c('row.names') , all =TRUE)
breed_total
i tried left join as well but it shows NA on the 2013-2020 rank
library(dplyr)
breed_traits |> left_join(breed_rank_all, by = c('Breed'))
this is the one i tried as well and return duplicated rows of same breed.
merge(breed_traits, breed_rank_all, by = c('row.names', 'Breed'), all = TRUE)

How to summarize two data frames by matching date columns?

I have two data frames: Original and Base......
Original<- data.frame(Bond = c("A","B","C","D"),Date = c("19-11-2021","19-11-2021","19-11-2021","17-11-2021"),
Rate =c("O_11","O_12","O_13","O_31"))
base<- data.frame(Date = c("19-11-2021","18-11-2021","17-11-2021"), Rate =c("B_1","B_2","B_3"))
Here I would like to calculate the rate differential between Original and Base for each bond of each date w.r.t. the base rate. The output should be in the following format -
Note: The original data frame contains numerical values of the Original and Base Rates
I was trying using group_by() but wasn't able to proceed much further. Please help me with this. Even suggestion will also work
Seems like you want to join on date, with dplyr you can with an inner_join, assuming that there exist a date in base for every record in Original:
Output <- Original %>%
inner_join(base, by="Date") %>%
mutate(Rate_Diff = paste0(Rate.x,"-",Rate.y), Rate=Rate.x) %>%
select(-Rate.x, -Rate.y)
> Output
Bond Date Rate_Diff Rate
1 A 19-11-2021 O_11-B_1 O_11
2 B 19-11-2021 O_12-B_1 O_12
3 C 19-11-2021 O_13-B_1 O_13
4 D 17-11-2021 O_31-B_3 O_31
Edit: Is see the note now, then you could just replace the paste0 function with the actual columns:
mutate(Rate_Diff = Rate.x - Rate.y, Rate=Rate.x)

Extract rows from second Dataframe which are newly added compare to first Dataframe [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I have two data frames, I need to find the rows in second data frame which are newly added that means my First data frame has some rows and my second data frame can have few rows from my First data frame and some other rows also. I need to find those rows which are not in first data frame. That means rows which are only in my second data frame.
Below is the example with output
comp1<- data.frame(sector =c('Sector_123','Sector_456','Sector_789','Sector_101','Sector_111','Sector_113','Sector_115','Sector_117'), id=c(1,2,3,4,5,6,7,8) ,stringsAsFactors = FALSE)
comp2 <- data.frame(sector = c('Sector_456','Sector_789','Sector_000','Sector_222'), id=c(2,3,6,5), stringsAsFactors = FALSE)
Expected output is should be like below:
sector id
Sector_000 6
Sector_222 5
I should not use any other libraries like compare and data.table.
any suggestions
Assuming we are looking for similar entries in column sector. For all columns just remove the restriction.
We could use dplyr:
anti_join(comp2, comp1, by="sector")
gives us
> anti_join(comp2, comp1, by="sector")
sector id
1 Sector_000 6
2 Sector_222 5
With base R we could use
comp2[!comp2$sector %in% comp1$sector,]

Using R, how do I take a list of players and points and create a dataframe of teams and top five player pts per game

I have a dataframe of all NBA players, their team and their points per game. I want to create a new data frame listing team names as the first column, and the next five columns are the pts per game of their five leading scorers.
so... (made up numbers)
ATL 17.2 14.3 12.2 10.2 9.4
I'm trying to work through what might get me there. I'm thinking I need to create subsets of the first data frame for each team (listing each of their scorers), then sort all 30 data frames and then move the first 5 values in the pts per game column into a new data frame using [0:4].
Is there an easy way to use a for loop to create all 30 data frames? Maybe if I created a list for each team name and then did something like....
for i in list:
create data frame i from ALLPLAYERS[TEAM = i]
Then I could use some other sort to sort them and add them into the final data frame.
Sorry, I know the "code" portion above isn't really the code, it's just what I'm thinking, I need to find the exact wording.
This works using data.table.
library(data.table)
nba = data.table(player = 1:100, team = rep(LETTERS[1:10],
each = 10), ppg = 1:100)
nba[, as.list(tail(sort(ppg), 5)), by = team]
I use an unrealistic points-per-game but it makes it easy to see what is happening.
Here's some example code for one strategy (top 2 scorers):
set.seed(123)
df <- data.frame(team = LETTERS[1:2], player = replicate(8, paste0(sample(letters, 5, T), collapse = "")), score = sample(1:20, 8, T))
aggregate(score~team, data = df[order(-df$score), ], head, 2)
# team score.1 score.2
# 1 A 9 5
# 2 B 10 9
Using the packages library(dplyr) and library(tidyr), along with the fake data generated by DaveTurek above, here is a step-by-step solution:
Generate fake data:
nba=data.frame(player=1:100,team=rep(LETTERS[1:10],each=10),ppg=1:100)
Select only the top 5 scorers per team by grouping, sorting, and slicing:
top_scorers <- nba %>% group_by(team) %>% arrange(-ppg) %>% slice(1:5)
Create a new variable called scoreRank that assigns their rank within the team, where 1 is highest scoring and 5 is 5th highest scoring:
top_scorers %<>% group_by(team) %>% mutate (scoreRank = rank(-ppg))
Drop the player name column and cast as a data frame (the latter necessary to a bug in dplyr):
top_scorers <- as.data.frame(top_scorers %>% select(-player))
Spread the data frame into the desired wide format, instead of its current long format:
result <- spread(top_scorers,scoreRank,ppg)

dplyr to reference two data frame (summarize function) in R

I created a data frame from a data set with unique marketing sources. Let's say I have 20 unique marketing sources in this new data frame D1. I want to add another column that has the count of times this marketing source was in my original data frame. I'm trying to use the dplyr package but not sure how to reference more than one data frame.
original data has 16000 observations
new data frame has 20 observations as there are only 20 unique marketing sources.
How to use summarize in dplyr to reference two data frames?
My objective is to find the percentage of marketing sources.
My original data frame has two columns: NAME, MARKETING_SOURCE
This data frame has 16,000 observations and 20 distinct marketing sources (email, event, sales call, etc)
I created a new data frame with only the unique MARKETING_SOURCES and called that data frame D1
In my new data frame, I want to add another column that has the number of times each marketing source appeared in the original data frame.
My new Data frame should have two columns: MARKETING_SOURCE, COUNT
I don't know if you need to use dplyr for something like this...
First let's create some data.frames:
df1 <- data.frame(source = letters[sample(1:26, 400, replace = T)])
df2 <- data.frame(source = letters, count = NA)
Then we can use table() to get the frequencies:
counts <- table(df1$source)
df2$count <- counts
head(df2)
source count
1 a 10
2 b 22
3 c 12
4 d 17
5 e 18
6 f 18
UPDATE:
In response to #MrFlick's wise comment below, you can use take the names() of the output from table() to ensure order is preserved:
df2$source <- names(counts)
Certainly not quite as elegant and would be even less elegant if df2 had other columns. But sufficient for the simple case presented above.

Resources