I created a data frame from a data set with unique marketing sources. Let's say I have 20 unique marketing sources in this new data frame D1. I want to add another column that has the count of times this marketing source was in my original data frame. I'm trying to use the dplyr package but not sure how to reference more than one data frame.
original data has 16000 observations
new data frame has 20 observations as there are only 20 unique marketing sources.
How to use summarize in dplyr to reference two data frames?
My objective is to find the percentage of marketing sources.
My original data frame has two columns: NAME, MARKETING_SOURCE
This data frame has 16,000 observations and 20 distinct marketing sources (email, event, sales call, etc)
I created a new data frame with only the unique MARKETING_SOURCES and called that data frame D1
In my new data frame, I want to add another column that has the number of times each marketing source appeared in the original data frame.
My new Data frame should have two columns: MARKETING_SOURCE, COUNT
I don't know if you need to use dplyr for something like this...
First let's create some data.frames:
df1 <- data.frame(source = letters[sample(1:26, 400, replace = T)])
df2 <- data.frame(source = letters, count = NA)
Then we can use table() to get the frequencies:
counts <- table(df1$source)
df2$count <- counts
head(df2)
source count
1 a 10
2 b 22
3 c 12
4 d 17
5 e 18
6 f 18
UPDATE:
In response to #MrFlick's wise comment below, you can use take the names() of the output from table() to ensure order is preserved:
df2$source <- names(counts)
Certainly not quite as elegant and would be even less elegant if df2 had other columns. But sufficient for the simple case presented above.
Related
I am trying to merge two data set with same columns of "Breed" which represent dog breeds, data1 have dog traits and score for it, data2 have same breed as data1 with there rank of popularity in America from 2013 -2020. I have trouble when trying to merge two data set into one. It either shows NA on the 2013-2020 rank information or it shows duplicate rows of same breed, one rows are data from data set 1 and another row is data from data set 2. The closest i can get is by using merge(x,y, by = 'row.names', all = TRUE) and i get all data in correctly but with two duplicated column of Breed.x and Breed.y. I am looking for a way to solve it with one Breed column only and all data in correctly.
here is the data i am using, breed_traits is the data set 1 i am saying, breed_rank_all is the data set 2 i want to merge in to breed_traits
breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
this is the function i used with the most correctly one but with
Breed.y
breed_total <- merge(breed_traits, breed_rank_all, by = c('row.names') , all =TRUE)
breed_total
i tried left join as well but it shows NA on the 2013-2020 rank
library(dplyr)
breed_traits |> left_join(breed_rank_all, by = c('Breed'))
this is the one i tried as well and return duplicated rows of same breed.
merge(breed_traits, breed_rank_all, by = c('row.names', 'Breed'), all = TRUE)
Complete R novice here.
I have wide form data frame which includes a vector/variable for participant_number, with each participant providing two responses (score), with a within-subjects manipulation (code).
enter image description here
However, I have three separate sets of values which corresponded to the participant numbers in three different (between subjects) experimental groups (e.g. control, active_1, active_2).
enter image description here
How can I use these sets of values to create a variable in my main data frame which indicates what experimental group the participant belongs to?
Any help, much appreciated.
The package "dplyr" is quite useful for these kind of things. Let's consider a small working example
df <- data.frame(ID=c(1:7))
ListActive1 <- c(1,3)
ListActive2 <- c(2,5)
ListControl <- c(4,7,6)
df is the main data frame containing the ID of the participant (and of course it may have further columns, e.g. the score etc.) The three vectors contain for each group the IDs of the participants belonging to this particular group, e.g. the participants with ID 2 and 5 belong to the group "Active2".
Now we create a new column in the main data frame using the command mutate which comes with the dplyr package (make sure to install and load it).
df <- mutate(df,group=case_when(
ID %in% ListActive1 ~ "Active1",
ID %in% ListActive2 ~ "Active2",
ID %in% ListControl ~ "Control"))
The command case_when checks for each participant in which of the lists the ID appears and then puts the corresponding label in the new column group.
ID group
1 1 Active1
2 2 Active2
3 3 Active1
4 4 Control
5 5 Active2
6 6 Control
7 7 Control
This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have 2 csv datasets, each one with about 10k columns. The datasets are extracted from the same source, but the column sequence of these datasets is different (there are some new columns on the 2nd ds). So, I want to merge data of the 2nd dataset into the first one, keeping the column sequence of the first dataset. How can I do this?
Here follows an example:
Dataset 1:
Brand Year Model Price
Ford 2010 Taurus 5K
Toyota 2015 Yaris 4K
Dataset 2:
Brand Year Model Color Location Price
Chevrolet 2013 Spark Dark Gray PHI 2K
I would like to ignore the new columns (color, location) on the 2nd dataset and add the data with the same columns (brand, year, model, price) of the 2nd dataset into the first one.
Thanks in advance.
If you want to append the two datasets, try using bind_rows from the dplyr library. Use the first dataset as the first argument.
Here's a reproducible example you can modify if this result doesn't get you what you are looking for. Remember, a reproducible example means that you provide code that others can run when they are testing solutions for you. Your example doesn't allow users to copy data into R and test a solution currently. Try using dput on a small dataset to get some data for folks on stack overflow to use.
library(dplyr)
# Make up data
df <- data.frame(a = c(1, 2), b = c(3, 4))
df2 <- data.frame(a = c(5,6), b = c(2, 3), c = c(7, 8), d = c(1, 5))
# determine columns to remove from df2:
remove.these <- setdiff(colnames(df2), colnames(df))
# remove them before binding to save time
df2 <- select(df2, -remove.these)
# bind two dataframes together
finaldf <- bind_rows(df, df2)
I have two data frames with totally different column names and values.
Example :
Data Frame 1 ->
company value
A 10
B 11
A 9
Data Frame 2 ->
id value2
Q 7
W 8
E 9
This question has several parts that I want to achieve:
Extract the unique values of COMPANY column from
data frame 1 based on the COMPANY column(Unique companies)
Copy the unique values obtained above into a NEW
COLUMN in Data Frame 2 RANDOMLY (only company field)
Merge the two data frames based on the unique value
column.(This is only for testing, hence why I need this step)
All help is appreciated!!
Thank you in advance.
You could try something like this:
company <- unique(df1$company)
df2$new_column <- sample(company, nrow(df2), replace = TRUE)
Sample data
mysample <- data.frame(ID = 1:100, kWh = rnorm(100))
I'm trying to automate the process of returning the rows in a data frame that contain the 5 highest values in a certain column. In the sample data, the 5 highest values in the "kWh" column can be found using the code:
(tail(sort(mysample$kWh), 5))
which in my case returns:
[1] 1.477391 1.765312 1.778396 2.686136 2.710494
I would like to create a table that contains rows that contain these numbers in column 2.
I am attempting to use this code:
mysample[mysample$kWh == (tail(sort(mysample$kWh), 5)),]
This returns:
ID kWh
87 87 1.765312
I would like it to return the r rows that contain the figures above in the "kWh" column. I'm sure I've missed something basic but I can't figure it out.
We can use rank
mysample$Rank <- rank(-mysample$kWh)
head(mysample[order(mysample$Rank),],5)
if we don't need to create column, directly use order (as #Jaap mentioned in three alternative methods)
#order descending and get the first 5 rows
head(mysample[order(-mysample$kWh),],5)
#order ascending and get the last 5 rows
tail(mysample[order(mysample$kWh),],5)
#or just use sequence as index to get the rows.
mysample[order(-mysample$kWh),][1:5]