This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN? [duplicate]
(3 answers)
Closed 2 years ago.
I have two tables. The columns i am interested in table 1 is "Year" and "CompanyName". Table 2 has 3 columns including: "Year" and "CompanyName".
How can I join these two tables together? The problem I have is that table 1 has many columns that have for example the year value as "Year" = "2004" and "CompanyName" = "Adidas". e.g.
# There are many other columns
Year CompanyName Spent
1 2004 Adidas 50
2 2004 Nike 34
3 2004 Adidas 45
4 2005 Reebok 33
5 2006 Reebok 11
6 2006 Adidas 47
7 2007 Nike 33
8 2007 Reebok 92
9 2007 Nike 01
10 2007 Adidas 23
#I want to join this to it
Year CompanyName Loss
1 2004 Nike 23
2 2004 Adidas 22
3 2005 Reebok 633
4 2006 Reebok 2
5 2006 Adidas 09
6 2007 Reebok 22
7 2007 Nike 34
I want to join the tables so when ever Year is 2004 and CompanyName is Adidas a column is added for Loss with the value 23
Thank You!
You can do that by
library(dplyr)
df3 <- df1 %>%
left_join(df2, by = c("Year", "CompanyName"))
Just make sure you don't have duplications in df2 when it comes to year & company name. You can do so through dplyr::distinct(df2, Year, CompanyName, .keep_all = T), however that might lead to dropping some relevant information. If you're not certain about it, it might make sense to aggregate by those two dimensions:
df2 %>%
group_by(Year, CompanyName) %>%
summarise(Loss = sum(Loss))
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
Suppose I have 2 different datasets for countries. Both have same countries, but slightly different:
dataset A:
col1 covid_cases region
russia 2 2
israel 3 1
russia 2 3
russia 2 4
russia 2 1
russia 2 6
dataset B:
col1 covid_cases income
russia 2 low
russia 2 low
israel 3 high
The region column and income column are independent.
In my original datasets I have 100 countries.
What's an efficient way to get this type of dataset:
col1 covid_cases region income
russia 2 2 low
israel 3 1 high
russia 2 3 low
russia 2 4 low
russia 2 1 low
russia 2 6 low
So order here in the dataset doesn't matter. I'm not interested in simply just taking one column from one dataset and adding it to another. I'm interested in adding the income column so that its values matches the countries income, just like in dataset 2.
Maybe try this:
library(dplyr)
#Code
newdf <- df1 %>% left_join(df2 %>% select(-c(covid_cases)) %>%
filter(!duplicated(col1)))
Output:
col1 covid_cases region income
1 russia 2 2 low
2 israel 3 1 high
3 russia 2 3 low
Using your new dataframes, the code will work too:
col1 covid_cases region income
1 russia 2 2 low
2 israel 3 1 high
3 russia 2 3 low
4 russia 2 4 low
5 russia 2 1 low
6 russia 2 6 low
I am having difficulty importing my data in the way I would like to from a .csv file to tidydata.
My data set is made up of descriptive data (age, country, etc.) and then 15 condition columns that I would like to have in just one column (long format). I have previously tried 'melting' the data in a few ways, but it does not turn out the way I intended it to. These are a few things I have tried, I know it is kind of messy. There are quite a few NAs in the data, which seem to be causing an issue. I am trying to create this specific column "Vignette" which will serve as the collective column for the 15 vignette columns I would like in long format.
head(dat)
ID Frequency Gender Country Continent Age
1 5129615189 At least weekly female France Europe 30-50 years
2 5128877943 At least daily female Spain Europe > 50 years
3 5126775994 At least weekly female Spain Europe 30-50 years
4 5126598863 At least daily male Albania Europe 30-50 years
5 5124909744 At least daily female Ireland Europe > 50 years
6 5122047758 At least weekly female Denmark Europe 30-50 years
Practice Specialty Seniority AMS
1 University public hospital centre Infectious diseases 6-10 years Yes
2 Other public hospital Infectious diseases > 10 years Yes
3 University public hospital centre Intensive care > 10 years Yes
4 University public hospital centre Infectious diseases > 10 years No
5 Private hospial/clinic Clinical microbiology > 10 years Yes
6 University public hospital centre Infectious diseases 0-5 years Yes
Durations V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15
1 range 7 2 7 7 7 5 7 14 7 42 42 90 7 NA 5
2 range 7 10 10 5 14 5 7 14 10 42 21 42 14 14 14
3 range 7 5 5 7 14 5 5 13 10 42 42 42 5 0 7
4 range 10 7 7 5 7 10 7 5 7 28 14 42 10 10 7
5 range 7 5 7 7 14 7 7 14 10 42 42 90 10 0 7
6 fixed duration 7 3 3 7 10 10 7 14 7 90 90 90 10 7 7
dat_long %>%
gather(Days, Age, -Vignette)
dat$new_sp = NULL
names(dat) <- gsub("new_sp", "", names(dat))
dat_tidy<-melt(
data=dat,
id=0:180,
variable.name="Vignette",
value.name="Days",
na.rm=TRUE
)
dat_tidy<- mutate(dat_tidy,
Days= sub("^V", "", Days)
)
It keeps saying "Error: id variables not found in data: NA"
I have tried to get rid of NAs but it doesn't seem to do anything.
I am guessing you are loading the melt function from reshape2. I will recommend that you try tidyr which is basically the next generation of reshape2.
Your error is presumable that the argument id=0:180. This is basically asking it to keep columns 0-180 as "identifier" columns, and melt the rest (i.e. create a new row for each value in each column).
When you subset more column indices than columns in a data.frame, the non-existing columns are filled in with pure NA - you asked for them, so you get them!
I would recommend loading tidyr, as it is newer. There should be some new verbs in the package that are more intuitive, but I'll give you a solution with the older semantic:
library(tidyr)
dat_tidy <- dat %>% gather('Vignette', 'Days', starts_with('V'))
# or a bit more verbose
dat_tidy <- dat %>% gather('Vignette', 'Days', V01, V02, V03, V04)
And check out the comment #heck1 for asking even better questions.
This question already has answers here:
Aggregate by specific year in R
(2 answers)
Closed 5 years ago.
i have this as part of dataset of about 6000 rows:
ÅR LM RE AGE PA REC
1 2012 PKORT Stockholm <19 17973 35508
2 2012 PKORT Stockholm 20-24 31042 63229
3 2012 PKORT Stockholm 25-29 27305 64558
4 2012 PKORT Stockholm 30-34 18256 42726
5 2012 PKORT Stockholm 35-39 13200 32145
6 2012 PKORT Stockholm 40< 9458 24422
7 2012 PKORT Stockholm 40< 6123 16152
and i want to sum all the rows for PA and REC where AGE is "40<" to reduce the data frame from an abundance of identical factor levels.
I have tried aggregate, tapply and also assumed that R understands that both "40<" should be summed when lm-functions are applied.
This seems like a really easy operation, any help is appreciated.
We can do this with dplyr
library(dplyr)
df1 %>%
filter(AGE == "40<") %>%
group_by_(.dots = names(df1)[1:3]) %>%
summarise_at(vars(PA, REC) , sum)
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Add column with order counts
(2 answers)
Closed 6 years ago.
I have following data set
id year
2 20332 2005
3 6383 2005
14 20332 2006
15 6806 2006
16 23100 2006
I would like to have an additional column, which counts the number of years the id variable is already available:
id year Counter
2 20332 2005 1
3 6383 2005 1
14 20332 2006 2
15 6806 2006 1
16 23100 2006 1
The dataset is currently not sorted according to the year. I thought about mutate rather than a function.
Any ideas? Thanks!
We can use ave from base R
df1$Counter <- with(df1, ave(id, id, FUN = seq_along))
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
Dataframe has the columns:
State Sex Year Name Number Percent
I need to filter for each year, one male and one female with highest percentage, in every state.
Example:
Washington M 2011 John 34 0.46
Washington F 2011 Mary 42 0.67
Washington M 2012 John 46 0.46
Washington F 2012 Mary 64 0.67
and so on for every State and year.
You can try
df %>%
group_by(State, Year, Sex) %>%
slice(which.max(Percent))