Add rows and complete dyad by group - r

I have a dataset in a dyadic format and sorted by group and I am trying to add an observation to each group. I need this observation to also be integrated with the other pairs. Below is a reproducible example to show what I mean. Data is a simplified version of my dataset (it contains more groups essentially).
data <- data.frame(country1 = c("BEL", "FRA", "BEL", "FRA", "AUS", "ITA"),
country2 = c("FRA", "BEL", "FRA", "BEL", "ITA", "AUS"),
year = c(2001,2001,2002,2002,2002,2002),
id = c(1,1,1,1,2,2))
> data
country1 country2 year id
1 BEL FRA 2001 1
2 FRA BEL 2001 1
3 BEL FRA 2002 1
4 FRA BEL 2002 1
5 AUS ITA 2002 2
6 ITA AUS 2002 2
I would like to add a different country to each group. For instance, say I would like to add Luxembourg to group 1 and Portugal to group 2.
This is what the output I need should look like:
> data
country1 country2 year id
1 BEL FRA 2001 1
2 FRA BEL 2001 1
3 LUX BEL 2001 1
4 LUX FRA 2001 1
5 BEL LUX 2001 1
6 FRA LUX 2001 1
7 BEL FRA 2002 1
8 FRA BEL 2002 1
9 LUX BEL 2002 1
10 LUX FRA 2002 1
11 BEL LUX 2002 1
12 FRA LUX 2002 1
13 AUS ITA 2002 2
14 ITA AUS 2002 2
15 POR AUS 2002 2
16 POR ITA 2002 2
17 AUS POR 2002 2
18 ITA POR 2002 2
I found a workaround way but I don't know how to simplify this process and to automate it to some extent.
id1 <- data%>%
filter(id== 1) %>%
mutate(country3 = "LUX")
id1_1 <- id1 %>%
select(!country2) %>%
rename("country2" = "country3") %>%
distinct()
id1_2 <- id1 %>%
select(!country1) %>%
rename("country1" = "country3") %>%
distinct()
id1_2 <- id1_2 [, c(2,1,3,4)]
id1 <- rbind(id1_1, id1_2)
data<- rbind(data, id1)
This completes the dyads but it is quite tedious to do since I am trying to add about 100 countries to a hundred groups.
I can create either a vector or a data frame containing all the countries I need to add (and arrange them by group if necessary), but I just don't know how to use them to fill the main data. Thanks for any tips!

Would something like this work for you?
library(tidyverse)
data <- data.frame(country1 = c("BEL", "FRA", "BEL", "FRA", "AUS", "ITA"),
country2 = c("FRA", "BEL", "FRA", "BEL", "ITA", "AUS"),
year = c(2001,2001,2002,2002,2002,2002),
id = c(1,1,1,1,2,2))
additions <- tribble(
~id, ~country1,
1, "LUX",
2, "POR"
)
unique_combos <- data |>
distinct(id, year, country1) |>
rows_append(additions) |>
expand(year, nesting(id, country1)) |>
filter(!is.na(year))
unique_combos |>
rename(country2 = country1) |>
full_join(unique_combos) |>
filter(country1 != country2) |>
arrange(id, year, country1, country2)
#> Joining, by = c("year", "id")
#> # A tibble: 24 × 4
#> year id country2 country1
#> <dbl> <dbl> <chr> <chr>
#> 1 2001 1 FRA BEL
#> 2 2001 1 LUX BEL
#> 3 2001 1 BEL FRA
#> 4 2001 1 LUX FRA
#> 5 2001 1 BEL LUX
#> 6 2001 1 FRA LUX
#> 7 2002 1 FRA BEL
#> 8 2002 1 LUX BEL
#> 9 2002 1 BEL FRA
#> 10 2002 1 LUX FRA
#> # … with 14 more rows
Created on 2022-06-29 by the reprex package (v2.0.1)

Related

r data.table adjust min and max years only if each set has at least one incrementing obs

I have a data set that holds an id, location, start year, end year, age1 and age2. For each group defined as id, location, age1 and age2, I would like to create new start and end year. For instance, I may have three entries for china encompassing age 0 - age 4. One will be 2000 - 2000, the other is 2001 - 2001, and the final is 2005-2005. Since the years are incrementing by 1 in the first two entries, I'd want their corresponding newstart and newend to be 2000-2001. The third entry would have newstart==2005 and newend==2005 as this is not apart of a continuous set of years.
The data table I have resembles the following, except it has thousands of entries many combinations :
id location start end age1 age2
1 brazil 2000 2000 0 4
1 brazil 2001 2001 0 4
1 brazil 2002 2002 0 4
2 argentina 1990 1991 1 1
2 argentina 1991 1991 2 2
2 argentina 1992 1992 2 2
2 argentina 1993 1993 2 2
3 belize 2001 2001 0.5 1
3 belize 2005 2005 1 2
I want to alter the data table so that it will look like the following
id location start end age1 age2 newstart newend
1 brazil 2000 2000 0 4 2000 2002
1 brazil 2001 2001 0 4 2000 2002
1 brazil 2002 2002 0 4 2000 2002
2 argentina 1990 1991 1 1 1991 1991
2 argentina 1991 1991 2 2 1991 1993
2 argentina 1992 1992 2 2 1991 1993
2 argentina 1993 1993 2 2 1991 1993
3 belize 2001 2001 0.5 1 2001 2001
3 belize 2005 2005 1 2 2005 2005
I have tried creating a variable that tracks the difference of the previous year and the current year using lag and then calculating the difference between these two years. I then created the newstart and newend by placing the min start and max end. I have found that this only works if there is a set of 2 in continuous years. If I have a larger set, this doesn't work as it has no way of tracking the number of obs in which the years increase by 1 for each grouping. I believe I need some type of loop.
Is there a more efficient way to accomplish this?
data.table
You tagged with data.table, so my first suggestion is this:
library(data.table)
dat[, contiguous := rleid(c(TRUE, diff(start) == 1)), by = .(id)]
dat[, c("newstart", "newend") := .(min(start), max(end)), by = .(id, contiguous)]
dat[, contiguous := NULL]
dat
# id location start end age1 age2 newstart newend
# 1: 1 brazil 2000 2000 0.0 4 2000 2002
# 2: 1 brazil 2001 2001 0.0 4 2000 2002
# 3: 1 brazil 2002 2002 0.0 4 2000 2002
# 4: 2 argentina 1990 1991 1.0 1 1990 1993
# 5: 2 argentina 1991 1991 2.0 2 1990 1993
# 6: 2 argentina 1992 1992 2.0 2 1990 1993
# 7: 2 argentina 1993 1993 2.0 2 1990 1993
# 8: 3 belize 2001 2001 0.5 1 2001 2001
# 9: 3 belize 2005 2005 1.0 2 2005 2005
base R
If instead you really just mean data.frame, then
dat <- transform(dat, contiguous = ave(start, id, FUN = function(a) cumsum(c(TRUE, diff(a) != 1))))
dat <- transform(dat,
newstart = ave(start, id, contiguous, FUN = min),
newend = ave(end , id, contiguous, FUN = max)
)
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
dat
# id location start end age1 age2 newstart newend contiguous
# 1 1 brazil 2000 2000 0.0 4 2000 2002 1
# 2 1 brazil 2001 2001 0.0 4 2000 2002 1
# 3 1 brazil 2002 2002 0.0 4 2000 2002 1
# 4 2 argentina 1990 1991 1.0 1 1990 1993 1
# 5 2 argentina 1991 1991 2.0 2 1990 1993 1
# 6 2 argentina 1992 1992 2.0 2 1990 1993 1
# 7 2 argentina 1993 1993 2.0 2 1990 1993 1
# 8 3 belize 2001 2001 0.5 1 2001 2001 1
# 9 3 belize 2005 2005 1.0 2 2005 2005 2
dat$contiguous <- NULL
Interesting point I just learned about ave: it uses interaction(...) (all grouping variables), which is going to give all possible combinations, not just the combinations observed in the data. Because of that, the FUNction may be called with zero data. In this case, it did, giving the warnings. One could suppress this with function(a) suppressWarnings(min(a)) instead of just min.
We could use dplyr. After grouping by 'id', take the difference of the 'start' and the lagof the 'start', apply rleid to get the run-length-id' and create the 'newstart', 'newend' as the min and max of the 'start'
library(dplyr)
library(data.table)
df1 %>%
group_by(id) %>%
group_by(grp = rleid(replace_na(start - lag(start), 1)),
.add = TRUE) %>%
mutate(newstart = min(start), newend = max(end))
-output
# A tibble: 9 x 9
# Groups: id, grp [4]
# id location start end age1 age2 grp newstart newend
# <int> <chr> <int> <int> <dbl> <int> <int> <int> <int>
#1 1 brazil 2000 2000 0 4 1 2000 2002
#2 1 brazil 2001 2001 0 4 1 2000 2002
#3 1 brazil 2002 2002 0 4 1 2000 2002
#4 2 argentina 1990 1991 1 1 1 1990 1993
#5 2 argentina 1991 1991 2 2 1 1990 1993
#6 2 argentina 1992 1992 2 2 1 1990 1993
#7 2 argentina 1993 1993 2 2 1 1990 1993
#8 3 belize 2001 2001 0.5 1 1 2001 2001
#9 3 belize 2005 2005 1 2 2 2005 2005
Or with data.table
library(data.table)
setDT(df1)[, grp := rleid(replace_na(start - shift(start), 1))
][, c('newstart', 'newend') := .(min(start), max(end)), .(id, grp)][, grp := NULL]

R Count values in DF variable by groups

Using following dataset:
set.seed(2)
origin <- rep(c("DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR","DEU", "GBR", "ITA", "NLD", "CAN", "MEX", "USA", "CHN", "JPN", "KOR"), 4)
year <- rep(c(rep(1998, 10), rep(2000, 10)), 2)
type <- sample(1:10, size=length(origin), replace=TRUE)
value <- sample(100:10000, size=length(origin), replace=TRUE)
test.df <- as.data.frame(cbind(origin, year, type, value))
rm(origin, year, type, value)
### add some (6) missing values
test.df$value[sample(1:length(test.df$value), 6, replace = FALSE)] <- NA
I want to count how many types by country (origin) per year
I tryed:
count(trade.df, origin, year)
and
test.df %>% group_by(origin, year) %>% count()
but I am not sure of how I can interpret these results.
of course, if value == NA, R should not count it...
To remove the rows where value is NA, use filter:
test.df %>% group_by(origin,year) %>%
filter(!is.na(value)) %>% count()
# A tibble: 20 x 3
# Groups: origin, year [20]
origin year n
<fct> <fct> <int>
1 CAN 1998 4
2 CAN 2000 3
3 CHN 1998 3
4 CHN 2000 4
5 DEU 1998 4
6 DEU 2000 4
7 GBR 1998 4
8 GBR 2000 4
9 ITA 1998 3
10 ITA 2000 4
11 JPN 1998 3
12 JPN 2000 3
13 KOR 1998 4
14 KOR 2000 4
15 MEX 1998 4
16 MEX 2000 4
17 NLD 1998 3
18 NLD 2000 4
19 USA 1998 4
20 USA 2000 4
Note, however, that this doesn't count how many types there are in each group, but how many rows there are. If you want to count the number of unique types, you can do this:
test.df %>% group_by(origin,year) %>%
filter(!is.na(value)) %>%
summarize(n_distinct(type)) #Merci, #Frank!
# A tibble: 20 x 3
# Groups: origin [?]
origin year `length(unique(type))`
<fct> <fct> <int>
1 CAN 1998 3
2 CAN 2000 3
3 CHN 1998 2
4 CHN 2000 3
5 DEU 1998 4
6 DEU 2000 3
7 GBR 1998 4
8 GBR 2000 4
9 ITA 1998 3
10 ITA 2000 4
11 JPN 1998 3
12 JPN 2000 2
13 KOR 1998 4
14 KOR 2000 4
15 MEX 1998 3
16 MEX 2000 3
17 NLD 1998 2
18 NLD 2000 3
19 USA 1998 3
20 USA 2000 4

Filling missing levels

I have the following type of dataframe:
Country <- rep(c("USA", "AUS", "GRC"),2)
Year <- 2001:2006
Level <- c("rich","middle","poor",rep(NA,3))
df <- data.frame(Country, Year,Level)
df
Country Year Level
1 USA 2001 rich
2 AUS 2002 middle
3 GRC 2003 poor
4 USA 2004 <NA>
5 AUS 2005 <NA>
6 GRC 2006 <NA>
I want to fill the missing values with the correct level label in the last from the right column.
So the expected outcome should be like this:
Country Year Level
1 USA 2001 rich
2 AUS 2002 middle
3 GRC 2003 poor
4 USA 2004 rich
5 AUS 2005 middle
6 GRC 2006 poor
In base R, you could use ave():
transform(df, Level = ave(Level, Country, FUN = na.omit))
# Country Year Level
# 1 USA 2001 rich
# 2 AUS 2002 middle
# 3 GRC 2003 poor
# 4 USA 2004 rich
# 5 AUS 2005 middle
# 6 GRC 2006 poor
Another, more accurate possibility is to use a join. Here we merge the Country column with the NA-omitted data. The outcome is the same, just in a different row order.
merge(df["Country"], na.omit(df))
# Country Year Level
# 1 AUS 2002 middle
# 2 AUS 2002 middle
# 3 GRC 2003 poor
# 4 GRC 2003 poor
# 5 USA 2001 rich
# 6 USA 2001 rich
We can group by 'Country' and get the non-NA unique value
library(dplyr)
df %>%
group_by(Country) %>%
dplyr::mutate(Level = Level[!is.na(Level)][1])
# A tibble: 6 x 3
# Groups: Country [3]
# Country Year Level
# <fctr> <int> <fctr>
#1 USA 2001 rich
#2 AUS 2002 middle
#3 GRC 2003 poor
#4 USA 2004 rich
#5 AUS 2005 middle
#6 GRC 2006 poor
If we have loaded dplyr along with plyr, it is better to specify explicitly dplyr::mutate or dplyr::summarise so that it uses the function from dplyr. There are same functions in plyr and it could potentially mask the functions from dplyr when both are loaded creating different behavior.
You can do it using data.table and zoo:-
library(data.table)
library(zoo)
setDT(df)
df[, Level := na.locf(Level), by = Country]
This will give you:-
Country Year Level
1: USA 2001 rich
2: AUS 2002 middle
3: GRC 2003 poor
4: USA 2004 rich
5: AUS 2005 middle
6: GRC 2006 poor
library(dplyr)
df %>%
group_by(Country) %>%
mutate(Level = replace(Level, is.na(Level), unique(na.omit(Level))))
Country Year Level
<fctr> <int> <fctr>
1 USA 2001 rich
2 AUS 2002 middle
3 GRC 2003 poor
4 USA 2004 rich
5 AUS 2005 middle
6 GRC 2006 poor
Or, more succinctly, applying #suchait's idea to use na.locf:
df %>%
group_by(Country) %>%
mutate(Level = zoo::na.locf(Level))
A solution using dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
arrange(Country) %>%
fill(Level) %>%
arrange(Year)
# Country Year Level
# 1 USA 2001 rich
# 2 AUS 2002 middle
# 3 GRC 2003 poor
# 4 USA 2004 rich
# 5 AUS 2005 middle
# 6 GRC 2006 poor
Here is another data.table solution which updates on join using a lookup table which is created from the given dataset itself:
library(data.table)
setDT(df)[df[!is.na(Level)], on = .(Country), Level := Level][]
Country Year Level
1: USA 2001 rich
2: AUS 2002 middle
3: GRC 2003 poor
4: USA 2004 rich
5: AUS 2005 middle
6: GRC 2006 poor

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

Add the value of two lines and create a new line

i'm new in R so i have some problems to modify my dataframe:
id <- c(1, 2,3,4,5,6,7,8,9,10)
number <- c(1,1,1,1,1,1,8,8,2,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0)
What i want is a similar dataframe but with 5 new rows that indicates the sum of the Value columns for M and F. Like that:
id <- c(1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15)
number <- c(1,1,1,1,1,1,8,8,2,2,1,1,1,8,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium","France", "France", "France", "Spain", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996,2010,2011,2010,2009,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F","T","T","T","T","T")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis","hiv","hiv","cancer","cancer","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0,16,2,170,647,0)
Much clear:
> whatIhave
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
> whatIwant
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
11 11 1 France 2010 T hiv 16
12 12 1 France 2011 T hiv 2
13 13 1 France 2010 T cancer 170
14 14 8 Spain 2009 T cancer 647
15 15 2 Belgium 1996 T tubercolosis 0
It has created a new T value for the column sex indicating the sum F + M.
The new 5 lines are the latest 5.
There are 5 lines because I have to add the F and M value for each country, by year, by disease. Number is related to the country. Id simply indicates the id of each line.
My data frame is obviously much bigger than this.
How can I do?
Thanks
Here is a quite fast solution using the data.table approach:
library(data.table)
# calculate the sums and store it in a separate data table dtpart2
dtpart2 <- setDT(df)[ , .(value= sum(value)), by = .(number, country, year, disease)]
# create columns of sex and id
dtpart2[, id := max(df$id)+1: nrow(dtpart2) ][, sex := "T"]
# set the same column order as in the original data frame
setcolorder(dtpart2, names(df))
# Append the two data sets
newdata <- rbind(df,dtpart2)
#> id number country year sex disease value
#> 1: 1 1 France 2010 M hiv 15
#> 2: 2 1 France 2010 F hiv 1
#> 3: 3 1 France 2011 M hiv 0
#> 4: 4 1 France 2011 F hiv 2
#> 5: 5 1 France 2010 M cancer 50
#> 6: 6 1 France 2010 F cancer 120
#> 7: 7 8 Spain 2009 M cancer 600
#> 8: 8 8 Spain 2009 F cancer 47
#> 9: 9 2 Belgium 1996 M tubercolosis 0
#> 10: 10 2 Belgium 1996 F tubercolosis 0
#> 11: 11 1 France 2010 T hiv 16
#> 12: 12 1 France 2011 T hiv 2
#> 13: 13 1 France 2010 T cancer 170
#> 14: 14 8 Spain 2009 T cancer 647
#> 15: 15 2 Belgium 1996 T tubercolosis 0
DATA:
df <- data.frame(id, number, country, year, sex, disease, value)
df <-
data.frame(
number <- c(1,1,1,1,1,1,8,8,2,2),
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium"),
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996),
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis"),
value <- c(15,1,0,2,50,120,600,47,0,0))
colnames(df) <- c("number","country", "year", "sex",
"disease", "value")
df2 <- aggregate(df[,colnames(df) %in% c("number", "value")], by = list(df$country, df$disease, df$year), FUN = sum)
df2$sex <- "T"
colnames(df2) <- c("country", "disease", "year", "number", "value", "sex")
df2 <- df2[,colnames(df2) %in% c( "number", "country", "year", "sex", "disease", "value")]
newdf <- rbind(df,df2)
newdf
number country year sex disease value
1 1 France 2010 M hiv 15
2 1 France 2010 F hiv 1
3 1 France 2011 M hiv 0
4 1 France 2011 F hiv 2
5 1 France 2010 M cancer 50
6 1 France 2010 F cancer 120
7 8 Spain 2009 M cancer 600
8 8 Spain 2009 F cancer 47
9 2 Belgium 1996 M tubercolosis 0
10 2 Belgium 1996 F tubercolosis 0
11 4 Belgium 1996 T tubercolosis 0
12 16 Spain 2009 T cancer 647
13 2 France 2010 T cancer 170
14 2 France 2010 T hiv 16
15 2 France 2011 T hiv 2

Resources