I have a dataframe in the following format:
df <- data.frame(year = c(2000, 2000, 2000, 2000, 2000, 2004, 2004, 2004, 2004, 2004,
2010, 2010, 2010, 2010, 2010),
city = c("City A", "City B", "City C", "City D", "City E",
"City A", "City B", "City C", "City D", "City E",
"City A", "City B", "City C", "City D", "City E"),
constant_y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
df
year city constant_y
1 2000 City A 1
2 2000 City B 2
3 2000 City C 3
4 2000 City D 4
5 2000 City E 5
6 2004 City A 6
7 2004 City B 7
8 2004 City C 8
9 2004 City D 9
10 2004 City E 10
11 2010 City A 11
12 2010 City B 12
13 2010 City C 13
14 2010 City D 14
15 2010 City E 15
I'd like to fill in/add the missing years for each city, using data from the prior year for that city. So in a way duplicate rows while changing the year column value, grouping by city. Below is the output that I am trying to get at for each city (City A as example)
year city constant_y
1 2000 City A 1
2 2001 City A 1
3 2002 City A 1
4 2003 City A 1
5 2004 City A 6
6 2005 City A 6
7 2006 City A 6
8 2007 City A 6
9 2008 City A 6
10 2009 City A 6
11 2010 City A 11
12 2011 City A 11
13 2012 City A 11
14 2013 City A 11
15 2014 City A 11
16 2015 City A 11
17 2016 City A 11
18 2017 City A 11
19 2018 City A 11
20 2019 City A 11
And then the same for City B, C, D etc. (using their "constant_y" values in prior years). E.g. City B would have 2 until 2003, then 7 from 2004 to 2009, and then 12 from 2010 to 2019.
So yes, I just want to add rows that duplicate/uses each city's "constant_y" to the following year. My data stops at some year (2010), but I want to use the values from to 2010 to extend it some years further, e.g. 2019 in the example above. I hope I am not overcomplicating this, but I am not sure how to solve it
Here's one method that starts by finding all possible city/year combinations, joining this on the original data, and then filling (via last-observation-carry-forward techniques) constant_y per-city.
dplyr
library(dplyr)
library(tidyr) # expand, fill
df %>%
expand(city, year = do.call(seq, as.list(range(year)))) %>%
full_join(df, by = c("city", "year")) %>%
arrange(city, year) %>%
fill(constant_y)
# # A tibble: 55 x 3
# city year constant_y
# <chr> <dbl> <dbl>
# 1 City A 2000 1
# 2 City A 2001 1
# 3 City A 2002 1
# 4 City A 2003 1
# 5 City A 2004 6
# 6 City A 2005 6
# 7 City A 2006 6
# 8 City A 2007 6
# 9 City A 2008 6
# 10 City A 2009 6
# # ... with 45 more rows
Granted, this only goes out to 2010, since that's all that was in your original data. If you need it to go beyond the original data, then change to
df %>%
expand(city, year = do.call(seq, as.list(range(c(year, 2019))))) %>%
... # ^^^^^^^^^^^^^ different
base R
# library(zoo) # na.locf
df2 <- merge(
df,
expand.grid(city = unique(df$city), year = do.call(seq, as.list(range(df$year)))),
by = c("city", "year"), all = TRUE)
df2$constant_y <- ave(df2$constant_y, df2$city, FUN = zoo::na.locf, na.rm = FALSE)
subset(df2, city == "City A")
# city year constant_y
# 1 City A 2000 1
# 2 City A 2001 1
# 3 City A 2002 1
# 4 City A 2003 1
# 5 City A 2004 6
# 6 City A 2005 6
# 7 City A 2006 6
# 8 City A 2007 6
# 9 City A 2008 6
# 10 City A 2009 6
# 11 City A 2010 11
(Same with with 2010 vs 2019.)
data.table
library(data.table)
DT <- as.data.table(df) # canonical would be `setDT(df)` instead
DT <- DT[, CJ(city = unique(city), year = do.call(seq, as.list(range(year))))
][DT, constant_y := i.constant_y, on = .(city, year)
][, constant_y := nafill(constant_y, type = "locf"), by = .(city)]
DT
# city year constant_y
# <char> <int> <num>
# 1: City A 2000 1
# 2: City A 2001 1
# 3: City A 2002 1
# 4: City A 2003 1
# 5: City A 2004 6
# 6: City A 2005 6
# 7: City A 2006 6
# 8: City A 2007 6
# 9: City A 2008 6
# 10: City A 2009 6
# ---
# 46: City E 2001 5
# 47: City E 2002 5
# 48: City E 2003 5
# 49: City E 2004 10
# 50: City E 2005 10
# 51: City E 2006 10
# 52: City E 2007 10
# 53: City E 2008 10
# 54: City E 2009 10
# 55: City E 2010 15
Below is how my code and dataframe looks like.
#Get country counts
countries <- as.data.frame(table(na.omit(co_df$country)))
print(countries)
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
I would like to add 4 new rows to the above countries data frame such that it looks like the below:
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10
I performed the below rbind function but it gave me an error; I also tried merge(countries, Addcountries, by = Null) and the as.data.frame function but these too gave me errors.
Addcountries <- data.frame(c(11, 12, 13, 14), c("Uruguay", "Saudi Arabia", "Japan", "Australia"), c("25", "19", "11", "10"))
names(Addcountries) <- c("Var1", "Freq")
countries2 <- rbind(countries, Addcountries)
print(countries2)
This is likely a silly issue but I would appreciate any help here since I'm new to R.
you may also use dplyr::add_row()
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
check it
countries <- read.table(text = " Var1 Freq
Austria 6
Canada 4
France 1
Germany 23
India 17
Italy 1
Russia 2
Sweden 1
UK 2
USA 10", header =T)
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10
Create a dataframe with two columns and rbind.
Addcountries <- data.frame(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10), stringsAsFactors = FALSE)
countries2 <- rbind(countries, Addcountries)
i'm new in R so i have some problems to modify my dataframe:
id <- c(1, 2,3,4,5,6,7,8,9,10)
number <- c(1,1,1,1,1,1,8,8,2,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0)
What i want is a similar dataframe but with 5 new rows that indicates the sum of the Value columns for M and F. Like that:
id <- c(1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15)
number <- c(1,1,1,1,1,1,8,8,2,2,1,1,1,8,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium","France", "France", "France", "Spain", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996,2010,2011,2010,2009,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F","T","T","T","T","T")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis","hiv","hiv","cancer","cancer","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0,16,2,170,647,0)
Much clear:
> whatIhave
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
> whatIwant
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
11 11 1 France 2010 T hiv 16
12 12 1 France 2011 T hiv 2
13 13 1 France 2010 T cancer 170
14 14 8 Spain 2009 T cancer 647
15 15 2 Belgium 1996 T tubercolosis 0
It has created a new T value for the column sex indicating the sum F + M.
The new 5 lines are the latest 5.
There are 5 lines because I have to add the F and M value for each country, by year, by disease. Number is related to the country. Id simply indicates the id of each line.
My data frame is obviously much bigger than this.
How can I do?
Thanks
Here is a quite fast solution using the data.table approach:
library(data.table)
# calculate the sums and store it in a separate data table dtpart2
dtpart2 <- setDT(df)[ , .(value= sum(value)), by = .(number, country, year, disease)]
# create columns of sex and id
dtpart2[, id := max(df$id)+1: nrow(dtpart2) ][, sex := "T"]
# set the same column order as in the original data frame
setcolorder(dtpart2, names(df))
# Append the two data sets
newdata <- rbind(df,dtpart2)
#> id number country year sex disease value
#> 1: 1 1 France 2010 M hiv 15
#> 2: 2 1 France 2010 F hiv 1
#> 3: 3 1 France 2011 M hiv 0
#> 4: 4 1 France 2011 F hiv 2
#> 5: 5 1 France 2010 M cancer 50
#> 6: 6 1 France 2010 F cancer 120
#> 7: 7 8 Spain 2009 M cancer 600
#> 8: 8 8 Spain 2009 F cancer 47
#> 9: 9 2 Belgium 1996 M tubercolosis 0
#> 10: 10 2 Belgium 1996 F tubercolosis 0
#> 11: 11 1 France 2010 T hiv 16
#> 12: 12 1 France 2011 T hiv 2
#> 13: 13 1 France 2010 T cancer 170
#> 14: 14 8 Spain 2009 T cancer 647
#> 15: 15 2 Belgium 1996 T tubercolosis 0
DATA:
df <- data.frame(id, number, country, year, sex, disease, value)
df <-
data.frame(
number <- c(1,1,1,1,1,1,8,8,2,2),
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium"),
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996),
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis"),
value <- c(15,1,0,2,50,120,600,47,0,0))
colnames(df) <- c("number","country", "year", "sex",
"disease", "value")
df2 <- aggregate(df[,colnames(df) %in% c("number", "value")], by = list(df$country, df$disease, df$year), FUN = sum)
df2$sex <- "T"
colnames(df2) <- c("country", "disease", "year", "number", "value", "sex")
df2 <- df2[,colnames(df2) %in% c( "number", "country", "year", "sex", "disease", "value")]
newdf <- rbind(df,df2)
newdf
number country year sex disease value
1 1 France 2010 M hiv 15
2 1 France 2010 F hiv 1
3 1 France 2011 M hiv 0
4 1 France 2011 F hiv 2
5 1 France 2010 M cancer 50
6 1 France 2010 F cancer 120
7 8 Spain 2009 M cancer 600
8 8 Spain 2009 F cancer 47
9 2 Belgium 1996 M tubercolosis 0
10 2 Belgium 1996 F tubercolosis 0
11 4 Belgium 1996 T tubercolosis 0
12 16 Spain 2009 T cancer 647
13 2 France 2010 T cancer 170
14 2 France 2010 T hiv 16
15 2 France 2011 T hiv 2
In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)