Filling in Rows with Missing Data - r

I have a specific code that I want to write in R that I couldn't find an answer to on Stack Overflow. I am manipulating a dataset of continents data and am looking to calculate cumulative values for each year. This is a snapshot of what the df looks like:
Continent Year Value Cumulative Value
<chr> <dbl> <dbl> <dbl>
1 Europe 2000. 10. 10.
2 Asia 2000. 30. 30.
3 Africa 2000. 67. 67.
4 N. America 2000. 23. 23.
5 S. America 2000. 19. 19.
6 Europe 2001. 3. 13.
7 Asia 2001. 4. 34.
8 Africa 2001. 3. 70.
9 Europe 2002. 3. 16.
10 Asia 2002. 9. 43.
11 Africa 2002. 2. 72.
12 N. America 2002. 4. 27.
13 S. America 2002. 90. 109.
My issue is that not every continent has a value every year, yet I still need the cumulative value for that year. The cumulative value for that year would be the same for that specific continent as the previous year.
For example, in 2001, N. America and S. America do not have a row, and I would like both to show up with value = 0 and cumulative value as 23 and 19, respectively, the same as the previous year (in year 2000). I am unsure what code would accomplish this so any advice would be greatly appreciated.
Continent Year Value Cumulative Value
N. America 2001. 0. 23.
S. America 2001. 0. 19.
Let me know if I should provide more information and thanks again!
data
structure(list(Continent = c("Europe", "Asia", "Africa", "N. America",
"S. America", "Europe", "Asia", "Africa", "Europe", "Asia", "Africa",
"N. America", "S. America"), Year = c(2000, 2000, 2000, 2000,
2000, 2001, 2001, 2001, 2002, 2002, 2002, 2002, 2002), Value = c(10,
30, 67, 23, 19, 3, 4, 3, 3, 9, 2, 4, 90), `Cumulative Value` = c(10,
30, 67, 23, 19, 13, 34, 70, 16, 43, 72, 27, 109)), .Names = c("Continent",
"Year", "Value", "Cumulative Value"), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))

This should work, but is untested since your data isn't shared in a copy/pasteable way. Share dput(your_sample_data) and I will test/debug.
library(dplyr)
library(tidyr)
complete(your_data, Continent, Year, fill = list(Value = 0)) %>%
group_by(Continent) %>%
mutate(`Cumulative Value` = zoo::na.locf(`Cumulative Value`))
# A tibble: 15 x 4
# Groups: Continent [5]
Continent Year Value CV
<chr> <dbl> <dbl> <dbl>
1 Africa 2000 67 67
2 Africa 2001 3 70
3 Africa 2002 2 72
4 Asia 2000 30 30
5 Asia 2001 4 34
6 Asia 2002 9 43
7 Europe 2000 10 10
8 Europe 2001 3 13
9 Europe 2002 3 16
10 N. America 2000 23 23
11 N. America 2001 0 23
12 N. America 2002 4 27
13 S. America 2000 19 19
14 S. America 2001 0 19
15 S. America 2002 90 109

Here's a tidyverse option:
library(tidyverse)
df %>%
complete(Continent, Year) %>%
replace_na(list(Value = 0)) %>%
fill(Cumulative)
# A tibble: 15 x 4
Continent Year Value Cumulative
<chr> <int> <dbl> <int>
1 Africa 2000 67 67
2 Africa 2001 3 70
3 Africa 2002 2 72
4 Asia 2000 30 30
5 Asia 2001 4 34
6 Asia 2002 9 43
7 Europe 2000 10 10
8 Europe 2001 3 13
9 Europe 2002 3 16
10 N. America 2000 23 23
11 N. America 2001 0 23
12 N. America 2002 4 27
13 S. America 2000 19 19
14 S. America 2001 0 19
15 S. America 2002 90 109

Related

Adding rows to fill "missing" years that are based on the values of the groups in the previous year

I have a dataframe in the following format:
df <- data.frame(year = c(2000, 2000, 2000, 2000, 2000, 2004, 2004, 2004, 2004, 2004,
2010, 2010, 2010, 2010, 2010),
city = c("City A", "City B", "City C", "City D", "City E",
"City A", "City B", "City C", "City D", "City E",
"City A", "City B", "City C", "City D", "City E"),
constant_y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15))
df
year city constant_y
1 2000 City A 1
2 2000 City B 2
3 2000 City C 3
4 2000 City D 4
5 2000 City E 5
6 2004 City A 6
7 2004 City B 7
8 2004 City C 8
9 2004 City D 9
10 2004 City E 10
11 2010 City A 11
12 2010 City B 12
13 2010 City C 13
14 2010 City D 14
15 2010 City E 15
I'd like to fill in/add the missing years for each city, using data from the prior year for that city. So in a way duplicate rows while changing the year column value, grouping by city. Below is the output that I am trying to get at for each city (City A as example)
year city constant_y
1 2000 City A 1
2 2001 City A 1
3 2002 City A 1
4 2003 City A 1
5 2004 City A 6
6 2005 City A 6
7 2006 City A 6
8 2007 City A 6
9 2008 City A 6
10 2009 City A 6
11 2010 City A 11
12 2011 City A 11
13 2012 City A 11
14 2013 City A 11
15 2014 City A 11
16 2015 City A 11
17 2016 City A 11
18 2017 City A 11
19 2018 City A 11
20 2019 City A 11
And then the same for City B, C, D etc. (using their "constant_y" values in prior years). E.g. City B would have 2 until 2003, then 7 from 2004 to 2009, and then 12 from 2010 to 2019.
So yes, I just want to add rows that duplicate/uses each city's "constant_y" to the following year. My data stops at some year (2010), but I want to use the values from to 2010 to extend it some years further, e.g. 2019 in the example above. I hope I am not overcomplicating this, but I am not sure how to solve it
Here's one method that starts by finding all possible city/year combinations, joining this on the original data, and then filling (via last-observation-carry-forward techniques) constant_y per-city.
dplyr
library(dplyr)
library(tidyr) # expand, fill
df %>%
expand(city, year = do.call(seq, as.list(range(year)))) %>%
full_join(df, by = c("city", "year")) %>%
arrange(city, year) %>%
fill(constant_y)
# # A tibble: 55 x 3
# city year constant_y
# <chr> <dbl> <dbl>
# 1 City A 2000 1
# 2 City A 2001 1
# 3 City A 2002 1
# 4 City A 2003 1
# 5 City A 2004 6
# 6 City A 2005 6
# 7 City A 2006 6
# 8 City A 2007 6
# 9 City A 2008 6
# 10 City A 2009 6
# # ... with 45 more rows
Granted, this only goes out to 2010, since that's all that was in your original data. If you need it to go beyond the original data, then change to
df %>%
expand(city, year = do.call(seq, as.list(range(c(year, 2019))))) %>%
... # ^^^^^^^^^^^^^ different
base R
# library(zoo) # na.locf
df2 <- merge(
df,
expand.grid(city = unique(df$city), year = do.call(seq, as.list(range(df$year)))),
by = c("city", "year"), all = TRUE)
df2$constant_y <- ave(df2$constant_y, df2$city, FUN = zoo::na.locf, na.rm = FALSE)
subset(df2, city == "City A")
# city year constant_y
# 1 City A 2000 1
# 2 City A 2001 1
# 3 City A 2002 1
# 4 City A 2003 1
# 5 City A 2004 6
# 6 City A 2005 6
# 7 City A 2006 6
# 8 City A 2007 6
# 9 City A 2008 6
# 10 City A 2009 6
# 11 City A 2010 11
(Same with with 2010 vs 2019.)
data.table
library(data.table)
DT <- as.data.table(df) # canonical would be `setDT(df)` instead
DT <- DT[, CJ(city = unique(city), year = do.call(seq, as.list(range(year))))
][DT, constant_y := i.constant_y, on = .(city, year)
][, constant_y := nafill(constant_y, type = "locf"), by = .(city)]
DT
# city year constant_y
# <char> <int> <num>
# 1: City A 2000 1
# 2: City A 2001 1
# 3: City A 2002 1
# 4: City A 2003 1
# 5: City A 2004 6
# 6: City A 2005 6
# 7: City A 2006 6
# 8: City A 2007 6
# 9: City A 2008 6
# 10: City A 2009 6
# ---
# 46: City E 2001 5
# 47: City E 2002 5
# 48: City E 2003 5
# 49: City E 2004 10
# 50: City E 2005 10
# 51: City E 2006 10
# 52: City E 2007 10
# 53: City E 2008 10
# 54: City E 2009 10
# 55: City E 2010 15

How to add new multiple rows to data.frame on R?

Below is how my code and dataframe looks like.
#Get country counts
countries <- as.data.frame(table(na.omit(co_df$country)))
print(countries)
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
I would like to add 4 new rows to the above countries data frame such that it looks like the below:
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10
I performed the below rbind function but it gave me an error; I also tried merge(countries, Addcountries, by = Null) and the as.data.frame function but these too gave me errors.
Addcountries <- data.frame(c(11, 12, 13, 14), c("Uruguay", "Saudi Arabia", "Japan", "Australia"), c("25", "19", "11", "10"))
names(Addcountries) <- c("Var1", "Freq")
countries2 <- rbind(countries, Addcountries)
print(countries2)
This is likely a silly issue but I would appreciate any help here since I'm new to R.
you may also use dplyr::add_row()
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
check it
countries <- read.table(text = " Var1 Freq
Austria 6
Canada 4
France 1
Germany 23
India 17
Italy 1
Russia 2
Sweden 1
UK 2
USA 10", header =T)
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10
Create a dataframe with two columns and rbind.
Addcountries <- data.frame(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10), stringsAsFactors = FALSE)
countries2 <- rbind(countries, Addcountries)

Add the value of two lines and create a new line

i'm new in R so i have some problems to modify my dataframe:
id <- c(1, 2,3,4,5,6,7,8,9,10)
number <- c(1,1,1,1,1,1,8,8,2,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0)
What i want is a similar dataframe but with 5 new rows that indicates the sum of the Value columns for M and F. Like that:
id <- c(1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15)
number <- c(1,1,1,1,1,1,8,8,2,2,1,1,1,8,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium","France", "France", "France", "Spain", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996,2010,2011,2010,2009,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F","T","T","T","T","T")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis","hiv","hiv","cancer","cancer","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0,16,2,170,647,0)
Much clear:
> whatIhave
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
> whatIwant
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
11 11 1 France 2010 T hiv 16
12 12 1 France 2011 T hiv 2
13 13 1 France 2010 T cancer 170
14 14 8 Spain 2009 T cancer 647
15 15 2 Belgium 1996 T tubercolosis 0
It has created a new T value for the column sex indicating the sum F + M.
The new 5 lines are the latest 5.
There are 5 lines because I have to add the F and M value for each country, by year, by disease. Number is related to the country. Id simply indicates the id of each line.
My data frame is obviously much bigger than this.
How can I do?
Thanks
Here is a quite fast solution using the data.table approach:
library(data.table)
# calculate the sums and store it in a separate data table dtpart2
dtpart2 <- setDT(df)[ , .(value= sum(value)), by = .(number, country, year, disease)]
# create columns of sex and id
dtpart2[, id := max(df$id)+1: nrow(dtpart2) ][, sex := "T"]
# set the same column order as in the original data frame
setcolorder(dtpart2, names(df))
# Append the two data sets
newdata <- rbind(df,dtpart2)
#> id number country year sex disease value
#> 1: 1 1 France 2010 M hiv 15
#> 2: 2 1 France 2010 F hiv 1
#> 3: 3 1 France 2011 M hiv 0
#> 4: 4 1 France 2011 F hiv 2
#> 5: 5 1 France 2010 M cancer 50
#> 6: 6 1 France 2010 F cancer 120
#> 7: 7 8 Spain 2009 M cancer 600
#> 8: 8 8 Spain 2009 F cancer 47
#> 9: 9 2 Belgium 1996 M tubercolosis 0
#> 10: 10 2 Belgium 1996 F tubercolosis 0
#> 11: 11 1 France 2010 T hiv 16
#> 12: 12 1 France 2011 T hiv 2
#> 13: 13 1 France 2010 T cancer 170
#> 14: 14 8 Spain 2009 T cancer 647
#> 15: 15 2 Belgium 1996 T tubercolosis 0
DATA:
df <- data.frame(id, number, country, year, sex, disease, value)
df <-
data.frame(
number <- c(1,1,1,1,1,1,8,8,2,2),
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium"),
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996),
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis"),
value <- c(15,1,0,2,50,120,600,47,0,0))
colnames(df) <- c("number","country", "year", "sex",
"disease", "value")
df2 <- aggregate(df[,colnames(df) %in% c("number", "value")], by = list(df$country, df$disease, df$year), FUN = sum)
df2$sex <- "T"
colnames(df2) <- c("country", "disease", "year", "number", "value", "sex")
df2 <- df2[,colnames(df2) %in% c( "number", "country", "year", "sex", "disease", "value")]
newdf <- rbind(df,df2)
newdf
number country year sex disease value
1 1 France 2010 M hiv 15
2 1 France 2010 F hiv 1
3 1 France 2011 M hiv 0
4 1 France 2011 F hiv 2
5 1 France 2010 M cancer 50
6 1 France 2010 F cancer 120
7 8 Spain 2009 M cancer 600
8 8 Spain 2009 F cancer 47
9 2 Belgium 1996 M tubercolosis 0
10 2 Belgium 1996 F tubercolosis 0
11 4 Belgium 1996 T tubercolosis 0
12 16 Spain 2009 T cancer 647
13 2 France 2010 T cancer 170
14 2 France 2010 T hiv 16
15 2 France 2011 T hiv 2

How to flatten data.frame for use with googlevis treemap?

In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)

How to remove rows in data frame after frequency tables in R

I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5

Resources