How to add new multiple rows to data.frame on R? - r

Below is how my code and dataframe looks like.
#Get country counts
countries <- as.data.frame(table(na.omit(co_df$country)))
print(countries)
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
I would like to add 4 new rows to the above countries data frame such that it looks like the below:
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10
I performed the below rbind function but it gave me an error; I also tried merge(countries, Addcountries, by = Null) and the as.data.frame function but these too gave me errors.
Addcountries <- data.frame(c(11, 12, 13, 14), c("Uruguay", "Saudi Arabia", "Japan", "Australia"), c("25", "19", "11", "10"))
names(Addcountries) <- c("Var1", "Freq")
countries2 <- rbind(countries, Addcountries)
print(countries2)
This is likely a silly issue but I would appreciate any help here since I'm new to R.

you may also use dplyr::add_row()
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
check it
countries <- read.table(text = " Var1 Freq
Austria 6
Canada 4
France 1
Germany 23
India 17
Italy 1
Russia 2
Sweden 1
UK 2
USA 10", header =T)
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10

Create a dataframe with two columns and rbind.
Addcountries <- data.frame(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10), stringsAsFactors = FALSE)
countries2 <- rbind(countries, Addcountries)

Related

Filling in Rows with Missing Data

I have a specific code that I want to write in R that I couldn't find an answer to on Stack Overflow. I am manipulating a dataset of continents data and am looking to calculate cumulative values for each year. This is a snapshot of what the df looks like:
Continent Year Value Cumulative Value
<chr> <dbl> <dbl> <dbl>
1 Europe 2000. 10. 10.
2 Asia 2000. 30. 30.
3 Africa 2000. 67. 67.
4 N. America 2000. 23. 23.
5 S. America 2000. 19. 19.
6 Europe 2001. 3. 13.
7 Asia 2001. 4. 34.
8 Africa 2001. 3. 70.
9 Europe 2002. 3. 16.
10 Asia 2002. 9. 43.
11 Africa 2002. 2. 72.
12 N. America 2002. 4. 27.
13 S. America 2002. 90. 109.
My issue is that not every continent has a value every year, yet I still need the cumulative value for that year. The cumulative value for that year would be the same for that specific continent as the previous year.
For example, in 2001, N. America and S. America do not have a row, and I would like both to show up with value = 0 and cumulative value as 23 and 19, respectively, the same as the previous year (in year 2000). I am unsure what code would accomplish this so any advice would be greatly appreciated.
Continent Year Value Cumulative Value
N. America 2001. 0. 23.
S. America 2001. 0. 19.
Let me know if I should provide more information and thanks again!
data
structure(list(Continent = c("Europe", "Asia", "Africa", "N. America",
"S. America", "Europe", "Asia", "Africa", "Europe", "Asia", "Africa",
"N. America", "S. America"), Year = c(2000, 2000, 2000, 2000,
2000, 2001, 2001, 2001, 2002, 2002, 2002, 2002, 2002), Value = c(10,
30, 67, 23, 19, 3, 4, 3, 3, 9, 2, 4, 90), `Cumulative Value` = c(10,
30, 67, 23, 19, 13, 34, 70, 16, 43, 72, 27, 109)), .Names = c("Continent",
"Year", "Value", "Cumulative Value"), row.names = c(NA, -13L), class = c("tbl_df",
"tbl", "data.frame"))
This should work, but is untested since your data isn't shared in a copy/pasteable way. Share dput(your_sample_data) and I will test/debug.
library(dplyr)
library(tidyr)
complete(your_data, Continent, Year, fill = list(Value = 0)) %>%
group_by(Continent) %>%
mutate(`Cumulative Value` = zoo::na.locf(`Cumulative Value`))
# A tibble: 15 x 4
# Groups: Continent [5]
Continent Year Value CV
<chr> <dbl> <dbl> <dbl>
1 Africa 2000 67 67
2 Africa 2001 3 70
3 Africa 2002 2 72
4 Asia 2000 30 30
5 Asia 2001 4 34
6 Asia 2002 9 43
7 Europe 2000 10 10
8 Europe 2001 3 13
9 Europe 2002 3 16
10 N. America 2000 23 23
11 N. America 2001 0 23
12 N. America 2002 4 27
13 S. America 2000 19 19
14 S. America 2001 0 19
15 S. America 2002 90 109
Here's a tidyverse option:
library(tidyverse)
df %>%
complete(Continent, Year) %>%
replace_na(list(Value = 0)) %>%
fill(Cumulative)
# A tibble: 15 x 4
Continent Year Value Cumulative
<chr> <int> <dbl> <int>
1 Africa 2000 67 67
2 Africa 2001 3 70
3 Africa 2002 2 72
4 Asia 2000 30 30
5 Asia 2001 4 34
6 Asia 2002 9 43
7 Europe 2000 10 10
8 Europe 2001 3 13
9 Europe 2002 3 16
10 N. America 2000 23 23
11 N. America 2001 0 23
12 N. America 2002 4 27
13 S. America 2000 19 19
14 S. America 2001 0 19
15 S. America 2002 90 109

cumulative count of character vector

I want to make a cumulative count of country names from a data frame:
df <- data.frame(country = c("Sweden", "Germany", "Sweden", "Sweden", "Germany",
"Vietnam"), year= c(1834, 1846, 1847, 1852, 1860, 1865))
I have tried different version of count(), cumsum() and tally() but can’t seem to get it right.
Output should look like:
country year n
Sweden 1834 1
Germany 1846 2
Sweden 1847 2
Sweden 1852 2
Germany 1860 2
Vietnam 1865 3
df %>% mutate(count = cumsum(!duplicated(.$country))) %>% as_tibble()
#> # A tibble: 6 x 3
#> country year count
#> <fctr> <dbl> <int>
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
or
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
df %>% mutate(var2=dist_cum(country))
#> country year var2
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
You can try this:
library(ggplot2)
library(plyr)
df<-data.frame(country=c("Sweden","Germany","Sweden","Sweden","Germany","Vietnam", "Germany"),year= c(1834,1846,1847,1852,1860,1865,1860))
counts <- ddply(df, .(df$country, df$year), nrow)
The output is:
> counts
df$country df$year V1
1 Germany 1846 1
2 Germany 1860 2
3 Sweden 1834 1
4 Sweden 1847 1
5 Sweden 1852 1
6 Vietnam 1865 1

Add the value of two lines and create a new line

i'm new in R so i have some problems to modify my dataframe:
id <- c(1, 2,3,4,5,6,7,8,9,10)
number <- c(1,1,1,1,1,1,8,8,2,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0)
What i want is a similar dataframe but with 5 new rows that indicates the sum of the Value columns for M and F. Like that:
id <- c(1, 2,3,4,5,6,7,8,9,10,11,12,13,14,15)
number <- c(1,1,1,1,1,1,8,8,2,2,1,1,1,8,2)
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium","France", "France", "France", "Spain", "Belgium")
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996,2010,2011,2010,2009,1996)
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F","T","T","T","T","T")
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis","hiv","hiv","cancer","cancer","tubercolosis")
value <- c(15,1,0,2,50,120,600,47,0,0,16,2,170,647,0)
Much clear:
> whatIhave
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
> whatIwant
id number country year sex disease value
1 1 1 France 2010 M hiv 15
2 2 1 France 2010 F hiv 1
3 3 1 France 2011 M hiv 0
4 4 1 France 2011 F hiv 2
5 5 1 France 2010 M cancer 50
6 6 1 France 2010 F cancer 120
7 7 8 Spain 2009 M cancer 600
8 8 8 Spain 2009 F cancer 47
9 9 2 Belgium 1996 M tubercolosis 0
10 10 2 Belgium 1996 F tubercolosis 0
11 11 1 France 2010 T hiv 16
12 12 1 France 2011 T hiv 2
13 13 1 France 2010 T cancer 170
14 14 8 Spain 2009 T cancer 647
15 15 2 Belgium 1996 T tubercolosis 0
It has created a new T value for the column sex indicating the sum F + M.
The new 5 lines are the latest 5.
There are 5 lines because I have to add the F and M value for each country, by year, by disease. Number is related to the country. Id simply indicates the id of each line.
My data frame is obviously much bigger than this.
How can I do?
Thanks
Here is a quite fast solution using the data.table approach:
library(data.table)
# calculate the sums and store it in a separate data table dtpart2
dtpart2 <- setDT(df)[ , .(value= sum(value)), by = .(number, country, year, disease)]
# create columns of sex and id
dtpart2[, id := max(df$id)+1: nrow(dtpart2) ][, sex := "T"]
# set the same column order as in the original data frame
setcolorder(dtpart2, names(df))
# Append the two data sets
newdata <- rbind(df,dtpart2)
#> id number country year sex disease value
#> 1: 1 1 France 2010 M hiv 15
#> 2: 2 1 France 2010 F hiv 1
#> 3: 3 1 France 2011 M hiv 0
#> 4: 4 1 France 2011 F hiv 2
#> 5: 5 1 France 2010 M cancer 50
#> 6: 6 1 France 2010 F cancer 120
#> 7: 7 8 Spain 2009 M cancer 600
#> 8: 8 8 Spain 2009 F cancer 47
#> 9: 9 2 Belgium 1996 M tubercolosis 0
#> 10: 10 2 Belgium 1996 F tubercolosis 0
#> 11: 11 1 France 2010 T hiv 16
#> 12: 12 1 France 2011 T hiv 2
#> 13: 13 1 France 2010 T cancer 170
#> 14: 14 8 Spain 2009 T cancer 647
#> 15: 15 2 Belgium 1996 T tubercolosis 0
DATA:
df <- data.frame(id, number, country, year, sex, disease, value)
df <-
data.frame(
number <- c(1,1,1,1,1,1,8,8,2,2),
country <- c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Belgium", "Belgium"),
year <- c(2010,2010,2011,2011,2010,2010,2009,2009,1996,1996),
sex <- c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
disease <- c("hiv","hiv","hiv","hiv","cancer","cancer","cancer","cancer","tubercolosis","tubercolosis"),
value <- c(15,1,0,2,50,120,600,47,0,0))
colnames(df) <- c("number","country", "year", "sex",
"disease", "value")
df2 <- aggregate(df[,colnames(df) %in% c("number", "value")], by = list(df$country, df$disease, df$year), FUN = sum)
df2$sex <- "T"
colnames(df2) <- c("country", "disease", "year", "number", "value", "sex")
df2 <- df2[,colnames(df2) %in% c( "number", "country", "year", "sex", "disease", "value")]
newdf <- rbind(df,df2)
newdf
number country year sex disease value
1 1 France 2010 M hiv 15
2 1 France 2010 F hiv 1
3 1 France 2011 M hiv 0
4 1 France 2011 F hiv 2
5 1 France 2010 M cancer 50
6 1 France 2010 F cancer 120
7 8 Spain 2009 M cancer 600
8 8 Spain 2009 F cancer 47
9 2 Belgium 1996 M tubercolosis 0
10 2 Belgium 1996 F tubercolosis 0
11 4 Belgium 1996 T tubercolosis 0
12 16 Spain 2009 T cancer 647
13 2 France 2010 T cancer 170
14 2 France 2010 T hiv 16
15 2 France 2011 T hiv 2

How to flatten data.frame for use with googlevis treemap?

In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)

How to remove rows in data frame after frequency tables in R

I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5

Resources