Reformat input-output table in R - r

I have an input output table with the origin (input field) as rows and the destination (output field) as columns. Here's an example:
Mexico Thailand Vietnam
USA 0 3 6
Italy 3 7 8
France 9 3 1
Germany 3 6 7
I want to convert the table so that the origin is in column1, destination is in column 2, and value is in column 3 so that it would look like this:
origin destination value
USA Mexico 0
USA Thailand 3
USA Vietnam 6
Italy Mexico 3
Italy Thailand 7
Italy Vietnam 8
France Mexico 9
France Thailand 3
France Vietnam 1
Germany Mexico 3
Germany Thailand 6
Germany Vietnam 7

There is a simple solution using the melt function from the reshape2 package:
#sample data
Mexico<-c(0, 3, 9,3)
Thailand <-c(3, 7, 3, 6)
Vietnam <-c(6, 8, 1, 7)
names<-c("USA", "Italy", "France", "Germany")
df<-data.frame(names, Mexico, Thailand, Vietnam)
library(reshape2)
melt(df )
The package "tidyr" has a similar functionality.
library(tidyr)
gather(df, "names")

Related

Creating a new column when two columns satisfy certain conditions in R

My data is like this:
country supporter1 supporter2 supporter3 supporter4 supporter5
USA Albania Germany USA NA NA
France USA France NA NA NA
UK UK Chile Peru NA NA
Germany USA Iran Mexico India Pakistan
USA China Spain NA NA NA
Cuba Cuba UK Germany South Korea NA
China Russia NA NA NA NA
What I want to do is to create a new variable when the country column and one of the remaining supporter columns (supporter 1, supporter 2, supporter 3, supporter 4, and supporter 5) are the same (for instance country France and supporter2 France are the same). In this case, the new variable should take 1, 0 otherwise.
I expect to have this:
country supporter1 supporter2 supporter3 supporter4 supporter5 new variable
USA Albania Germany USA NA NA 1
France USA France NA NA NA 1
UK UK Chile Peru NA NA 1
Germany USA Iran Mexico India Pakistan 0
USA China Spain NA NA NA 0
Cuba Cuba UK Germany South Korea NA 1
China Russia NA NA NA NA 0
Update dplyr only solution Using if_any:
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.integer(as.logical(if_any(starts_with("supporter"), ~ . %in% country))))
country supporter1 supporter2 supporter3 supporter4 supporter5 new_var
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 USA Albania Germany USA NA NA 1
2 France USA France NA NA NA 1
3 UK UK Chile Peru NA NA 1
4 Germany USA Iran Mexico India Pakistan 0
5 USA China Spain NA NA NA 0
6 Cuba Cuba UK Germany South Korea NA 1
7 China Russia NA NA NA NA 0
First answer: also correct:
Here is one possible solution:
calculate rowwise
check in cols supporter1 to supporter5 if country is included
unite all new columns to one and with an ifelse statement take 1 or 0
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rowwise() %>%
mutate(across(supporter1:supporter5, ~ifelse(. %in% country, 1,0), .names = "new_{col}")) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
mutate(New_Col = ifelse(str_detect(New_Col, "1"), 1,0))
country supporter1 supporter2 supporter3 supporter4 supporter5 New_Col
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 USA Albania Germany USA NA NA 1
2 France USA France NA NA NA 1
3 UK UK Chile Peru NA NA 1
4 Germany USA Iran Mexico India Pakistan 0
5 USA China Spain NA NA NA 0
6 Cuba Cuba UK Germany South Korea NA 1
7 China Russia NA NA NA NA 0
Here is a base R solution.
First mapply checks for equality of suporter* and country. NA's are considered to return FALSE. Then as.integer/rowSums transforms rows with at least one TRUE into 1, otherwise 0.
eq <- mapply(\(x, y){x == y & !is.na(x)}, df1[-1], df1[1])
as.integer(rowSums(eq) != 0)
#[1] 1 1 1 0 0 1 0
df1$new_variable <- as.integer(rowSums(eq) != 0)
Data
df1 <- read.table(text = "
country supporter1 supporter2 supporter3 supporter4 supporter5
USA Albania Germany USA NA NA
France USA France NA NA NA
UK UK Chile Peru NA NA
Germany USA Iran Mexico India Pakistan
USA China Spain NA NA NA
Cuba Cuba UK Germany 'South Korea' NA
China Russia NA NA NA NA
", header = TRUE)
Another solution is checking per row whether country is present in one of the columns:
df <- data.frame(country=c("USA","France","UK","Germany","USA","Cuba","China"),
supporter1=c("Albania","USA","UK","USA","China","Cuba","Russia"),
supporter2=c("Germany","France","Chile","Iran","Spain","UK","NA"),
supporter3=c("USA","NA","Peru","Mexico","NA","Germany","NA"),
supporter4=c("NA","NA","NA","India","NA","South Korea","NA"),
supporter5=c("NA","NA","NA","Pakistan","NA","NA","NA"))
That would give:
df$new <- sapply(seq(1,nrow(df)), function(x) ifelse(df$country[x] %in% df[x,2:6],1,0))
> df$new
[1] 1 1 1 0 0 1 0

How to add new multiple rows to data.frame on R?

Below is how my code and dataframe looks like.
#Get country counts
countries <- as.data.frame(table(na.omit(co_df$country)))
print(countries)
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
I would like to add 4 new rows to the above countries data frame such that it looks like the below:
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10
I performed the below rbind function but it gave me an error; I also tried merge(countries, Addcountries, by = Null) and the as.data.frame function but these too gave me errors.
Addcountries <- data.frame(c(11, 12, 13, 14), c("Uruguay", "Saudi Arabia", "Japan", "Australia"), c("25", "19", "11", "10"))
names(Addcountries) <- c("Var1", "Freq")
countries2 <- rbind(countries, Addcountries)
print(countries2)
This is likely a silly issue but I would appreciate any help here since I'm new to R.
you may also use dplyr::add_row()
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
check it
countries <- read.table(text = " Var1 Freq
Austria 6
Canada 4
France 1
Germany 23
India 17
Italy 1
Russia 2
Sweden 1
UK 2
USA 10", header =T)
countries %>% add_row(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10))
Var1 Freq
1 Austria 6
2 Canada 4
3 France 1
4 Germany 23
5 India 17
6 Italy 1
7 Russia 2
8 Sweden 1
9 UK 2
10 USA 10
11 Uruguay 25
12 Saudi Arabia 19
13 Japan 11
14 Australia 10
Create a dataframe with two columns and rbind.
Addcountries <- data.frame(Var1 = c("Uruguay", "Saudi Arabia", "Japan", "Australia"),
Freq = c(25, 19, 11, 10), stringsAsFactors = FALSE)
countries2 <- rbind(countries, Addcountries)

cumulative count of character vector

I want to make a cumulative count of country names from a data frame:
df <- data.frame(country = c("Sweden", "Germany", "Sweden", "Sweden", "Germany",
"Vietnam"), year= c(1834, 1846, 1847, 1852, 1860, 1865))
I have tried different version of count(), cumsum() and tally() but can’t seem to get it right.
Output should look like:
country year n
Sweden 1834 1
Germany 1846 2
Sweden 1847 2
Sweden 1852 2
Germany 1860 2
Vietnam 1865 3
df %>% mutate(count = cumsum(!duplicated(.$country))) %>% as_tibble()
#> # A tibble: 6 x 3
#> country year count
#> <fctr> <dbl> <int>
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
or
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
df %>% mutate(var2=dist_cum(country))
#> country year var2
#> 1 Sweden 1834 1
#> 2 Germany 1846 2
#> 3 Sweden 1847 2
#> 4 Sweden 1852 2
#> 5 Germany 1860 2
#> 6 Vietnam 1865 3
You can try this:
library(ggplot2)
library(plyr)
df<-data.frame(country=c("Sweden","Germany","Sweden","Sweden","Germany","Vietnam", "Germany"),year= c(1834,1846,1847,1852,1860,1865,1860))
counts <- ddply(df, .(df$country, df$year), nrow)
The output is:
> counts
df$country df$year V1
1 Germany 1846 1
2 Germany 1860 2
3 Sweden 1834 1
4 Sweden 1847 1
5 Sweden 1852 1
6 Vietnam 1865 1

How to flatten data.frame for use with googlevis treemap?

In order to use the treemap function on googleVis, data needs to be flattened into two columns. Using their example:
> library(googleVis)
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
However, in the real world this information more frequently looks like this:
> a <- data.frame(
+ scal=c("Global", "Global", "Global", "Global", "Global", "Global", "Global"),
+ cont=c("Europe", "Europe", "Europe", "America", "America", "Asia", "Asia"),
+ country=c("France", "Sweden", "Germany", "Mexico", "USA", "China", "Japan"),
+ val=c(71, 89, 58, 2, 38, 5, 48),
+ fac=c(2,3,10,9,11,1,11))
> a
scal cont country val fac
1 Global Europe France 71 2
2 Global Europe Sweden 89 3
3 Global Europe Germany 58 10
4 Global America Mexico 2 9
5 Global America USA 38 11
6 Global Asia China 5 1
7 Global Asia Japan 48 11
But how to most efficiently change transform this data?
If we use dplyr, this script will transform the data correctly:
library(dplyr)
cbind(NA,a %>% group_by(scal) %>% summarize(val=sum(val),fac=sum(fac))) -> topLev
names(topLev) <- c("Parent","Region","val","fac")
a %>% group_by(scal,cont) %>% summarize(val=sum(val),fac=sum(fac)) %>%
select(Region=cont,Parent=scal,val,fac) -> midLev
a[,2:5] %>% select(Region=country,Parent=cont,val,fac) -> bottomLev
bind_rows(topLev,midLev,bottomLev) %>% select(2,1,3,4) -> answer
We can verify this by comparing dataframes:
> answer
Source: local data frame [11 x 4]
Region Parent val fac
1 Global NA 311 47
2 America Global 40 20
3 Asia Global 53 12
4 Europe Global 218 15
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
> Regions
Region Parent Val Fac
1 Global <NA> 10 2
2 America Global 2 4
3 Europe Global 99 11
4 Asia Global 10 8
5 France Europe 71 2
6 Sweden Europe 89 3
7 Germany Europe 58 10
8 Mexico America 2 9
9 USA America 38 11
10 China Asia 5 1
11 Japan Asia 48 11
Interesting that the summaries for the continents and the globe aren't the sum of their components (or min/max/ave/mean/normalized...)

How to remove rows in data frame after frequency tables in R

I have 3 data frames from which I have to find the continent with less than 2 countries and remove those countries(rows). The data frames are structured in a manner similar a data frame called x below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 China Asia 10
6 Nigeria Africa 14
7 Holland Europe 01
8 Italy Europe 05
9 Japan Asia 06
First I wanted to know the frequency of each country per continent, so I did
x2<-table(x$Continent)
x2
Africa Europe Asia
3 4 2
Then I wanted to identify the continents with less than 2 countries
x3 <- x2[x2 < 10]
x3
Asia
2
My problem now is how to remove these countries. For the example above it will be the 2 countries in Asia and I want my final data set to look like presented below:
row Country Continent Ranking
1 Kenya Africa 17
2 Gabon Africa 23
3 Spain Europe 04
4 Belgium Europe 03
5 Nigeria Africa 14
6 Holland Europe 01
7 Italy Europe 05
The number of continents with less than 2 countries will vary among the different data frames so I need one universal method that I can apply to all.
Try
library(dplyr)
x %>%
group_by(Continent) %>%
filter(n()>2)
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#5 6 Nigeria Africa 14
#6 7 Holland Europe 01
#7 8 Italy Europe 05
Or using the x2
subset(x, Continent %in% names(x2)[x2>2])
# row Country Continent Ranking
#1 1 Kenya Africa 17
#2 2 Gabon Africa 23
#3 3 Spain Europe 04
#4 4 Belgium Europe 03
#6 6 Nigeria Africa 14
#7 7 Holland Europe 01
#8 8 Italy Europe 05
A very easy way with "data.table" would be:
library(data.table)
as.data.table(x)[, N := .N, by = Continent][N > 2]
# row Country Continent Ranking N
# 1: 1 Kenya Africa 17 3
# 2: 2 Gabon Africa 23 3
# 3: 3 Spain Europe 4 4
# 4: 4 Belgium Europe 3 4
# 5: 6 Nigeria Africa 14 3
# 6: 7 Holland Europe 1 4
# 7: 8 Italy Europe 5 4
In base R you can try:
x[with(x, ave(rep(TRUE, nrow(x)), Continent, FUN = function(y) length(y) > 2)), ]
# row Country Continent Ranking
# 1 1 Kenya Africa 17
# 2 2 Gabon Africa 23
# 3 3 Spain Europe 4
# 4 4 Belgium Europe 3
# 6 6 Nigeria Africa 14
# 7 7 Holland Europe 1
# 8 8 Italy Europe 5

Resources