Using apply function to calculate the mean of a column - r

After splitting a data frame into multiple data frames by country,I wanted to be able to calculate the mean of the column centralization in each country's data frame that i split. I used tapply which worked and I tried to use sapply() but the weird thing is that all mean values of the country follows the mean value of the first country. I cannot figure out why and I am asked to use sapply as an exercise so I would like to know how i can improve on my code. Any pointer would be appreciated. (it might be a dumb mistake)
INPUT/my code:
strikes.df = read.csv("http://www.stat.cmu.edu/~pfreeman/strikes.csv")
strikes.by.country=split(strikes.df,strikes.df$country)
my.fun=function(x=strikes.by.country){
l=length(strikes.by.country)
for (i in 1:l){
return(strikes.by.country[[i]]$centralization %>% mean)
}
}
sapply(strikes.by.country, my.fun)
#using tapply()
tapply(strikes.df[,"centralization",],INDEX=strikes.df[,"country",],FUN=mean)
OUTPUT
0.374644 0.374644 0.374644 0.374644 0.374644
Finland France Germany Ireland Italy
0.374644 0.374644 0.374644 0.374644 0.374644
Japan Netherlands New.Zealand Norway Sweden
0.374644 0.374644 0.374644 0.374644 0.374644
Switzerland UK USA
0.374644 0.374644 0.374644
Australia Austria Belgium Canada Denmark
0.374644022 0.997670495 0.749485177 0.002244134 0.499958552
Finland France Germany Ireland Italy
0.750374065 0.002729909 0.249968231 0.499711882 0.250699502
Japan Netherlands New.Zealand Norway Sweden
0.124675342 0.749602699 0.375940378 0.875341821 0.875253817
Switzerland UK USA
0.499990005 0.375946785 0.002390639
i am given instruction to use sapply after using split; thats why the only thing that occured to me is using for loops.

Better use sapply on the unique country names. Actually there's no need to split anything.
sapply(unique(strikes.df$country), function(x)
mean(strikes.df[strikes.df$country == x, "centralization"]))
# Australia Austria Belgium Canada Denmark Finland France
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909
# Germany Ireland Italy Japan Netherlands New.Zealand Norway
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821
# Sweden Switzerland UK USA
# 0.875253817 0.499990005 0.375946785 0.002390639
But if you depend on using split as well, you may do:
sapply(split(strikes.df$centralization, strikes.df$country), mean)
# Australia Austria Belgium Canada Denmark Finland France
# 0.374644022 0.997670495 0.749485177 0.002244134 0.499958552 0.750374065 0.002729909
# Germany Ireland Italy Japan Netherlands New.Zealand Norway
# 0.249968231 0.499711882 0.250699502 0.124675342 0.749602699 0.375940378 0.875341821
# Sweden Switzerland UK USA
# 0.875253817 0.499990005 0.375946785 0.002390639
Or write it in two lines:
s <- split(strikes.df$centralization, strikes.df$country)
sapply(s, mean)
Edit
If splitting the whole data frame is required, do
s <- split(strikes.df, strikes.df$country)
sapply(s, function(x) mean(x[, "centralization"]))
or
foo <- function(x) mean(x[, "centralization"])
sapply(s, foo)

Using the gapminder::gapminder dataset as example data this can be achieved like so:
The example code computes mean life expectancy (lifeExp) by continent.
# sapply: simplifies. returns a vector
sapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> Africa Americas Asia Europe Oceania
#> 48.86533 64.65874 60.06490 71.90369 74.32621
# lapply: returns a list
lapply(split(gapminder::gapminder, gapminder::gapminder$continent), function(x) mean(x$lifeExp, na.rm = TRUE))
#> $Africa
#> [1] 48.86533
#>
#> $Americas
#> [1] 64.65874
#>
#> $Asia
#> [1] 60.0649
#>
#> $Europe
#> [1] 71.90369
#>
#> $Oceania
#> [1] 74.32621

Related

Build identity matrix from dataframe (sparsematrix) in R

I am trying to create an identity matrix from a dataframe. The dataframe is like so:
i<-c("South Korea", "South Korea", "France", "France","France")
j <-c("Rwanda", "France", "Rwanda", "South Korea","France")
distance <-c(10844.6822,9384,6003,9384,0)
dis_matrix<-data.frame(i,j,distance)
dis_matrix
1 South Korea South Korea 0.0000
2 South Korea Rwanda 10844.6822
3 South Korea France 9384.1793
4 France Rwanda 6003.3498
5 France South Korea 9384.1793
6 France France 0.0000
I am trying to create a matrix that will look like this:
South Korea France Rwanda
South Korea 0 9384.1793 10844.6822
France 9384.1793 0 6003.3498
Rwanda 10844.6822 6003.3498 0
I have tried using SparseMatrix from Matrix package as described here (Create sparse matrix from data frame)
The issue is that the i and j have to be integers, and I have character strings. I am unable to find another function that does what I am looking for. I would appreciate any help. Thank you
A possible solution:
tidyr::pivot_wider(dis_matrix, id_cols = i, names_from = j,
values_from = distance, values_fill = 0)
#> # A tibble: 2 × 4
#> i Rwanda France `South Korea`
#> <chr> <dbl> <dbl> <dbl>
#> 1 South Korea 10845. 9384 0
#> 2 France 6003 0 9384
You can use igraph::get.adjacency to create the desired matrix. You can also create a sparse matrix with sparse = TRUE.
library(igraph)
g <- graph.data.frame(dis_matrix, directed = FALSE)
get.adjacency(g, attr="distance", sparse = FALSE)
South Korea France Rwanda
South Korea 0.00 9384 10844.68
France 9384.00 0 6003.00
Rwanda 10844.68 6003 0.00
We may convert the first two columns to factor with levels specified as the unique values from both columns, and then use xtabs from base R
un1 <- unique(unlist(dis_matrix[1:2]))
dis_matrix[1:2] <- lapply(dis_matrix[1:2], factor, levels = un1)
xtabs(distance ~ i + j, dis_matrix)
-output
j
i South Korea France Rwanda
South Korea 0.00 9384.00 10844.68
France 9384.00 0.00 6003.00
Rwanda 0.00 0.00 0.00

substract two strings in dplyr row wise for R dataframe

Have two columns and need a third substracting the two using dplyr.
Very simple example for the sake of clarity. Split/separate approach not valid in my case.
x <- c("FRANCE","GERMANY","RUSSIA")
y <- c("Paris FRANCE", "Berlin GERMANY", "Moscow RUSSIA")
cities <- data.frame(x,y)
cities
x y
1 FRANCE Paris FRANCE
2 GERMANY Berlin GERMANY
3 RUSSIA Moscow RUSSIA
Expected results:
x y new
1 FRANCE Paris FRANCE Paris
2 GERMANY Berlin GERMANY Berlin
3 RUSSIA Moscow RUSSIA Moscow
What I've tried so far (to no avail):
this gets the very same df but removing the city (contrary as desired)
cities %>% mutate(new = setdiff(x,y))
x y new
1 FRANCE Paris FRANCE FRANCE
2 GERMANY Berlin GERMANY GERMANY
3 RUSSIA Moscow RUSSIA RUSSIA
On the contrary, setdiff in reverse order gets same initial data
cities %>% mutate(new = setdiff(y,x))
x y new
1 FRANCE Paris FRANCE Paris FRANCE
2 GERMANY Berlin GERMANY Berlin GERMANY
3 RUSSIA Moscow RUSSIA Moscow RUSSIA
Using gsub to remove worked just for first row issuing a warning
cities %>% mutate(new = gsub(x,"",y))
Warning message:
In gsub(x, "", y) :
argument 'pattern' has length > 1 and only the first element will be used
x y new
1 FRANCE Paris FRANCE Paris
2 GERMANY Berlin GERMANY Berlin GERMANY
3 RUSSIA Moscow RUSSIA Moscow RUSSIA
We can use stringr::str_replace:
library(tidyverse)
cities %>%
mutate_if(is.factor, as.character) %>%
mutate(new = trimws(str_replace(y, x, "")))
# x y new
#1 FRANCE Paris FRANCE Paris
#2 GERMANY Berlin GERMANY Berlin
#3 RUSSIA Moscow RUSSIA Moscow
Here is a solution with base R:
x <- c("FRANCE","GERMANY","RUSSIA")
y <- c("Paris FRANCE", "Berlin GERMANY", "Moscow RUSSIA")
cities <- data.frame(x,y,stringsAsFactors = F)
cities$new = mapply(function(a,b)
{setdiff(strsplit(a,' ')[[1]],strsplit(b,' ')[[1]])}, cities$y, cities$x)
Output:
x y new
1 FRANCE Paris FRANCE Paris
2 GERMANY Berlin GERMANY Berlin
3 RUSSIA Moscow RUSSIA Moscow
Hope this helps!

Join 2 dataframes together if two columns match

I have 2 dataframes:
CountryPoints
From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1
and another dataframe with neighbouring/bordering countries:
From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy
I would like to add another column in CountryPoints called neighbour (Y/N) depending if the key value pair is found in the neighbour/bordering countries dataframe. Is this somehow possible - so it is a kind of a join but the result should be a boolean column.
The result should be:
From.country To.Country points Neighbour
Belgium Finland 4 Y
Belgium Germany 5 Y
Malta Italy 12 Y
Malta UK 1 N
In the question below it shows how you can merge but it doesn't show how you can add that extra boolean column
Two alternative approaches:
1) with base R:
idx <- match(df1$From.country, df2$From.country, nomatch = 0) &
match(df1$To.Country, df2$To.Country, nomatch = 0)
df1$Neighbour <- c('N','Y')[1 + idx]
2) with data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[, Neighbour := 'N'][df2, on = .(From.country, To.Country), Neighbour := 'Y'][]
which both give (data.table-output shown):
From.country To.Country points Neighbour
1: Belgium Finland 4 Y
2: Belgium Germany 5 Y
3: Malta Italy 12 Y
4: Malta UK 1 N
Borrowing the idea from this post:
df1$Neighbour <- duplicated(rbind(df2[, 1:2], df1[, 1:2]))[ -seq_len(nrow(df2)) ]
df1
# From.country To.Country points Neighbour
# 1 Belgium Finland 4 TRUE
# 2 Belgium Germany 5 TRUE
# 3 Malta Italy 12 TRUE
# 4 Malta UK 1 FALSE
What about something like this?
sortpaste <- function(x) paste0(sort(x), collapse = "_");
df1$Neighbour <- apply(df1[, 1:2], 1, sortpaste) %in% apply(df2[, 1:2], 1, sortpaste)
# From.country To.Country points Neighbour
#1 Belgium Finland 4 TRUE
#2 Belgium Germany 5 TRUE
#3 Malta Italy 12 TRUE
#4 Malta UK 1 FALSE
Sample data
df1 <- read.table(text =
"From.country To.Country points
Belgium Finland 4
Belgium Germany 5
Malta Italy 12
Malta UK 1", header = T)
df2 <- read.table(text =
"From.country To.Country
Belgium Finland
Belgium Germany
Malta Italy", header = T)

Summarize data using doBy package at region level

I have a dataset Data as below,
Region Country Market Price
EUROPE France France 30.4502
EUROPE Israel Israel 5.14110965
EUROPE France France 8.99665
APAC CHINA CHINA 2.6877232
APAC INDIA INDIA 60.9004
AFME SL SL 54.1729685
LA BRAZIL BRAZIL 56.8606917
EUROPE RUSSIA RUSSIA 11.6843732
APAC BURMA BURMA 63.5881232
AFME SA SA 115.0733685
I would like to summarize the data at Region level and get the SUM of Price at every Region Level.
I want the ouput to be Like below.
Data Output
Region Country Price
EUROPE France 30.4502
EUROPE Israel 5.14110965
EUROPE France 8.99665
EUROPE RUSSIA 11.6843732
Europe 56.27233285
APAC BURMA 63.5881232
APAC CHINA 2.6877232
APAC INDIA 60.9004
Apac 127.1762464
AFME BAHARAIN 54.1729685
AFME SA 115.0733685
AFME 169.246337
LA BRAZIL 56.8606917
LA 56.8606917
I have used summaryBy function of doBy package, i have tried the code below.
summaryBy
myfun1 <- function(x){c(s=Sum(x)}
DB= summaryBy(Data$Price ~Region + Country , data=Data, FUN=myfun1)
Anyhelp on this regard is very much appreciated.
You can do this by using dplyr to generate a summary table:
library(dplyr)
totals <- data %>% group_by(Region) %>% summarise(Country="",Price=sum(Price))
And then merging the summary with the rest of the data:
summary <- rbind(data[-3], totals)
Then you can sort by Region to put the summary with the region:
summary <- summary %>% arrange(Region)
Output:
Region Country Price
1 AFME SL 54.1730
2 AFME SA 115.0734
3 AFME 169.2463
4 APAC CHINA 2.6877
5 APAC INDIA 60.9004
6 APAC BURMA 63.5881
7 APAC 127.1762
8 EUROPE France 30.4502
9 EUROPE Israel 5.1411
10 EUROPE France 8.9967
11 EUROPE RUSSIA 11.6844
12 EUROPE 56.2723
13 LA BRAZIL 56.8607
14 LA 56.8607
You have to split data by Region factor and sum Price for each factor
lapply(split(data, data$Region), function(x) sum(x$Price))
Or, if you need to present result as you have shown:
totals = lapply(split(data, data$Region), function(x) rbind(x,data.frame(Region=unique(x$Region), Country="", Market="", Price=sum(x$Price))))
do.call(rbind, totals)

Mean of time - hh:mm:ss - group by a variable

Need to calculate the mean of Time by Country. Time is a Date variable - hh:mm:ss.
This command with(df,tapply(as.numeric(times(df$Time)),Country,mean))
is not returning the correct mean in hh:mm:ss.
Country Time
1 Germany 2:26:21
2 Germany 2:19:19
3 Brazil 2:06:34
4 USA 2:06:17
5 Eth 2:18:58
6 Japan 2:08:35
7 Morocco 2:05:27
8 Germany 2:13:57
9 Romania 2:21:30
10 Spain 2:07:23
Output:
>with(df,tapply(as.numeric(times(df$Time)),Country,mean))
Andorra Australia Brazil Canada China
0.09334491 0.09634259 0.09578125 0.09634645 0.09481192
Eritrea Ethiopia France Germany Great Britain
0.09709491 0.09010031 0.10025463 0.09713349 0.09524306
Ireland Italy Japan Kenya Morocco
0.09593750 0.09520255 0.09579630 0.08934854 0.09400463
New Zeland Peru Poland Romania Russia
0.09664931 0.09809606 0.09638889 0.09875000 0.09327932
Spain Switzerland Uganda United States Zimbabwe
0.09314236 0.09620949 0.10068287 0.09399016 0.09892940
I see you've discovered the agony of working with date and time values in R...
Is this what you had in mind?
df$nTime <- difftime(strptime(df$Time,"%H:%M:%S"),
strptime("00:00:00","%H:%M:%S"),
units="secs")
df.means <- aggregate(df$nTime,by=list(df$Country),mean)
df.means$Time <- format(.POSIXct(df.means$x,tz="GMT"), "%H:%M:%S")
df.means
Group.1 x Time
# 1 Brazil 7594.000 02:06:34
# 2 Eth 8338.000 02:18:58
# 3 Germany 8392.333 02:19:52
# 4 Japan 7715.000 02:08:35
# 5 Morocco 7527.000 02:05:27
# 6 Romania 8490.000 02:21:30
# 7 Spain 7643.000 02:07:23
# 8 USA 7577.000 02:06:17
The first line adds a column nTime which is the time, in seconds, since midnight.
The second line calculates the means.
The third line converts back to H:M:S.
The problem you were having is the strptime(...), when forced to convert to numeric, returns the number of second between 1970-01-01 and the indicated time today. So, a really big number. This code just subtracts out the number of second from 1970-01-01 and 00:00:00 today.
Are you trying to do this -
dades$Time <- strptime(dades$Time,'%H:%M:%S')
by(dades$Time, dades$Country, mean)
If I didn't understand your question, can you please post sample output.

Resources