I need to add a new column containing an specifica function to a data frame.
Basically i need to calculate an indicator which is the sum of the past 5 observations (in column "value1") multuplied by 100 and divided by column "value2" {this one not as a sum, just the simple observatio} of my sample data below.
somewhat like this (its not a formal notation):
indicator = [sum (i-5) value1 / value2] * 100
the indicator must be calculate by country.
in case of countries or dates "mixed" in the data frame the formula need to be able to recognize and sum the correct values only, in the correct order.
If there is a NA value in the value 1, the formula should also be able to ignore this line as a computation. ex: 31/12, 1/01, 2/01, 3/01, 4/01 = NA, 05/01 --> the indicator of 06/01 will then take into account the past 5 valid observation, 31/12, 1/01, 2/01, 3/01, 05/01.
Important -> only use base R
Example of the data frame (my actual data frame is more complex)
set.seed(1)
Country <- c(rep("USA", 10),rep("UK", 10), rep("China", 10))
Value1 <- sample(x = c(120, 340, 423), size = 30, replace = TRUE)
Value2 <- sample(x = c(1,3,5,6,9), size = 30, replace = TRUE)
date <- seq(as.POSIXct('2020/01/01'),
as.POSIXct('2020/01/30'),
by = "1 day")
df = data.frame(Country, Value1, Value2, date)
I thank you all very much in advance. this one has bein very hard to crack :D
Since it has to be done group-wise but in base R, you could use the split-apply-bind method
df2 <- do.call(rbind, lapply(split(df, df$Country), function(d) {
d <- d[order(d$date),]
d$computed <- 100 * d$Value1 / d$Value2
d$Result <- NA
for(i in 5:nrow(d)) d$Result[i] <- sum(tail(na.omit(d$computed[seq(i)]), 5))
d[!names(d) %in% "computed"]
}))
rn <- sapply(strsplit(rownames(df2), "\\."), function(x) as.numeric(x[2]))
`rownames<-`(df2[rn,], NULL)
#> Country Value1 Value2 date Result
#> 1 USA 423 9 2020-01-01 NA
#> 2 USA 120 3 2020-01-02 NA
#> 3 USA 120 3 2020-01-03 NA
#> 4 USA 423 5 2020-01-04 NA
#> 5 USA 120 1 2020-01-05 33160.00
#> 6 USA 120 1 2020-01-06 40460.00
#> 7 USA 120 3 2020-01-07 40460.00
#> 8 USA 340 1 2020-01-08 70460.00
#> 9 USA 423 6 2020-01-09 69050.00
#> 10 USA 340 9 2020-01-10 60827.78
#> 11 UK 340 5 2020-01-11 NA
#> 12 UK 423 6 2020-01-12 NA
#> 13 UK 423 3 2020-01-13 NA
#> 14 UK 340 1 2020-01-14 NA
#> 15 UK 120 3 2020-01-15 65950.00
#> 16 UK 120 9 2020-01-16 60483.33
#> 17 UK 423 1 2020-01-17 95733.33
#> 18 UK 423 9 2020-01-18 86333.33
#> 19 UK 340 1 2020-01-19 86333.33
#> 20 UK 340 3 2020-01-20 93666.67
#> 21 China 340 1 2020-01-21 NA
#> 22 China 340 9 2020-01-22 NA
#> 23 China 423 3 2020-01-23 NA
#> 24 China 120 1 2020-01-24 NA
#> 25 China 340 9 2020-01-25 67655.56
#> 26 China 340 5 2020-01-26 40455.56
#> 27 China 120 5 2020-01-27 39077.78
#> 28 China 340 9 2020-01-28 28755.56
#> 29 China 340 9 2020-01-29 20533.33
#> 30 China 423 5 2020-01-30 25215.56
Created on 2022-06-08 by the reprex package (v2.0.1)
Here's an option - not sure if the calculation is as you intend:
split_df <- split(df, Country)
split_df <- lapply(split_df, function(x) {
x <- x[order(x$date),]
x$index <- nrow(x):1
x$indicator <- ifelse(x$index <= 5, sum(x$Value2[x$index <= 5]) * 100 / x$Value2, NA)
x$index <- NULL
return(x)
})
final_df <- do.call(rbind, split_df)
Country Value1 Value2 date indicator
China.21 China 120 3 2020-01-21 NA
China.22 China 423 5 2020-01-22 NA
China.23 China 340 6 2020-01-23 NA
China.24 China 120 3 2020-01-24 NA
China.25 China 340 9 2020-01-25 NA
China.26 China 423 6 2020-01-26 366.6667
China.27 China 120 3 2020-01-27 733.3333
China.28 China 340 3 2020-01-28 733.3333
China.29 China 120 5 2020-01-29 440.0000
China.30 China 340 5 2020-01-30 440.0000
UK.11 UK 423 1 2020-01-11 NA
UK.12 UK 340 6 2020-01-12 NA
UK.13 UK 423 1 2020-01-13 NA
UK.14 UK 423 5 2020-01-14 NA
UK.15 UK 340 6 2020-01-15 NA
UK.16 UK 340 1 2020-01-16 2400.0000
UK.17 UK 120 5 2020-01-17 480.0000
UK.18 UK 423 9 2020-01-18 266.6667
UK.19 UK 120 6 2020-01-19 400.0000
UK.20 UK 423 3 2020-01-20 800.0000
USA.1 USA 423 1 2020-01-01 NA
USA.2 USA 423 5 2020-01-02 NA
USA.3 USA 423 5 2020-01-03 NA
USA.4 USA 423 6 2020-01-04 NA
USA.5 USA 423 1 2020-01-05 NA
USA.6 USA 340 5 2020-01-06 600.0000
USA.7 USA 340 5 2020-01-07 600.0000
USA.8 USA 423 6 2020-01-08 500.0000
USA.9 USA 423 5 2020-01-09 600.0000
USA.10 USA 423 9 2020-01-10 333.3333
In base R you could do:
transform(df,Results=ave(Value1,Country,FUN=function(x)replace(x,!is.na(x),
filter(na.omit(x),rep(1,5),sides=1)))/Value2)
Country Value1 Value2 date Results
1 USA 120 1 2020-01-01 NA
2 USA 423 6 2020-01-02 NA
3 USA 120 1 2020-01-03 NA
4 USA 340 6 2020-01-04 NA
5 USA 120 5 2020-01-05 224.6000
6 USA 423 3 2020-01-06 475.3333
7 USA 423 3 2020-01-07 475.3333
8 USA 340 6 2020-01-08 274.3333
9 USA 340 6 2020-01-09 274.3333
10 USA 423 6 2020-01-10 324.8333
11 UK 423 3 2020-01-11 NA
12 UK 120 6 2020-01-12 NA
13 UK 120 1 2020-01-13 NA
14 UK 120 1 2020-01-14 NA
15 UK 340 6 2020-01-15 187.1667
16 UK 340 1 2020-01-16 1040.0000
17 UK 340 3 2020-01-17 420.0000
18 UK 340 5 2020-01-18 296.0000
19 UK 423 3 2020-01-19 594.3333
20 UK 120 3 2020-01-20 521.0000
21 China 423 9 2020-01-21 NA
22 China 120 3 2020-01-22 NA
23 China 120 1 2020-01-23 NA
24 China 120 5 2020-01-24 NA
25 China 120 5 2020-01-25 180.6000
26 China 340 6 2020-01-26 136.6667
27 China 120 5 2020-01-27 164.0000
28 China 120 1 2020-01-28 820.0000
29 China 340 6 2020-01-29 173.3333
30 China 340 9 2020-01-30 140.0000
Been trying to learn the most basic of items at first and then expanding the complexity. So for this one, how would I modify the last line to where it would be create a rolling 12 month average for each seriescode. In this case, it would produce an average of 8 for seriescode 100 and 27 for seriescode 101.
First, is the sample data
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
Manipulations
library(dplyr)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA))
You would just need to add a group_by(seriescode) which would then perform the mutate functions per seriescode:
Monthx<- c(201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011,201911,201912,20201
,20202,20203,20204,20205,20206,20207
,20208,20209,202010,202011)
empx <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,21,22,23,24,25,26,27,28,29,20,31,32,33)
seriescode<-c(100,100,100,100,100,100,100,100,100,100,100,100,100,110,110,110,110,110,110,110,110,110,110,110,110,110)
ces12x <- data.frame(Monthx,empx,seriescode)
ces12x<- ces12x %>% mutate(year = substr(as.numeric(Monthx),1,4),
month = substr(as.numeric(Monthx),5,7),
date = as.Date(paste(year,month,"1",sep ="-")))
Month_ord <- order(Monthx)
ces12x<-ces12x %>% group_by(seriescode) %>% mutate(ravg = zoo::rollmeanr(empx, 12, fill = NA)) # add the group_by(seriescode)
This produces the output:
# A tibble: 26 x 7
# Groups: seriescode [2]
Monthx empx seriescode year month date ravg
<dbl> <dbl> <dbl> <chr> <chr> <date> <dbl>
1 201911 1 100 2019 11 2019-11-01 NA
2 201912 2 100 2019 12 2019-12-01 NA
3 20201 3 100 2020 1 2020-01-01 NA
4 20202 4 100 2020 2 2020-02-01 NA
5 20203 5 100 2020 3 2020-03-01 NA
6 20204 6 100 2020 4 2020-04-01 NA
7 20205 7 100 2020 5 2020-05-01 NA
8 20206 8 100 2020 6 2020-06-01 NA
9 20207 9 100 2020 7 2020-07-01 NA
10 20208 10 100 2020 8 2020-08-01 NA
11 20209 11 100 2020 9 2020-09-01 NA
12 202010 12 100 2020 10 2020-10-01 6.5
13 202011 13 100 2020 11 2020-11-01 7.5
14 201911 21 110 2019 11 2019-11-01 NA
15 201912 22 110 2019 12 2019-12-01 NA
16 20201 23 110 2020 1 2020-01-01 NA
17 20202 24 110 2020 2 2020-02-01 NA
18 20203 25 110 2020 3 2020-03-01 NA
19 20204 26 110 2020 4 2020-04-01 NA
20 20205 27 110 2020 5 2020-05-01 NA
21 20206 28 110 2020 6 2020-06-01 NA
22 20207 29 110 2020 7 2020-07-01 NA
23 20208 20 110 2020 8 2020-08-01 NA
24 20209 31 110 2020 9 2020-09-01 NA
25 202010 32 110 2020 10 2020-10-01 25.7
26 202011 33 110 2020 11 2020-11-01 26.7
If you want to continue using the tidyverse for this, the following should do the trick:
library(dplyr)
ces12x %>%
group_by(seriescode) %>%
arrange(date) %>%
slice(tail(row_number(), 12)) %>%
summarize(ravg = mean(empx))
I just begin coding in R , I am trying to manipulate data but I have an issue which is the following:
I have 2 different tables (simplified )
the first one (player_df) is as follows:
name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack
...
The second table is the salary by club and experience in million (salary_df)
*experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18
..*
I would like to add a new column to my data in the first table named let say salary_estimation, and which takes into consideration 2 variables for example here experience and the club.
For example for "luc" who plays in "FCB" and has "2" years experience the output should be "2.5"
In excel its an index / match function, but in R I don't know which function should I use .
How should I approach the problem ?
Data:
df1 <- read.table(text = 'name experience Club age Position
luc 2 FCB 18 Goalkeeper
jean 9 Real 26 midfielder
ronaldo 14 FCB 32 Goalkeeper
jean 9 Real 26 midfielder
messi 11 Liverpool 35 midfielder
tevez 6 Chelsea 27 Attack
inzaghi 9 Juve 34 Defender
kwfni 17 Bayern 40 Attack
Blabla 9 Real 25 midfielder
wdfood 11 Liverpool 33 midfielder
player2 7 Chelsea 28 Attack
player3 10 Juve 34 Defender
fgh 17 Bayern 40 Attack', header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = 'experience FCB BAYERN Juve Real Chelsea
1 1.5 1.3 1 4 3
2 2.5 2 2.4 5 4
3 3.4 3.1 3.5 6.3 5
4 5 4.5 6.7 9 6
5 7.1 6.9 9 12 7
6 9 8 10 15 10
7 10 9 12 16 15
8 14 12 13 19 16
9 14.5 17 15 20 17
10 15 19 17 23 18', header = TRUE, stringsAsFactors = FALSE)
Code:
library('data.table')
setDT(df2)[, Chelsea := as.numeric(Chelsea)]
df2 <- melt(df2, id.vars = "experience", variable.name = "Club", value.name = "Salary" )
df2[df1, on = c("experience", "Club"), nomatch = NA]
Output:
# experience Club Salary name age Position
# 1: 2 FCB 2.5 luc 18 Goalkeeper
# 2: 9 Real 20.0 jean 26 midfielder
# 3: 14 FCB NA ronaldo 32 Goalkeeper
# 4: 9 Real 20.0 jean 26 midfielder
# 5: 11 Liverpool NA messi 35 midfielder
# 6: 6 Chelsea 10.0 tevez 27 Attack
# 7: 9 Juve 15.0 inzaghi 34 Defender
# 8: 17 Bayern NA kwfni 40 Attack
# 9: 9 Real 20.0 Blabla 25 midfielder
# 10: 11 Liverpool NA wdfood 33 midfielder
# 11: 7 Chelsea 15.0 player2 28 Attack
# 12: 10 Juve 17.0 player3 34 Defender
# 13: 17 Bayern NA fgh 40 Attack
One of possible solutions is by joining first table (let say it is player_df) with "long format" of second table salary_df using experience and club as keys. You can do it by using tidyverse package.
library(tidyverse)
player_df %>%
mutate(Club = str_to_title(Club)) %>%
left_join(
salary_df %>%
pivot_longer(-experience, names_to = "Club", values_to = "salary_estimation") %>%
mutate(Club = str_to_title(Club)) )
# Joining, by = c("experience", "Club")
# # A tibble: 13 x 6
# name experience Club age Position salary_estimation
# <chr> <dbl> <chr> <dbl> <chr> <dbl>
# 1 luc 2 Fcb 18 Goalkeeper 2.5
# 2 jean 9 Real 26 midfielder 20
# 3 ronaldo 14 Fcb 32 Goalkeeper NA
# 4 jean 9 Real 26 midfielder 20
# 5 messi 11 Liverpool 35 midfielder NA
# 6 tevez 6 Chelsea 27 Attack 10
# 7 inzaghi 9 Juve 34 Defender 15
# 8 kwfni 17 Bayern 40 Attack NA
# 9 Blabla 9 Real 25 midfielder 20
# 10 wdfood 11 Liverpool 33 midfielder NA
# 11 player2 7 Chelsea 28 Attack 15
# 12 player3 10 Juve 34 Defender 17
# 13 fgh 17 Bayern 40 Attack NA
This question already has answers here:
Filter rows in R based on values in multiple rows
(2 answers)
Closed 5 years ago.
I find it a bit hard to find the right words for what I'm trying to do.
Say I have this dataframe:
library(dplyr)
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Greece 2017 33
5 Hungary 2017 67
6 Italy 2017 38
7 Canada 2009 88
8 France 2009 91
9 Germany 2009 93
10 Greece 2009 NA
11 Hungary 2009 NA
12 Italy 2009 NA
Now I want to delete the rows that have NA values in 2009 but then I want to remove the rows of those countries in 2017 as well. I would like to get the following results:
# A tibble: 74 x 3
country year conf_perc
<chr> <dbl> <dbl>
1 Canada 2017 77
2 France 2017 45
3 Germany 2017 60
4 Canada 2009 88
5 France 2009 91
6 Germany 2009 93
We can do any after grouping by 'country'
library(dplyr)
df1 %>%
group_by(country) %>%
filter(!any(is.na(conf_perc)))
# A tibble: 6 x 3
# Groups: country [3]
# country year conf_perc
# <chr> <int> <int>
#1 Canada 2017 77
#2 France 2017 45
#3 Germany 2017 60
#4 Canada 2009 88
#5 France 2009 91
#6 Germany 2009 93
base R solution:
foo <- df$year == 2009 & is.na(df$conf_perc)
bar <- df$year == 2017 & df$country %in% unique(df$country[foo])
df[-c(which(foo), which(bar)), ]
# country year conf_perc
# 1 Canada 2017 77
# 2 France 2017 45
# 3 Germany 2017 60
# 7 Canada 2009 88
# 8 France 2009 91
# 9 Germany 2009 93
my df2:
League freq
18 England 108
27 Italy 79
20 Germany 74
43 Spain 64
19 France 49
39 Russia 34
31 Mexico 27
47 Turkey 24
32 Netherlands 23
37 Portugal 21
49 United States 18
29 Japan 16
25 Iran 15
7 Brazil 13
22 Greece 13
14 Costa 11
45 Switzerland 11
5 Belgium 10
17 Ecuador 10
23 Honduras 10
42 South Korea 9
2 Argentina 8
48 Ukraine 7
3 Australia 6
11 Chile 6
12 China 6
15 Croatia 6
35 Norway 6
41 Scotland 6
34 Nigeria 5
I try to select europe.
europe <- subset(df2, nrow(x=18, 27, 20) select=c(1, 2))
What is the most effective way to select europe, africa, Asia ... from df2?
You either need to identify which countries are on which continents by hand, or you might be able to scrape this information from somewhere:
(basic strategy from Scraping html tables into R data frames using the XML package)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_European_countries_by_area"
tables <- readHTMLTable(theurl)
library(stringr)
europe_names <- str_extract(as.character(tables[[1]]$Country),"[[:alpha:] ]+")
head(sort(europe_names))
## [1] "Albania" "Andorra" "Austria" "Azerbaijan" "Belarus"
## [6] "Belgium"
## there's also a 'Total' entry in here but it's probably harmless ...
subset(df2,League %in% europe_names)
Of course you'd have to figure this out again for Asia, America, etc.
So here's a slightly different approach from #BenBolker's, using the countrycode package.
library(countrycode)
cdb <- countrycode_data # database of countries
df2[toupper(df2$League) %in% cdb[cdb$continent=="Europe",]$country.name,]
# League freq
# 27 Italy 79
# 20 Germany 74
# 43 Spain 64
# 19 France 49
# 32 Netherlands 23
# 37 Portugal 21
# 22 Greece 13
# 45 Switzerland 11
# 5 Belgium 10
# 48 Ukraine 7
# 15 Croatia 6
# 35 Norway 6
One problem you're going to have is that "England" is not a country in any database (rather, "United Kingdom"), so you'll have to deal with that as a special case.
Also, this database considers the "Americas" as a continent.
df2[toupper(df2$League) %in% cdb[cdb$continent=="Americas",]$country.name,]
so to get just South America you have to use the region field:
df2[toupper(df2$League) %in% cdb[cdb$region=="South America",]$country.name,]
# League freq
# 7 Brazil 13
# 17 Ecuador 10
# 2 Argentina 8
# 11 Chile 6