I have the following DataFrame in R:
Y ... Price Year Quantity Country
010190 ... 4781 2021 4 Germany
010190 ... 367 2021 3 Germany
010190 ... 4781 2021 6 France
010190 ... 250 2021 3 France
020190 ... 690 2021 NA USA
020190 ... 10 2021 6 USA
...... ... .... .. ...
217834 ... 56 2021 3 USA
217834 ... 567 2021 9 USA
As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:
Y Price Year Quantity Country
01 5148 2021 7 Germany
01 5031 2021 9 USA
02 700 2021 6 USA
.. .... ... .... ...
21 623 2021 12 USA
You can use group_by and summarize from dplyr
library(dplyr)
df %>%
mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = '%02d')) %>%
group_by(Y, Year, Country) %>%
summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups: Y, Year [3]
#> Y Year Country Price Quantity
#> <chr> <int> <chr> <int> <int>
#> 1 01 2021 France 5031 9
#> 2 01 2021 Germany 5148 7
#> 3 02 2021 USA 700 NA
update: request:
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))
We could use substr to get the first two characters from Y and group_by and summarise() with sum()
library(dplyr)
df %>%
mutate(Y = substr(Y, 1, 2)) %>%
group_by(Y, Year, Country) %>%
summarise(Price = sum(Price, na.rm = TRUE),
Quantity = sum(Quantity, na.rm = TRUE)
)
Y Year Country Price Quantity
<chr> <dbl> <chr> <dbl> <dbl>
1 01 2021 France 5031 9
2 01 2021 Germany 5148 7
3 02 2021 USA 700 6
4 21 2021 USA 623 12
Using aggregate and the substring of Y.
aggregate(cbind(Quantity, Price) ~ Y + Year + Country,
transform(dat, Y=substr(Y, 1, 2)), sum)
# Y Year Country Quantity Price
# 1 10 2021 France 9 5031
# 2 10 2021 Germany 7 5148
# 3 20 2021 USA 7 700
# 4 21 2021 USA 12 623
Data:
dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L,
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...",
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L,
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L,
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla",
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany",
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA,
-8L))
Related
I'm unsure how to structure my pivot longer command when I have both annual and monthly data. For example I have:
wide <- data.frame(region_name = character(), # Create empty data frame
total_population_2019 = numeric(),
total_population_2020 = numeric(),
mean_temperature_2019_1 = numeric(),
mean_temperature_2019_2 = numeric(),
mean_temperature_2020_1 = numeric(),
mean_temperature_2020_2 = numeric(),
stringsAsFactors = FALSE)
wide[1, ] <- list("funville", 50000, 51250, 26.3, 24.6, 25.7, 24.9)
region_name total_population_2019 total_population_2020 mean_temperature_2019_1 mean_temperature_2019_2 mean_temperature_2020_1 mean_temperature_2020_2
funville 50000 51250 26.3 24.6 25.7 24.9
I'm able to pivot on the monthly columns using spread:
long <- pivot_longer(wide, cols = 4:7, names_to = c("layer" ,"year", "month"),
names_pattern = "(.*)_(.*)_?_(.*)") %>%
group_by(layer) %>%
mutate(n = 1:n()) %>%
spread(layer, value) %>%
select(-n)
which gives
region_name total_population_2019 total_population_2020 year month mean_temperature
1 funville 50000 51250 2019 1 26.3
2 funville 50000 51250 2019 2 24.6
3 funville 50000 51250 2020 1 25.7
4 funville 50000 51250 2020 2 24.9
I'd like to now have a population column where the values are attributed for each row/month that falls in that year, ideally would look like:
desired.df <- data.frame(region_name = c("funville", "funville", "funville", "funville"),
year = c("2019", "2019", "2020", "2020"),
month = c("1", "2", "1", "2"),
population = c("50000", "50000", "51250", "51250"),
mean_temperature = c("26.3", "24.6", "25.7", "24.9"))
which gives
region_name year month population mean_temperature
1 funville 2019 1 50000 26.3
2 funville 2019 2 50000 24.6
3 funville 2020 1 51250 25.7
4 funville 2020 2 51250 24.9
Does anyone have a solution? Thanks in advance
One option would be to use the names_pattern argument and the special .value. To make this work I first add a helper month to your population columns. Additionally I use tidyr::fill to fill up the population column:
library(dplyr)
library(tidyr)
wide |>
rename_with(~ paste(.x, 1, sep = "_"), starts_with("total")) |>
pivot_longer(-region_name,
names_to = c(".value", "year", "month"),
names_pattern = "^(.*?)_(\\d+)_(\\d+)$") |>
group_by(year) |>
fill(total_population) |>
arrange(year)
#> # A tibble: 4 × 5
#> # Groups: year [2]
#> region_name year month total_population mean_temperature
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 funville 2019 1 50000 26.3
#> 2 funville 2019 2 50000 24.6
#> 3 funville 2020 1 51250 25.7
#> 4 funville 2020 2 51250 24.9
I want to calculate the weighted variance using the weights provided in the dataset, while group for the countries and cities, however the function returns NAs:
library(Hmisc) #for the 'wtd.var' function
weather_winter.std<-weather_winter %>%
group_by(country, capital_city) %>%
summarise(across(starts_with("winter"),wtd.var))
The provided output from the console (when in long format):
# A tibble: 35 x 3
# Groups: country [35]
country capital_city winter
<chr> <chr> <dbl>
1 ALBANIA Tirane NA
2 AUSTRIA Vienna NA
3 BELGIUM Brussels NA
4 BULGARIA Sofia NA
5 CROATIA Zagreb NA
6 CYPRUS Nicosia NA
7 CZECHIA Prague NA
8 DENMARK Copenhagen NA
9 ESTONIA Tallinn NA
10 FINLAND Helsinki NA
# … with 25 more rows
This is the code that I used to get the data from a wide format into a long format:
weather_winter <- weather_winter %>% pivot_longer(-c(31:33))
weather_winter$name <- NULL
names(weather_winter)[4] <- "winter"
Some example data:
structure(list(`dec-wet_2011` = c(12.6199998855591, 12.6099996566772,
14.75, 11.6899995803833, 18.2899990081787), `dec-wet_2012` = c(13.6300001144409,
14.2199993133545, 14.2299995422363, 16.1000003814697, 18.0299987792969
), `dec-wet_2013` = c(4.67999982833862, 5.17000007629395, 4.86999988555908,
7.56999969482422, 5.96000003814697), `dec-wet_2014` = c(14.2999992370605,
14.4799995422363, 13.9799995422363, 15.1499996185303, 16.1599998474121
), `dec-wet_2015` = c(0.429999977350235, 0.329999983310699, 1.92999994754791,
3.30999994277954, 7.42999982833862), `dec-wet_2016` = c(1.75,
1.29999995231628, 3.25999999046326, 6.60999965667725, 8.67999935150146
), `dec-wet_2017` = c(13.3400001525879, 13.3499994277954, 15.960000038147,
10.6599998474121, 14.4699993133545), `dec-wet_2018` = c(12.210000038147,
12.4399995803833, 11.1799993515015, 10.75, 18.6299991607666),
`dec-wet_2019` = c(12.7199993133545, 13.3800001144409, 13.9899997711182,
10.5299997329712, 12.3099994659424), `dec-wet_2020` = c(15.539999961853,
16.5200004577637, 11.1799993515015, 14.7299995422363, 13.5499992370605
), `jan-wet_2011` = c(8.01999950408936, 7.83999967575073,
10.2199993133545, 13.8899993896484, 14.5299997329712), `jan-wet_2012` = c(11.5999994277954,
11.1300001144409, 12.5500001907349, 10.1700000762939, 22.6199989318848
), `jan-wet_2013` = c(17.5, 17.4099998474121, 15.5599994659424,
13.3199996948242, 20.9099998474121), `jan-wet_2014` = c(12.5099992752075,
12.2299995422363, 15.210000038147, 9.73999977111816, 9.63000011444092
), `jan-wet_2015` = c(17.6900005340576, 16.9799995422363,
11.75, 9.9399995803833, 19), `jan-wet_2016` = c(15.6099996566772,
15.5, 14.5099992752075, 10.3899993896484, 18.4499988555908
), `jan-wet_2017` = c(9.17000007629395, 9.61999988555908,
9.30999946594238, 15.8499994277954, 11.210000038147), `jan-wet_2018` = c(8.55999946594238,
9.10999965667725, 13.2599992752075, 9.85999965667725, 15.8899993896484
), `jan-wet_2019` = c(17.0699996948242, 16.8699989318848,
14.5699996948242, 19.0100002288818, 19.4699993133545), `jan-wet_2020` = c(6.75999975204468,
6.25999975204468, 6.00999975204468, 5.35999965667725, 8.15999984741211
), `feb-wet_2011` = c(9.1899995803833, 8.63999938964844,
6.21999979019165, 9.82999992370605, 4.67999982833862), `feb-wet_2012` = c(12.2699995040894,
11.6899995803833, 8.27999973297119, 14.9399995803833, 13.0499992370605
), `feb-wet_2013` = c(15.3599996566772, 15.9099998474121,
17.0599994659424, 13.3599996566772, 16.75), `feb-wet_2014` = c(10.1999998092651,
11.1399993896484, 13.8599996566772, 10.7399997711182, 7.35999965667725
), `feb-wet_2015` = c(11.9200000762939, 12.2699995040894,
8.01000022888184, 14.5299997329712, 5.71999979019165), `feb-wet_2016` = c(14.6999998092651,
14.7799997329712, 16.7899990081787, 4.90000009536743, 19.3500003814697
), `feb-wet_2017` = c(8.98999977111816, 9.17999935150146,
11.7699995040894, 6.3899998664856, 13.9899997711182), `feb-wet_2018` = c(16.75,
16.8599987030029, 12.0599994659424, 16.1900005340576, 8.51000022888184
), `feb-wet_2019` = c(7.58999967575073, 7.26999998092651,
8.21000003814697, 7.57999992370605, 8.81999969482422), `feb-wet_2020` = c(10.6399993896484,
10.4399995803833, 13.4399995803833, 8.53999996185303, 19.939998626709
), country = c("SERBIA", "SERBIA", "SLOVENIA", "GREECE",
"CZECHIA"), capital_city = c("Belgrade", "Belgrade", "Ljubljana",
"Athens", "Prague"), weight = c(20.25, 19.75, 14.25, 23.75,
14.25)), row.names = c(76L, 75L, 83L, 16L, 5L), class = "data.frame")
Your code seems to provide the right answer, now there's more data:
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 27.2
2 GREECE Athens 14.6
3 SERBIA Belgrade 19.1
4 SLOVENIA Ljubljana 16.3
Is this what you were looking for?
I took the liberty of streamlining your code:
weather_winter <- weather_winter %>%
pivot_longer(-c(31:33), values_to = "winter") %>%
select(-name)
weather_winter.std <- weather_winter %>%
group_by(country, capital_city) %>%
summarise(winter = wtd.var(winter))
With only one "winter" column, there's no need for the across().
Finally, you are not using the weights. If these are needed, then change the last line to:
summarise(winter = wtd.var(winter, weights = weight))
To give:
# A tibble: 4 x 3
# Groups: country [4]
country capital_city winter
<chr> <chr> <dbl>
1 CZECHIA Prague 26.3
2 GREECE Athens 14.2
3 SERBIA Belgrade 18.8
4 SLOVENIA Ljubljana 15.8
I'm sorry if this question has already been answered, but I don't really know how to phrase my question.
I have a data frame structured in this way:
country
year
score
France
2020
10
France
2019
9
Germany
2020
15
Germany
2019
14
I would like to have a new column called previous_year_score that would look into the data frame looking for the "score" of a country for the "year - 1". In this case France 2020 would have a previous_year_score of 9, while France 2019 would have a NA.
You can use match() for this. I imagine there are plenty of other solutions too.
Data:
df <- structure(list(country = c("France", "France", "Germany", "Germany"
), year = c(2020L, 2019L, 2020L, 2019L), score = c(10L, 9L, 15L,
14L), prev_score = c(9L, NA, 14L, NA)), row.names = c(NA, -4L
), class = "data.frame")
Solution:
i <- match(paste(df[[1]],df[[2]]-1),paste(df[[1]],df[[2]]))
df$prev_score <- df[i,3]
You can use the following solution:
library(dplyr)
df %>%
group_by(country) %>%
arrange(year) %>%
mutate(prev_val = ifelse(year - lag(year) == 1, lag(score), NA))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 Germany 2019 14 NA
3 France 2020 10 9
4 Germany 2020 15 14
Using case_when
library(dplyr)
df1 %>%
arrange(country, year) %>%
group_by(country) %>%
mutate(prev_val = case_when(year - lag(year) == 1 ~ lag(score)))
# A tibble: 4 x 4
# Groups: country [2]
country year score prev_val
<chr> <int> <int> <int>
1 France 2019 9 NA
2 France 2020 10 9
3 Germany 2019 14 NA
4 Germany 2020 15 14
So this is my data frame. Country1 represent the people that live in Germany and Country 2 represent the country that they used to live 5 years before moving to Country1 .
Country1
Country2
Weight
obs
Germany
Germany
4
1
Germany
Germany
119
2
France
Germany
3
3
France
Germany
2
4
Italy
France
1
5
Basically what I want is to make a summary of the columns weights for each combination and the multiply by the observation (represented by the column obs. For example, in the first row I have the combination Germany to Germany so what I want is to sum the weights of the column Weight (119+4=123) and then multiply the result of this sum (123* 1=123) to the respective observation of the column Obs (1) (in the first row). For the second row would be the same the summary of the weight for Germany would be (119+4=123)and this result have to be multiplied by the observation of this row in this case (123* 2=246). In the third row the sum of weights would be (3+2=5) and then multiply this result by the observations for this row (5* 3=15) and so on.
The output that I want is represented by the column x and it would be something like this.
Country1
Country2
Weight
obs
x
Germany
Germany
4
1
123
Germany
Germany
119
2
246
France
Germany
3
3
15
France
Germany
2
4
20
Italy
France
1
5
5
Also the formula that im trying to apply is this one.
You could also solve it as follows:
df1$x <- tapply(df1$Weight, df1$Country1, sum)[df1$Country1] * df1$obs
Country1 Country2 Weight obs x
1 Germany Germany 4 1 123
2 Germany Germany 119 2 246
3 France Germany 3 3 15
4 France Germany 2 4 20
5 Italy France 1 5 5
Try this:
library(dplyr)
#Code
new <- df %>% group_by(Country1) %>%
mutate(x=sum(Weight)*obs)
Output:
# A tibble: 5 x 5
# Groups: Country1 [3]
Country1 Country2 Weight obs x
<chr> <chr> <int> <int> <int>
1 Germany Germany 4 1 123
2 Germany Germany 119 2 246
3 France Germany 3 3 15
4 France Germany 2 4 20
5 Italy France 1 5 5
Some data used:
#Data
df <- structure(list(Country1 = c("Germany", "Germany", "France", "France",
"Italy"), Country2 = c("Germany", "Germany", "Germany", "Germany",
"France"), Weight = c(4L, 119L, 3L, 2L, 1L), obs = 1:5), class = "data.frame", row.names = c(NA,
-5L))
We can use data.table methods
library(data.table)
setDT(df1)[, x := sum(Weight) *obs, by = Country1][]
-output
# Country1 Country2 Weight obs x
#1: Germany Germany 4 1 123
#2: Germany Germany 119 2 246
#3: France Germany 3 3 15
#4: France Germany 2 4 20
#5: Italy France 1 5 5
Or using base R with ave
df1$x <- with(df1, ave(Weight, Country1, FUN = sum) * obs)
data
df1 <- structure(list(Country1 = c("Germany", "Germany", "France", "France",
"Italy"), Country2 = c("Germany", "Germany", "Germany", "Germany",
"France"), Weight = c(4L, 119L, 3L, 2L, 1L), obs = 1:5),
class = "data.frame", row.names = c(NA,
-5L))
Below is a sample data set
area periodyear period employment date
01 2020 08 100 2020-08-01
01 2020 09 105 2020-09-01
01 2020 10 110 2020-10-01
02 2020 08 101 2020-08-01
02 2020 09 102 2020-09-01
02 2020 10 103 2020-10-01
The question is how I get R to return the last TWO rows. I created the date using the following code as a way of having a single value (instead of periodyear and period) that a max value can be found for.
substate$date<- ymd(paste(substate$PERIODYEAR,substate$PERIOD,"1",sep="-"))
I know how to have it find the max value of a column (date, in this instance) but unclear how to have it create a data frame that looks like below
area periodyear period employment date
01 2020 09 105 2020-09-01
01 2020 10 110 2020-10-01
02 2020 09 102 2020-09-01
02 2020 10 103 2020-10-01
The reason for wanting the last TWO is that one month is brand new data and the one before is revised. From here, I update a SQL database.
An option is slice after arrangeing the 'area', and the Date class converted 'date' (if they are not in the order)
library(dplyr)
df1 %>%
arrange(area, as.Date(date)) %>%
group_by(area) %>%
slice_tail(n = 2) %>%
ungroup
-output
# A tibble: 4 x 5
# area periodyear period employment date
# <chr> <int> <int> <int> <chr>
#1 01 2020 9 105 2020-09-01
#2 01 2020 10 110 2020-10-01
#3 02 2020 9 102 2020-09-01
#4 02 2020 10 103 2020-10-01
data
df1 <- structure(list(area = c("01", "01", "01", "02", "02", "02"),
periodyear = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L),
period = c(8L, 9L, 10L, 8L, 9L, 10L), employment = c(100L,
105L, 110L, 101L, 102L, 103L), date = c("2020-08-01", "2020-09-01",
"2020-10-01", "2020-08-01", "2020-09-01", "2020-10-01")),
row.names = c(NA,
-6L), class = "data.frame")
Maybe this:
library(dplyr)
#Code
df %>% arrange(area,date) %>% group_by(area) %>%filter(row_number() %in% 2:n())
Output:
# A tibble: 4 x 5
# Groups: area [2]
area periodyear period employment date
<int> <int> <int> <int> <date>
1 1 2020 9 105 2020-09-01
2 1 2020 10 110 2020-10-01
3 2 2020 9 102 2020-09-01
4 2 2020 10 103 2020-10-01