how to fill values using conditional row in R? - r

I have following dataframe,
Here I want to add a column 'Constant Vol', where if the 'Year' column is 2006 all the values for 'Constant Vol' should be that od 2006 'Vol'. The result should be like following dataframe.

Using dplyr, we can group_by Seg and get the corresponding Vol where Year = 2006
library(dplyr)
df %>%
group_by(Seg) %>%
mutate(Constnt_Vol = Vol[Year == 2006])
# Seg Year Vol Constnt_Vol
# <fct> <int> <dbl> <dbl>
#1 Agri 2006 23 23
#2 Agri 2007 29 23
#3 Agri 2008 16 23
#4 Agri 2009 31 23
#5 Auto 2006 12 12
#6 Auto 2007 34 12
#7 Auto 2008 45 12
#8 Auto 2009 32 12
and in data.table that would be
library(data.table)
setDT(df)[, Constnt_Vol := Vol[Year == 2006], Seg]
This is assuming you have only one row with Year = 2006 in each Seg, if there are multiple we can use which.max to get the first one. (Vol[which.max(Year == 2006)]).
data
df <- data.frame(Seg = rep(c("Agri", "Auto"), each =4),
Year =2006:2009, Vol = c(23, 29, 16, 31, 12, 34, 45, 32))

We can use
library(dplyr)
df %>%
group_by(Seg) %>%
mutate(Constnt_Vol = Vol[match(2006, Year)])
data
df <- data.frame(Seg = rep(c("Agri", "Auto"), each =4),
Year =2006:2009, Vol = c(23, 29, 16, 31, 12, 34, 45, 32))

Related

Use lapply on condition in R

it's easier to explain what I want to do if you look at the code first but essentially I think I want to use lapply on a condition but I wasn't able to do it.
library("tidyverse")
names <- rep(c("City A", "City B"), each = 11)
year <- rep(c(2010:2020), times = 2)
col_1 <- c(1, 17, 34, 788, 3, 4, 78, 98, 650, 45, 20,
23, 45, 56, 877, 54, 12, 109, 167, 12, 19, 908)
col_2 <- c(3, 4, 23, 433, 2, 45, 34, 123, 98, 76, 342,
760, 123, 145, 892, 23, 5, 90, 40, 12, 67, 98)
df <- as.data.frame(cbind(names, year, col_1, col_2))
df <- df %>%
mutate(col_1 = as.numeric(col_1),
col_2 = as.numeric(col_2))
I want every numeric column in the year 2018 and later to be rounded with round_any to a value which is a multiple of three (plyr::round_any, 3)
What I tried is this:
df_2018 <- df %>%
filter(year >= 2018)
df <- df %>%
filter(!(year >= 2018))
df_2018[, c(3:4)] <- lapply(df_2018[, c(3:4)], plyr::round_any, 3)
df <- rbind(df, df_2018)
In reality, there's about 50 numeric columns and tons of rows. What I tried works in theory but I would like to achieve it with less code and cleaner.
I am new to using lapply and I failed trying to combine it with an ifelse because I don't want it to change my year column.
Thank you for everyone who takes the time out of their day to look at this :)
Using dplyr::across and if_else you could do:
library(dplyr)
df |>
mutate(across(-c(names, year), ~ if_else(year >= 2018, plyr::round_any(.x, 3), .x)))
#> names year col_1 col_2
#> 1 City A 2010 1 3
#> 2 City A 2011 17 4
#> 3 City A 2012 34 23
#> 4 City A 2013 788 433
#> 5 City A 2014 3 2
#> 6 City A 2015 4 45
#> 7 City A 2016 78 34
#> 8 City A 2017 98 123
#> 9 City A 2018 651 99
#> 10 City A 2019 45 75
#> 11 City A 2020 21 342
#> 12 City B 2010 23 760
#> 13 City B 2011 45 123
#> 14 City B 2012 56 145
#> 15 City B 2013 877 892
#> 16 City B 2014 54 23
#> 17 City B 2015 12 5
#> 18 City B 2016 109 90
#> 19 City B 2017 167 40
#> 20 City B 2018 12 12
#> 21 City B 2019 18 66
#> 22 City B 2020 909 99
Using data.table:
cols <- grep("^col_[0-9]+$", names(df), value = TRUE)
setDT(df)[year >= 2018, (cols) := round(.SD / 3) * 3, .SDcols = cols]

creating matrix from three column data frame in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 11 months ago.
I have a data frame with three columns where each row is unique:
df1
# state val_1 season
# 1 NY 3 winter
# 2 NY 10 spring
# 3 NY 24 summer
# 4 BOS 14 winter
# 5 BOS 26 spring
# 6 BOS 19 summer
# 7 WASH 99 winter
# 8 WASH 66 spring
# 9 WASH 42 summer
I want to create a matrix with the state names for rows and the seasons for columns with val_1 as the values. I have previously used:
library(reshape2)
df <- acast(df1, state ~ season, value.var='val_1')
And it has created the desired matrix with each state name appearing once but for some reason when I have been using acast or dcast recently it automatically defaults to the length function and gives 1's for the values. Can anyone recommend a solution?
data
state <- c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 <- c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season <- c('winter', 'spring', 'summer', 'winter', 'spring', 'summer',
'winter', 'spring', 'summer')
df1 <- data.frame(state, val_1, season)
You may define the fun.aggregate=.
library(reshape2)
acast(df1, state~season, value.var = 'val_1', fun.aggregate=sum)
# spring summer winter
# BOS 26 19 14
# NY 10 24 3
# WASH 66 42 99
This also works
library(reshape2)
state = c('NY', 'NY', 'NY', 'BOS', 'BOS', 'BOS', 'WASH', 'WASH', 'WASH')
val_1 = c(3, 10, 24, 14, 26, 19, 99, 66, 42)
season = c('winter', 'spring', 'summer', 'winter', 'spring', 'summer', 'winter', 'spring', 'summer')
df1 = data.frame(state,
val_1,
season)
dcast(df1, state~season, value.var = 'val_1')
#> state spring summer winter
#> 1 BOS 26 19 14
#> 2 NY 10 24 3
#> 3 WASH 66 42 99
Created on 2022-04-08 by the reprex package (v2.0.1)

Creating a new column in R conditioned on the values in a different column and different row

I'm trying to figure out how to create a new column in an R dataframe whose values are based on the values in another column, but in a different row. My data is as follows:
player <- c('Tim Duncan', 'Lebron James', 'Kobe Bryant', 'Paul Pierce',
'Tim Duncan', 'Lebron James', 'Kobe Bryant', 'Paul Pierce',
'Tim Duncan', 'Lebron James', 'Kobe Bryant', 'Paul Pierce')
t <- c(3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1)
min_per_game <- c(30, 36, 34, 33, 31, 36, 34, 32, 29, 35, 32, 36)
pts_per_36_min <- c(19, 28, 27, 24, 22, 27, 25, 28, 23, 28, 29, 29)
df <- data.frame(player, t, min_per_game, pts_per_36_min)
What I want to do is create a new column called "pts_per_game" that will look at each row in the dataframe, examine the value in the 't' column, then go find the row that has an equivalent value in the 'player' column but a value in the 't' column that is smaller by 1, and then fill the new "pts_per_game" column using data from the row that R has identified (specifically min_per_game/36 * pts_per_36 min).
So for example, in the first row of this dataframe, the value in the 'player' column is "Tim Duncan" and the value in the "t" column is 3. I want R to see that, go find the row where 'player == "Tim Duncan" and t == 2, take the data from that row and do ((min_per_game/36)* pts_per_36 min), and then put the resulting value in the first dataframe row (where player is Tim Duncan and t is 3) in a new column called "pts_per_game". And I want it do loop through the whole dataframe and do that for every row, with an understanding that this means that rows with the lowest possible value of t (1, in this case), will not be able to have a "pts_per_game" value computed for them, and thus should receive NA. Can anyone help me figure out how to do this?
You may try using dplyr::lead
library(dplyr)
df %>%
arrange(player, desc(t)) %>%
group_by(player) %>%
mutate(pts_per_game = lead(min_per_game)/36 * lead(pts_per_36_min))
player t min_per_game pts_per_36_min pts_per_game
<chr> <dbl> <dbl> <dbl> <dbl>
1 Kobe Bryant 3 34 27 23.6
2 Kobe Bryant 2 34 25 25.8
3 Kobe Bryant 1 32 29 NA
4 Lebron James 3 36 28 27
5 Lebron James 2 36 27 27.2
6 Lebron James 1 35 28 NA
7 Paul Pierce 3 33 24 24.9
8 Paul Pierce 2 32 28 29
9 Paul Pierce 1 36 29 NA
10 Tim Duncan 3 30 19 18.9
11 Tim Duncan 2 31 22 18.5
12 Tim Duncan 1 29 23 NA
This also works
data.frame(player, t, min_per_game, pts_per_36_min) %>%
arrange(player, desc(t)) %>%
dplyr::group_by(player) %>%
dplyr::mutate(pts_per_game = dplyr::lead(min_per_game)/36 * dplyr::lead(pts_per_36_min))

In R - Check vector elements in another vector and write occurrences

I have the following dfs reporting the occurrences of some codes in week 1 and week 2
This for Week 1:
w1 <- data.frame("Code" = c("B00F328AFW", "B0792HCFTG", "B071SDVC6Z", "B0792H8GHP", "X000MLAQUJ"), "Occs" = c(31, 23, 19, 18, 16))
# Code # Occs
B00F328AFW 31
B0792HCFTG 23
B071SDVC6Z 19
B0792H8GHP 18
X000MLAQUJ 16
And this for Week 2:
w2 <- data.frame("Code" = c("X000VID7DV", "X000MLAQUJ", "B000FNFSPY", "X000Z94DWZ", "B01I3DT21I", "X000SC7OO3", "B00F328AFW", "B071SDVC6Z"), "Occs" = c(27, 21, 20, 20, 19, 19, 15, 14))
# Code # Occs
X000VID7DV 27
X000MLAQUJ 21
B000FNFSPY 20
X000Z94DWZ 20
B01I3DT21I 19
X000SC7OO3 19
B00F328AFW 15
B071SDVC6Z 14
I would like to understand whether the codes from the first week are present in the second week and how many occurrences (if they are more or less compared with the previous one).
Unfortunately, I don't have any idea on how to proceed, I've just tried to use the %in% function comparing the first columns of the dfs but the result is very far what I expect.
Here is a dplyr solution.
library(dplyr)
w1 %>% inner_join(w2, by = "Code") %>%
mutate(compare = case_when(
Occs.y > Occs.x ~ "more",
Occs.y < Occs.x ~ "less",
Occs.y == Occs.x ~ "same")) %>%
rename(Occ_week1 = Occs.x, Occ_week2 = Occs.y)
Code Occ_week1 Occ_week2 compare
1 B00F328AFW 31 15 less
2 B071SDVC6Z 19 14 less
3 X000MLAQUJ 16 21 more

How to access columns that contains the argument in data frame and sort in decreasing order

I am dealing with a data frame with column names, company name, division name all_production_2017, bad_production_2017...with many years back
Now I am writing a function that takes a company name and a year as arguments and summarize the company's production in that year. Then sort it by decreasing order in all_production_year
I have already converted the year to a string and filter the rows and columns required. But how can I sort it by a specific column? I do not know how to access that column name because the argument year is the suffix of that.
Here is a rough sketch of the structure of my data frame.
structure(list(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"),
division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M", "CHANG2M"),
all_production_2000 = c(15, 25, 25, 10, 25, 18),
good_production_2000 = c(10, 24, 10, 8, 10, 10),
bad_production_2000 = c(2, 1, 2, 1, 3, 5)))
with data from 2000 to 2017
I want to write a function that given a name of the company and a year.
It can filter out the company and the year relevant, and sort the all_production_thatyear, by decreasing order.
I have done so far.
ExportCompanyYear <- function(company.name, year){
year.string <- toString(year)
x <- filter(company.data, company == company.name) %>%
select(company, division, contains(year.string))
}
I just do not know how to sort by decreasing order because i do not know how to access the column name which contains the argument year.
You definitely need to reshape your data in such a way that year values could be passed as a parameter.
To create a reproducible example, I have added another year 2001 in the data.
df = data.frame(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"), division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M", "CHANG2M"), all_production_2000 = c(15, 25, 25, 10, 25, 18), good_production_2000 = c(10, 24, 10, 8, 10, 10), bad_production_2000 = c(2, 1, 2, 1, 3, 5),all_production_2001 = 2*c(15, 25, 25, 10, 25, 18), good_production_2001 = 2*c(10, 24, 10, 8, 10, 10), bad_production_2001 = 2*c(2, 1, 2, 1, 3, 5))
Now you can reshape the data using the reshape function in R.
Here, the variables "all_production","good_production","bad_production" are varying with time, and year values are changing for those variables.
So we specify v.names = c("all_production","good_production","bad_production").
df2 = reshape(df,direction="long",
v.names = c("all_production","good_production","bad_production"),
varying = names(df)[3:8],
idvar = c("company","division"),
timevar = "year",times = c(2000,2001))
For your data.frame you can specify times=2000:2017 and varying=3:ncol(df)
>df2
company division year all_production good_production bad_production
DLT.Marketing.2000 DLT Marketing 2000 15 2 10
DLT.CHANG1.2000 DLT CHANG1 2000 25 1 24
DLT.CAHNG2.2000 DLT CAHNG2 2000 25 2 10
MSF.MARKETING.2000 MSF MARKETING 2000 10 1 8
MSF.CHANG1M.2000 MSF CHANG1M 2000 25 3 10
MSF.CHANG2M.2000 MSF CHANG2M 2000 18 5 10
DLT.Marketing.2001 DLT Marketing 2001 30 4 20
DLT.CHANG1.2001 DLT CHANG1 2001 50 2 48
DLT.CAHNG2.2001 DLT CAHNG2 2001 50 4 20
MSF.MARKETING.2001 MSF MARKETING 2001 20 2 16
MSF.CHANG1M.2001 MSF CHANG1M 2001 50 6 20
MSF.CHANG2M.2001 MSF CHANG2M 2001 36 10 20
Now you can filter and sort like this:
library(dplyr)
somefunc<-function(company.name,yearval){
df2%>%filter(company==company.name,year==yearval)%>%arrange(-all_production)
}
>somefunc("DLT",2001)
company division year all_production good_production bad_production
1 DLT CHANG1 2001 50 2 48
2 DLT CAHNG2 2001 50 4 20
3 DLT Marketing 2001 30 4 20
Although it seems OP has provided a very simple sample data which contains data for only year 2000.
A solution approach could be:
1. Convert the list to data.frame
2. Use gather from tidyr to arrange dataframe in way where filter can be applied
ll <- structure(list(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"),
division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M",
"CHANG2M"), all_production_2000 = c(15, 25, 25, 10, 25, 18),
good_production_2000 = c(10, 24, 10, 8, 10, 10),
bad_production_2000 = c(2, 1, 2, 1, 3, 5)))
df <- as.data.frame(ll)
library(tidyr)
gather(df, key = "key", value = "value", -c("company", "division"))
#result:
# company division key value
#1 DLT Marketing all_production_2000 15
#2 DLT CHANG1 all_production_2000 25
#3 DLT CAHNG2 all_production_2000 25
#4 MSF MARKETING all_production_2000 10
#5 MSF CHANG1M all_production_2000 25
#6 MSF CHANG2M all_production_2000 18
#7 DLT Marketing good_production_2000 10
#8 DLT CHANG1 good_production_2000 24
#9 DLT CAHNG2 good_production_2000 10
#10 MSF MARKETING good_production_2000 8
#11 MSF CHANG1M good_production_2000 10
#12 MSF CHANG2M good_production_2000 10
#13 DLT Marketing bad_production_2000 2
#14 DLT CHANG1 bad_production_2000 1
#15 DLT CAHNG2 bad_production_2000 2
Now, filter can be applied easily on above data.frame.

Resources