I am new to R and am having trouble creating a new variable using conditions from already existing variables. I have a dataset that has a few columns: Name, Month, Binary for Gender, and Price. I want to create a new variable, Price2, that will:
make the price charged 20 if [the month is 6-9(Jun-Sept) and Gender is 0]
make the price charged 30 if [the month is 6-9(Jun-Sept) and Gender is 1]
make the price charged 0 if [the month is 1-5(Jan-May) or month is 10-12(Oct-Dec]
--
structure(list(Name = c("ADI", "SLI", "SKL", "SNK", "SIIEL", "DJD"), Mon = c(1, 2, 3, 4, 5, 6), Gender = c(1, NA, NA, NA, 1, NA), Price = c(23, 34, 32, 64, 23, 34)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
Using case_when() from the dplyr package:
mydf$newprice <- dplyr::case_when(
mydf$Mon >= 6 & mydf$Mon <= 9 & mydf$Gender == 0 ~ 20,
mydf$Mon >= 6 & mydf$Mon <= 9 & mydf$Gender == 1 ~ 30,
mydf$Mon < 6 | mydf$Mon > 9 ~ 0)
Related
I have a database like this:
structure(list(code = c(1, 2, 3, 4), age = c(25, 30, 45, 50),
car = c(0, 1, 0, 1)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
I want to create a column "drivers under 40" with this conditions:
0 if Age<40 & car==0
1 if Age<40 & car==1
How do I create the third column with this conditions?
I tried using the code "if else" to create a variable but it doesn't work.
drivers <- ifelse((age <= 40) & (car==0), 0, ifelse((age<=40) & (car==1), 1))
Is maybe the code written wrong?
Is there another method to do it? I am afraid to mess up the parentheses so I'd prefer another method, if there is any faster
Here is a dplyr version with case_when
library(dplyr)
df %>%
mutate(drivers_under_40 = case_when(age <= 40 & car==0 ~ 0,
age <= 40 & car==1 ~ 1,
TRUE ~ NA_real_))
code age car drivers_under_40
<dbl> <dbl> <dbl> <dbl>
1 1 25 0 0
2 2 30 1 1
3 3 45 0 NA
4 4 50 1 NA
A base R option
df1$drivers_under_40 <- with(df1, (age <= 40 & car == 1)* NA^(age> 40))
df1$drivers_under_40
[1] 0 1 NA NA
Unless you work with dplyr you have to specify the data in your ifelse statement.
data$column for example. Also you have to assign a new column for the operation.
And the last else-statement is missing.
so your ifelse statement should look like this:
data = structure(list(code = c(1, 2, 3, 4), age = c(25, 30, 45, 50),
car = c(0, 1, 0, 1)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
data$drivers <- ifelse((data$age <= 40) & (data$car==0), 0, ifelse((data$age<=40) & (data$car==1), 1, "here you have to fill another 'else' value"))
I am trying to make a table like this -
The table contains several scenarios and risk_type.
The scenarios are basically filters. For example
0 - loan_age > 18
1 - interest_rate > 8%
2 - interest_rate > 18% AND referee == "MALE" AND new_LTV > 50
risk_type are columns in the original dataset like
A - flood risk
B - wildfire risk
C - foundation risk
What I want to do is to create a summary table of all these different risks for all the filters.
This is how the data looks like -
Damage and new LTV is a function of risk score, and I want to filter for risk score > 4
Edit - The first 5 rows of the dummy dataframe.
structure(list(ID = c(1, 2, 3, 4, 5), LTV_value = c(43, 43, 32,
34, 35), loan_age = c(17, 65, 32, 33, 221), referee = c("MALE",
"FEMALE", "MALE", "MALE", "FEMALE"), interest_rate = c(0.02,
0.03, 0.05, 0.0633333333333333, 0.0783333333333333), value = c(70000,
80000, 90000, 1e+05, 45000), flood_risk_score = c(3, 4, 5, 0,
1), wildfire_risk_score = c(3, 4, 3, 3, 2), foundation_risk_score = c(5,
5, 2, 0, 1), flood_damage = c(21000, 32000, 45000, 0, 4500),
wildfire_damage = c(21000, 32000, 27000, 30000, 9000), foundation_damage = c(35000,
40000, 18000, 0, 4500), new_LTV_flood = c(40, 39, 27, 34,
34), new_LTV_wildfire = c(40, 39, 29, 31, 33), new_LTV_foundation = c(38,
38, 30, 34, 34)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))
Till now I have tried these methods.
risk_list = c("flood_risk_score"
, "wildfire_risk_score"
, "foundation_risk_score")
for (i in risk_list){
table <- df %>%
filter(df[i] > 3) %>%
summarise(Count = n()
, mean = mean(value, na.rm = TRUE)
, LTV = mean(LTV))
# Using rbind() to append the output of one iteration to the dataframe
table_append= rbind(table_append, table)
}
This helps me get the values for all the risk scores, however, I have two issues here.
I am unable to filter according to a filter list
For the filter list, I tried this code, but I am unable to add it in a loop -
filters_list = list(which(df$interest > 8)
, which(df$loan_age > 18))
For LTV update, all of them have different new LTV
All of them need to be filtered for high LTV using their new LTV scores
risk_type_list = c("flood"
, "wildfire"
, "foundation")
for (i in list(paste0(risk_type_list,"_risk_level"))){
table <- df %>%
filter(df[paste0(i,"_risk_level")] > 3) %>%
summarise(Count = n())
#Using rbind() to append the output of one iteration to the dataframe
table_append = rbind(table_append, table)
}
In the end, I want to have code that will generate data from the given data by putting in required filters for all different risk types and also use their new LTV values.
I have a liste of dataframes (file1, file2, ..., file 72). For each dataframe I want to create one variable containing information from another dataframe based on two conditions.
The idea is simple:
condition 1: if file*$countryid equals source$country, and
condition 2: if file*$year is higher than source$starting but lower than source$ending, then if true I want to create a column file*$rank with the value in source$rank
I have been trying code lines like this but this code does not go through all lines in source:
file1$rank<-ifelse(file1$countryid=source$countryid & file1$year>source$starting & file1$year<source$ending,source$rank,NA)
In addition I would like to implement this within a loop to avoid iterating manually through all these dataframes:
dflist<-Filter(is.data.frame, mget(ls()))
dflist<-function(df,x){df$rank<-ifelse(df$countryid=source$countryid & df$year>source$starting & df$year<source$ending,source$rank,NA))
Here is an example of the data I have.
Thank you!
> dput(file1)
structure(list(id = c(1, 2, 3), countryid = c(10, 10, 13), year = c(1948,
1954, 1908)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
dput(file2)
structure(list(id = c(1, 2, 3), countryid = c(13, 10, 13), year = c(1907,
1908, 1907)), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
> dput(source)
structure(list(country = c(13, 13, 13, 10, 10, 10), rank = c(1,
2, 3, 1, 2, 3), starting = c(1885, 1909, 1940, 1902, 1907, 1931
), ending = c(1908, 1939, 1960, 1906, 1930, 1960)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
We can use a non-equi join after getting all the file\\d+ datasets into a list
library(data.table)
out <- lapply(mget(ls(pattern = '^file\\d+$')), function(dat)
setDT(dat)[, year := as.integer(year)][as.data.table(source), rank := i.rank,
on = .(countryid = country, year > starting, year < ending)])
-output
out
#$file1
# id countryid year rank
#1: 1 10 1948 3
#2: 2 10 1954 3
#3: 3 13 1908 NA
#$file2
# id countryid year rank
#1: 1 13 1907 1
#2: 2 10 1908 2
#3: 3 13 1907 1
if it needs to update the original objects, use list2env
list2env(out, .GlobalEnv)
I'm performing some light analysis on an NFL kickers' dataset, and am trying to find the total number of kicks made from 18-29yds grouped by each kicker. The dataset's rows contain every made or missed field goal for each kicker, along with the distance and some other variables irrelevant to this issue. I'm using groupby() and then the sum function within the summarise function, but it is returning 1 for every kicker. I have tried different combinations, trying to use filter() as well, but the results keep returning 1 for each kicker. Pics of my code are attached. Any help is appreciated :)
Some code I have tried:
kicks20to29 <- nfl_kicks1%>%
group_by(Kicker)%>%
count(filter(nfl_kicks1$`FG Length`>=18 & nfl_kicks1$`FG Length`<=29))
kicks20to29 <- nfl_kicks1%>%
group_by(Kicker)%>%
filter(`FG Length`>=18 & `FG Length`<=29)
dput output:
structure(list(Quarter = c(1, 2, 1, 2, 2, 4), `Possession Team` = c("NE",
"NE", "NE", "NE", "NE", "NE"), `Wind Speed` = c("6", "6", "12",
"12", "12", "12"), Down = c(4, 4, 4, 4, 4, 4), Distance = c(13,
7, 2, 6, 9, 12), YardLine = c(22, 20, 2, 6, 35, 25), `FG Length` = c(39,
37, 19, 23, 52, 42), `4Q to tie or take lead` = c(0, 0, 0, 0,
0, 0), Result = c("Miss", "Miss", "Good", "Good", "Good", "Miss"
), `Success Rate` = c(0, 0, 1, 1, 1, 0), Kicker = c("A.Vinatieri",
"A.Vinatieri", "A.Vinatieri", "A.Vinatieri", "A.Vinatieri", "A.Vinatieri"
), `# career kicks in study` = c(766, 766, 766, 766, 766, 766
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
One approach is to use the tally function, which counts the number of rows per group.
library(tidyverse)
nfl_kicks1 %>%
group_by(Kicker) %>%
dplyr::filter(`FG Length` >= 18 & `FG Length` <= 29) %>%
tally(name = "Number of Kicks")
## A tibble: 1 x 2
# Kicker `Number of Kicks`
# <chr> <int>
#1 A.Vinatieri 2
You can use group_by + summarise :
library(dplyr)
nfl_kicks1 %>%
group_by(Kicker) %>%
summarise(n_kicks = sum(`FG Length` >= 18 & `FG Length` <= 29))
I have a dataset like this:
data <- data.frame(Time = c(1,4,6,9,11,13,16, 25, 32, 65),
A = c(10, NA, 13, 2, 32, 19, 32, 34, 93, 12),
B = c(1, 99, 32, 31, 12, 13, NA, 13, NA, NA),
C = c(2, 32, NA, NA, NA, NA, NA, NA, NA, NA))
What I want to retrieve are the values in Time that corresponds to the last numerical value in A, B, and C.
For example, the last numerical values for A, B, and C are 12, 13, and 32 respectively.
So, the Time values that correspond are 65, 25, and 4.
I've tried something like data[which(data$Time== max(data$A)), ], but this doesn't work.
We can multiply the row index with the logical matrix, and get the colMaxs (from matrixStats) to subset the 'Time' column
library(matrixStats)
data$Time[colMaxs((!is.na(data[-1])) * row(data[-1]))]
#[1] 65 25 4
Or using base R, we get the index with which/arr.ind, get the max index using a group by operation (tapply) and use that to extract the 'Time' value
m1 <- which(!is.na(data[-1]), arr.ind = TRUE)
data$Time[tapply(m1[,1], m1[,2], FUN = max)]
#[1] 65 25 4
Or with summarise/across in the devel version of dplyr
library(dplyr)
data %>%
summarise(across(A:C, ~ tail(Time[!is.na(.)], 1)))
# A B C
#1 65 25 4
Or using summarise_at with the current version of dplyr
data %>%
summarise_at(vars(A:C), ~ tail(Time[!is.na(.)], 1))