I am trying to make a new column in my data.frame based on another column.
My data frame is called dat.cp2 and one column in it has a certain year from 1990-2017
Here you can see how my data looks. The "ar" column states a year.
I need to make a new column called "TB" with periods. e.g period one is 1990-1996 and i want that period to be called "TB1".. 1997-2003 is "TB2" etc. So for a person born in 1995 the new column says "TB1".
I tried:
dat.cp2 %>% mutate(TB =
case_when(ar <=1996 ~ "TB1",
ar >=1997&<=2003 ~ "TB2",
ar >=2004&<=2010 ~ "TB3",
ar >=2011 ~ "TB4")
But i get error message:
Error: unexpected '<=' in:
" case_when(ar <=1996 ~ "TB1",
ar >=1997&<="
I have tried looking for answers but can't find any.. Can anyone help?
The syntax &<= may be acceptable in some other languages, but in R, the syntax should have ar in both expressions connected by &
library(dplyr)
dat.cp2 %>%
mutate(TB =
case_when(ar <=1996 ~ "TB1",
ar >=1997 & ar <=2003 ~ "TB2",
ar >=2004 & ar <=2010 ~ "TB3",
ar >=2011 ~ "TB4"))
NOTE: There are many methods for simplifying. But, this is just to show where the OP's code mistake is
You don't actually need the & since you are working sequentially, and also you can finalise with TRUE:
dat.cp2 %>%
mutate(
TB = case_when(ar <= 1996 ~ 'TB1',
ar <= 2003 ~ 'TB2',
ar <= 2010 ~ 'TB3',
TRUE ~ 'TB4')
)
You could also do:
dat.cp2 %>%
mutate(TB = cut(ar, breaks = c(1989,1996, 2003, 2010, 2017),
labels = c("TB1", "TB2","TB3","TB4")))
Related
This is a follow up question to a question that I asked before (R apply multiple functions when large number of categories/types are present using case_when (R vectorization)). Unfortunately I have not been able to figure out the problem. I think I may have narrowed down the source of the problem an wanted to check if someone with a better understanding than me could help me figure out a solution.
Suppose I have the following dataset:
set.seed(100)
City=c("City1","City2","City2","City1")
Business=c("B","A","A","B")
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
Here suppose that there exists 2 different business named "A" and "B". Further suppose there exists two different cities City1 and City2. My original dataset contains about 200K observations with multiple Businesses and about 100 cities. For each city, I have a unique pre-written function to compute adjusted revenue. Instead of running them by each observation/row, I want to use case_when to run the function for the relevant city (for eg take the observations for City 1, run a vectorized function for city 1 if possible then move to city 2 and so on).
For the purposes of illustration, suppose I have the following highly simplified functions for the two cities.
#Writing the custom functions for the categories here
City1=function(full_data,observation){
NewSet=full_data[which(full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
City2=function(full_data,observation){
NewSet=full_data[which(full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-1000*rnorm(1)
return(BusinessMax)
}
These simple functions here essentially subset the data for the city, and adds (City1) or subtracts (City2) some random number from the expected revenue. Once again, these simple functions are simply for illustration and does not reflect the actual functions. I also manually check, if the functions work by typing in:
City1(full_data = zz,observation = zz[1,])
City1(full_data = zz,observation = zz[4,])
and get "29.97808" and "36.31531". Note that in the above functions, since I add or subtract a random number, I would expect to get different values for two observations in the same city like I have obtained here.
Finally, I try to use case_when to run the code as follows:
library(dplyr) #I use dplyr here
zz[,"AdjustedRevenue"] = case_when(
zz[["City"]]=="City1"~City1(full_data=zz,observation=zz[,]),
zz[["City"]]=="City2"~City2(full_data=zz,observation=zz[,])
)
The output I receive is the following:
City Business ExpectedRevenue AdjustedRevenue
1 City1 B 35 43.86785
2 City2 A 20 -81.97127
3 City2 A 15 -81.97127
4 City1 B 19 43.86785
Here, for observations 1 and 4 & 2 and 3, the adjusted values are the same. Instead what I would expect is to obtain different values for each observation (since I add or remove some random number for each observation; or atleast intended to). Following Martin Gal's answer to my previous question (https://stackoverflow.com/a/62378991/3988575), I suspect this is due to not calling the 2nd argument of my City1 and City2 functions correctly in the final step. However, I have been somewhat lost trying to figure out why and what to do in order to fix it.
It'd be really helpful If someone could point out why this is happening and how to fix this error. Thanks in advance!
P.S.
I am also open to other vectorized solutions. I am relatively new to vectorization and do not have much experience in it and would appreciate any suggestions.
Converted the City functions to dplyr. If CityMaster is too simplified for the final function then mer could be moved inside the case_when as applicable. If a new city is added to the data then it will return NA until a case is defined.
library(dplyr)
CityMaster <- function(data, city) {
mer <- data %>%
filter(City == city) %>%
pull(ExpectedRevenue) %>%
max()
case_when(city == 'City1' ~ mer + 10 * rnorm(1),
city == 'City2' ~ mer - 1000 * rnorm(1),
TRUE ~ NA_real_)
}
set.seed(100)
zz %>%
rowwise() %>%
mutate(AdjustedRevenue = CityMaster(., City))
# A tibble: 4 x 4
# Rowwise:
City Business ExpectedRevenue AdjustedRevenue
<chr> <chr> <dbl> <dbl>
1 City1 B 35 30.0
2 City2 A 20 -867.
3 City2 A 15 -299.
4 City1 B 19 29.2
Breaking City functions apart
City1 <- function(data, city) {
data %>%
filter(City == city) %>%
pull(ExpectedRevenue) %>%
max() + 10 * rnorm(1)
}
City2 <- function(data, city) {
data %>%
filter(City == city) %>%
pull(ExpectedRevenue) %>%
max() - 1000 * rnorm(1)
}
set.seed(100)
zz %>%
rowwise() %>%
mutate(AdjustRevenue = case_when(City == 'City1' ~ City1(., City),
City == 'City2' ~ City2(., City),
TRUE ~ NA_real_))
I have a dataframe called SCWB from 2001.
The variable YR_IMM captures the year of immigration for each individual (=observation).
1) I would like to delete the "don't know" (=9998) and "refuse" (=9999) observations.
How should I go about this? I tried the dplyr package, but I can't figure out how to work an "continuous" variable (immigration years go from 1920 to 2000)
2) I would like to recode YR_IMM into "years spent in the US". Would this code be correct?
YRSinUS <- transform(SCWB, YR_IMM = 2001 - YR_IMM)
Delete (filter out) "don't know" and "refuse" and recode YR_IMM:
library(dplyr)
SCWB %>%
filter(YR_IMM != 9998 & YR_IMM != 9999) %>%
mutate(YR_IMM = 2001 - YR_IMM)
I would like to select cases with values in some variables above the corresponding third quartile (𝑄3)
As my dataset is very large I am going to take as an example the 'Air Quality' database that comes in R.
df <- airquality[complete.cases(airquality),]
The objective was to filter by certain columns
('Ozone', 'Solar.R', 'Wind', 'Temp').
Currently I was able to develop this solution:
filtro_Ozone = df$Ozone>quantile(df$Ozone)[4]
filtro_Solar.R = df$Solar.R>quantile(df$Solar.R)[4]
filtro_Wind = df$Wind>quantile(df$Wind)[4]
filtro_Temp = df$Temp>quantile(df$Temp)[4]
df[filtro_Ozone & filtro_Solar.R & filtro_Wind & filtro_Temp,]
With which I obtain:
Ozone Solar.R Wind Temp Month Day
40 71 291 13.8 90 6 9
Another fancier way to get this?
UPDATE: per OP's updated request, you can use filter_at from dplyr to only filter at selected variables:
df <- airquality[complete.cases(airquality),]
filter_at(df, vars(Ozone, Solar.R, Wind, Temp), ~. > quantile(., probs = 0.75))
I have constructed a dataset from the gss data (https://gss.norc.org/) associating data in decades
env_data <- select(gss, year, sex, degree, natenvir) %>% na.omit()
env_datadecades <- env_data %>%
mutate(decade=as.factor(ifelse(year<1980,
"70s",
ifelse(year>1980 & year<=1990,
"80s",
ifelse(year>1990 & year<2000, "90s", "00s")))))
I want to plot it with ggplot2 and facet_grid() and the order is not right so I made it as seen somewhere else
set.seed(6809)
env_datadecades$decade <- factor(env_datadecades$decade,
levels = c("Seventies", "Eighties", "Nineties", "Twothous"))
It worked the first time but when I try to run the code again I get NA for all data in decade. What is happening?
I just made a simple dataset of years
df <- data.frame(Years = sample(1970:2010, 20, replace = T))
Convert it into the required factors by this method,
df <- df %>%
mutate(Decades = case_when(Years < 1980 ~ "Seventies",
1980 <= Years & Years < 1990 ~ "Eighties",
1990 <= Years & Years < 2000 ~ "Nineties",
2000 <= Years ~ "TwoThousands"))
df$Decades <- factor(df$Decades, levels = c("Seventies", "Eighties", "Nineties", "TwoThousands"), ordered = T)
and now try faceting.
I think the problem with your code was that you gave the levels one set of names when you first converted the variables to a factor, and then in the second line of code, you give them another set of names. Stick to the same set, and it should work
Building off of this question Pass a data.frame with column names and fields as filter
Let's say we have the following data set:
filt = data.table(X1 = c("Gender","Male"),
X2 = c('jobFamilyGroup','Finance'),
X3 = c('jobFamilyGroup','Software Dev')
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'))
and I want to use filt as a filter for df. filt is done by grabbing an input from Shiny and transforming it a bit to get me that data.table above. My goal is to filter df so we have: All rows that are MALE AND (Software Dev OR Finance).
Currently, I'm hardcoding it to always be an AND but that isn't ideal for situations like this. My thought would be to have multiple if conditions to catch things like this, but I feel like there could be an easier approach for building this logic in.
___________UPDATE______________
Once I have a table like filt I can pass code like:
if(!is.null(primary))
{
if(ncol(primary)==1){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1])]
}
else if(length(primary)==2){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1]) &
get(as.character(primary[1,2]))==as.character(primary[2,2])]
}
else{
d2 = df[get(as.character(primary[1,1]))==as.character(primary[1,2]) &
get(as.character(primary[1,2]))==as.character(primary[2,2]) &
get(as.character(primary[1,3]))==as.character(primary[2,3])]
}
}
But this code doesn't account for the OR Logical needed if there are multiple inputs for one type of grouping. Meaning the current code says give me all rows where: Gender == Male & Job Family Group == 'Finance'& Job Family Group == 'Software Dev' When really it should be Gender == Male & (Job Family Group == 'Finance'| Job Family Group == 'Software Dev')
this is a minimal example meaning there are many other columns so ideally the solution has the ability to determine when a multiple input for a grouping is present.
Given your problem, what if you parsed it so your logic looked like:
Gender %in% c("Male") & jobFamilyGroup %in% c('Finance','Software Dev')
By lumping all filter values with the same column name together in an %in% you get your OR and you keep your AND between column names.
UPDATE
Consider the case discussed in comments below.
Your reactive inputs a data.table specifying
Gender IS Male
Country IS China OR US
EmployeeStatus IS Active
In the sample data you provided there is no country column, so I added one. I extract the columns to be filtered and the values to be filtered and split the values to be filtered by the columns. I pass this into an lapply which does the logical check for each column using an %in% rather than a == so that options within the same column are treated as an | instead of a &. Then I rbind the logical results together and apply an all to the columns and then filter df by the results.
This approach handles the & between columns and the | within columns. It supports any number of columns to be searched removing the need for your if/else logic.
library(data.table)
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'),
Country = c('China','China','US','US','China'))
filt = data.table(x1 = c('Gender' , 'Male'),x2 = c('Country' , 'China'),x3 = c('Country','US'), x4 = c('EmployeeStatus','Active'))
column = unlist(filt[1,])
value = unlist(filt[2,])
tofilter = split(value,column)
tokeep = apply(do.call(rbind,lapply(names(tofilter),function(x){
`[[`(df,x) %in% tofilter[[x]]
})),2,all)
df[tokeep==TRUE]
#> Gender EmployeeStatus jobFamilyGroup Country
#> 1: Male Active Finance China
#> 2: Male Active HR US
#> 3: Male Active Finance US