Summarizing outcomes by groups in R - r

The following code works....
sum( (WASDATj$HCNT == 1 | WASDATj$HCNT == -1 | WASDATj$HCNT == 0 ) & WASDATj$Region=='United States'
& WASDATj$Unit=='Million Bushels'
& WASDATj$Commodity=='Soybeans'
& WASDATj$Attribute == 'Production'
& WASDATj$Fdex.x == 10
,na.rm=TRUE
)
It counts the number of observations where HCNT takes a value of -1,1,0
it provides a single number for this category.
The variable WASDATj$Fdex.x takes a value from 1-20.
How can I generalize this to count the number of observations that take a value -1,1,0 for each of the values of Fdex.x (so provide me 20 sums for Fdex.x from 1-20)? I did look for an answer, but I'm such a novice I may have missed what is an obvious answer....

Simply extend your sum of a boolean vector to aggregate function using length which is essentially a count aggregation and analogous to your sum of TRUE:
agg_df <- aggregate(cbind(Count=HCNT) ~ Fdex.x,
data=WASDATj[WASDATj$HCNT %in% c(1,-1, 0) &
WASDATj$Region=='United States' &
WASDATj$Unit=='Million Bushels' &
WASDATj$Commodity=='Soybeans' &
WASDATj$Attribute=='Production', ],
FUN=length)
Result should be a data frame of 20 rows by two columns for each distinct Fdex.x value and corresponding count.
And if needed, you can extend grouping for other counts by adjusting formula and data filter:
agg_df <- aggregate(cbind(Count=HCNT) ~ Fdex.x + Region + Unit + Commodity + Attribute,
data=WASDATj[WASDATj$HCNT %in% c(1,-1, 0), ],
FUN=length)

Related

Perform conditional calculations in a R data frame

I have data in a dataframe in R like this:
Value | Metric
10 | KG
5 | lbs etc.
I want to create a new column (weight) where I can calculate a converted weight based on the Metric - something like if Metric = "Kg" then Value * 1, if Metric = "lbs" then Value * 2.20462
I also have another use case I want to do a similar conditional calculation but based on continuous values i.e. if x >= 2 then "Classification" elseif x >= 1 then "Classification 2" else "Other
Any ideas that might work for both in R?
Does this work:
library(dplyr)
df %>% mutate(converted_wt = case_when(Metric == 'lbs' ~ Value * 2.20462, TRUE ~ Value))
Value Metric converted_wt
1 10 KG 10.0000
2 5 lbs 11.0231
If you have other units apart from "KG" and "lbs" you need to include those in case_when condition accordingly.

Mutate character column to adjust values for inflation with else if statement in R

I am trying to mutate a column for salary in my data frame to adjust for inflation since I have a multi-year sample, called adj_SALARY. The salary column is a character vector (indicated by unadj_SALARY), and I need to multiply the values by a ratio of Consumer Price Indices(shown below as a fraction) to convert all values to 2017 dollars. I also have columns as dummy variables indicating YEAR_2014, YEAR_2015, YEAR_2016, YEAR_2017, and YEAR_2018. I have tried running the code below and am still being met with an error message that "In if (YEAR_2014 == 1) { :
the condition has length > 1 and only the first element will be used". Would love some help on the best way to set this up! Here's my code right now:
enter code here NHIS_test <- NHIS1 %>%
mutate(adj_SALARY = if(YEAR_2014 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/230.280) }
else if(YEAR_2015 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/233.916) }
else if (YEAR_2016 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/233.707) }
else if (YEAR_2017 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(242.839/236.916)}
else if (YEAR_2018 == 1) {
as.numeric(as.character(NHIS1$unadj_SALARY))*(1)})
We can use ifelse/case_when instead of if/else ifelse is vectorized
library(dplyr)
NH1S1 %>%
mutate(unadj_SALARY = as.numeric(as.character(unadj_SALARY)),
adj_SALARY =
case_when(
YEAR_2014 == 1 ~ unadj_SALARY *(242.839/230.280),
YEAR_2015 == 1 ~ unadj_SALARY *(242.839/233.916),
YEAR_2016 == 1 ~ unadj_SALARY *(242.839/233.707),
YEAR_2017 == 1 ~ unadj_SALARY *(242.839/236.916),
YEAR_2018 == 1 ~ unadj_SALARY))
NOTE: Instead of doing the numeric conversion on 'unadj_SALARY' multiple times, it is better to do it once and then use that for further transformation/calculations

r data.table filter based on count of rows satisfying a condition

I am learning data.table and got confused at one place. Need help to understand how the below can be achieved. The data I am having, I need to filter out those brands which have sales of 0 in the 1st period OR do not have sales > 0 in atleast 14 periods. I have tried and I think I have achieved the 1st part....however not able to get how I can get the second part of filtering those brands which do not have sales > 0 in atleast 14 periods.
Below is my sample data and code that I have written. Please suggest how I can I achieve the second part?
library(data.table)
#### set the seed value
set.seed(9901)
#### create the sample variables for creating the data
group <- sample(1:7,1200,replace = T)
brn <- sample(1:10,1200,replace = T)
period <- rep(101:116,75)
sales <- sample(0:50,1200,replace = T)
#### create the data.table
df1 <- data.table(cbind(group,brn,period,sales))
#### taking the minimum value by group x brand x period
df1_min <- df1[,.(min1 = min(sales,na.rm = T)),by = c('group','brn','period')][order(group,brn,period)]
#### creating the filter
df1_min$fil1 <- ifelse(df1_min$period == 101 & df1_min$min1 == 0,1,0)
Thank you !!
Assuming that the first restriction applies on the dataset wide minimum period (101), implying that brn/group pairs starting with a 0-sales period greater than 101 are still included.
# 1. brn/group pairs with sales of 0 in the 1st period.
brngroup_zerosales101 = df1[sales == 0 & period == min(period), .(brn, group)]
# 2a. Identify brn/group pairs with <14 positive sale periods
df1[, posSale := ifelse(sales > 0, 1, 0)] # Was the period sale positive?
# 2b. For each brn/group pair, sum posSale and filter posSale < 14
brngroup_sub14 = df1[, .(GroupBrnPosSales = sum(posSale)), by = .(brn, group)][GroupBrnPosSales < 14, .(brn, group)]
# 3. Join the two restrictions
restr = rbindlist(list(brngroup_zerosales101, brngroup_sub14))
df1[, ID := paste(brn, group)] # Create a brn-group ID
restr[, ID := paste(brn, group)] # See above
filtered = df1[!(ID %in% restr[,ID]),]

r - create function to calculate count of filtered rows in dataset

I'm trying to create a helper function that will calculate how many rows are there in a data.frame according parameters.
getTotalParkeds <- function(place, weekday, entry_hour){
data <- PARKEDS[
PARKEDS$place == place,
PARKEDS$weekday == weekday,
PARKEDS$entry_hour == entry_hour
]
return(nrow(data))
}
Then I'm running this like:
getTotalParkeds('MyPlace', 'mon', 1)
So it is returning this error:
Warning: Error in : Length of logical index vector must be 1 or 11 (the number of columns), not 10000
I'm totally new to R, so I have no idea on what is happening.
Here's the correction you need for your approach -
getTotalParkeds <- function(place, weekday, entry_hour){
data <- PARKEDS[
PARKEDS$place == place &
PARKEDS$weekday == weekday &
PARKEDS$entry_hour == entry_hour,
]
return(nrow(data))
}
Allowing different PARKEDS data, say next month's data:
getTotalParkeds <- function(input, place, weekday, entry_hour){
row.count <- nrow(subset(input, place == place &
weekday == weekday &
entry_hour == entry_hour))
return(row.count)
}

Find match in the backward direction from a row in dataframe

I have a data frame which stores a hierarchical data, a part of which is as shown below:
print(data)
level Name
1 WRG ASM ENGINE
2 MOUNT CLAMP
3 Carbon Steel
4 Carbon
3 PA
4 F-Fibre
Now say I want to find the immediate parent of row with name "Carbon". I am currently using the below code:
1.Finding the level value for Carbon,the immediate parent will have level
value 1 less than Carbon's.
level_carbon <-data[which(data$Name=="Carbon"),"level"]
2.Finding position of carbon in the data frame
row_num_carbon <-which(data$Name=="Carbon)
3.Getting index of all the possible immediate parents
Parents_Carbon_index <- which(data$level==level_carbon-1 )
4.Index of immediate parent will be less than that of carbon and it will be
closest to carbon in the data frame
Act_Parent_Carbon <- (which.min(Parents_Carbon_index < row_num_carbon))-1
Carbon_Parent <- data[Act_Parent_Carbon ,"Name"]
print(Carbon_Parent)
"Carbon Steel"
The above code serves the purpose, but I am looking for a shorter code which looks cleaner and takes less execution time.
# create an identifier for order
data <- tibble::rowid_to_column(data)
# update following up #r2evans' comment:
# below is a base R option to get rowids since rowid_to_column requires tibble
# data$rowid <- seq_len(nrow(data))
# conditions: one level up + before the given row + closest to given row
tail(data$Name[data$level == data$level[data$Name == "Carbon"] - 1 & data$rowid < data$rowid[data$Name == "Carbon"]], 1)
You can create a function to find the parent of a given item:
data$rowid <- seq_len(nrow(data)) # using base R option as #r2evans suggested
find_parent <- function(item) {
tail(data$Name[data$level == data$level[data$Name == item] - 1 & data$rowid < data$rowid[data$Name == item]], 1)
}
find_parent("Carbon")
# [1] "Carbon Steel"
with(data,Name[(s<-which(level==level["Carbon"==Name]-1))[max(s<which("Carbon"==Name))]])
[1] "Carbon Steel"

Resources