I Have a big data base of below sample data . I am trying to create a function which can filter Dept from Dept column one by one and the from filtered subset data create a table for all filtered managers like below.
Managers can be many or less for dept , so that it can create table dynamically for filtered managers. because sometimes managers can be 1-23 or more for dept.
I tried with the help filtering and transposing data and then binding it but that is not working properly for me as i am medium proficient in dplyr only.
df <- data.frame(Dept = c("CA","HR","CA","HR","HR","CA","HR","HR","CA","CA","HR","CA","CA"),
Manager = c("AKASH","MANU","AMAN","SANU","NISH","KAMAL","VEER","SANIL","SAMEER","KANU","NUKUL","KUNAL","RAMIT"),
PF = c("Yes","No","Yes","Yes","Yes","No","No","Yes","No","Yes","Yes","Yes","No"),
Yearlybonus=c(6946,5871,0,7173,2161,3008,0,3025,4323,4196,0,5594,2313),
Quaterlybonus=c(2683,3846,0,2391,6716,6012,5479,3869,3764,0,4632,0,2371),
monthlybonus=c(4453,6466,2811,6845,4377,2617,0,7631,7761,2944,6270,3534,5856))
using data.table for large dataset
This solution does not accumulate transformed data as list. Instead, it uses a function for filtering data
library('data.table')
setDT(df)
df[, `:=`(Yearlybonus = as.character(Yearlybonus),
Quaterlybonus = as.character(Quaterlybonus),
monthlybonus = as.character(monthlybonus))]
This function is a pure function, because it sends a copy of data (df) to function everytime. If you have less memory space, you could refactor this function to use the data (df) from global scope. The refactored code will look like the ones in commented lines.
# myfun <- function(x) {
myfun <- function(df, x) {
# wide to long
# y <- melt( df[Dept == x],
df <- melt( df[Dept == x],
id.vars = c('Manager'),
measure.vars = c('PF', 'Yearlybonus', 'Quaterlybonus', 'monthlybonus'),
variable.name = 'T' )
# long to wide
# y <- dcast(y, T ~ Manager, value.var = 'value')
df <- dcast(df, T ~ Manager, value.var = 'value')
# add dept column
#y[, Dept := x ]
df[, Dept := x ]
# set column order in memory
#nm <- names(y)
nm <- names(df)
nm <- c('Dept', nm[nm != 'Dept'])
#setcolorder(y, nm)
setcolorder(df, nm)
#return(y[])
return(df[]) }
}
# create index for speed
setkey(df, Dept)
# myfun(x = 'CA')
myfun(df = df, x = 'CA')
# Dept T AKASH AMAN KAMAL KANU KUNAL RAMIT SAMEER
#1: CA PF Yes Yes No Yes Yes No No
#2: CA Yearlybonus 6946 0 3008 4196 5594 2313 4323
#3: CA Quaterlybonus 2683 0 6012 0 0 2371 3764
#4: CA monthlybonus 4453 2811 2617 2944 3534 5856 7761
# myfun(x = 'HR')
myfun(df = df, x = 'HR')
# Dept T MANU NISH NUKUL SANIL SANU VEER
#1: HR PF No Yes Yes Yes Yes No
#2: HR Yearlybonus 5871 2161 0 3025 7173 0
#3: HR Quaterlybonus 3846 6716 4632 3869 2391 5479
#4: HR monthlybonus 6466 4377 6270 7631 6845 0
How about this:
library(tidyr)
library(dplyr)
library(purrr)
map(unique(df$Dept), ~df %>% filter(Dept == .x) %>%
mutate(across(Yearlybonus:monthlybonus, ~as.character(.x))) %>%
pivot_longer(PF:monthlybonus, names_to="T", values_to="vals") %>%
pivot_wider(names_from="Manager", values_from="vals"))
# [[1]]
# # A tibble: 4 x 9
# Dept T AKASH AMAN KAMAL SAMEER KANU KUNAL RAMIT
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 CA PF Yes Yes No No Yes Yes No
# 2 CA Yearlybonus 6946 0 3008 4323 4196 5594 2313
# 3 CA Quaterlybonus 2683 0 6012 3764 0 0 2371
# 4 CA monthlybonus 4453 2811 2617 7761 2944 3534 5856
#
# [[2]]
# # A tibble: 4 x 8
# Dept T MANU SANU NISH VEER SANIL NUKUL
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 HR PF No Yes Yes No Yes Yes
# 2 HR Yearlybonus 5871 7173 2161 0 3025 0
# 3 HR Quaterlybonus 3846 2391 6716 5479 3869 4632
# 4 HR monthlybonus 6466 6845 4377 0 7631 6270
Related
I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21
Here's a brief look at my data
X name sex X1880 X1881
1 1 Mary F 7065 6919
2 2 Anna F 2604 2698
3 3 Emma F 2003 2034
4 4 Elizabeth F 1939 1852
5 5 Minnie F 1746 1653
Each "X----" represents a year (up to 2010), the column "name" represents a unique name for a child, and so the corresponding number between any name and year is the number of children born in year "X---" with the specified name (for example, there were 7065 Marys born in 1880).
I would like to loop through columns covering the years 1931 to 2010, find the total number of children born in that year, and then find the total number of children born in that year whose name begins with each letter of the alphabet. Finally, I would like to get the percent of children born in each year whose name begins with each letter, and store it to a list so I can plot trend lines for all letters/all years on the same graph.
Here is the code I have
allnames <- read.csv("SSA-longtail-names.csv")
girls <- subset(allnames, allnames$sex=="F")
year_columns <- as.vector(names(girls)[54:134])
percs <- list()
years <- length(year_columns)
letters <- length(LETTERS)
for (i in range(1:years)){
total = sum(girls[year_columns[i]])
for (n in range(1:letters)){
l <- toString(LETTERS[n])
sub <- girls[(grep(l, girls$name)),year_columns[i]]
sub_total <- sum(sub[year_columns[i]])
percent <- (sub_total / total) * 100
percs <- append(percs, percent)
}
}
But the for loops only go through 8 iterations, and the list percs (which is supposed to store the calculated percentages) is full of NAs. Can anyone suggest a way to fix these loops, or perhaps an even easier way to accomplish this task?
Here is an approach using dplyr, tidyr, and stringr to make a long data table by pivoting your year columns.
library(dplyr)
library(tidyr)
library(stringr)
data2 <- data %>%
pivot_longer(cols = c(-X, -name, -sex), names_to = "year", values_to = "births") %>%
complete.cases() %>% # remove NA rows
mutate(year = as.integer(str_remove(year, "X")),
first_letter = str_sub(name, start = 1, end = 1) %>%
filter(year >= 1931 & year <= 2010)
Now you can do something like:
data3 <- data2 %>%
group_by(first_letter, year) %>%
summarize(total = sum(births))
This gives you a data.frame of three columns:
first_letter year total
A 1880 17972
A 1881 16426
# etc.
Now you can do some plotting, for example with ggplot2
library(ggplot2)
# this only looks at the English vowels to make a manageable example
ggplot(data = data3 %>% filter(first_letter %in% c("A", "E", "I", "O", "U"),
aes(x = year, y = total, color = first_letter)) +
geom_line()
As mentioned, consider reshaping data to long format (the better format in data analytics for merging, cleaning, aggregating, modeling, and plotting).
Reshape
girls_long <- reshape(girls, varying = names(girls)[4:ncol(girls)], times = names(girls)[4:ncol(girls)],
idvar = c("X", "name", "sex"),
v.names = "count", timevar = "year", ids=NULL,
new.row.names = 1:1E5, direction = "long")
girls_long$year <- as.integer(gsub("X", "", girls_long$year))
girls_long
# X name sex year count
# 1 1 Mary FALSE 1880 7065
# 2 2 Anna FALSE 1880 2604
# 3 3 Emma FALSE 1880 2003
# 4 4 Elizabeth FALSE 1880 1939
# 5 5 Minnie FALSE 1880 1746
# 6 1 Mary FALSE 1881 6919
# 7 2 Anna FALSE 1881 2698
# 8 3 Emma FALSE 1881 2034
# 9 4 Elizabeth FALSE 1881 1852
# 10 5 Minnie FALSE 1881 1653
Aggregations
# Total number of children born in that year
total_df <- aggregate(name ~ year, girls_long, FUN=length)
total_df
# year count
# 1 1880 15357
# 2 1881 15156
# Total number of children born in that year whose name begins with each letter of the alphabet
girls_long$name_letter <- substring(girls_long$name, 1, 1)
girls_agg <- aggregate(cbind(count=name) ~ name_letter + year, girls_long, FUN=length)
girls_agg
# name_letter year count
# 1 A 1880 2604
# 2 E 1880 3942
# 3 M 1880 8811
# 4 A 1881 2698
# 5 E 1881 3886
# 6 M 1881 8572
# Percent of children born in each year whose name begins with each letter
girls_agg$percent <- with(girls_agg, count / ave(count, year, FUN=sum))
girls_agg
# name_letter year count percent
# 1 A 1880 2604 0.1695644
# 2 E 1880 3942 0.2566908
# 3 M 1880 8811 0.5737449
# 4 A 1881 2698 0.1780153
# 5 E 1881 3886 0.2564001
# 6 M 1881 8572 0.5655846
I've split the solution into the three parts you describe. If you are only after the percentages, you can ignore the first part (total) and combine the second and third:
library(dplyr)
library(stringr)
library(tidyr)
data <- tibble(name = c('Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'),
sex = rep('F', 5),
X1880 = c(7065, 2604, 2003, 1939, 1746),
X1881 = c(6919, 2698, 2034, 1852, 1653))
total <- data %>%
summarise(across(X1880:X1881, sum)) %>%
pivot_longer(everything(), names_to = 'year', values_to = 'total')
total
# year total
# <chr> <dbl>
# 1 X1880 15357
# 2 X1881 15156
totalPerLetter <- data %>%
mutate(letter = str_extract(name, '^.')) %>%
select(letter, starts_with('X')) %>%
pivot_longer(-letter, names_to = 'year', values_to = 'count') %>%
group_by(letter, year) %>%
mutate(count = sum(count)) %>%
distinct()
totalPerLetter
# letter year count
# <chr> <chr> <dbl>
# 1 M X1880 8811
# 2 M X1881 8572
# 3 A X1880 2604
# 4 A X1881 2698
# 5 E X1880 3942
# 6 E X1881 3886
pctPerLetter <- totalPerLetter %>%
group_by(year) %>%
mutate(total = sum(count)) %>%
ungroup() %>%
mutate(percent = count/(total/100))
pctPerLetter
# letter year count total percent
# <chr> <chr> <dbl> <dbl> <dbl>
# 1 M X1880 8811 15357 57.4
# 2 M X1881 8572 15156 56.6
# 3 A X1880 2604 15357 17.0
# 4 A X1881 2698 15156 17.8
# 5 E X1880 3942 15357 25.7
# 6 E X1881 3886 15156 25.6
Below is a subsetted dataset, I was wondering how do I go about for each set of ids, and sorted by earliest to latest date, create a new column that indicates the row before the "LTD" status? The purpose is to identify the diagnosis or row before hitting the "LTD" status for that unique id. Thanks in advance!
Dataset:
id <- c(123,123,123,123,123,321,321)
diag <- c("injury1", "injury2" , "cancer","injury4","cancer", "injury5", "cancer")
date <- as.Date(c('2008-11-1','2009-3-25','2010-3-14',"2010-10-14","2010-11-14", '2015-3-14', '2015-4-15'))
status <- (c("STD", "STD", "LTD", "STD","LTD","STD", "LTD"))
data <- data.frame(id, diag, date, status)
Result (N for no, Y for yes):
123 injury1 2008-11-01 STD N
123 injury2 2009-03-25 STD Y
123 cancer 2010-03-14 LTD NA
123 injury4 2010-10-14 STD Y
123 Cancer 2010-11-14 LTD NA
321 injury5 2015-03-14 STD Y
321 cancer 2015-04-15 LTD NA
We can convert the date to date object arrange by date , group_by id and use case_when based on conditions.
library(dplyr)
data %>%
mutate(date = as.Date(date)) %>%
arrange(date) %>%
group_by(id) %>%
mutate(result = case_when(lead(status == "LTD") ~"Y",
status == "LTD" ~ NA_character_,
TRUE~ "N"))
# id diag date status result
# <dbl> <fct> <date> <fct> <chr>
#1 123 injury1 2008-11-01 STD N
#2 123 injury2 2009-03-25 STD Y
#3 123 cancer 2010-03-14 LTD NA
#4 123 injury4 2010-10-14 STD Y
#5 123 cancer 2010-11-14 LTD NA
#6 321 injury5 2015-03-14 STD Y
#7 321 cancer 2015-04-15 LTD NA
Using by() and step-by-step assignment.
do.call(rbind, by(data[order(data$date), ], data$id, function(x) {
x$diag <- "N"
x$diag[which(x$status == "LTD") - 1] <- "Y"
x$diag[x$status == "LTD"] <- NA
return(x[c(1, 3:4, 2)])
}))
# id date status diag
# 123.1 123 2008-11-01 STD N
# 123.2 123 2009-03-25 STD Y
# 123.3 123 2010-03-14 LTD <NA>
# 123.4 123 2010-10-14 STD Y
# 123.5 123 2010-11-14 LTD <NA>
# 321.6 321 2015-03-14 STD Y
# 321.7 321 2015-04-15 LTD <NA>
I have two tables, one with property listings and another one with contacts made for a property (i.e. is someone is interested in the property they will "contact" the owner).
Sample "listings" table below:
listings <- data.frame(id = c("6174", "2175", "9176", "4176", "9177"), city = c("A", "B", "B", "B" ,"A"), listing_date = c("01/03/2015", "14/03/2015", "30/03/2015", "07/04/2015", "18/04/2015"))
listings$listing_date <- as.Date(listings$listing_date, "%d/%m/%Y")
listings
# id city listing_date
#1 6174 A 01/03/2015
#2 2175 B 14/03/2015
#3 9176 B 30/03/2015
#4 4176 B 07/04/2015
#5 9177 A 18/04/2015
Sample "contacts" table below:
contacts <- data.frame (id = c ("6174", "6174", "6174", "6174", "2175", "2175", "2175", "9176", "9176", "4176", "4176", "9177"), contact_date = c("13/03/2015","14/04/2015", "27/03/2015", "13/04/2015", "15/03/2015", "16/03/2015", "17/03/2015", "30/03/2015", "01/06/2015", "08/05/2015", "09/05/2015", "23/04/2015" ))
contacts$contact_date <- as.Date(contacts$contact_date, "%d/%m/%Y")
contacts
# id contact_date
#1 6174 2015-03-13
#2 6174 2015-04-14
#3 6174 2015-03-27
#4 6174 2015-04-13
#5 2175 2015-03-15
#6 2175 2015-03-16
#7 2175 2015-03-17
#8 9176 2015-03-30
#9 9176 2015-06-01
#10 4176 2015-05-08
#11 4176 2015-05-09
#12 9177 2015-04-23
Problem
1. I need to count the number of contacts made for a property within 'x' days of listing. The output should be a new column added to "listings" with # contacts:
Sample ('x' = 30 days)
listings
# id city listing_date ngs
#1 6174 A 2015-03-01 2
#2 2175 B 2015-03-14 3
#3 9176 B 2015-03-30 1
#4 4176 B 2015-04-07 0
#5 9177 A 2015-04-18 1
I have done this with the for loop; it is horrible slow for live data:
n <- nrow(listings)
mat <- vector ("integer", n)
for (i in 1:n) {
mat[i] <- nrow (contacts[contacts$id==listings[i,"id"] & as.numeric (contacts$contact_date - listings[i,"listing_date"]) <=30,])
}
listings$ngs <- mat
I need to prepare a histogram of # contacts vs days with 'x' as variable - through manipulate function. I can't figure out a way to do all this inside the manipulate function.
Here's a possible solution using data.table rolling joins
library(data.table)
# key `listings` by proper columns in order perform the binary join
setkey(setDT(listings), id, listing_date)
# Perform a binary rolling join while extracting matched icides and counting them
indx <- data.table(listings[contacts, roll = 30, which = TRUE])[, .N, by = V1]
# Joining back to `listings` by proper rows while assigning the counts by reference
listings[indx$V1, ngs := indx$N]
# id city listing_date ngs
# 1: 2175 B 2015-03-14 3
# 2: 4176 B 2015-04-07 NA
# 3: 6174 A 2015-03-01 2
# 4: 9176 B 2015-03-30 1
# 5: 9177 A 2015-04-18 1
I'm not sure if your actual id values are factor, but I'll start by making those numeric. Using them as factors will cause you problems:
listings$id <- as.numeric(as.character(listings$id))
contacts$id <- as.numeric(as.character(contacts$id))
Then, the strategy is to calculate the "days since listing" value for each contact and add this to your contacts data.frame. Then, aggregate this new data.frame (in your example, sum of contacts within 30 days), and then merge the resulting count back into your original data.
contacts$ngs <- contacts$contact_date - listings$listing_date[match(contacts$id, listings$id)]
a <- aggregate(ngs ~ id, data = contacts, FUN = function(x) sum(x <= 30))
merge(listings, a)
# id city listing_date ngs
# 1 2175 B 2015-03-14 3
# 2 4176 B 2015-04-07 0
# 3 6174 A 2015-03-01 2
# 4 9176 B 2015-03-30 1
# 5 9177 A 2015-04-18 1
Or:
indx <- match(contacts$id, listings$id)
days_since <- contacts$contact_date - listings$listing_date[indx]
n <- with(contacts[days_since <= 30, ], tapply(id, id, length))
n[is.na(n)] <- 0
listings$n <- n[match(listings$id, names(n))]
It's similar to Thomas' answer but utilizes tapply and match instead of aggregate and merge.
You could use the dplyr package. First merge the data:
all.data <- merge(contacts,listings,by = "id")
Set a target number of days:
number.of.days <- 30
Then gather the data by ID (group_by), exclude the results that are not within the time frame (filter) and count the number of occurrences/rows (summarise).
result <- all.data %>% group_by(id) %>% filter(contact_date > listing_date + number.of.days) %>% summarise(count.of.contacts = length(id))
I think there are a number of ways this could be potentially solved but I have found dplyr to be very helpful in a lot circumstances.
EDIT:
Sorry should have thought about that a little more. Does this work,
result <- all.data %>% group_by(id,city,listing_date) %>% summarise(ngs = length(id[which(contact_date < listing_date + number.of.days)]))
I don't think zero results can be passed sensibly through the filter stage (understandably, the goal is usually the opposite). I'm not too sure what sort of impact the 'which' component will have on processing time, likely to be slower than using the 'filter' function but might not matter.
Using dplyr for your first problem:
left_join(contacts, listings, by = c("id" = "id")) %>%
filter(abs(listing_date - contact_date) < 30) %>%
group_by(id) %>% summarise(cnt = n()) %>%
right_join(listings)
And the output is:
id cnt city listing_date
1 6174 2 A 2015-03-01
2 2175 3 B 2015-03-14
3 9176 1 B 2015-03-30
4 4176 NA B 2015-04-07
5 9177 1 A 2015-04-18
I'm not sure I understand your second question to answer it.