Related
I have a dataset that looks like (A) and I'm trying to get (B):
#(A)
event <- c('A', 'A', 'A', 'B', 'B', 'C', 'D', 'D', 'D')
person <- c('Ann', 'Sally', 'Ryan', 'Ann', 'Ryan', 'Sally', 'Ann', 'Sally', 'Ryan')
birthday <- c('1990-10-10', NA, NA, NA, '1985-01-01', NA, '1990-10-10', '1950-04-02', NA)
data <- data.frame(event, person, birthday)
#(B)
person <- c('Ann', 'Sally', 'Ryan')
A <- c(1, 1, 1)
B <- c(1, 0, 1)
C <- c(0, 0, 1)
D <- c(1, 1, 1)
birthday <- c('1990-10-10', '1950-04-02', '1985-01-01')
data <- data.frame(person, A, B, C, D, birthday)
Basically, I have a sign-up list of events and can see people who attended various ones. I want to get a list of all the unique people with columns for which events they did/didn't attend. I also got profile data from some of the events, but some had more data than others - so I also want to keep the most filled out data (i.e. couldn't identify Ryan's birthday from event D but could from event B).
I've tried looking up many different things but get confused between whether I should be looking at reshaping, vs. dcast, vs. spread/gather... new to R so any help is appreciated!
EDIT: Additional q - instead of indicating 1/0 for if someone went an event, if multiple events were in the same category, how would you identify how many times someone went to that category of event? E.g., I would have events called A1, A2, and A3 in the dataset as well. The final table would still have a column called A, but instead of just 1/0, it would say 0 if the person attended no A events, and 1, 2, or 3 if the person attended 1, 2, or 3 A events.
A data.table option
dcast(
setDT(data),
person + na.omit(birthday)[match(person, person[!is.na(birthday)])] ~ event,
fun = length
)
gives
person birthday A B C D
1: Ann 1990-10-10 1 1 0 1
2: Ryan 1985-01-01 1 1 0 1
3: Sally 1950-04-02 1 0 1 1
A base R option using reshape
reshape(
transform(
data,
birthday = na.omit(birthday)[match(person, person[!is.na(birthday)])],
cnt = 1
),
direction = "wide",
idvar = c("person", "birthday"),
timevar = "event"
)
gives
person birthday cnt.A cnt.B cnt.C cnt.D
1 Ann 1990-10-10 1 1 NA 1
2 Sally 1950-04-02 1 NA 1 1
3 Ryan 1985-01-01 1 1 NA 1
First, you should isolate birthdays which is not represented cleanly in your table ; then you should reshape and finally get birthdays back.
Using the package reshape2 :
birthdays <- unique(data[!is.na(data$birthday),c("person","birthday")])
reshaped <- reshape2::dcast(data,person ~ event, value.var = "event",fun.aggregate = length)
final <- merge(reshaped,birthdays)
Explications : I just told reshape2::dcast to put my person into rows and event into columns, and count every occurrence (made by the aggregation function length) of event.
EDIT: for your additional question, it works just the same, just add substr() on the event variable :
reshaped <- reshape2::dcast(data,person ~ substr(event,1,1), value.var = "event",fun.aggregate = length)
I have two dataframes that have sales data from fruits store.
1st Data frame has sales data from 'Store A',
and the 2nd data frame has that data gathered from 'Store A + Store B'
StoreA = data.frame(
Fruits = c('Apple', 'Banana', 'Blueberry'),
Customer = c('John', 'Peter', 'Jenny'),
Quantity = c(2, 3, 1)
)
Total = data.frame(
Fruits = c('Blueberry', 'Apple', 'Banana', 'Blueberry', 'Pineapple'),
Customer = c('Jenny' , 'John', 'Peter', 'John', 'Peter'),
Quantity = c(4, 7, 3, 5, 3)
)
StoreA
Total
I wish to subtract the sales data of 'StoreA' from 'Total' to get sales data for 'StoreB'.
At the end, I wish to have something like
Great Question! There is a simple and graceful way of achieving exactly what you want.
The title to this question is: "subtracting values in a data frame with conditons() from another data frame"
This subtraction can be accomplished just like the title says. But there is a better way than using subtraction. Turning a subtraction problem into an addition problem is often the easiest way of solving a problem.
To make this into an addition problem, just convert one of the data frames (StoreA$Quantity) into negative values. Only convert the Quantity variable into negative values. And then rename the other data frame (Total) into StoreB.
Once that is done, it's easy to finish. Just use the join function with the two data frames (StoreA & StoreB). Doing that brings the negative and positive values together and the data is more understandable. When there are the same things with positive and negative values, then it's obvious these things need to be combined.
To combine those similar items, use the group_by() function and pipe it into a summarize() function. Doing the coding this way makes the code easy to read and easy to understand. The code can almost be read like a book.
Create data frames:
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
StoreB = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
Convert StoreA$Quantity to negative values:
StoreA_ <- StoreA
StoreA_Quanity <- StoreA$Quantity * -1
StoreA_
StoreA_ now looks like this:
Fruits Customer Quantity
<fct> <fct> <dbl>
Apple John -2
Banana Peter -3
Blueberry Jenny -1
Now combine StoreA and StoreB. Use the full_join() function to join the two stores:
Total <- full_join(StoreA_, StoreB, disparse = 0)
Total
The last thing is accomplished using the group_by function. This will combine the positive and negative values together.
Total %>% group_by(Fruits, Customer) %>% summarize(s = sum(Quantity))
It's Done! The output is shown at this link:
You could do a full join first, then rename the columns, fill the missing values resulting from the join and then compute the difference.
library(tidyverse)
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
Total = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
full_join(StoreA %>%
rename(Qty_A = Quantity),
Total %>%
rename(Qty_Total = Quantity), by = c("Fruits", "Customer")) %>%
# fill NAs with zero
replace_na(list(Qty_A = 0)) %>%
# compute the difference
mutate(Qty_B = Qty_Total - Qty_A)
#> Fruits Customer Qty_A Qty_Total Qty_B
#> 1 Apple John 2 7 5
#> 2 Banana Peter 3 3 0
#> 3 Blueberry Jenny 1 4 3
#> 4 Blueberry John 0 5 5
#> 5 Pineapple Peter 0 3 3
Created on 2020-09-28 by the reprex package (v0.3.0)
I am trying to achieve this using WHILE, but it's too complex for me,there must be a way using dplyr library.
I have a warehouse with:
product_id amount
1 1001 1
2 4911 100
3 4014 32
I am writing a function that will pass product_id and amount, and take the required amount out, and if such product_id does not exist or the amount higher that what available return an error.
So, if I ran the function:
remove_warehouse(1001,1)
Result should be:
product_id amount
1 4911 100
2 4014 32
And if I run eiter:
remove_warehouse(240,1)
or
remove_warehouse(4014,60)
I should get a generic error "not enough amount or product not present"
One way of writing the function could be
remove_warehouse <- function(df, product_id, amount) {
id = df$product_id == product_id
if (any(id))
amount_base = df$amount[id]
else
stop("No id present")
if (amount > amount_base)
stop("No sufficient amount")
else
df$amount[id] = df$amount[id] - amount
df
}
remove_warehouse(df, 4911, 90)
# product_id amount
#1 1001 1
#2 4911 10
#3 4014 32
remove_warehouse(df, 1234, 12)
#Error in remove_warehouse(df, 1234, 12) : No id present
remove_warehouse(df, 1001, 100)
#Error in remove_warehouse(df, 1001, 100) : No sufficient amount
This is assuming you will have only one product_id in your df.
data
df <- structure(list(product_id = c(1001L, 4911L, 4014L), amount = c(1L,
100L, 32L)), .Names = c("product_id", "amount"), class = "data.frame",
row.names = c("1", "2", "3"))
I have a data frame in which I have different duplicates of ID and dates. I just want to detect the duplicates of one column that are also in the other so I can say:
1. remove the rows with duplicate id, duplicate datee and missing in T (second record in this table).
2. And then say: if there is a duplicate id and duplicate date, chose the T=="high".
id<-c("a", "a", "a", "a", "b", "c")
datee<-c("12/02/10", "12/02/10", "12/02/10","10/03/11", "10/04/18","1/04/18" )
T<-c("high", NA, "low","high", "low", "medium")
mydata<-data.frame(id, datee, T)
This is like this:
id datee T
a 12/02/10 high
a 12/02/10 <NA>
a 12/02/10 low
a 10/03/11 high
b 10/04/18 low
c 1/04/18 medium
A step by step solution
Step 1 - Remove missing
mydata<-mydata[!is.na(mydata[,3]),]
Step 2 - Identify duplicated rows on ID
dup_rows_ID<-duplicated(mydata[,c(1)],fromLast = TRUE) | duplicated(mydata[,c(1)],fromLast = FALSE)
mydata_dup<-mydata[dup_rows_ID,]
Step 3 - Identity duplicated rows on ID and datee
dup_rows_ID_datee<-duplicated(mydata_dup[,c(1,2)],fromLast = TRUE) | duplicated(mydata_dup[,c(1,2)],fromLast = FALSE)
Step 4 - Select T="high"
mydata_dup2<-mydata_dup[mydata_dup[dup_rows_ID_datee,"T"]=="high",]
Your output
rbind(mydata_dup[rownames(mydata_dup) %in% rownames(mydata_dup2),],
+ mydata[!dup_rows_ID,])
id datee T
1 a 12/02/10 high
4 a 10/03/11 high
5 b 10/04/18 low
6 c 1/04/18 medium
About ID==a you have two date with T=="high", you have to choose if you want the one with higher or lower datee.
You can do like this first:
is_duplicate <- lapply(X = mydata, FUN = duplicated, incomparables = FALSE)
is_na <- lapply(X = mydata, FUN = is.na)
and use data.frames to f.ex. remove the rows with duplicate id, duplicate datee and missing in T like this:
drop_idx <- which(is_duplicate$id & is_duplicate$datee & is_na$T)
data[drop_idx, ]
Description of the data
My data.frame represents the salary of people living in different cities (city) in different countries (country). city names, country names and salaries are integers. In my data.frame, the variable country is ordered, the variable city is ordered within each country and the variable salary is ordered within each city (and country). There are two additional columns called arg1 and arg2, which contain floats/doubles.
Goal
For each country and each city, I want to consider a window of size WindowSize of salaries and calculate D = sum(arg1)/sum(arg2) over this window. Then, the window slide by WindowStep and D should be recalculated and so on. For example, let's consider a WindowSize = 1000 and WindowStep = 10. Within each country and within each city, I would like to get D for the range of salaries between 0 and 1000 and for the range between 10 and 1010 and for the range 20 and 1020, etc...
At the end the output should be a data.frame associating a D statistic to each window. If a given window has no entry (for example nobody has a salary between 20 and 1020 in country 1, city 3), then the D statistic should be NA.
Note on performance
I will have to run this algorithm about 10000 times on pretty big tables (that have nothing to do with countries, cities and salaries; I don't yet have a good estimate of the size of these tables), so performance is of concern.
Example data
set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
country city salary arg1 arg2
1 1 1 22791 -1.4606212 1.07084528
2 1 1 34598 0.9244679 1.19519158
3 1 1 76411 0.8288587 0.86737330
4 1 1 76790 1.3013056 0.07380115
5 1 1 87297 -1.4021137 1.62395596
6 1 2 12581 1.3062181 -1.03360620
With this example, if windowSize = 70000 and windowStep = 30000, the first values of D are -0.236604 and 0.439462 which are the results of sum(dt$arg1[1:2])/sum(dt$arg2[1:2]) and sum(dt$arg1[2:5])/sum(dt$arg2[2:5]), respectively.
Unless I've misunderstood something, the following might be helpful.
Define a simple function regardless of hierarchical groupings:
ff = function(salary, wSz, wSt, arg1, arg2)
{
froms = (wSt * (0:ceiling(max(salary) / wSt)))
tos = froms + wSz
Ds = mapply(function(from, to, salaries, args1, args2) {
inds = salaries > from & salaries < to
sum(args1[inds]) / sum(args2[inds])
},
from = froms, to = tos,
MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
list(from = froms, to = tos, D = Ds)
}
Compute on the groups with, for example, data.table:
library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
# country city from to D
# 1: 1 1 0 70000 -0.2366040
# 2: 1 1 30000 100000 0.4394620
# 3: 1 1 60000 130000 0.2838260
# 4: 1 1 90000 160000 NaN
# 5: 1 2 0 70000 1.8112196
# 6: 1 2 30000 100000 0.6134090
# 7: 1 2 60000 130000 0.5959344
# 8: 1 2 90000 160000 NaN
# 9: 1 3 0 70000 1.3216255
#10: 1 3 30000 100000 1.8812397
I.e. a faster equivalent of
lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))
Without your expected outcome it is a bit hard to guess whether my result is correct but it should give you a head start for the first step. From a performance point of view the data.table package is very fast. Much faster than loops.
set.seed(84)
country <- rep(1:3, c(30, 22, 51))
city <- c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt <- paste0(city, country)
salary <- c()
for (i in unique(tt)) salary <- append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 <- rnorm(length(country), 1, 1)
arg2 <- rnorm(length(country), 1, 1)
dt <- data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dt)
# For data table
require(data.table)
# For rollapply
require(zoo)
setDT(dt)
WindowSize <- 10
WindowStep <- 3
dt[, .(D = (rollapply(arg1, width = WindowSize, FUN = sum, by = WindowStep) /
rollapply(arg2, width = WindowSize, FUN = sum, by = WindowStep)),
by = list(country = country, city = city))]
You can achieve the latter part of your goal by melting the data and doing and writing a custom summary function that you use to dcast your data together again.
Table = NULL
StepNumber = 100
WindowSize = 1000
WindowRange = c(0,WindowSize)
WindowStep = 100
for(x in dt$country){
#subset of data for that country
CountrySubset = dt[dt$country == x,,drop=F]
for(y in CountrySubset$city){
#subset of data for citys within country
CitySubset = CountrySubset[CountrySubset$city == y,,drop=F]
for(z in 1:StepNumber){
WinRange = WindowRange + (z*WindowStep)
#subset of salarys within country of city via windowRange
WindowData = subset(CitySubset, salary > WinRange[1] & salary < WinRange[2])
CalcD = sum(WindowData$arg1)/sum(WindowData$arg2)
Output = c(Country = x, City = y, WinStart = WinRange[1], WinEnd = WinRange[2], D = CalcD)
Table = rbind(Table,Output)
}
}
}
Using your example code this should work, its just a series of nested loops that will write to Table. It does however duplicate a line every now and then because the only way I know to keep adding results to a table is rbind.
So if someone can alter this to fix that. Should be good.
WindowStep is the difference between each consecutive WindowSize you want.
StepNumber is how many steps you want to take in total, might be best to find out what the maximum salary is and then adjust for that.