Finding the starting location of specific consecutive values in every row - r

I am kind of new to R and am still learning how to use all the functions. I'm stuck with this problem on my project, so would definitely appreciate any help!
I have the spending behavior data of customer 1, 2, 3, and 4 from Jan to Dec. Every row is a unique customer, and every column is the customer's spending activity of that month (1 is active, 0 is inactive).
+------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Name | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec |
+------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Customer 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| Customer 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
| Customer 3 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| Customer 4 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
I am trying to find the "last" occurrence of 3 consecutive zero's in every row, if any, and locate the starting point of them. This would help me identify when the customer goes into "idle" or "hibernation" status.
The expected results would be:
Jun (or 6 as column index) for Customer 1
Sep (or 9 as column index) for Customer 2
Oct (or 10 as column index) for Customer 3, since the last 3-month occurrence begins at Oct instead of Sep
NA for Customer 4, since there's no such an occurrence
After going through some similar questions on stackoverflow, I figured using rle and apply might be the right approach, but I have been struggling with how to write this into actual code. Sincerely appreciate any idea!

I would melt the dataframe so there are 3 columns:
Name, Month, Spending Behaviour.
I would then do a rollmean from the zoo package with a left align, subset for instances where the rollmean is 0 and pick the last observation for each customer.

Related

Read a three way ouput table in R

I hope you can help me.
I would like to know how to read an output in R.
So I have three binary variables Q3A1, Q3A2, Q3A3, each line corresponds to the answer of a subject
Q3A1
Q3A2
Q3A3
0
1
0
1
0
1
0
1
0
1
0
1
I wanted to have the number of people for each possible answer, so I wrote the command
table(dataQ$Q3A1,dataQ$Q3A2,dataQ$Q3A3)
and I got this as an output in R console
, , = 0
| | 0 |1 |
| 0 | 90 |16|
| 1 | 80 |61|
, , = 1
| | 0 |1 |
| 0 | 19 |4 |
| 1 | 52 |13|
but I don't know how to read it, which rows and columns corresponds to which variable ?
Thanks in advance for your help !

Sqlite count occurence per year

So let's say I have a table in my Sqlite database with some information about some files, with the following structure:
| id | file format | creation date |
----------------------------------------------------------
| 1 | Word | 2010:02:12 13:31:33+01:00 |
| 2 | PSD | 2021:02:23 15:44:51+01:00 |
| 3 | Word | 2019:02:13 14:18:11+01:00 |
| 4 | Word | 2010:02:12 13:31:20+01:00 |
| 5 | Word | 2003:05:25 18:55:10+02:00 |
| 6 | PSD | 2014:07:20 20:55:58+02:00 |
| 7 | Word | 2014:07:20 21:09:24+02:00 |
| 8 | TIFF | 2011:03:30 11:56:56+02:00 |
| 9 | PSD | 2015:07:15 14:34:36+02:00 |
| 10 | PSD | 2009:08:29 11:25:57+02:00 |
| 11 | Word | 2003:05:25 20:06:18+02:00 |
I would like results that show me a chronology of how many of each file format were created in a given year – something along the lines of this:
|Format| 2003 | 2009 | 2010 | 2011 | 2014 | 2015 | 2019 | 2021 |
----------------------------------------------------------------
| Word | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 0 |
| PSD | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| TIFF | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
I've gotten kinda close (I think) with this, but am stuck:
SELECT
file_format,
COUNT(CASE file_format WHEN creation_date LIKE '%2010%' THEN 1 ELSE 0 END),
COUNT(CASE file_format WHEN creation_date LIKE '%2011%' THEN 1 ELSE 0 END),
COUNT(CASE file_format WHEN creation_date LIKE '%2012%' THEN 1 ELSE 0 END)
FROM
fileinfo
GROUP BY
file_format;
When I do this I am getting unique amounts for each file format, but the same count for every year…
|Format| 2010 | 2011 | 2012 |
-----------------------------
| Word | 4 | 4 | 4 |
| PSD | 1 | 1 | 1 |
| TIFF | 6 | 6 | 6 |
Why am I getting that incorrect tally, and moreover, is there a smarter way of querying that doesn't rely on the year being statically searched for as a string for every single year? If it helps, the column headers and row headers could be switched – doesn't matter to me. Please help a n00b :(
Use SUM() aggregate function for conditional aggregation:
SELECT file_format,
SUM(creation_date LIKE '2010%') AS `2010`,
SUM(creation_date LIKE '2011%') AS `2011`,
..........................................
FROM fileinfo
GROUP BY file_format;
See the demo.

(AVB)&(AV~B) is logically equivalent to ~~A, it that true or false?

(AVB)&(AV~B) is logically equivalent to ~~A?
It kind of confused me since they are in different dimension.
Starting from:
(A∨B)∧(A∨¬B)
One can first apply one of De Morgan's laws to the AND:
¬(¬(A∨B)∨¬(A∨¬B))
Next, one of De Morgan's laws can be applied to each side:
¬((¬A∧¬B)∨(¬A∧B))
By applying distributivity of AND over OR in reverse, ¬A can be extracted:
¬(¬A∧(¬B∨B))
By complementation, ¬B∨B is 1:
¬(¬A∧1)
By identity for AND:
¬¬A
Applying double negation:
A
This goes through your target ¬¬A without going through simple A. It's much simpler not to do that.
Starting again from:
(A∨B)∧(A∨¬B)
By applying distributivity of OR over AND in reverse, A can be extracted:
A∨(B∧¬B)
By complementation, B∨¬B is 1:
A∧1
By identity for AND:
A
Applying double negation in reverse:
¬¬A
Since you mention doing it by truth table, here's my go at showing that they're equivalent that way:
+---+---+----+----+-------+--------+--------------+-----+
| A | B | ¬A | ¬B | (A∨B) | (A∨¬B) | (A∨B)∧(A∨¬B) | ¬¬A |
+---+---+----+----+-------+--------+--------------+-----+
| 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |
+---+---+----+----+-------+--------+--------------+-----+

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Group by Week with Columns for each day of week

I'm trying to created a report for my asp.net application which will show the quantity of each item in combination with unit that was ordered for each day of the week. The days of the week are columns.
To be more specific:
I have two table, one is the Orders table with order id, customer name, date etc...
The second table is OrderItems, this table has order id as a foreign key, order Item id, item name, unit (exp: each, box , case), quantity, and price.
When a user picks a date range for the report, for example from 3/2/12 to 4/2/12, on my asp page, the report will group order items by week and will look as follows:
**week (1) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $1234
**week (2) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $2354
**week (2) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $2354
I wish I could have something to show that I have already started, but crystal isn't my strong point and I dont even know where start tackling this one. I do know how to pass parameters and a datatable that I myself pre-filtered before passing it to the report. For example filtering items by date range and customer or order id.
any help would be much appreciated
Create a formula for each day of the week that totals the order.
ie Sunday quantity:
if dayOfWeek(dateField) = 'Sun'
then order.quantity
else 0
Add each day formula to the detail section of the report and then summarize it for each group level. To group it by week, just group by the date field, then set the grouping option to by week. Suppress the detail, and you'll have what you are looking for.
I don't remember the exact name of the dayOfWeek function, but it's something like that.

Resources