I'm learning to use R, and I want to do an analysis of my customers and the distribution of them into categories. This is an example of my data:
I want to know how many customers (cards) have tickets in A, B or C, and the amount spend on each of these categories, and then do the same thing but for each type of segment.
I want to do a bar chart with the percentage of tickets in each category of all my customers and then 3 charts of each segment.
Heres an example of what I want to get:
Example
The thing is that there are cards that have 0 tickets on that category.
I've been thinking that it might work to add a column beside each category, adding for example an "if the row has tickets in category A", then put A in that row, so maybe its easier to create a column?
If you think that might work, how could I run that?
Sorry but I'm very new on R and I have a huge data base :(
Thanks for your help!!
Related
I am trying to produce a user interactive table like Target table
The yellow columns represent editable columns, white ones can't be edited, and they work in following fashion
User updates column B and allocates 100% of distribution among the three rows(40,35,25 in this case). A total is shown in footer to help user identify correct allocation of 100%
Column A updates in backend by multiplying each % with a predetermined number(1000 in this case). A total is shown, which would be 1000 if user entered % correctly
User then enters values in column C. No total is necessary here
Column D is then calculated in backend as a product of column A and column C
Is this achievable by any simple means?
I also then want to reuse this table for more calculations since this is the input prompt
Thank you very much.
You can do most or all of this with the tools you tagged but it won't be simple:
The DT package will not do all of this for you: you will also have to use reactivity.
This link discusses working with edited values in DT, providing an idea of the level of difficulty to expect: R -shiny- DT: how to update col filters.
In addition to what you see in the example, you want to add formatting, a totals row, and only allow editing of some columns, which will require more work.
I have a table visualisation in PowerBI that sums the top 10 products sold by sales quantity. I have a calculated column which shows the rate of sale, using other fields from the data source:
(quantity / # stores with product) / weeks on sale
The ROS calculates correctly, but it still sums and appears in the total row.. The number of stores and number of weeks are set to 'Don't Summarize', but they still add together and give some meaningless number in the total row. If i set ROS to 'Don't Summarize', to remove the total row, the summing of the rest of the table and therefore the filter I have on top N by quantity drops out.
It is very frustrating! Is there an option somewhere to simply not display total for a field?? I don't want to remove the total row completely as the other fields (e.g. Qty, Value, Margin) are useful to see a sum of.. It seems very strange that it is so difficult to do something so minor..
Additional info:
Qty is a SUM field.
Stores is not summarized and simply refers to the average number of stores that stock that product over the weeks of the trading season
Weeks is not summarized.
Weeks is not summarized and refers to the weeks that have passed in the trading season.
Example data:
Item.......Qty......Stores.....Weeks....ROS
Itm1........600........390.........2............0.77
Itm2........444........461.........2............0.48
Itm3........348........440.........2............0.40
Total.....1,392.....1,291*......6*...........1.65*
Fields marked with a * are those where the sum is a meaningless figure unrelated to the data. I do not actually need Stores and Weeks to show in the table, so the fact that they sum does not matter. However, ROS is essential, but the sum part is totally irrelevant and I do not want it to show. Any ideas? I am open to the idea of using R to overcome the lack of flexibility in the standard tables although my knowledge in this area is fairly limited.
I suspect you've made a common mistake - using a Calculated Column for ROS where you should've used a Measure.
If you rebuild that calculation as a Measure, then you can wrap the HASONEVALUE function around it, with the objective of showing a blank when there are multiple Item values in context (the Total row).
Roughly the Measure formula would be:
ROS = IF ( HASONEVALUE ( Mytable[Item] ) , << calculation >> , BLANK() )
I would also replace your use of / with the DIVIDE function, to avoid divide by zero errors.
You can remove individual totals for columns in tables and matrix objects in a round-about way by using field formatting.
Click the object, go to formatting, click the field formatting accordion, select the column or columns you want to affect from the drop-down list, set the font color to white, set 'apply to values' to off, and set 'apply to totals' to on.
A bit tedious if you have many columns, but you will have, in affect, whited-out the column totals.
Heads up, you might still have a problem with exporting data, though.
Cheers
Click on the table -> Fields -> expand the value field you don't want to include -> Select "Don't Summarize." This will exclude it from the "Total" row.
select do not summarise option for those metrics which you dont want total
Select the table you want to change
In the Visualizations pane:
Go to Format,
Find the Field Formatting option,
Choose the field you don't want to summarize.
Turn off 'apply to header',
Turn off 'apply to values',
Turn ON 'apply to total',
Change the font color to white.
I'm trying to develop a ChartCustomizer that takes the data from a chart and converts it into a histogram (because JR does not directly support histograms). It's a fairly simple implementation with hard-coded intervals, etc. mostly as a proof-of-concept at this point.
The data I'm analyzing is HTTP response-time data of the form [date, response-time] and I have a CSV file with 18512 records in it. In my summary band, I have 3 items:
A text field dumping $V{REPORT_COUNT} (it reports 18512 in iReport's report preview)
A time series showing all the data points [date, response-time]
A category plot containing all the data points in a single series [category=$F{DATE}, value=$F{RESPONSE_TIME}]
I decided that the most straightforward way to build a histogram would be to use the Category plot because it had the right structure for the final histogram chart.
When the ChartCustomizer runs, it dumps out all kinds of good information about the data set, including the size. Strangely, the size is 10252: it's missing something like 8000 data points. I can't understand why the category plot would have fewer data points than the whole data set.
Any ideas?
Answering my own question in case others run across this foolish user error.
The problem was that CategoryDataset only allows one data point per "category", and in my case, "category" was a java.util.Date captured from the web server log. Apparently, nearly half of my dates were duplicates and so part of the data set overwrote the other half, leaving a subset of the data.
That should have been totally obvious to me at the outset, because that is exactly how a category dataset works.
Anyhow, simply changing the category plot series's "category expression" from $F{DATE} to $V{REPORT_COUNT} gave each datum a unique category which makes everything work.
This is very similar to the question here:
How to use ggplot to group and show top X categories?
Except in my case I don't have a discrete value to go on. I've got data about users posting messages to a user forum. Similar to:
Year, Month, Day, User, Message
I've got an entry for every single message a person posted and I want to plot the top 5 users per year in terms of total Messages posted. In the previous question there was a distinct list of values that could be keyed off of.
In my case, I'm curious if I can do it easily in ggplot2, or if I need to do something like:
Load the data into a dataframe
Construct a new dataframe which is the same data collapsed & summarized by year
Plot from the new frame using the same approach as the previous question
If this is the best way to do it, what's the "correct" way to do #2? That new dataframe should probably be of the form:
Year, User, Total number of Messages
any help is appreciated.
Based on Joran's comment, I found this plyr approach:
ddply(posts, .(year, poster), summarise, freq=length(year))
Which gives me the posts per year per user. From there I can trim it down as suggested in other posts to get the top X posters per year.
I have a column of year values by which I am sorting. I'd like to find the quantity per year (read: number of repeats of each year value). I'd like to chart said values. I'm not sure how to make this happen.
I am using Apple's Numbers '08, but if possible a general solution that multiple people could use would be preferred.
You should use the countif() function: http://office.microsoft.com/en-us/excel/HP052090291033.aspx
I did a similar thing to count how many hours of work there are for each upcoming version of my iPhone app. I was doing sumif(), but you just want countif().
See cells N4-N6 here: http://spreadsheets.google.com/ccc?key=0AhL0igVI9HVNdGpaS3U1cS1qOGVNd3h0Slg0a21vUWc&hl=en
On a new sheet, list the unique years in one column, then their quantity count in the column next to them. Select the entire range created, then create a chart.
I'm unsure from your question what you would specifically need more than this (and I work in Excel 2003).