Y Axis wrong index - r

Im reading a book and I found this code. Which I tried and im a little bit confused about the graph im getting.
This is Data Sample.
consumption[sample(1:nrow(consumption), 5, replace=F),]
Food Units Year Amount
8 Fruits and Vegetables Pounds 1980 603.57948
31 Caloric sweeteners Pounds 1995 144.08113
16 Fruits and Vegetables Pounds 1985 630.24491
28 Eggs Number 1995 232.28203
19 Fish and Shellfist Pounds 1990 14.94411
And im getting this graph. Which the Y indexes are numbers from 1 to 20, that are not the correct "Amounts".
What can I do so the Amount index in the Y axis shows correctly?

The figure you show is just like the one in the book, R in a Nutshell, that provided you with the code. Actually, the book provides the code for two different versions of the same plot. I suggest trying them both.
library(nutshell)
data(consumption)
library(lattice)
dotplot(Amount ~ Year | Food, consumption)
dotplot(Amount ~ Year | Food, consumption,
aspect="xy", scales=list(relation="sliced", cex=.4))

Related

Constrained K-means, R

I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.

Calculating a ratio in a ggplot2 graph while retaining faceting variables

So I don't think this has been asked before, but SO search might just be getting confused by combinations of 'ratio' and 'faceting'. I'm trying to calculate a productivity ratio; number of widgets produced for number of workers on a given day or period. I've got my data structured in a single data frame, with each widget produced each day by each worker in it's own record, and other workers that worked that day but didn't produce a widget also in their own record, along with various metadata.
Something like this:
widget_ind
employee_active_ind
employee_id
day
product_type
employee_bu
1
1
123
6/1/2021
pc
americas
0
1
234
6/1/2021
mac
emea
0
1
345
6/1/2021
mac
apac
1
1
444
6/1/2021
mac
americas
1
1
333
6/1/2021
pc
emea
0
1
356
6/1/2021
pc
americas
I'm trying to find the ratio of widget_inds to employee_active_inds, over time, while retaining the metadata, so that i can filter or facet within the ggplot2 code, something like:
plot <- ggplot(data = df[df$employee_bu == 'americas',],aes(y = (widget_ind/employee_active_ind), x = day)) +
geom_bar(stat = 'identity', position = 'stack') +
facet_wrap(product_type ~ ., scales = 'fixed') + #change these to look at different cuts of metadata
print(plot)
Retaining the metadata is appealing rather than making individual dataframes summarizing by the various combinations, but the results with no faceting aren't even correct (e.g. the ggplot is showing a barchart with a height of ~18 widgets per person; creating a summarized dataframe with no faceting is showing a ratio of less than 1 widget per person).
I'm currently getting this error when I run the ggplot code:
Warning message:
Removed 9865 rows containing missing values (geom_bar).
Which doesn't make sense since in my data frame both widget_ind and employee_active_ind have no NA values, so calculating the ratio of the two should always work?
Edit 1: Clarifying employee_active_ind: I should not have any employee_active_ind = 0, but my current joins produce them (and it passes the reality sniff test; the process we are trying to model allows you to do work on day 1 that results in a widget on day 2, where you may not do any work, so wouldn't be counted as active on that day). I think I need to re-think my data structure. Even so, I'm assuming here that ggplot2 is acting like it would for a given bar chart; it's taking the number in each widget_ind record, for a given day (along with any facets and filters), and is then summing that set and displaying the result. The wrinkle I'm adding is dividing by the number of active employees on that day, and while you can have some one out on a given day, you'd never have everyone out. But that isn't what ggplot is doing is it?
I agree with MrFlick - especially the question concerning employee_active_ind of 0. If you have them, this could create NA values where something is divided by 0.

Extracting words based on a $ and unnesting the data

I am trying to do two things to my data. The data looks like:
# A tibble: 10 x 2
grp newCol
<int> <chr>
1 6303 "The company sees earnings of $3.40 to $3.60 a share for all of 2008, agai…
2 7686 " -- reaffirmed its fiscal 2015 guidance of per share diluted earnings b…
3 9577 "Analysts polled by FactSet anticipate earnings of 96 cents a share, down …
4 6475 ""
5 5229 "The company also expects income to be \"significantly impacted\" by costs…
6 2648 "Hoku also expects losses for the foreseeable future on significant cost i…
7 3691 "St. Louis-based Emerson now sees full-year earnings of $2.40 to $2.60 a s…
8 9297 ""
9 2080 "The restaurant group also raised its earnings guidance for fiscal 2007 to…
10 3513 "Guidance, For the full fiscal year 2008, the Company is moderating its pr…
I can run the following:
x <- d %>%
mutate(
extractedWords = str_extract_all(newCol, "([^\\s]+\\s){2}earnings(\\s[^\\s]+){12}")
)
Where I get:
# A tibble: 10 x 1
extractedWords
<list>
1 <chr [1]>
2 <chr [3]>
3 <chr [1]>
4 <chr [0]>
5 <chr [0]>
6 <chr [0]>
7 <chr [1]>
8 <chr [0]>
9 <chr [2]>
10 <chr [4]>
I firstly want to modify the str_extract_all(newCol, "([^\\s]+\\s){2}earnings(\\s[^\\s]+){12}") - which currently extracts the 2 words before the Word earnings and 12 words after the Word earnings. I want to change it such that it extracts the words before the dollar $ symbol.
Secondly I want to unnest the columns. When I run:
x %>%
unnest(extractedWords)
The number of rows from the data goes from 10 to 12. I want to unnest it but paste the c("text", more text") into something like text, more text or separated by | (or some variation).
Data:
d <- structure(list(grp = c(6303L, 7686L, 9577L, 6475L, 5229L, 2648L,
3691L, 9297L, 2080L, 3513L), newCol = c("The company sees earnings of $3.40 to $3.60 a share for all of 2008, against $2.74 a share from continuing operations in 2007, an increase of 24% to 31%. ",
" -- reaffirmed its fiscal 2015 guidance of per share diluted earnings between , We believe the guidance outlook for fiscal 2015 remains realistic and takes into consideration the heightened competitive market trends for the Diagnostics segment offset by strategic investments that target the growing outpatient segment and further growth of our Life Science segment through our focus on global expansion, increasing industrial market efforts, and emerging success in the AgriBio and genomics research areas., FISCAL 2015 GUIDANCE REAFFIRMED, For the fiscal year ending September 30, 2015, management expects net revenues to be in the range of $193 million to $200 million and per share diluted earnings to be between $0.85 and $0.91. The per share estimates assume an increase in average diluted shares outstanding from approximately 41.9 million at fiscal 2014 year end to approximately 42.4 million at fiscal 2015 year end. The revenue and earnings guidance provided in this press release is from expected internal growth and does not include the impact of any additional acquisitions the Company might complete during fiscal 2015.",
"Analysts polled by FactSet anticipate earnings of 96 cents a share, down 10 cents from a year earlier. Revenue is expected to have decreased 5.5% to $30.4 billion, which would mark the fourth consecutive quarterly decline after six years of growth. Verizon has said it expects earnings and sales will be roughly flat this year.",
"", "The company also expects income to be \"significantly impacted\" by costs related to its pending acquisition of SRS Labs Inc. (SRSL)., The company also expects income to be \"significantly impacted\" by costs related to its pending acquisition of SRS Labs Inc. (SRSL).",
"Hoku also expects losses for the foreseeable future on significant cost increases. ",
"St. Louis-based Emerson now sees full-year earnings of $2.40 to $2.60 a share, down from its February forecast of $2.70 to $2.95 a share. The company also expects net sales for the fiscal year to fall 13% to 15% to $21 billion to $21.7 billion. Sales are expected to be hurt by about 5% because of currency translations, but boosted by 1% because of acquisitions.",
"", "The restaurant group also raised its earnings guidance for fiscal 2007 to a range of $3.45 a share to $3.50 a share, and said its expects earnings for the third quarter ending in July of 85 cents to 89 cents a share and same-store sales growth of 5% to 6%. ",
"Guidance, For the full fiscal year 2008, the Company is moderating its previously issued guidance and expects net sales to be approximately $950 million and earnings per diluted share to be approximately $1.00, which includes approximately $0.45 per diluted share of restructuring charges and other unusual items. For the twelve months ended February 2, 2008, net sales were $1.09 billion and earnings per diluted share were $2.59., Set forth below is our reconciliation of net earnings per share, calculated in accordance with generally accepted accounting principles, or GAAP, to net earnings per share, as adjusted, for certain historical periods and certain future periods. For reference, we also include our previous guidance for third quarter fiscal 2008. Net earnings per share, as adjusted, excludes (i) the net impact of certain restructuring costs and other unusual items as well as the write off of unamortized financing costs during the first three quarters of fiscal 2008 and (ii) the anticipated impact of certain restructuring costs in the fourth quarter of fiscal 2008. We believe that investors often look at ongoing operations as a measure of assessing performance and as a basis for comparing past results against future results. Therefore, we believe that presenting our results and expected results excluding these items provides useful information to investors because this allows investors to make decisions based on our ongoing operations. We use the results excluding these items to discuss our business with investment institutions, our board of directors and others. Further, we believe that presenting our results and expected results excluding these items provides useful information to investors because this allows investors to compare our results and our expected results for the periods presented to other periods., Guidance Results for Guidance Guidance"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))
I'm not 100% sure what you mean with the first part of your question. Assuming you want to extract all words after the word earnings and before $, this should do what you want. It uses a 'positive lookahead' and allows for any number of words until it finds the first dollar sign (hence the *?).
Rather than unnesting, I loop over the extractedWords column using purrr::map_chr, which returns a character vector, which makes further unnesting unneccessary.
library(tidyverse)
d %>%
mutate(
extractedWords = str_extract_all(newCol, "([^\\s]+\\s){2}\\$(\\s?[^\\s]+){12}")
) %>%
mutate(result = map_chr(extractedWords, str_c, "", collapse="|"))
EDIT: edited the regular expression to extract 2 words before the dollar sign and 12 words after it. Note I had to escape the dollar sign (\\$) for it to work, since a dollar sign has a special meaning in a regular expression.

RStudio: Separate YYYY-MM-DD into Individual Columns

I am fairly new to R and I am pulling my hair out trying to do what is probably something super simple.
I downloaded the crime data for Los Angeles from 2010 - 2019. There are 2,114,010 rows of data. Right now, it is called 'df' in my Global Environment area.
I want to manipulate one specific column titled "Occurred" - which is a date reference to when the crime occurred.
Right now, it is set up as YYYY-MM-DD (ie., 2010-02-20).
I am trying to separate all three into individual columns. I have Googled, and Googled, and Googled and tried and tried and tried things from this forum and StackExchange and just cannot get it to work.
I have tried Lubridate and followed instructions to other answers, but it simply won't create new columns (one each for Year, Month, Day).
Here is a bit of the reprex from the dataset ... I did not include all of the different variables, because they aren't the issue.
As mentioned, I am trying to separate 'occurred' into individual Year, Month, and Day columns.
> head(df, 10)[c('dr_no','occurred','time','area_name')]
dr_no occurred time area_name
1 1307355 2010-02-20 1350 Newton
2 11401303 2010-09-12 45 Pacific
3 70309629 2010-08-09 1515 Newton
4 90631215 2010-01-05 150 Hollywood
5 100100501 2010-01-02 2100 Central
6 100100506 2010-01-04 1650 Central
7 100100508 2010-01-07 2005 Central
8 100100509 2010-01-08 2100 Central
9 100100510 2010-01-09 230 Central
10 100100511 2010-01-06 2100 Central
We can do this with tidyverse and lubridate
library(dplyr)
library(lubridate)
df <- df %>%
mutate(occurred = as.Date(occurred),
year = year(occurred), month = month(occurred), day = day(occurred))

Line graph with ggplot2 in R Studio [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am trying to learn the R programming language to analyse and visualize my data. I have made some good progress so far and I am really enjoying learning R but I am stomped here.
I am having some trouble creating line graphs for products in specific categories. I have no problem creating graphs to show sales all categories but I would like to specify a particular category and show the product sales.
This is what my data set looks like.
Can someone show me how I could do this? E.g I would like to create a line graph to show the sales of Products in the Bakery category where the X axis would have the product name and the Y axis would have the quantity sold.
Any help would be greatly appreciated.
Next time please include the head this can be done using
head(Store_sales)
ProductID category sales product
1 101 Bakery 9468 White bread
2 102 Personal Care 9390 Everday Female deodorant
3 103 Cereal 9372 Weetabix
4 104 Produce 9276 Apple
5 105 Meat 9268 Chicken Breasts
6 106 Bakery 9252 Pankcakes
I reproduced relevant fields to help you out. First thing is to filter out Baker items from categories.
> install.packages("tidyverse")
> library(tidyverse)
Store sales before filter
> Store_sales
ProductID category sales product
1 101 Bakery 9468 White bread
2 102 Personal Care 9390 Everday Female deodorant
3 103 Cereal 9372 Weetabix
4 104 Produce 9276 Apple
5 105 Meat 9268 Chicken Breasts
6 106 Bakery 9252 Pankcakes
7 107 Produce 9228 Carrot
Filter out "Bakery" from category column into Store_sales_bakery
> Store_sales_bakery <- filter(Store_sales, category == "Bakery")
What Store_sales_bakery includes
> Store_sales_bakery
ProductID category sales product
1 101 Bakery 9468 White bread
2 106 Bakery 9252 Pankcakes
Unfortunately because the picture you gave us does not contain enough information to produce a line graph (you only have 1 data point for each variable which is not enough to create a line) so in its stead I created a point plot for you.
ggplot(Store_sales, aes(x = product, y = sales)) + geom_point()
ggplot point
Here is a bar plot with two variables
ggplot(Store_sales, aes(x = product, y = sales)) + geom_bar(stat = "identity")
bar plot
If you had enough data to make a line graph you would replace geom_bar() or geom_point() with geom_line()
Here is a link to ggplot cheatsheet that may help you in the future
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Resources