Existing object is not found in complete function - r

I have a data frame with the following columns Entity, Customer Class, Month and other
|CClass |Entity |Month| Sales volume|
|-------|--------|-----|-------------|
|Bakery | 1 | 1 |100 |
|Bakery | 1 | 2 |106 |
|Bakery | 1 | 3 |103 |
|Bakery | 1 | 5 |135 |
|Bakery | 1 | 6 |121 |
|Bakery | 1 | 7 |176 |
|Bakery | 1 | 10 |133 |
|Bakery | 1 | 11 |100 |
|Bakery | 1 | 12 |112 |
|Bakery | 2 | 1 |136 |
|Bakery | 2 | 3 |123 |
|Bakery | 2 | 4 |108 |
|Bakery | 2 | 5 |101 |
|Bakery | 2 | 7 |105 |
|Bakery | 3 | 10 |103 |
|Bakery | 3 | 11 |106 |
|Bakery | 3 | 12 |110 |
|Grocery| 1 | 1 |120 |
|Grocery| 1 | 2 |150 |
When I try to populate the missing Month to each Customer Class using the complete() function:
DF <- complete(DF, nesting(Entity, CClass), Month)
I got the Error message "! object 'Entity' not found"
st <- complete(ST, nesting(Entity, CClass), SBMONTH)
Error in dplyr::summarise():
! Problem while computing ..1 = complete(data = dplyr::cur_data(), ..., fill = fill, explicit = explicit).
i The error occurred in group 1: CClass = "Bagel Shop", End Market = "Food Service", Entity = 1.
Caused by error:
! object 'Entity' not found
Run rlang::last_error() to see where the error occurred.
But with the testing samples this function works.
Please advise

I can't reproduce the error. Starting from a fresh R session and using this data:
DF = read.table(text = 'CClass Entity Month Sales_volume
Bakery 1 1 100
Bakery 1 2 106
Bakery 1 3 103
Bakery 1 5 135
Bakery 1 6 121
Bakery 1 7 176
Bakery 1 10 133
Bakery 1 11 100
Bakery 1 12 112
Bakery 2 1 136
Bakery 2 3 123
Bakery 2 4 108
Bakery 2 5 101
Bakery 2 7 105
Bakery 3 10 103
Bakery 3 11 106
Bakery 3 12 110
Grocery 1 1 120
Grocery 1 2 150', header = T)
I load tidyr and run the complete command you have in your question and get reasonable-looking output:
library(tidyr)
complete(DF, nesting(Entity, CClass), Month)
# # A tibble: 40 × 4
# Entity CClass Month Sales_volume
# <int> <chr> <int> <int>
# 1 1 Bakery 1 100
# 2 1 Bakery 2 106
# 3 1 Bakery 3 103
# 4 1 Bakery 4 NA
# 5 1 Bakery 5 135
# 6 1 Bakery 6 121
# 7 1 Bakery 7 176
# 8 1 Bakery 10 133
# 9 1 Bakery 11 100
# 10 1 Bakery 12 112
# # … with 30 more rows
# # ℹ Use `print(n = ...)` to see more rows
Some ideas: make sure you are using tidyr::complete, if you loaded another package with a complete function that might be masking the correct version. You can check conflicts() and see if complete is listed, and if so specify tidyr::complete to get the correct version. Also check names(DF) and make sure your column names are exactly what you think they are--no extra whitespace, capitalized correctly, etc. Also check the class(DF) and make sure it is data.frame or tbl_df, and maybe have a look at str(DF) to make sure the columns are appropriate classes. Since you didn't use dput() to share your data, we can't be sure what the class of the data or the columns are.
If you still have trouble, please try to find a sample of data that reproduces the problem and please use dput to share it.

Related

R, Friedman's test 'not an unreplicated complete block design' error?

I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor

r data.table groupby join in pyspark 1.6

I have the following datatables (R code):
accounts <- fread("ACC_ID | DATE | RATIO | VALUE
1 | 2017-12-31 | 2.00 | 8
2 | 2017-12-31 | 2.00 | 12
3 | 2017-12-31 | 6.00 | 20
4 | 2017-12-31 | 1.00 | 5 ", sep='|')
timeline <- fread(" DATE
2017-12-31
2018-12-31
2019-12-31
2020-12-31", sep="|")
In R, I know I can join on DATE, by ACC_ID, RATIO and VALUE:
accounts[, .SD[timeline, on='DATE'], by=c('ACC_ID', 'RATIO', 'VALUE')]
This way, I can "project" ACC_ID, RATIO and VALUE values over timeline dates, getting the following data table:
ACC_ID | RATIO | VALUE | DATE
1 | 2 | 8 |2017-12-31
2 | 2 | 12 |2017-12-31
3 | 6 | 20 |2017-12-31
4 | 1 | 5 |2017-12-31
1 | 2 | 8 |2018-12-31
2 | 2 | 12 |2018-12-31
3 | 6 | 20 |2018-12-31
4 | 1 | 5 |2018-12-31
1 | 2 | 8 |2019-12-31
2 | 2 | 12 |2019-12-31
3 | 6 | 20 |2019-12-31
4 | 1 | 5 |2019-12-31
1 | 2 | 8 |2020-12-31
2 | 2 | 12 |2020-12-31
3 | 6 | 20 |2020-12-31
4 | 1 | 5 |2020-12-31
I've been trying hard to find something similar with PySpark, but I've not been able to. What should be the appropriate way to solve this?
Thanks very much for your time. I greatly appreciate any help you can give me, this one is important for me.
It looks like you're trying to do a cross join?
spark.sql('''
select ACC_ID, RATIO, VALUE, timeline.DATE
from accounts, timeline
''')

R data.table check if a row exists in another data.table

I have two data.tables like so:
tests
id | test | score
=================
1 | 1 | 90
1 | 2 | 100
2 | 1 | 70
2 | 2 | 80
3 | 1 | 100
3 | 2 | 95
cheaters
id | test | score
=================
1 | 2 | 100
3 | 1 | 100
3 | 2 | 95
Say I now want to include a boolean column in all_scores to tell whether that particular test was cheated on, so the output would be like this:
tests
id | test | score | cheat
=========================
1 | 1 | 90 | FALSE
1 | 2 | 100 | TRUE
2 | 1 | 70 | FALSE
2 | 2 | 80 | FALSE
3 | 1 | 100 | TRUE
3 | 2 | 95 | TRUE
Is there an easy way to do this? The tables are keyed on id and test.
Create the cheat column with initial value of FALSE, then join with cheaters, and update cheat column to TRUE when there's match:
library(data.table)
setkey(setDT(tests), id, test)
setkey(setDT(cheaters), id, test)
tests[, cheat := FALSE][cheaters, cheat := TRUE]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE
Or without setting the keys, use on parameter to specify the columns to join on:
setDT(tests)
setDT(cheaters)
tests[, cheat := FALSE][cheaters, cheat := TRUE, on = .(id, test)]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE

R: stem and leaf plot issue

I have the following vector:
x <- c(54.11, 58.09, 60.82, 86.59, 89.92, 91.61,
95.03, 95.03, 96.77, 98.52, 100.29, 102.07,
102.07, 107.51, 113.10, 130.70, 130.70, 138.93,
147.41, 149.57, 153.94, 158.37, 165.13, 201.06,
208.67, 235.06, 240.53, 251.65,254.47, 254.47, 333.29)
I want to get the following stem and leaf plot in R:
Stem Leaf
5 4 8
6 0
8 6 9
9 1 5 5 6 8
10 0 2 2 7
11 3
13 0 0 8
14 7 9
15 3 8
16 5
20 1 8
23 5
24 0
25 1 4 4
33 3
However, when I try the stem() function in R, I get the folliwing:
> stem(x)
The decimal point is 2 digit(s) to the right of the |
0 | 566999
1 | 000000011334
1 | 55567
2 | 0144
2 | 555
3 | 3
> stem(x, scale = 2)
The decimal point is 1 digit(s) to the right of the |
4 | 48
6 | 1
8 | 7025579
10 | 02283
12 | 119
14 | 7048
16 | 5
18 |
20 | 19
22 | 5
24 | 1244
26 |
28 |
30 |
32 | 3
Question: Am I missing an argument in the stem() function? If not, is there another solution?
I believe what you want is a little non-standard: a stem-and-leaf should have on its left equally-spaced numbers/digits, and you're asking for irregularly-spaced. I understand your frustration that 54 and 58 are grouped within the 40s, but the stem-and-leaf graph is really just a textual representation of a horizontal histogram, and the numbers on the side reflect the "bins" which will often begin/end outside of the known data. Think of scale(x, scale=2) left-scale numbers as 40-59, 60-79, etc.
You probably already tried this, but
stem(x, scale=3)
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 7 |
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 12 |
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 17 |
# 18 |
# 19 |
# 20 | 19
# 21 |
# 22 |
# 23 | 5
# 24 | 1
# 25 | 244
# 26 |
# 27 |
# 28 |
# 29 |
# 30 |
# 31 |
# 32 |
# 33 | 3
This is a good start, and is "proper" in that the bins are equally sized.
If you must remove the empty rows (which to me are still statistically significant, relevant, informative, etc), then because stem's default is to print to the console, you'll need to capture the console output (might have problems in rmarkdown docs), filter out the empty rows, and re-cat them to the console.
cat(Filter(function(s) grepl("decimal|\\|.*[0-9]", s),
capture.output(stem(x, scale=3))),
sep="\n")
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 20 | 19
# 23 | 5
# 24 | 1
# 25 | 244
# 33 | 3
(My grepl regex could likely be improved to handle something akin to "if there is a pipe, then it must be followed by one or more digits", but I think this suffices for now.)
There are some inequalities, in that you want 6 | 0, but your 60.82 is rounding to 61 (ergo the "1"). If you really want the 60.82 to be a 6 | 0, then truncate it with stem(trunc(x), scale=3). It's not exact, but I'm guessing that's because your sample output is hand-jammed.

Reshape dataframe based on the datediff

I have the data from the activity of a website :
DAY | NB_USERS_CONNECTED
1 | 10
2 | 14
3 | 15
4 | 11
5 | 17
6 | 11
How can I do reshape the data frame in order to create a column with the number of users
who was connected the day before? :
DAY | NB_USERS_CONNECTED_DAY0 | NB_USERS_CONNECTED_DAY_-1
1 | 10 | NA
2 | 14 | 10
3 | 15 | 14
4 | 11 | 15
5 | 17 | 11
6 | 11 | 17
If possible I'd like to use a method which can also do the job with a lag of 2 days
NB_USERS_CONNECTED_DAY_-1 & NB_USERS_CONNECTED_DAY_-2
You can use head with a negative argument:
transform(dat,day_before=c(NA,head(dat$NB_USERS_CONNECTED,-1)))
DAY NB_USERS_CONNECTED day_before
1 1 10 NA
2 2 14 10
3 3 15 14
4 4 11 15
5 5 17 11
6 6 11 17

Resources