R data.table check if a row exists in another data.table - r

I have two data.tables like so:
tests
id | test | score
=================
1 | 1 | 90
1 | 2 | 100
2 | 1 | 70
2 | 2 | 80
3 | 1 | 100
3 | 2 | 95
cheaters
id | test | score
=================
1 | 2 | 100
3 | 1 | 100
3 | 2 | 95
Say I now want to include a boolean column in all_scores to tell whether that particular test was cheated on, so the output would be like this:
tests
id | test | score | cheat
=========================
1 | 1 | 90 | FALSE
1 | 2 | 100 | TRUE
2 | 1 | 70 | FALSE
2 | 2 | 80 | FALSE
3 | 1 | 100 | TRUE
3 | 2 | 95 | TRUE
Is there an easy way to do this? The tables are keyed on id and test.

Create the cheat column with initial value of FALSE, then join with cheaters, and update cheat column to TRUE when there's match:
library(data.table)
setkey(setDT(tests), id, test)
setkey(setDT(cheaters), id, test)
tests[, cheat := FALSE][cheaters, cheat := TRUE]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE
Or without setting the keys, use on parameter to specify the columns to join on:
setDT(tests)
setDT(cheaters)
tests[, cheat := FALSE][cheaters, cheat := TRUE, on = .(id, test)]
tests
# id test score cheat
#1: 1 1 90 FALSE
#2: 1 2 100 TRUE
#3: 2 1 70 FALSE
#4: 2 2 80 FALSE
#5: 3 1 100 TRUE
#6: 3 2 95 TRUE

Related

Existing object is not found in complete function

I have a data frame with the following columns Entity, Customer Class, Month and other
|CClass |Entity |Month| Sales volume|
|-------|--------|-----|-------------|
|Bakery | 1 | 1 |100 |
|Bakery | 1 | 2 |106 |
|Bakery | 1 | 3 |103 |
|Bakery | 1 | 5 |135 |
|Bakery | 1 | 6 |121 |
|Bakery | 1 | 7 |176 |
|Bakery | 1 | 10 |133 |
|Bakery | 1 | 11 |100 |
|Bakery | 1 | 12 |112 |
|Bakery | 2 | 1 |136 |
|Bakery | 2 | 3 |123 |
|Bakery | 2 | 4 |108 |
|Bakery | 2 | 5 |101 |
|Bakery | 2 | 7 |105 |
|Bakery | 3 | 10 |103 |
|Bakery | 3 | 11 |106 |
|Bakery | 3 | 12 |110 |
|Grocery| 1 | 1 |120 |
|Grocery| 1 | 2 |150 |
When I try to populate the missing Month to each Customer Class using the complete() function:
DF <- complete(DF, nesting(Entity, CClass), Month)
I got the Error message "! object 'Entity' not found"
st <- complete(ST, nesting(Entity, CClass), SBMONTH)
Error in dplyr::summarise():
! Problem while computing ..1 = complete(data = dplyr::cur_data(), ..., fill = fill, explicit = explicit).
i The error occurred in group 1: CClass = "Bagel Shop", End Market = "Food Service", Entity = 1.
Caused by error:
! object 'Entity' not found
Run rlang::last_error() to see where the error occurred.
But with the testing samples this function works.
Please advise
I can't reproduce the error. Starting from a fresh R session and using this data:
DF = read.table(text = 'CClass Entity Month Sales_volume
Bakery 1 1 100
Bakery 1 2 106
Bakery 1 3 103
Bakery 1 5 135
Bakery 1 6 121
Bakery 1 7 176
Bakery 1 10 133
Bakery 1 11 100
Bakery 1 12 112
Bakery 2 1 136
Bakery 2 3 123
Bakery 2 4 108
Bakery 2 5 101
Bakery 2 7 105
Bakery 3 10 103
Bakery 3 11 106
Bakery 3 12 110
Grocery 1 1 120
Grocery 1 2 150', header = T)
I load tidyr and run the complete command you have in your question and get reasonable-looking output:
library(tidyr)
complete(DF, nesting(Entity, CClass), Month)
# # A tibble: 40 × 4
# Entity CClass Month Sales_volume
# <int> <chr> <int> <int>
# 1 1 Bakery 1 100
# 2 1 Bakery 2 106
# 3 1 Bakery 3 103
# 4 1 Bakery 4 NA
# 5 1 Bakery 5 135
# 6 1 Bakery 6 121
# 7 1 Bakery 7 176
# 8 1 Bakery 10 133
# 9 1 Bakery 11 100
# 10 1 Bakery 12 112
# # … with 30 more rows
# # ℹ Use `print(n = ...)` to see more rows
Some ideas: make sure you are using tidyr::complete, if you loaded another package with a complete function that might be masking the correct version. You can check conflicts() and see if complete is listed, and if so specify tidyr::complete to get the correct version. Also check names(DF) and make sure your column names are exactly what you think they are--no extra whitespace, capitalized correctly, etc. Also check the class(DF) and make sure it is data.frame or tbl_df, and maybe have a look at str(DF) to make sure the columns are appropriate classes. Since you didn't use dput() to share your data, we can't be sure what the class of the data or the columns are.
If you still have trouble, please try to find a sample of data that reproduces the problem and please use dput to share it.

R: How to use the map function to find min value within a subset of columns

I am trying to find out how to efficiently output minimum values of runtime_sec based on of a subset from hour column potentially using an anonymous function. Currently, I have a long way of creating a new dataframe, then joining it to the existing dataframe, but would like to do this more efficiently, without having to save out to a new dataframe. I've been looking at how to do this in map (purrr) functions but having a bit of trouble understanding. Apologies in advance if this is confusing, this is my first post on here.
Existing df:
| index | hour | runtime_sec |
|-----: |-----:| -----------:|
| 1 | 6 | 50 |
| 1 | 7 | 100 |
| 1 | 8 | 120 |
| 1 | 9 | 90 |
| 1 | 10 | 100 |
| 1 | 11 | 100 |
| 2 | 10 | 100 |
Current code:
df_min <- df %>%
group_by(index) %>%
filter(hour >= 8 & hour < 10) %>%
summarize(min_ref = min(runtime_sec))
df_join <- df %>%
left_join(df_min, by = "index")
Desired output:
| index | hour | runtime_sec | min_ref |
|----: |----: | ----: | ----: |
| 1 | 6 | 50 | 90 |
| 1 | 7 | 100 | 90 |
| 1 | 8 | 120 | 90 |
| 1 | 9 | 90 | 90 |
| 1 | 10 | 100 | 90 |
| 1 | 11 | 100 | 90 |
| 2 | 10 | 100 | 100 |
dat %>%
group_by(index) %>%
mutate(min_ref = if (any(hour >= 8 & hour < 10)) min(runtime_sec[hour >= 8 & hour < 10]) else NA) %>%
ungroup()
# # A tibble: 7 x 4
# index hour runtime_sec min_ref
# <int> <int> <int> <int>
# 1 1 6 50 90
# 2 1 7 100 90
# 3 1 8 120 90
# 4 1 9 90 90
# 5 1 10 100 90
# 6 1 11 100 90
# 7 2 10 100 NA
Your expectation of min_ref=100 for index==2 is against your rules: the hour is not < 10, so there should be no data that meets your condition. If you expect it to match, then you should be using hour <= 10, in which case one can replace hour >= 8 & hour <= 10 with between(hour, 8, 10).
You can reduce the code slightly if you accept that Inf is a reasonable "minimum" lacking values:
dat %>%
group_by(index) %>%
mutate(min_ref = suppressWarnings(min(runtime_sec[hour >= 8 & hour < 10]))) %>%
ungroup()
# # A tibble: 7 x 4
# index hour runtime_sec min_ref
# <int> <int> <int> <dbl>
# 1 1 6 50 90
# 2 1 7 100 90
# 3 1 8 120 90
# 4 1 9 90 90
# 5 1 10 100 90
# 6 1 11 100 90
# 7 2 10 100 Inf
though this just shortens the code a little.

R, Friedman's test 'not an unreplicated complete block design' error?

I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor

Reshape a ragged, wide array with repeated variables to long in r

I have a table like
+------+---------+---------+---------+----------+---------+
| Code | Display | Synonym | Synonym | Synonym | Synonym |
+------+---------+---------+---------+----------+---------+
| 1 | A | Cat | Dog | Lion | |
| 2 | B | Horse | Penguin | | |
| 3 | C | Donkey | Giraffe | Mongoose | Rabbit |
+------+---------+---------+---------+----------+---------+
I want to output a table like
+------+---------+----------+
| Code | Display | Synonym |
+------+---------+----------+
| 1 | A | Cat |
| 1 | A | Dog |
| 1 | A | Lion |
| 2 | B | Horse |
| 2 | B | Penguin |
| 3 | C | Donkey |
| 3 | C | Giraffe |
| 3 | C | Mongoose |
| 3 | C | Rabbit |
+------+---------+----------+
In other words, I want to pair off Code and Display with each Synonym that is presented, and each Code can have 1 to several synonyms. I've seen examples of reshape used in other contexts, but haven't been able to figure out how to apply it here.
You can use standard reshaping on a ragged array - with melt() from reshape2, you can use the na.rm argument to remove NAs as you go, otherwise you can do it afterward:
library(reshape2)
dat.m <- melt(dat, id.vars = c("Code", "Display"), value.name = "Synonym", na.rm = TRUE)
# Code Display variable Synonym
#1 1 A Synonym Cat
#2 2 B Synonym Horse
#3 3 C Synonym Donkey
#4 1 A Synonym.1 Dog
#5 2 B Synonym.1 Penguin
#6 3 C Synonym.1 Giraffe
#7 1 A Synonym.2 Lion
#9 3 C Synonym.2 Mongoose
#12 3 C Synonym.3 Rabbit
You can drop the variable column if you like:
dat.m$variable <- NULL
Here are two base R approaches.
stack
cbind(mydf[1:2], stack(lapply(mydf[-c(1:2)], as.character)))
# Code Display values ind
# 1 1 A Cat Synonym
# 2 2 B Horse Synonym
# 3 3 C Donkey Synonym
# 4 1 A Dog Synonym.1
# 5 2 B Penguin Synonym.1
# 6 3 C Giraffe Synonym.1
# 7 1 A Lion Synonym.2
# 8 2 B Synonym.2
# 9 3 C Mongoose Synonym.2
# 10 1 A Synonym.3
# 11 2 B Synonym.3
# 12 3 C Rabbit Synonym.3
reshape
Make life easier by renaming your columns first to a pattern like "Synonym_1", "Synonym_2" and so on. Actually, R likes "Synonym.1", "Synonym.2" and so on better....
A <- grep("Synonym", names(mydf))
names(mydf)[A] <- paste0("Synonym_", seq_along(A))
Now, reshape...
reshape(mydf, direction = "long", varying = A, sep = "_")
# Code Display time Synonym id
# 1.1 1 A 1 Cat 1
# 2.1 2 B 1 Horse 2
# 3.1 3 C 1 Donkey 3
# 1.2 1 A 2 Dog 1
# 2.2 2 B 2 Penguin 2
# 3.2 3 C 2 Giraffe 3
# 1.3 1 A 3 Lion 1
# 2.3 2 B 3 2
# 3.3 3 C 3 Mongoose 3
# 1.4 1 A 4 1
# 2.4 2 B 4 2
# 3.4 3 C 4 Rabbit 3
I figured out a maybe indirect way to do this shortly after asking the question:
allergies_output <- reshape(allergies_input,varying=list(grep('Synonym',names(allergies_input),value=TRUE)),direction='long',idvar=c('Code','Display'),v.names='Synonym',names(allergies_input))
This gives some wonky results, but nothing that can't be fixed by dropping some column names.

order grouping variable in R

I have a database like this:
ID | familysize | age | gender
------+------------+-------------------+------------+-----+----------
1001 | 4 | 26 | 1
1001 | 4 | 38 | 2
1001 | 4 | 30 | 2
1001 | 4 | 7 | 1
1002 | 3 | 25 | 2
1002 | 3 | 39 | 1
1002 | 3 | 10 | 2
1003 | 5 | 60 | 1
1003 | 5 | 50 | 2
1003 | 5 | 26 | 2
1003 | 5 | 23 | 1
1003 | 5 | 20 | 1
1004 | ....
I want to order this dataframe by age of people in each ID , so I use this command:
library(plyr)
require(plyr)
b2<-ddply(b , "ID", function(x) head(x[order(x$ age, decreasing = TRUE), ], ))
but when I use this command I lost some of observation. what should I do for ordering this database ?
b2 <- b[order(b$ID, -b$age), ]
should do the trick.
The arrange function in plyr does a great job here. Order by ID after that by age but in a descending order.
arrange(b, ID, desc(age))

Resources