`by` can't contain join column/inner_join - r

im trying to run this code but i keep getting this error, given the code below, Im applying inner-join for 2 lists which are student_db and grade_db by student_id and then couse_db by course_id.
could anyone help with this issue?
q2 <- inner_join(student_db, grade_db, by = "student_id") %>%
inner_join(course_db, by = "course_id", suffix = c(".student", ".course")) %>%
filter(name.student == "Ava Smith" | name.student == "Freddie Haris")
Error in common_by.list():
! by can't contain join column
course_id which is missing from LHS.
Run rlang::last_error() to see where the error occurred.

Does this work?
library(dplyr)
q2 <- student_db %>%
inner_join(grade_db, by = c("student_id"="student_id")) %>%
inner_join(course_db, by = c("course_id"="course_id")) %>%
filter(name.student %in% c("Ava Smith","Freddie Haris"))
If the names of your id variables are different in the two data frames, the by argument allows you to tell R which variables should matched, e.g. by=("var_from_df1"="var_from_df2"). (My guess is your dataframes have different column names, so this might be what you need to fix....
I'm not sure why you've included the suffix argument. That's there for if you have two variables with same name in both data sets with data that doesn't match. If you need it you can add it back. It's hard to tell exactly what the problem is without seeing your dataframes or similar example data....

Related

Suppose the name of a column in a dataframe is unknown to me, how can I sort the df according to the values in that column?

I'm trying to sort a dataframe in descending order according to the values in a specific column whose name is supposed to be unknown to me (i.e. I know it but I am not allowed to use it). The only clue is that it is the last column of this dataframe.
I've tried arange() and order() but they doesn't work. I also noticed that if I try to use names(df)[ncol(df)], I will get the name of that column as a character. However, the correct argument formating in arrange() seems to be columnName in two grave accents rather than "columnName". So I don't know how to correctly passs the name I got to the functions I want to use.
Base R
mtcars[order(mtcars[tail(names(mtcars), 1)]), ] #ascending
mtcars[order(mtcars[tail(names(mtcars), 1)], decreasing = TRUE), ] #descending
tidyverse
library(dplyr)
mtcars %>% arrange_at(vars(last(names(.)))) #ascending
mtcars %>% arrange_at(vars(last(names(.))), desc) #descending

How to interpret column length error from ddplyr::mutate?

I'm trying to apply a function (more complex than the one used below, but I was trying to simplify) across two vectors. However, I receive the following error:
mutate_impl(.data, dots) :
Column `diff` must be length 2 (the group size) or one, not 777
I think I may be getting this error because the difference between rows results in one row less than the original dataframe, per some posts that I read. However, when I followed that advice and tried to add a vector to add 0/NA on the final line I received another error. Did I at least identify the source of the error correctly? Ideas? Thank you.
Original code:
diff_df <- DF %>%
group_by(DF$var1, DF$var2) %>%
mutate(diff = map2(DF$duration, lead(DF$duration), `-`)) %>%
as.data.frame()
We don't need map2 to get the difference between the 'duration' and the lead of 'duration'. It is vectorized. map2 will loop through each element of 'duration' with the corresponding element of lead(duration) which is unnecessary
DF %>%
group_by(var1, var2) %>%
mutate(diff = duration - lead(duration))
NOTE: When we extract the column with DF$duration after the group_by. it is breaking the grouping condition and get the full dataset column. Also, in the pipe, there is no need for dataset$columnname. It should be columnname (However,in certain situations, when we want to get the full column for some comparison - it can be used)

Filtering one dataframe by a multiple columns in another

Sorry if this is a silly question!
My aim is basically the same as this post here: Take dates from one dataframe and filter data in another dataframe - R and continue using dplr as I am later going to run this code across all rows of my dataset using row_wise()
However, in my case I wish to take the 'start' and 'end' years from 2 different columns in the second dataframe.
Here's some dummy data (taken from the original post and adapted to my problem):
main_data = data.frame(year=c(1966:2017))
second_data = data.frame(Participant = c(1:6),
Start_year = c(2012,1994,1974,1983,1969,2002),
End_year = c(2017,2017,2017,2017,2017,2017))
and wrote this code based on the original post:
filtered.total =
main_data %>%
rowwise() %>%
mutate(year = any(year >= second_data$Start_year & year <=
second_data$End_year)) %>%
filter(year) %>%
data.frame()
I'm also filtering my data by location(country and county)but it just gives me the following error message for my dataset:
Error in filter_impl(.data, quo) : Result must have length 2299, not 0
and for the dummy data above:
In year <= second_data$End_year :
longer object length is not a multiple of shorter object length
Thanks for any help - quite new to R and my PhD is testing my minimal knowledge right now!
you might need to use min(second_data$year) and max(second_data$year), as at the moment you're providing many values to compare against, and i think its complaining about that.

Select and left join: Error: could not find function, Syntax issue? [duplicate]

I have an imported data frame that has column names with various punctuations including parentheses, e.g. BILLNG.STATUS.(COMPLETED./.INCOMPLTE) .
I was trying to use group_by from dplyr to do some summarizing, something like
df <- df %>% group_by(ORDER.NO, BILLNG.STATUS.(COMPLETED./.INCOMPLTE))
which brings the error Error in mutate_impl(.data, dots) :
could not find function "BILLNG.STATUS."
Short of changing the column names, is there a way to handle such column names directly in group_by ?
I think you can make this work if you enclose the "illegal" column names in backticks. For example, let's say I start with this data frame (called df):
BILLING.STATUS.(COMPLETED./.INCOMPLETE) ORDER.VALUE.(USD)
1 A 0.01544196
2 A 0.95522706
3 B 1.13479303
4 B 1.22848285
Then I can summarise it like this:
dat %>% group_by(`BILLING.STATUS.(COMPLETED./.INCOMPLETE)`) %>%
summarise(count=n(),
mean = mean(`ORDER.VALUE.(USD)`))
Giving:
BILLING.STATUS.(COMPLETED./.INCOMPLETE) count mean
1 A 2 0.4853345
2 B 2 1.1816379
Backticks also come in handy for referring to or creating variable names with whitespace. You can find a number of questions related to dplyr and backticks on SO, and there's also some discussion of backticks in the help for Quotes.
I'm just using this not-an-answer as a counter-example or illustration of limitations for the the backtick method. (It was the first strategem I tried. Perhaps it is the fact that two language operations ("(" and "/") are being handled adjacently that makes this fail.)
names(iris)[5] <- "Specie(/)s"
library(dplyr)
by_species <- iris %>% group_by(`Specie(/)s`)
by_species %>% summarise_each(funs(mean(., na.rm = TRUE)))
#Error: cannot modify grouping variable
Tried a variety or other language-oriented efforts with quote, as.name and substitute that also failed. (I wish there were a mechanism to request that this sink to the bottom of the answers.)

How many columns can be selected in a data frame in R?

I want to select 3117 columns out of a data frame,
I tried to select them by column names:
dataframe %>%
select(
'AAACCTGAGCACGCCT-1',
'AAACCTGAGCGCTTAT-1',
'AAACCTGAGCGTTGCC-1',
......,
'TTGGAACCACGGACAA-1'
)
or
firstpickupnames <- ('AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-1',......,'TTGGAACCACGGACAA-1')
Both ways the R console just replied
'AAACCTGAGCACGCCT-1','AAACCTGAGCGCTTAT-1','AAACCTGAGCGTTGCC-
1',......,'TTGGAACCACGGACAA-1'
+ )
+
What does this mean? Is there a limitation of columns that I can select in R?
Without a reproducible example, it's difficult to know what exactly you're looking for, but dplyr::select() has several options for selecting columns, and dplyr::everything() might be what you're looking for:
library(dplyr)
# this reorders the column names, but keeps everything without having to name the columns specifically:
mtcars %>%
select(carb, gear, everything())
# from a list of column names:
keep_columns <- c('cyl','disp','hp')
mtcars %>%
select(one_of(keep_columns))
# specific names, and a range of names:
mtcars %>%
select(hp, qsec:gear)
#You could also use `contains()`, `starts_with()`, `ends_with()`, or `matches()`. Note that calling all of the following at once will give you no results:
mtcars %>%
select(contains('t')) %>%
select(starts_with('a')) %>%
select(ends_with('b')) %>%
select(matches('^m.+g$'))
The way that the console replies (with the + indicating that it is waiting for the rest of the expression) strongly suggests that you are encountering a limitation in the capacity for the console to process long commands (which you are attempting to assemble via pasting from the clipboard) rather than an inherent limit in the number of columns which can be selected. The only place I could find in the documentation to this limitation is here where it says "Command lines entered at the console are limited to about 4095 bytes."
In the comments you said that the column names that you wanted to select were in a csv file. You didn't say much about the structure of the csv file, but say that you have a csv file that contains a single list of column names. As an example, I created a file named "colnames.csv" which has a single line:
Sepal.Width, Petal.Length
Note that there is no need to manually place quote marks around the column names in the text file. Then in the R console I typed:
iris %>% select(one_of(as.character(read.csv("colnames.csv",header = FALSE, strip.white = TRUE,stringsAsFactors = FALSE))))
which worked as expected. Even though this example only used 2 columns, there is no reason that it should fail with 3000+, since the number of columns per se wasn't the problem with what you were doing.
If the structure of the csv file is different from the example then you would need to adjust the call to read.csv and perhaps the way that you convert it to a character vector, but you should be able to tweak this approach to your situation.

Resources