Left join without returning NA values

Left join without returning NA values - r

I am using R and the library dplyr.
I want to join a larger database with a smaller database (in terms of rows).
I use left join because I want to have a final database that has the same number of rows as the larger one.
This naturally returns NA values when the smaller database does not have a value corresponding to the joining key.
What I want to achieve is sort of copying the previous values of the smaller database into the rows where NA is returned by the left join.
In other words:
if is.na(columnvalue[j]) == TRUE then
columnvalue[j] = columnvalue[j-1]
where columnvalue is a joined column from the smaller database and j = 1,..., nrow(largerdataset).
A loop with that if statement should work, but it is a bit cumbersome. Is there any other smarter solution?
Thank you.

If you update with some sample data, I could provide full code for this. The general solution is to use fill from tidyr package, possibly with a group_by the key if needed. You would just write it as:
library(tidyverse)
data %>%
# group_by(key) %>%
tidyr::fill(var1, var2, var3, .direction = "up")

Related

How to mutate the negative values of some columns into positive in my data frame

i have the following issue:
In my data frame (89 columns) I have 4 of them which have the values in a negative way as you can see in the following image
![1]: https://i.stack.imgur.com/ZFF0U.png
So I would like to know how I could mutate that specific columns of my data frame in order to make the values of them positive (absolute value).
Many thanks

Here's one option:
library(dplyr)
your_data %>%
mutate(across(c("DAYS_BIRTH", "DAYS_EMPLOYED", "DAYS_REGISTRATION", "DAYS_ID_PUBLISH"), abs))
Depending on which columns you want to mutate and which you want to leave, you might be able to use a simpler select helper, like mutate(across(starts_with("DAYS"), abs)), for example.

A general solution:
library(dplyr)
data %>% mutate_if(function(x) all(x<0), function(x) abs(x))

How can I copy and append rows of data to fill in missing records in a defined sequence?

I have a sequence of numeric labels for records that can be shared by a variable number of records per label (labelsequence). I also have the records themselves, but unfortunately for some of the sequence values, all records have been lost (dataframe df). I need to identify when a numeric label from labelsequence does not appear in the label column of df, copy all records within df that are associated with the closest label value that is less than the missing value, and append these to a newly filled-in dataframe, say df2.
I am trying to accomplish this in R (a dplyr answer would be ideal), and have looked at answers to questions regarding filling in missing rows, such as Fill in missing rows in R and fill missing rows in a dataframe, and have a working solution below, was wondering if anyone has a better way of doing this.
Take , for instance, this example data:
labelsequence<-data.frame(label=c(1,2,3,4,5,6))
and
df<-data.frame(label=c(1,1,1,1,3,3,4,4,4),
place=c('vermont','kentucky',
'wisconsin','wyoming','nevada',
'california','utah','georgia','kentucky'),
animal=c('wolf','wolf','cougar','cougar','lamb',
'cougar','donkey','lamb','wolf'))
with desired result...
desired_df2<-data.frame(label=c(1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6),
place=c('vermont','kentucky',
'wisconsin','wyoming','vermont','kentucky',
'wisconsin','wyoming','nevada',
'california','utah','georgia','kentucky','utah',
'georgia','kentucky','utah','georgia','kentucky'),
animal=c('wolf','wolf','cougar','cougar','wolf',
'wolf','cougar','cougar','lamb','cougar',
'donkey','lamb','wolf','donkey','lamb','wolf',
'donkey','lamb','wolf'))
Is there a better (be it effiency of code, flexibility, or resource efficiency) way than the following?
df2<- df %>%
full_join(expand.grid(label=unique(df$label),newlabel=labelsequence$label)) %>%
mutate(missing = ifelse(newlabel %in% label,0,1))%>%
filter(label<newlabel)%>%
group_by(newlabel) %>%
filter(label==max(label) & missing ==1) %>%
ungroup()%>%
mutate(label=newlabel,missing=NULL,newlabel=NULL) %>%
bind_rows(df) %>%
arrange(label)

Vector addition with vector indexing

This may well have an answer elsewhere but I'm having trouble formulating the words of the question to find what I need.
I have two dataframes, A and B, with A having many more rows than B. I want to look up a value from B based on a column of A, and add it to another column of A. Something like:
A$ColumnToAdd + B[ColumnToMatch == A$ColumnToMatch,]$ColumnToAdd
But I get, with a load of NAs:
Warning in `==.default`: longer object length is not a multiple of shorter object length
I could do it with a messy for-loop but I'm looking for something faster & elegant.
Thanks

If I understood your question correctly, you're looking for a merge or a join, as suggested in the comments.
Here's a simple example for both using dummy data that should fit what you described.
library(tidyverse)
# Some dummy data
ColumnToAdd <- c(1,1,1,1,1,1,1,1)
ColumnToMatch <- c('a','b','b','b','c','a','c','d')
A <- data.frame(ColumnToAdd, ColumnToMatch)
ColumnToAdd <- c(1,2,3,4)
ColumnToMatch <- c('a','b','c','d')
B <- data.frame(ColumnToAdd, ColumnToMatch)
# Example using merge
A %>%
merge(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
# Example using join
A %>%
inner_join(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
The advantages of the dplyr versions over merge are:
rows are kept in existing order
much faster
tells you what keys you're merging by (if you don't supply)
also work with database tables.

Using multiple columns in dplyr window functions?

Comming from SQL i would expect i was able to do something like the following in dplyr, is this possible?
# R
tbl %>% mutate(n = dense_rank(Name, Email))
-- SQL
SELECT Name, Email, DENSE_RANK() OVER (ORDER BY Name, Email) AS n FROM tbl
Also is there an equivilant for PARTITION BY?

I did struggle with this problem and here is my solution:
In case you can't find any function which supports ordering by multiple variables, I suggest that you concatenate them by their priority level from left to right using paste().
Below is the code sample:
tbl %>%
mutate(n = dense_rank(paste(Name, Email))) %>%
arrange(Name, Email) %>%
view()
Moreover, I guess group_by is the equivalent for PARTITION BY in SQL.
The shortfall for this solution is that you can only order by 2 (or more) variables which have the same direction. In the case that you need to order by multiple columns which have different direction, saying that 1 asc and 1 desc, I suggest you to try this:
Calculate rank with ties based on more than one variable

Binding a tibble with more than 1 rows and another one with only one row in R

I have two tibbles: the first one with more than one row and second one, with exactly one row.
I want to col bind them, and, for this purpose, I want the second one to have the same number of rows as the first.
I can do this operation with this trick:
for (i in colnames(df2)) {
df1[[i]] <- df2[1,i]
}
However, this sounds like a workaround to me. Is there a "tidier" way of doing this (I mean, with tidyverse)?

You can just go for cbind(df1,df2), it will expand the shortest data.frame to match the number of rows of the longest
If you want to use dplyr, you would want a cross join... but dplyr has no cross join yet.
You can create a dummy column on both tables, and inner_join on it:
df1 %>%
mutate(dummy_id=1) %>%
inner_join(df2 %>% mutate(dummy_id=1)) %>%
mutate(dummy_id=NULL)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Left join without returning NA values - r

If you update with some sample data, I could provide full code for this. The general solution is to use fill from tidyr package, possibly with a group_by the key if needed. You would just write it as: library(tidyverse) data %>% # group_by(key) %>% tidyr::fill(var1, var2, var3, .direction = "up")

Related

How to mutate the negative values of some columns into positive in my data frame

How can I copy and append rows of data to fill in missing records in a defined sequence?

Vector addition with vector indexing

Using multiple columns in dplyr window functions?

Binding a tibble with more than 1 rows and another one with only one row in R

Categories

Resources