Join smaller data frame to larger data frame by index in tidyverse? - r

Suppose I have the following data:
df <- data.frame(a=c(1,2,3,4))
index <- data.frame(a=c(1,3), data=c('x', 'y'))
I want to join df and index such that I end up with a result that has the rows of df, but with index$data joined for appropriate index$a. For some reason, English words fail me, but 'x' should be applied to 1 and 2 (because index$a has 1, and 3 is the "next" index value), and 'y' should be applied to 3 and 4.
Here is the data I'd like to end up with:
df2 <- data.frame(a=c(1,2,3,4), data=c('x', 'x', 'y', 'y'))
Ideally this solution is compatible with tidyverse without loading any other libraries.
Suggestions?

Is this what you want? First, join df and index and keep all observations in df. Then, we fill in all NA values with the last non-NA observations.
df %>% left_join(index, by = "a") %>% fill(data)

We can use data.table
library(data.table)
library(zoo)
setDT(df)[index, data := i.data, on = .(a)][, data := na.locf0(data)]

Related

How to subset the data frame based on selected variable with limited column?

i would like to subset limited column and selected variable as i have multiple column in my data frame.
my sample data:
df <- data.frame('ID'=c('A','B','C'),'YEAR'=c('2020','2020','2020'),'MONTH'=c('1','1','1'),'DAY'=c('16','16','16'),'HOUR'=c('15','15','15'),'VALUE1'=c(1,2,3))
i would like to subset ID'='C' and column name 'VALUE1'
Expected output:-
ID VALUE1
1 C 3
Appreciate any help...!
What i have tried so far is.
df1 <- subset(df,df$ID=='C')
df2 <- subset(df1,select=c('ID','VALUE1')
Is there any efficient way to do that as creating multiple data frame when we have multiple is not good.
you can use dplyr chaining function too,
df %>% select(ID,VALUE1) %>% filter(ID=="C")
We can have both subset and select
subset(df, subset = ID=='C', select = c('ID', 'VALUE1'))

How to filter a dataframe using a list of multiple ranges of a variable

I'm attempting to filter a large signal intensity dataframe using a list of ranges of one variable (chromosome position) in the dataframe. The list has 256 ranges in total, with start and end positions. I can successfully filter the dataframe using a single range, but I can't seem to get this to loop over the entire dataframe.
DT is the original signal intensity dataframe (SNP, Chr, Position, Intensity Ratio) and PR is a two column dataframe with start and end Position:
Chr Start End
1 130104 207101
1 1423247 4459324
1 6543121 7924836
This line of code works to extract the data from a single range:
test <- DT %>% filter(Chr %in% ("1")) %>% filter(Position %in% c(PR$Start[1]:PR$End[1]))
This does NOT work:
for (i in 1:nrow(PR)){
help <- DT %>% filter(Chr %in% ("1")) %>% filter(Position %in% c(PR$Start[i]:PR$End[i]))
}
The above code produces a dataframe with a random selection of data that doesn't correspond to the range of positions.
This doesn't work either:
range = data.table(start=PR$Start,end=PR$End)
x <- DT[Position %inrange% range]
Thank you in advance!
Your data.table solution worked for me. Does this work for you, with my made up data?
dt <- data.table(id = 1:100, var=runif(100))
ranges <- data.table(start=c(20,50,70), end=c(30,55,72))
dt[id %inrange% ranges]

Calculate fraction of complete/not missing values of variables in a data frame for output in a long format [duplicate]

This question already has answers here:
How to find the percentage of NAs in a data.frame?
(6 answers)
Closed 5 years ago.
I've got a data frame (df1) with four variables, a, b, c, and d.
I'd like to get the completeness (!is.na(x)) for each variable in the data frame. I'd like the output to be in long format (df2).
The problem's that I can't get the nrow() part of my code to work (therefore I don't know if it works overall). Or is there a dplyr+tidyr way of doing it?
Any help would be much appreciated.
Starting point (df1):
df1 <- data.frame(a=c(1,2,3,NA),b=c(1,2,NA,NA),c=c(1,2,3,4),d=c(NA,NA,NA,NA),stringsAsFactors = TRUE)
Current code:
sapply(df1, function(x) sum(!is.na(df1$x)) / nrow(df1$x))
Desired outcome (df2):
df2 <- data.frame(nameofvar=c("a","b","c","d"),completeness=c(75,50,100,0))
As you wanted the answer to be in the long format, here’s how:
df2 = df1 %>%
gather(NameOfVar, Value) %>%
group_by(NameOfVar) %>%
summarize(Completeness = mean(! is.na(Value)) * 100)
As for why your (base R) code isn’t working:
When sapplying over a data.frame, the argument to your function (x) is the column data itself. So instead of having df1$x1 you need to just use x, and instead of nrow you now need to use length, since each column x is a vector.
1 In addition, $-subsetting with a variable never works,
so even if x was a column name/index, df1$x wouldn’t work anyway. You’d have to use df1[[x]] instead.
try purrr package part of tidyverse.
df1 %>%
map_df(~ sum(!is.na(.)) / length(.) * 100)
with data.table
dt1 <- as.data.table(df1)
dt1[, sapply(.SD, function(x) {sum(!is.na(x)) / .N}), .SD = names(dt1)]
Or very simply with base R:
colSums(!is.na(df1))/ ncol(df1) * 100
Using only dplyr package:
library(dplyr)
df1 <- data.frame(a=c(1,2,3,NA),
b=c(1,2,NA,NA),
c=c(1,2,3,4),
d=c(NA,NA,NA,NA),
stringsAsFactors = TRUE)
# get percentage of non NA values
df1 %>% summarise_all(function(x) mean(! is.na(x)))
# a b c d
# 1 0.75 0.5 1 0

How to add NA rows to an incomplete dataframe based on an complete index?

For the given incomplete dataframe df and complete index t:
t = seq(as.POSIXct("2016-01-01 00:05:00"), as.POSIXct("2016-01-01 01:00:00"), by = '5 min')
index<-t[c(1,2,4:7,9,12)]
a<-(1:8)
b<-(1:8)
df<-data.frame(index,a,b)
By my way, the missing rows can be added by the following code:
index<-t #complete index
a<-vector('numeric',12)
a<-NA
b<-vector('numeric',12)
b<-NA
empty_df<-data.frame(index,a,b) # build an complete NA dataframe
for (i in 1:12) {
if(!(df$index[i]==empty_df$index[i]))
df<-rbind(rbind(df[1:i-1,],empty_df[i,]),df[i:length(df$index),])} # comparison and revison
However, my solution have two problems:
Cannot deal with the situation when the first row is missing.
When the dataframe is large, the computing will take hours.
So I'm wondering if there is any easier way to deal with it?
We can do this with merge (base R) or left_join (from dplyr)
library(dplyr)
data.frame(index = t) %>%
left_join(., df)
Or join from data.table
library(data.table)
setDT(df)[data.table(index=t), on = "index"]

Subsetting data.frame upon two constraints

Say I want to subset using 2 constraints.
1, being the values in the first column be identical
2, and at the same time, the values in the second column be the same
For example, I have a data frame
a <- rep(1:5)
b <- c(1,2,2,2,1,1,1,2,2,2)
data <- data.frame(a,b)
say a is the pair identification number and b represents the gender
now we want to subset to create a dataset where we have a matched pair ID and gender.
Would one create a loop using the while command or use the duplicated
the expected results should return a subset of data that is highlighted here in green
You can try
data[with(data, !!ave(b, a, FUN=function(x)
length(unique(x))==1)),]
Or
library(dplyr)
data %>%
group_by(a) %>%
filter(n_distinct(b)==1)
Or
library(data.table)
setDT(data)[,.(b=b[length(unique(b))==1]) , a]
Or another data.table solution provided by #David Arenburg
setDT(data)[, if (length(unique(b)) == 1) .SD, a]

Resources