How to make a specific binding of two tables - r

I have a single column dataframe with all possible IDs:
ID
a1
a2
b1
b11
c1
I get dataframe from my database with same column "ID". But in that dataframe, not all IDs might be. Here is example of that table:
ID value
a1 18
a2 10
b1 10
I want to bind those tw tables in that way, so IDs which were not in my table have value zero. So, how to bind these two tables to get this:
ID value
a1 18
a2 10
b1 10
b11 0
c1 0

join the two tables and replace NA value with 0.
Using dplyr :
library(dplyr)
df1 %>%
full_join(df2, by = 'ID') %>%
mutate(value = replace(value, is.na(value), 0))
# ID value
#1 a1 18
#2 a2 10
#3 b1 10
#4 b11 0
#5 c1 0
In base R, you can do this as :
transform(merge(df1, df2, by = 'ID', all = TRUE),
value = replace(value, is.na(value), 0))

We can also do
library(dplyr)
df1 %>%
full_join(df2, by = 'ID') %>%
mutate(value = case_when(is.na(value) ~ 0, TRUE ~ value))

Related

R convert matrix into table

there is a matrix:
mat<-matrix(0,ncol = 10, nrow = 5)
colnames(mat)<-c("A1","A2","A3","A4","A5","A6","A7","A8","A9","A10")
rownames(mat)<-c("ID_1", "ID_2", "ID_3", "ID_4", "ID_5")
mat[1,] <-c(0,0,1,1,1,1,0,0,0,0)
mat[2,]<-c(0,0,0,1,1,1,0,0,0,0)
mat[3,]<-c(0,0,0,1,1,1,1,1,1,0)
mat[4,]<-c(0,0,0,0,0,1,1,1,1,0)
mat[5,]<-c(0,0,0,0,0,0,1,1,1,1)
I want to convert this matrix into a table with three columns - "ID", "start" and "stop", where "start" is a column with the first value (1) in row "ID", "stop" is a column with the last value in the row. I would like to receive this output:
Could You please help me?
Thanks in advance.
Here is one way using dplyr, converting the matrix to dataframe, convert rownames to column, get the data in long format, filter rows with value = 1 and select first and last column name for each id.
library(dplyr)
mat %>%
as.data.frame() %>%
tibble::rownames_to_column('id') %>%
tidyr::pivot_longer(cols = -id) %>%
filter(value == 1) %>%
group_by(id) %>%
summarise(start = first(name), stop = last(name))
# A tibble: 5 x 3
# id start stop
# <chr> <chr> <chr>
#1 ID_1 A3 A6
#2 ID_2 A4 A6
#3 ID_3 A4 A9
#4 ID_4 A6 A9
#5 ID_5 A7 A10
In base R and keeping mat as matrix :
t(apply(mat, 1, function(x) {
inds <- which(x == 1)
c(start = colnames(mat)[min(inds)], stop = colnames(mat)[max(inds)])
}))
You can do this using the ties.method argument in max.col. Use the result to subset the colnames.
data.frame(id = rownames(mat),
start = colnames(mat)[max.col(mat, "first")],
stop = colnames(mat)[max.col(mat, "last")])
# id start stop
# 1 ID_1 A3 A6
# 2 ID_2 A4 A6
# 3 ID_3 A4 A9
# 4 ID_4 A6 A9
# 5 ID_5 A7 A10

find duplicate, compare a condition, erase one row r

Using the following reproducible example:
ID1<-c("a1","a4","a6","a6","a5", "a1" )
ID2<-c("b8","b99","b5","b5","b2","b8" )
Value1<-c(2,5,6,6,2,7)
Value2<- c(23,51,63,64,23,23)
Year<- c(2004,2004,2004,2004,2005,2004)
df<-data.frame(ID1,ID2,Value1,Value2,Year)
I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with the smaller value.
Expected result:
ID1 ID2 Value1 Value2 Year new
2 a4 b99 5 51 2004 a4_b99_2004
4 a6 b5 6 64 2004 a6_b5_2004
5 a5 b2 2 23 2005 a5_b2_2005
6 a1 b8 7 23 2004 a1_b8_2004
I tried the following:
Find a unique identifier for the conditions I am interested
df$new<-paste(df$ID1,df$ID2, df$Year, sep="_")
I can use the unique identifier to find the rows of the database that contain the duplicates
IND<-which(duplicated(df$new) | duplicated(df$new, fromLast = TRUE))
In a for loop if unique identifier has duplicate compare the values and erase the rows, but the loop is too complicated and I cannot solve it.
for (i in df$new) {
if(sum(df$new == i)>1)
{
ind<-which(df$new==i)
m= min(df$Value1[ind])
df<-df[-which.min(df$Value1[ind]),]
m= min(df$Value2[ind])
df<-df[-which.min(df$Value2[ind]),]
}
}
Some different possibilities. Using dplyr:
df %>%
group_by(ID1, ID2, Year) %>%
filter(Value1 == max(Value1) & Value2 == max(Value2))
Or:
df %>%
rowwise() %>%
mutate(max_val = sum(Value1, Value2)) %>%
ungroup() %>%
group_by(ID1, ID2, Year) %>%
filter(max_val == max(max_val)) %>%
select(-max_val)
Using data.table:
setDT(df)[df[, .I[Value1 == max(Value1) & Value2 == max(Value2)], by = list(ID1, ID2, Year)]$V1]
Or:
setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)
][filter != FALSE
][, -c("max_val", "filter")]
Or:
subset(setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)], filter != FALSE)[, -c("max_val", "filter")]
Consider aggregate to retrieve the max values by your grouping, ID1, ID2, and Year:
df_new <- aggregate(.~ID1 + ID2 + Year, df, max)
df_new
# ID1 ID2 Year Value1 Value2
# 1 a6 b5 2004 6 64
# 2 a1 b8 2004 7 23
# 3 a4 b99 2004 5 51
# 4 a5 b2 2005 2 23
Solution without loading libraries:
ID1 ID2 Value1 Value2 Year
a6.b5.2004 a6 b5 6 64 2004
a1.b8.2004 a1 b8 7 23 2004
a4.b99.2004 a4 b99 5 51 2004
a5.b2.2005 a5 b2 2 23 2005
Code
do.call(rbind, lapply(split(df, list(df$ID1, df$ID2, df$Year)), # make identifiers
function(x) {return(x[which.max(x$Value1 + x$Value2),])})) # take max of sum

Sorting the values of column in ascending order in R

The script below is a data frame of four columns. My need is that I want to take a pair of values(a1,a2) at a time. The column "a3" is such that if you check a pair say (a1,a2), as you span the data, the pair's value is arranged in ascending order. If there is a duplicate of the pair present in the table, I want the "a4" column values to be arranged just like the corresponding "a3" column in ascending order for the particular (a1,a2) value. Say the first (a1,a2) pair ("A","D"), the pair appears thrice and the corresponding a3 values are in asecending order. Similarly I wish to arrange the a4 values based on the order of a4 values in ascending order. Please check the expected outcome. Thanks and please suggest.
a1 = c("A","B","C","A","B","C","A","C")
a2 = c("D","E","F","D","E","E","D","F")
a3 = c(5,15,12,10,40,35,20,50)
a4 = c(100,160,66,65,130,150,80,49)
a123= data.frame(a1,a2,a3,a4)
library(dplyr)
a123_r <- a123 %>%
group_by(a1, a2) %>%
mutate(a3 = sort(a3)) %>%
ungroup()
a123_r
Expected Output
a1 = c("A","B","C","A","B","C","A","C")
a2 = c("D","E","F","D","E","E","D","F")
a3 = c(5,15,12,10,40,35,20,50)
a4 = c(65,130,66,80,160,150,100,49)
a123_r <- data.frame(a1,a2,a3,a4)
For the sake of completeness, here is an answer using data.table:
library(data.table)
cols <- c("a3", "a4")
setDT(a123)[, (cols) := lapply(.SD, sort), by = .(a1, a2), .SDcols = cols][]
a1 a2 a3 a4
1: A D 5 65
2: B E 15 130
3: C F 12 49
4: A D 10 80
5: B E 40 160
6: C E 35 150
7: A D 20 100
8: C F 50 66
Data
a1 = c("A","B","C","A","B","C","A","C")
a2 = c("D","E","F","D","E","E","D","F")
a3 = c(5,15,12,10,40,35,20,50)
a4 = c(100,160,66,65,130,150,80,49)
a123= data.frame(a1,a2,a3,a4)

Conditional count in data frame

I have a dataframe (df) with three columns like so:
Structure:
id id1 age
A1 a1 32
A1 a2 45
A1 a3 45
A1 a4 12
A2 b1 15
A2 b5 34
A2 b64 17
Expected Output:
id count count1
A1 4 1
A2 3 2
Logic:
Column "count" is the number of times "id" is repeated
Column "count1" is the number of rows where age is less than 21
Current Code:
library(dplyr)
df_summarized <- df %>%
group_by(id) >%>
summarise(count = n(),count1 = count(age<21))
Problem:
Error: no applicable method for 'group_by_' applied to an object of class "logical"
We need to do the sum
df %>%
group_by(id) %>%
summarise(count = n(),count1 = sum(age < 21))
# A tibble: 2 × 3
# id count count1
# <chr> <int> <int>
#1 A1 4 1
#2 A2 3 2
as count applies to data.frame or tbl_df and not in a single column inside the summarise
Or using data.table
library(data.table)
setDT(df)[, .(count = .N, count1 = sum(age < 21)), id]
Or with base R
cbind(count = rowSums(table(df[-2])), count1 = as.vector(rowsum(+(df$age < 21), df$id)))
# count count1
#A1 4 1
#A2 3 2
Or using aggregate based on the sum
do.call(data.frame, aggregate(age~id, df, FUN =
function(x) c(count = length(x), count1 = sum(x<21))))
NOTE: All the above methods give the dataset with proper columns. This will be especially noted in aggregate. That is the reason the output column i.e. a matrix is converted to proper columns with do.call(data.frame
With base R, we can use aggregate to find number of rows for each group (id) as well as number of rows with value less than 21
aggregate(age~id, df, function(x) c(count = length(x),
count1 = length(x[x < 21])))
# id age.count age.count1
#1 A1 4 1
#2 A2 3 2

Convert columns i to j to percentage

Suppose I have the following data:
df1 <- data.frame(name=c("A1","A1","B1","B1"),
somevariable=c(0.134,0.5479,0.369,NA),
othervariable=c(0.534, NA, 0.369, 0.3333))
In this example, I want to convert columns 2 and 3 to percentages (with one decimal point). I can do it with this code:
library(scales)
df1 %>%
mutate(somevariable=try(percent(somevariable),silent = T),
othervariable=try(percent(othervariable),silent = T))
But I'm hoping there is a better way, particularly for the case where I have many columns instead of just 2.
I tried mutate_each but I'm doing something wrong...
df1 %>%
mutate_each(funs = try(percent(),silent = T), -name)
Thanks!
Here's an alternative approach using custom function. This function will only modify numeric vectors, so no need to worry about try or removing non-numeric columns. It will also handle NAs by defult
myfun <- function(x) {
if(is.numeric(x)){
ifelse(is.na(x), x, paste0(round(x*100L, 1), "%"))
} else x
}
df1 %>% mutate_each(funs(myfun))
# name somevariable othervariable
# 1 A1 13.4% 53.4%
# 2 A1 54.8% <NA>
# 3 B1 36.9% 36.9%
# 4 B1 <NA> 33.3%
Try
df1 %>%
mutate_each(funs(try(percent(.), silent=TRUE)), -name)
# name somevariable othervariable
#1 A1 13.4% 53.4%
#2 A1 54.8% NA%
#3 B1 36.9% 36.9%
#4 B1 NA% 33.3%
if you need to filter out the NAs from getting the percentage,
df1 %>%
mutate_each(funs(try(ifelse(!is.na(.), percent(.), NA),
silent=TRUE)),-name)
# name somevariable othervariable
#1 A1 13.4% 53.4%
#2 A1 54.8% <NA>
#3 B1 36.9% 36.9%
#4 B1 <NA> 33.3%

Resources