R: Merging data frames: Exclude specific column value, but keep skipped rows - r

I want to merge two data frames, skipping rows based on a specific column value, but still keep the skipped rows in the final merged data frame. I can manage the first part (skipping), but not the second.
Here are the data frames:
# Data frame 1 values
ids1 <- c(1:3)
x1 <- c(100, 101, 102)
doNotMerge <- c(1, 0, 0)
# Data frame 2 values
ids2 <- c(1:3)
x2 <- c(200, 201, 202)
# Creating the data frames
df1 <- as.data.frame(matrix(c(ids1, x1, doNotMerge),
nrow = 3,
ncol = 3,
dimnames = list(c(),c("ID", "X1", "DoNotMerge"))))
df2 <- as.data.frame(matrix(c(ids2, x2),
nrow = 3,
ncol = 2,
dimnames = list(c(),c("ID", "X2"))))
# df1 contents:
# ID X1 DoNotMerge
# 1 1 100 1
# 2 2 101 0
# 3 3 102 0
# df2 contents:
# ID X2
# 1 1 200
# 2 2 201
# 3 3 202
I used merge:
merged <- merge(df1[df1$DoNotMerge != 1,], df2, by = "ID", all = T)
# merged contents:
# ID X1 DoNotMerge X2
# 1 1 NA NA 200
# 2 2 101 0 201
# 3 3 102 0 202
The skipping part I was able to do, but what I actually want is to keep the df1 row where DoNotMerge == 1, like so:
# ID X1 DoNotMerge X2
# 1 1 NA NA 200
# 2 1 100 1 NA
# 3 2 101 0 201
# 4 3 102 0 202
Can anyone please help? Thanks.

Update: I actually found the solution while writing the question (ran into this question), so figured I'd post it in case someone else encounters this problem:
require(plyr)
rbind.fill(merged, df1[df1$DoNotMerge == 1,])
# Result:
# ID X1 DoNotMerge X2
# 1 1 NA NA 200
# 2 2 101 0 201
# 3 3 102 0 202
# 4 1 100 1 NA

Related

how to add a new row with extra column in R?

I was trying to add results of a for loop into a dataframe as new rows, but it gets an error when there is a new result with more columns than the original dataframe, how could I add the new result with extra columns to the dataframe with adding the extra column names to the original dataframe?
e.g.
original dataframe:
-______A B C
x1 1 1 1
x2 2 2 2
x3 3 3 3
I want to get
-______A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
X4 4 4 4 4
I tried rbind (Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match)
and rbind_fill (Error: All inputs to rbind.fill must be data.frames)
and bind_rows (Argument 2 must have names)
In base R, this can be done by creating a new column 'D' with NA and then assign new row with 4.
df1$D <- NA
df1['x4', ] <- 4
-output
> df1
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Or in a single line
rbind(cbind(df1, D = NA), x4 = 4)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Regarding the error in bind_rows, it happens when the for loop output is not a named vector
library(dplyr)
> vec1 <- c(4, 4, 4, 4)
> bind_rows(df1, vec1)
Error: Argument 2 must have names.
Run `rlang::last_error()` to see where the error occurred.
If it is a named vector, then it should work
> vec1 <- c(A = 4, B = 4, C = 4, D = 4)
> bind_rows(df1, vec1)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
...4 4 4 4 4
data
df1 <- structure(list(A = 1:3, B = 1:3, C = 1:3),
class = "data.frame", row.names = c("x1",
"x2", "x3"))
You probably have something like this, if you list the elements of your for loop.
(l <- list(x1, x2, x3, x4, x5))
# [[1]]
# [1] 1 1 1
#
# [[2]]
# [1] 2 2 2 2
#
# [[3]]
# [1] 3 3
#
# [[4]]
# [1] 4
#
# [[5]]
# NULL
Multiple elements can be rbinded using a do.call(rbind, .) approach, your problem is, how to rbind multiple elements that differ in length.
There's a `length<-` function with which you may adjust the length of a vector. To know to which length, there's another function, lengths, that gives you the lengths of each list element, where you are interested in the maximum.
I include the special case when an element has length NULL (our 5th element of l); since length of NULL cannot be changed, replace those elements with NA.
So altogether you may do:
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, max(lengths(l))))
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 NA
# [2,] 2 2 2 2
# [3,] 3 3 NA NA
# [4,] 4 NA NA NA
# [5,] NA NA NA NA
Or, since you probably want a data frame with pretty row and column names:
ml <- max(lengths(l))
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, ml)) |>
as.data.frame() |> `dimnames<-`(list(paste0('x', 1:length(l)), LETTERS[1:ml]))
# A B C D
# x1 1 1 1 NA
# x2 2 2 2 2
# x3 3 3 NA NA
# x4 4 NA NA NA
# x5 NA NA NA NA
Note: R >= 4.1 used.
Data:
x1 <- rep(1, 3); x2 <- rep(2, 4); x3 <- rep(3, 2); x4 <- rep(4, 1); x5 <- NULL

Using IFELSE function across multiple columns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I want to create a new column based on multiple columns of different data types
Names
1
2
3
A
000
NA
030
B
100
DDD
NA
C
XXX
000
050
Based on column 1-3, I want to add another column with the condition If value >= 30 then 1 else 0.
Output will be:
Names
1
2
3
4
A
000
NA
030
1
B
100
DDD
NA
1
C
XXX
000
015
0
Note : There are 36 such columns (1-36) across where I want to use the if condition and then create a new column.
adding some more details:
These variables are extracted from one long string like "030060000XXX010" which turned into 030 , 060, 000, XXX, 010. Now using IFELSE condition if any of the value (number looking) is >= 30 then 1 else 0
Consider using if_any. Loop over the columns other than 'Name', create a logical condition after converting to integer class, replace the NA with FALSE and coerces the logical output from if_any to binary (+)
library(dplyr)
library(tidyr)
df1 %>%
mutate(new = +(if_any(-Names, ~ replace_na(as.integer(.) >= 30, FALSE) ) ))
Since you want to group by 3, one way is to split.default the columns by 3, operate on one three-pack at a time, then combine them later.
I'll demonstrate on the data but repeating the three data columns so that we can show the iteration.
dat <- structure(list(Names = c("A", "B", "C"), X1 = c("000", "100", "XXX"), X2 = c(NA, "DDD", "000"), X3 = c(30L, NA, 50L), X1 = c("000", "100", "XXX"), X2 = c(NA, "DDD", "000"), X3 = c(30L, NA, 50L)), class = "data.frame", row.names = c(NA, -3L))
split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3)
# $`0`
# X1 X2 X3
# 1 000 <NA> 30
# 2 100 DDD NA
# 3 XXX 000 50
# $`1`
# X1.1 X2.1 X3.1
# 1 000 <NA> 30
# 2 100 DDD NA
# 3 XXX 000 50
With this, we'll work on one three-pack at a time.
func <- function(x, lim = 30) {
x <- as.matrix(x)
x <- `dim<-`(suppressWarnings(as.numeric(x)), dim(x))
cbind(x,(+(rowSums(x <= lim, na.rm = TRUE) > 0)))
}
lapply(split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3), func)
# $`0`
# [,1] [,2] [,3] [,4]
# [1,] 0 NA 30 1
# [2,] 100 NA NA 0
# [3,] NA 0 50 1
# $`1`
# [,1] [,2] [,3] [,4]
# [1,] 0 NA 30 1
# [2,] 100 NA NA 0
# [3,] NA 0 50 1
Now we just need to recombine them all again:
do.call(cbind, c(list(dat[,1,drop=FALSE]), lapply(split.default(dat[,-1], (seq_along(dat)[-1]-2) %/% 3), func)))
# Names 0.1 0.2 0.3 0.4 1.1 1.2 1.3 1.4
# 1 A 0 NA 30 1 0 NA 30 1
# 2 B 100 NA NA 0 100 NA NA 0
# 3 C NA 0 50 1 NA 0 50 1

Column Split into columns and rows in R

My Data looks like
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}
'))
Desired data view
user_id answer_id1 answer_id2 answer_id3 answer_id4
13 A B C D
13 A1 B1 C1 D1
15 W X Y Z
15 W1 X1 Y1 Z1
i'm new with R and hope to get solution soon as i do always
may not be the best solution but this can get you from your sample input to your desired output using stringr, purrr, & tidyr. See regex101 for an explanation of the regex used in the stringr::str_match_all() call.
df <- data.frame(user_id=c('13','15'),
answer_id = c('{"row[0][0]":"A","row[0][1]":"B","row[0][2]":"C","row[0][3]":"D","row[1][0]":"A1","row[1][1]":"B1","row[1][2]":"C1","row[1][3]":"D1"}', '{"row[0][0]":"W","row[0][1]":"X","row[0][2]":"Y","row[0][3]":"Z","row[1][0]":"W1","row[1][1]":"X1","row[1][2]":"Y1","row[1][3]":"Z1"}'),
stringsAsFactors=F)
#use regex to extract row ids and answers
regex_matches <- stringr::str_match_all(df$answer_id, '\\"row\\[(\\d+)\\]\\[(\\d+)\\]\\":\\"([^\\"]*)\\"')
#add user id to each result
answers_by_user <- purrr::map2(df$user_id, regex_matches, ~cbind(.x, .y[,-1]))
#combine list of matrices and convert to df
answers_df <- data.frame(do.call(rbind, answers_by_user))
#add meaningful names
names(answers_df) <- c("user_id", "row_1", "row_2", "value")
#convert to wide
spread_row_1 <- tidyr::spread(answers_df, row_1, value)
final_df <- tidyr::spread(answers_df, row_2, value)
#remove row column
final_df$row_1 <- NULL
#clean up names
names(final_df) <- c("user_id", "answer_id1", "answer_id2", "answer_id3", "answer_id4")
final_df
#output
user_id answer_id1 answer_id2 answer_id3 answer_id4
1 13 A B C D
2 13 A1 B1 C1 D1
3 15 W X Y Z
4 15 W1 X1 Y1 Z1
Column 2 looks like JSON, so you could do something like this to get it into a form that you can do something with...
library(rjson)
df2 <- lapply(1:nrow(df),function(i)
data.frame(user=df[i,1],
answer=unlist(fromJSON(as.character(df[i,2]))),stringsAsFactors = FALSE))
df2 <- do.call(rbind,df2)
df2[,"r1"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\1",rownames(df2))
df2[,"r2"] <- gsub(".+\\[(\\d)]\\[(\\d)].*","\\2",rownames(df2))
df2
user answer r1 r2
row[0][0] 13 A 0 0
row[0][1] 13 B 0 1
row[0][2] 13 C 0 2
row[0][3] 13 D 0 3
row[1][0] 13 A1 1 0
row[1][1] 13 B1 1 1
row[1][2] 13 C1 1 2
row[1][3] 13 D1 1 3
row[0][0]1 15 W 0 0
row[0][1]1 15 X 0 1
row[0][2]1 15 Y 0 2
row[0][3]1 15 Z 0 3
row[1][0]1 15 W1 1 0
row[1][1]1 15 X1 1 1
row[1][2]1 15 Y1 1 2
row[1][3]1 15 Z1 1 3

R: Find the Variance of all Non-Zero Elements in Each Row

I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA

Order data frame by columns with different calling schemes

Say I have the following data frame:
df <- data.frame(x1 = c(2, 2, 2, 1),
x2 = c(3, 3, 2, 1),
let = c("B", "A", "A", "A"))
df
x1 x2 let
1 2 3 B
2 2 3 A
3 2 2 A
4 1 1 A
If I want to order df by x1, then x2 then let, I do this:
df2 <- df[with(df, order(x1, x2, let)), ]
df2
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
However, x1 and x2 have actually been saved as an id <- c("x1", "x2") vector earlier in the code, which I use for other purposes.
So my problem is that I want to reference id instead of x1 and x2 in my order function, but unfortunately anything like df[order(df[id], df$let), ] will result in a argument lengths differ error.
From what I can tell (and this has been addressed at another SO thread), the problem is that length(df[id]) == 2 and length(df$let) == 4.
I have been able to make it through this workaround:
df3 <- df[order(df[, id[1]], df[, id[2]], df[, "let"]), ]
df3
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
But it looks ugly and depends on knowing the size of id.
Is there a more elegant solution to sorting my data frame by id then let?
I would suggest using do.call(order, ...) and combining id and "let" with c():
id <- c("x1", "x2")
df[do.call(order, df[c(id, "let")]), ]
# x1 x2 let
# 4 1 1 A
# 3 2 2 A
# 2 2 3 A
# 1 2 3 B

Resources