Compare values of two dataframes and substitute them - r

I've two data frames with the same number of rows and columns, 113x159 with this structure:
df1:
1 2 3 4
a AT AA AG CT
b NA AG AT CC
c AG GG GT AA
d NA NA TT TC
df2:
1 2 3 4
a NA 23 12 NA
b NA 23 44 12
c 11 14 27 55
d NA NA 12 34
I want to compare value to value db1 e db2, and if the value of db 2 is NA and the value of db1 isn't, replace it (also if db1 value is NA and in db2 not).
At the end, my df has to be this:
1 2 3 4
a NA AA AG NA
b NA AG AT CC
c AG GG GT AA
d NA NA TT CC
I've written this if loop but it doesn't work:
merge.na<-function(x){
for (i in df2) AND (k in df1){
if (i==NA) AND (k!=NA)
k==NA}
Any idea?

We can use replace
replace(df1, is.na(df2), NA)
# X1 X2 X3 X4
#a <NA> AA AG <NA>
#b <NA> AG AT CC
#c AG GG GT AA
#d <NA> <NA> TT TC

Related

R strsplit for uneven number of columns in a huge data set

I have a huge data set with about 200 columns and 25k+ rows, with the separator ';'. The columns are of an uneven number.
I read it in as a delimited txt file df <- read.delim(~path/data.txt, sep=";", header = FALSE)
which reads nicely as a table.
My issue is, many of the rows are so long that in the txt file they often spill onto new lines and it is here that it is not recognising that it should continue on the same row. Therefore the distinguished columns have information that belongs else where.
Each observation of data is a dbl.
I have created a new example below for ease of reading, therefore it is not possible to simply sort conditions into columns.
***EDIT: x, y and z contain spatial coordinates, but I have substituted them for their corresponding letters for ease of reading.
The data is X-profile data giving me coordinates of the centre point along a line, followed by offsets of 1m (up to 100m either side of 0, the centre line) in each column with its corresponding height ***
My data ends up looking something like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9]
[1] x y z 1 2 3 N/A N/A N/A
[2] x y z 1 2 3 4 5 6
[3] 7 8 9 10 N/A N/A N/A N/A N/A
[4] x y z 1 2 3 4 5 7
[5] 7 8 9 N/A N/A N/A N/A N/A N/A
[6] x y z 1 2 3 N/A N/A N/A
[7] x y z 1 2 3 4 5 N/A
And I'd like it to look like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9] [c10] [c11] [c12] [c13]
[1] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[2] x y z 1 2 3 4 5 6 7 8 9 10
[3] x y z 1 2 3 4 5 6 7 8 9 N/A
[4] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[5] x y z 1 2 3 4 5 N/A N/A N/A N/A N/A
I have tried strsplit(as.character(df), split = "\n", fixed = TRUE) and it returns an error that it is not a character string. I have tried the same function with split = "\t" and split = "\r" and it returns the same error. Each attempt takes around half an hour to process so I was also wondering if there is a more efficient way to do this.
I hope I have explained clearly my aim.
EDIT
The text file is similar to the following example:
x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15
In some cases a number is split between the previous line and that below:
E.G.
101;102;103;10
4;105;106
This layout is exactly how it is being read into R.
Use scan which omits empty lines by default. Next, find positions that begin with "x" using findInterval, split there and paste them together. Then basically the ususal strsplit, some length adaptions etc. and you got it.
r <- scan('foo.txt', 'A', qui=T)
r <- split(r, findInterval(seq_len(length(r)), grep('^x', r))) |>
lapply(paste, collapse='') |>
lapply(strsplit, ';') |>
lapply(el) |>
{\(.) lapply(., `length<-`, max(lengths(.)))}() |>
do.call(what=rbind) |>
as.data.frame()
r
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
# 1 x y z 1 2 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 2 x y z 1 2 3 4 5 6 7 8 9 10 <NA> <NA> <NA> <NA> <NA>
# 3 x y z 1 2 3 4 5 6 7 8 9 <NA> <NA> <NA> <NA> <NA> <NA>
# 4 x y z 1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data:
writeLines(text='x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15', 'foo.txt')
using data.table:
dt <- data.table(df)
dt[, grp := cumsum(c1 == "x")]
dt <- merge(dt[c1 == "x"], dt[c1 == 7], by = "grp", all = T)[, grp := NULL]
names(dt) <- paste0("c", 1:ncol(dt))
Resulting to:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18
1: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
2: x y z 1 2 3 4 5 6 7 8 9 10 NA NA NA NA NA
3: x y z 1 2 3 4 5 7 7 8 9 NA NA NA NA NA NA
4: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
5: x y z 1 2 3 4 5 NA NA NA NA NA NA NA NA NA NA

Combine two dataframes same/different names [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have 2 dataframes, i am trying to combine both the dataframes not only the ones with common names but also with different variable names and tell as NA if respective value not found.
I tried normal rbind but it asks for same column names.
Dataframes:
d1 <- data.frame(a=c('a1','a2','a3'), b = c("a51","a52","a53"), d = c(12,13,14))
d2 <- data.frame(a=c('a4','a5','a6'), g = c("a151","a152","a153"), k = c(122,123,124))
Expected Output:
a b d g k
1 a1 a51 12 <NA> NA
2 a2 a52 13 <NA> NA
3 a3 a53 14 <NA> NA
4 a4 <NA> NA a151 122
5 a5 <NA> NA a152 123
6 a6 <NA> NA a153 124
Here is an option with bind_rows
library(dplyr)
bind_rows(d1, d2)
# a b d g k
#1 a1 a51 12 <NA> NA
#2 a2 a52 13 <NA> NA
#3 a3 a53 14 <NA> NA
#4 a4 <NA> NA a151 122
#5 a5 <NA> NA a152 123
#6 a6 <NA> NA a153 124
Or using rbindlist
library(data.table)
rbindlist(list(d1, d2))

How to make all elements of all columns align by creating empty spaces so that it stays in the same pattern

I have a temporal dataset, however, it is incomplete so I can not reconstruct the series accurately. These are the data:
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
I need it to end up like this:
df2<-data.frame(year=c(2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,
2015,2016,2017,2018),
sample1=c(NA,NA,"D","D","DDD","D","U","UU","UUU","U","D","DDD",NA,NA,NA),
sample2=c("U","UU","D","D","DDD","D","U","UU","UUU","U",NA,NA,NA,NA,NA),
sample3=c(NA,NA,NA,"D","DDD","D","U","UU","UUU","U","D","DDD","D",NA,NA),
sample4=c(NA,NA,"D","D",NA,NA,NA,NA,"UUU","U","D","DDD","D","U","U"),
sample5=c(NA,"UU","D",NA,NA,NA,"U","UU","UUU","U",NA,NA,"D","U",NA))
I need all the columns aligned in the same pattern, the best result was using DNA alignment functions, but these times to find the best alignment invert the elements, in my case can not occur this.
I have no idea how to do this.
dplyr's add_row function makes this pretty easy, once the initial dataframe exists.
library(dplyr)
df<-data.frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1 = c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2 = c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3 = c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4 = c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5 = c(NA,"UU","D","U","UU","UUU","U","D","U",NA))
df2 <- df %>%
add_row(year = 2016:2018)
library(dplyr)
df <- data_frame(year=c(2006,2007,2008,2009,2010,2011,2012,2013,2014,2015),
sample1=c("D","D","DDD","D","U","UU","UUU","U","D","DDD"),
sample2=c("U","UD","D","D","DDD","D","U","UU","UUU","U"),
sample3=c("D","DDD","D","U","UU","UUU","U","D","DDD","D"),
sample4=c("D","D","UUU","U","D","DDD","D","U","U",NA),
sample5=c(NA,"UU","D","U","UU","UUU","U","D","U",NA)) %>%
add_row(year = c(2004, 2005), .before = 1) %>%
add_row(year = c(2016:2018))
Result:
# A tibble: 15 x 6
year sample1 sample2 sample3 sample4 sample5
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2004 NA NA NA NA NA
2 2005 NA NA NA NA NA
3 2006 D U D D NA
4 2007 D UD DDD D UU
5 2008 DDD D D UUU D
6 2009 D D U U U
7 2010 U DDD UU D UU
8 2011 UU D UUU DDD UUU
9 2012 UUU U U D U
10 2013 U UU D U D
11 2014 D UUU DDD U U
12 2015 DDD U D NA NA
13 2016 NA NA NA NA NA
14 2017 NA NA NA NA NA
15 2018 NA NA NA NA NA

Left join (or equivalent) to number index by group

I have a sequence of numbers (days):
dayNum <- c(1:10)
And I have a dataframe of id, day, and event:
id = c("aa", "aa", "aa", "bb", "bb", "cc")
day = c(1, 2, 3, 1, 6, 2)
event = c("Y", "Y", "Y", "Y", "Y", "Y")
df = data.frame(id, day, event)
Which looks like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
bb 1 Y
bb 6 Y
cc 2 Y
I am trying to put this dataframe into a form that resembles left joining dayNum with df for each id. That is, even if id "aa" had no event on day 5, I should still get a row for "aa" on day 5 with N/A or something under event. Like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
aa 4 N/A
aa 5 N/A
aa 6 N/A
aa 8 N/A
aa 9 N/A
aa 10 N/A
bb 1 Y
bb 2 N/A
bb 3 N/A
bb 4 N/A
bb 5 N/A
bb 6 Y
bb 7 N/A
...etc
I can make this work using dplyr and left_join when my dataframe only contains one unique id, but I am stuck trying to make this work with a dataframe that has many different ids.
A push in the right direction would be greatly appreciated.
Thank you!
We can use expand.grid and merge. We create a new dataset using the unique 'id' of 'df' and the 'dayNum'. Then merge with the 'df' to get the expected output.
merge(expand.grid(id=unique(df$id), day=dayNum), df, all.x=TRUE)
# id day event
#1 aa 1 Y
#2 aa 2 Y
#3 aa 3 Y
#4 aa 4 <NA>
#5 aa 5 <NA>
#6 aa 6 <NA>
#7 aa 7 <NA>
#8 aa 8 <NA>
#9 aa 9 <NA>
#10 aa 10 <NA>
#11 bb 1 Y
#12 bb 2 <NA>
#13 bb 3 <NA>
#14 bb 4 <NA>
#15 bb 5 <NA>
#16 bb 6 Y
#17 bb 7 <NA>
#18 bb 8 <NA>
#19 bb 9 <NA>
#20 bb 10 <NA>
#21 cc 1 <NA>
#22 cc 2 Y
#23 cc 3 <NA>
#24 cc 4 <NA>
#25 cc 5 <NA>
#26 cc 6 <NA>
#27 cc 7 <NA>
#28 cc 8 <NA>
#29 cc 9 <NA>
#30 cc 10 <NA>
A similar option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(df), set the 'key' columns, join with the dataset derived from cross join of unique 'id' and 'dayNum'.
library(data.table)
setDT(df, key=c('id', 'day'))[CJ(id=unique(id), day=dayNum)]

reshape data with non-unique id and varying time frames

I have a dataset with the following format:
name1 year name2 profits2010 profits2009 count
AA 2009 AA 10 15 20
AA 2010 AA 10 15 3
BB 2009 BB 4 NA 34
BB 2010 BB 4 NA 4
I need to reshape the data to this format.Any ideas on how this can be done?
name1 year name2 profits count
AA 2009 AA 15 20
AA 2010 AA 10 3
BB 2009 BB NA 34
BB 2010 BB 4 4
Try
indx <- grep('profits', names(df1))
indx2 <- cbind(1:nrow(df1), match(df1$year,
as.numeric(sub('\\D+', '', names(df1)[indx]))))
df1$profits <- df1[indx][indx2]
df1[-indx]
# name1 year name2 count profits
#1 AA 2009 AA 20 15
#2 AA 2010 AA 3 10
#3 BB 2009 BB 34 NA
#4 BB 2010 BB 4 4
This isn't really reshaping, just defining a new variable. Try this:
df$profits <- ifelse(df$year==2009,df$profits2009,df$profits2010)

Resources