Merging 2 dataframes with duplicate columns? - r

I have an empty dataframe as such:
a <- data.frame(x = rep(NA,10))
which gives the following:
x
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
and I have another dataframe as such (the non-sequential row numbers are because this dataframe is a subset of a much larger dataframe):
x
1 NA
2 4
3 NA
5 NA
6 5
7 71
8 3
What I want to do is to merge the 2 dataframes together the values from b will replace the current values in x for an output like this:
x
1 NA
2 4
3 NA
4 NA
5 NA
6 5
7 71
8 3
9 NA
10 NA
My first instinct is to use a for loop like this:
for (i in rownames(b)){
a[i,"x"] <- b[i,"x"]
}
However, this is inefficient for large dataframes. I haven't seen an implementation of this using merge and cbind/rbind yet.
Is there a more efficient way to accomplish this?

transform(a, x = b[row.names(a),])
# x
#1 NA
#2 4
#3 NA
#4 NA
#5 NA
#6 5
#7 71
#8 3
#9 NA
#10 NA

We can merge based on rownames:
a <- data.frame(x = rep(NA,10))
b <- data.frame(x = c(NA,4,NA,NA,5,71,3))
data.frame(x=merge(a, b, by=0, suffixes = c(".a","") ,all=TRUE)[,"x"])
#> x
#> 1 NA
#> 2 NA
#> 3 4
#> 4 NA
#> 5 NA
#> 6 5
#> 7 71
#> 8 3
#> 9 NA
#> 10 NA
d.b answer is the efficient one.

Related

Keep first duplicate in a sequence across all sequences of numerical values and replace the remaining values with NA in R

I have the following dataset, where numerical values in column x are intertwined with NAs. I would like to keep the first instance of the numerical values across all numerical sequences and replace the remaining duplicated values in each sequence with NAs.
x = c(1,1,1,NA,NA,NA,3,3,3,NA,NA,1,1,1,NA)
data = data.frame(x)
> data
x
1 1
2 1
3 1
4 NA
5 NA
6 NA
7 3
8 3
9 3
10 NA
11 NA
12 1
13 1
14 1
15 NA
So that the final result should be:
> data
x
1 1
2 NA
3 NA
4 NA
5 NA
6 NA
7 3
8 NA
9 NA
10 NA
11 NA
12 1
13 NA
14 NA
15 NA
I would apprecite some suggestions, ideally with dplyr. Thanks!
This simple solution seems to work as I expected, although it doesn't use dplyr.
data$x[data$x == lag(data$x)] <- NA
> data
x
1 1
2 NA
3 NA
4 NA
5 NA
6 NA
7 3
8 NA
9 NA
10 NA
11 NA
12 1
13 NA
14 NA
15 NA
For those who want to stay within a dplyr workflow:
library(dplyr)
data %>%
as_tibble() %>%
mutate(x = na_if(x, lag(x)))
#> # A tibble: 15 × 1
#> x
#> <dbl>
#> 1 1
#> 2 NA
#> 3 NA
#> 4 NA
#> 5 NA
#> 6 NA
#> 7 3
#> 8 NA
#> 9 NA
#> 10 NA
#> 11 NA
#> 12 1
#> 13 NA
#> 14 NA
#> 15 NA

How to extract values of existing variable and paste them in top rows of dataframe (using R)

Probably there's a very easy solution to this but I can't figure it out for some reason. This is what my data (in R) look like (except for value_new which is the exact description of what I need!):
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
I hope that this is self explanatory. What I need is the values of "value" for is.na(value) (i.e. the first five rows) and paste these values as the first five rows (i.e. when value<0) of a new variable I'd like to call "value_new".
What is an easy way of doing this? I'd basically need to cut out the bottom half and paste it as new variable(s) in the top section of the dataframe. Hope this makes sense.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9))
dat$value_new = NA
dat$value_new[!is.na(dat$id)] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 NA 7 NA
# 7 NA NA NA
# 8 NA 4 NA
# 9 NA 1 NA
# 10 NA 9 NA
In case you have more rows with a non-NA id compared to NA id you can use:
dat<-data.frame("id"=c(1,2,3,4,5,6,NA,NA,NA,NA,NA),
"value"=c(rep(NA,6),7,NA,4,1,9))
k = sum(is.na(dat$id))
dat$value_new = NA
dat$value_new[!is.na(dat$id)][1:k] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 6 NA NA
# 7 NA 7 NA
# 8 NA NA NA
# 9 NA 4 NA
# 10 NA 1 NA
# 11 NA 9 NA
where k is the number of values you'll replace in the top part of your new column.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
ind <- which(!is.na(dat$value))[1]
newcol <- `length<-`(dat$value[ind:nrow(dat)], nrow(dat))
dat$value_new2 <- newcol
# id value value_new value_new2
#1 1 NA 7 7
#2 2 NA NA NA
#3 3 NA 4 4
#4 4 NA 1 1
#5 5 NA 9 9
#6 NA 7 NA NA
#7 NA NA NA NA
#8 NA 4 NA NA
#9 NA 1 NA NA
#10 NA 9 NA NA
Short version:
dat$value_new2 <- `length<-`(dat$value[which(!is.na(dat$value))[1]:nrow(dat)], nrow(dat))
I remove the first continuing NA and add them to the end. Not considering id's here.

R: In dataframe: set first non-NA value in column to NA

I have a large dataframe, 300+ columns (time series) with about 2600 observations. The columns are filled with a lot of NA's and then a short time series, and then typically NA's again. I would like to find the first non-NA value in each column and replace it with NA.
This is what I'm hoping to achieve, only with a much bigger dataframe:
Before:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 1 1 NA NA
4 2 2 1 1
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
After:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 2 NA NA
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
I've searched around and found a way to do this for each column, but my efforts to apply it to the whole dataframe has proven difficult.
I have created an example dataframe to reproduce my original dataframe:
#Dataframe with NA
x1=x2=c(NA,NA,1:10,NA,NA)
x3=x4=c(NA,NA,NA,1:7,NA,NA,NA,NA)
df=data.frame(x1,x2,x3,x4)
I have used this to replace the first value with NA in 1 column (provided by #Joshua Ulrich here), however I would like to apply it to all columns without manually changing 300+ codes:
NonNAindex <- which(!is.na(df[,1]))
firstNonNA <- min(NonNAindex)
is.na(df[,1]) <- seq(firstNonNA, length.out=1)
I have tried to set the above as a function and run it for all columns with apply/lapply, as well as a for loop, but haven't really figured out how to apply the changes to my dataframe. I'm sure there is something I've completely overlooked as I'm just taking my first small steps in R.
All suggestions would be highly appreciated!
We can use base R
df1[] <- lapply(df1, function(x) replace(x, which(!is.na(x))[1], NA))
df1
# x1 x2 x3 x4
#1 NA NA NA NA
#2 NA NA NA NA
#3 NA NA NA NA
#4 2 2 NA NA
#5 3 3 2 2
#6 4 4 3 3
#7 5 5 4 4
#8 6 6 5 5
#9 7 7 6 6
#10 8 8 7 7
#11 9 9 NA NA
#12 10 10 NA NA
#13 NA NA NA NA
#14 NA NA NA NA
Or as #thelatemail suggested
df1[] <- lapply(df1, function(x) replace(x, Position(Negate(is.na), x), NA))
Since you would like to do this for all columns, you could use the mutate_all function from dplyr. See http://dplyr.tidyverse.org/ for more information. In particular, you may want to look at some of the examples shown here.
library(dplyr)
mutate_all(df, funs(if_else(row_number() == min(which(!is.na(.))), NA_integer_, .)))
#> x1 x2 x3 x4
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 2 2 NA NA
#> 5 3 3 2 2
#> 6 4 4 3 3
#> 7 5 5 4 4
#> 8 6 6 5 5
#> 9 7 7 6 6
#> 10 8 8 7 7
#> 11 9 9 NA NA
#> 12 10 10 NA NA
#> 13 NA NA NA NA
#> 14 NA NA NA NA

Insertion of blank columns at repetitive positions in a large data frame in R

For example,
dataX = data.frame(a=c(1:5),b=c(2:6),c=c(3:7),d=c(4:8),e=c(5:9),f=c(6:10))
How do I insert a blank column after every 2 columns?
Here is a similar method that uses a trick with matrices and integer selection of columns. The original data.frame gets an NA column with cbind. The columns of this new object are then referenced with every two columns and then the final NA column using a matrix to fill in the final column with rbind.
cbind(dataX, NewCol=NA)[c(rbind(matrix(seq_along(dataX), 2), ncol(dataX)+1))]
a b NewCol c d NewCol.1 e f NewCol.2
1 1 2 NA 3 4 NA 5 6 NA
2 2 3 NA 4 5 NA 6 7 NA
3 3 4 NA 5 6 NA 7 8 NA
4 4 5 NA 6 7 NA 8 9 NA
5 5 6 NA 7 8 NA 9 10 NA
We can use use split to split the dataset at unique positions into a list of data.frame, loop through the list, cbind with NA and cbind the elements together
res <- do.call(cbind, setNames(lapply(split.default(dataX, (seq_len(ncol(dataX))-1)%/%2),
function(x) cbind(x, NewCol = NA)), NULL))
res
# a b NewCol c d NewCol e f NewCol
#1 1 2 NA 3 4 NA 5 6 NA
#2 2 3 NA 4 5 NA 6 7 NA
#3 3 4 NA 5 6 NA 7 8 NA
#4 4 5 NA 6 7 NA 8 9 NA
#5 5 6 NA 7 8 NA 9 10 NA
names(res) <- make.unique(names(res))
Let us construct a empty data frame with the same number of rows as dataX
empty_df <- data.frame(x1=rep(NA,nrow(df)),x2=rep(NA,nrow(df)),x3=rep(NA,nrow(df)))
dataX<-cbind(dataX,empty_df)
dataX<-dataX[c("a","b","x1","c","d","x2","e","f","x3")]
resulting in:
a b x1 c d x2 e f x3
1 1 2 NA 3 4 NA 5 6 NA
2 2 3 NA 4 5 NA 6 7 NA
3 3 4 NA 5 6 NA 7 8 NA
4 4 5 NA 6 7 NA 8 9 NA
5 5 6 NA 7 8 NA 9 10 NA

Search and replace entries in a dataframe in two columns

I have a certain data set in which there are few missing values.
the dataset looks like the following:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 NA 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 NA 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 NA 2
3 10 NA NA 5 1 NA 5
4 5 NA NA 6 2 NA 3
4 11 25 10 NA NA 2.5 NA
My data is in the above mentioned format. Column a is a kind of time period which is in sequence and has multiple codes corresponding to it.
Column b just shows an item. This item either has a repeated entry in time or has an unique value.
Column g and h are just the columns made by dividing column c0/d0 = g and c1/d1 = h. Out here, column g holds more importance.
Now, since it is clear that there are few NA and some of the column b entries are duplicate whereas rest are unique.
I have to perform the following steps in order to compute the NA's in column 'g':
I have to find in the 'column b' that is the entry repetitive or has an unique value.Eg : Entry 6 and 5 are repeated, whereas 7,8 9,10 and 11 are unique.
Once it has been found, next step is to that whether there is some value in 'column g' already for the item or not.
If there is, then we need to take average of the repetaed value in 'column g' if it's other than NA, like for item 5, I can find that the values are 2 and 2.5 and hence the average of 2.25 should be place in 'column g' for the repeated 5 value at a=4.
Now, if there is a repeated value but still column g is NA, then I can simply take the 'column h' value as value of 'column g'.
For the non repetitive items, like 9,10,7, etc. since they are unique, just replace the column g entry by column h.
The final output should be as follows:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 4 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 1 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 2 2
3 10 NA NA 5 1 5 5
4 5 NA NA 6 2 2.25 3
4 11 25 10 NA NA 2.5 NA
Request you to help me out with it. In case, you have any question in understanding the question, do let me know or even if some more details are required.
Your desired output is inconsistent. You have one row missing, column h has been altered and hence column g at the seventh row looks inconsistent too.
Either-way, following your description, I would do this in two steps.
First subset your data only by b instances that have dupes and alternate NAs by the mean of the rest of the group
replace all the NAs left by column h
I'd suggest data.table as it allows comfortable operations on subsets
library(data.table)
setDT(df)[duplicated(b) | duplicated(b, fromLast = TRUE), # operate only on the dupes
g := replace(g, is.na(g), mean(g, na.rm = TRUE)), by = b] # replace NA by group
df[is.na(g), g := as.double(h)] # subset by NAs and replace with corresponding values in h
df
# a b c0 d0 c1 d1 g h
# 1: 1 5 20 10 NA NA 2.00 NA
# 2: 1 6 NA NA 8 2 4.00 4
# 3: 2 5 25 10 NA NA 2.50 NA
# 4: 2 7 NA NA 2 2 1.00 1
# 5: 2 8 50 10 NA NA 5.00 NA
# 6: 3 9 10 10 NA NA 1.00 NA
# 7: 3 6 NA NA 8 2 4.00 4
# 8: 3 10 NA NA 5 1 5.00 5
# 9: 4 5 NA NA 6 2 2.25 3
# 10: 4 11 25 10 NA NA 2.50 NA
We can reduce it to "one" step once we recognize that when grouped by b, duplicates imply that there are more than one row grouped. Therefore, the condition to replace the NA values in g by the mean of its group (that are not NA) is if:
the number of rows grouped by b is greater than one and not all of g in the group is NA
Otherwise, replace the NA values in g with h:
library(data.table)
setDT(df)[, g := if (.N > 1 & !all(is.na(g))) {
replace(g, is.na(g), mean(g, na.rm = TRUE))
} else {
replace(g, is.na(g), as.double(h))
}, by=b][]
## a b c0 d0 c1 d1 g h
## 1: 1 5 20 10 NA NA 2.00 NA
## 2: 1 6 NA NA 8 2 4.00 4
## 3: 2 5 25 10 NA NA 2.50 NA
## 4: 2 7 NA NA 2 2 1.00 1
## 5: 2 8 50 10 NA NA 5.00 NA
## 6: 3 9 10 10 NA NA 1.00 NA
## 7: 3 6 NA NA 8 2 4.00 4
## 8: 3 10 NA NA 5 1 5.00 5
## 9: 4 5 NA NA 6 2 2.25 3
##10: 4 11 25 10 NA NA 2.50 NA

Resources