I have a certain data set in which there are few missing values.
the dataset looks like the following:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 NA 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 NA 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 NA 2
3 10 NA NA 5 1 NA 5
4 5 NA NA 6 2 NA 3
4 11 25 10 NA NA 2.5 NA
My data is in the above mentioned format. Column a is a kind of time period which is in sequence and has multiple codes corresponding to it.
Column b just shows an item. This item either has a repeated entry in time or has an unique value.
Column g and h are just the columns made by dividing column c0/d0 = g and c1/d1 = h. Out here, column g holds more importance.
Now, since it is clear that there are few NA and some of the column b entries are duplicate whereas rest are unique.
I have to perform the following steps in order to compute the NA's in column 'g':
I have to find in the 'column b' that is the entry repetitive or has an unique value.Eg : Entry 6 and 5 are repeated, whereas 7,8 9,10 and 11 are unique.
Once it has been found, next step is to that whether there is some value in 'column g' already for the item or not.
If there is, then we need to take average of the repetaed value in 'column g' if it's other than NA, like for item 5, I can find that the values are 2 and 2.5 and hence the average of 2.25 should be place in 'column g' for the repeated 5 value at a=4.
Now, if there is a repeated value but still column g is NA, then I can simply take the 'column h' value as value of 'column g'.
For the non repetitive items, like 9,10,7, etc. since they are unique, just replace the column g entry by column h.
The final output should be as follows:
a b c0 d0 c1 d1 g h
1 5 20 10 NA NA 2 NA
1 6 NA NA 8 2 4 4
2 5 25 10 NA NA 2.5 NA
2 7 NA NA 2 2 1 1
2 8 50 10 NA NA 5 NA
3 9 10 10 NA NA 1 NA
3 6 NA NA 8 4 2 2
3 10 NA NA 5 1 5 5
4 5 NA NA 6 2 2.25 3
4 11 25 10 NA NA 2.5 NA
Request you to help me out with it. In case, you have any question in understanding the question, do let me know or even if some more details are required.
Your desired output is inconsistent. You have one row missing, column h has been altered and hence column g at the seventh row looks inconsistent too.
Either-way, following your description, I would do this in two steps.
First subset your data only by b instances that have dupes and alternate NAs by the mean of the rest of the group
replace all the NAs left by column h
I'd suggest data.table as it allows comfortable operations on subsets
library(data.table)
setDT(df)[duplicated(b) | duplicated(b, fromLast = TRUE), # operate only on the dupes
g := replace(g, is.na(g), mean(g, na.rm = TRUE)), by = b] # replace NA by group
df[is.na(g), g := as.double(h)] # subset by NAs and replace with corresponding values in h
df
# a b c0 d0 c1 d1 g h
# 1: 1 5 20 10 NA NA 2.00 NA
# 2: 1 6 NA NA 8 2 4.00 4
# 3: 2 5 25 10 NA NA 2.50 NA
# 4: 2 7 NA NA 2 2 1.00 1
# 5: 2 8 50 10 NA NA 5.00 NA
# 6: 3 9 10 10 NA NA 1.00 NA
# 7: 3 6 NA NA 8 2 4.00 4
# 8: 3 10 NA NA 5 1 5.00 5
# 9: 4 5 NA NA 6 2 2.25 3
# 10: 4 11 25 10 NA NA 2.50 NA
We can reduce it to "one" step once we recognize that when grouped by b, duplicates imply that there are more than one row grouped. Therefore, the condition to replace the NA values in g by the mean of its group (that are not NA) is if:
the number of rows grouped by b is greater than one and not all of g in the group is NA
Otherwise, replace the NA values in g with h:
library(data.table)
setDT(df)[, g := if (.N > 1 & !all(is.na(g))) {
replace(g, is.na(g), mean(g, na.rm = TRUE))
} else {
replace(g, is.na(g), as.double(h))
}, by=b][]
## a b c0 d0 c1 d1 g h
## 1: 1 5 20 10 NA NA 2.00 NA
## 2: 1 6 NA NA 8 2 4.00 4
## 3: 2 5 25 10 NA NA 2.50 NA
## 4: 2 7 NA NA 2 2 1.00 1
## 5: 2 8 50 10 NA NA 5.00 NA
## 6: 3 9 10 10 NA NA 1.00 NA
## 7: 3 6 NA NA 8 2 4.00 4
## 8: 3 10 NA NA 5 1 5.00 5
## 9: 4 5 NA NA 6 2 2.25 3
##10: 4 11 25 10 NA NA 2.50 NA
Related
I need to create a dataframe with all possible combinations of a variable. I found an example using data.table that works like this:
df <- data.frame("Age"=1:10)
df <- setDT(df)
df[,lag.Age1 := c(NA,Age[-.N])]
That creates this:
Age lag.Age1
1: 1 NA
2: 2 1
3: 3 2
.. .. ..
10: 10 9
Now, I want to keep adding lagged vectors that produce something like this:
Age lag.Age1 lag.Age2 lag.Age3
1: 1 NA NA NA
2: 2 1 NA NA
3: 3 2 1 NA
.. .. .. .. ..
10: 10 9 8 7
I tried this for the third column:
df[,lag.Age2 := c(NA,NA,Age[1:8])]
But I really don't get how data.table works here. That line runs but it doesn't do anything.
EDIT: what if the dataframe has a group variable and I want the lag to be done by group? For the first lag it is just:
df <- data.frame("Age"=1:10, "Group"=c(rep("A",4),rep("B",6)))
df[,lag.Age1 := c(NA,Age[-.N]), by="Group"]
How would this be now? note that the groups have different length.
data.table::shift() is very powerful, because you can provide a vector of offsets; For example, if you want n lag columns (from 1 to n), you can do this:
n=3
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3
<int> <char> <int> <int> <int>
1: 1 A NA NA NA
2: 2 A 1 NA NA
3: 3 A 2 1 NA
4: 4 A 3 2 1
5: 5 B NA NA NA
6: 6 B 5 NA NA
7: 7 B 6 5 NA
8: 8 B 7 6 5
9: 9 B 8 7 6
10: 10 B 9 8 7
Alternatively:
df[, c(paste0("lag.Age",1:3)):=shift(Age,1:3), Group]
If you want to have the number of lags vary by group, where the number equals the number of observations in that group-1, then one approach is to do this:
# make function to return lags based on length of x
f <- function(x) shift(x,1:(length(x)-1))
# get unique groups
grps= unique(df$Group)
# set as DT, and use lapply()
setDT(df)
grp_lags = lapply(grps, \(g) f(df[Group==g, Age]))
names(grp_lags)<-grps
Output:
$A
$A[[1]]
[1] NA 1 2 3
$A[[2]]
[1] NA NA 1 2
$A[[3]]
[1] NA NA NA 1
$B
$B[[1]]
[1] NA 5 6 7 8 9
$B[[2]]
[1] NA NA 5 6 7 8
$B[[3]]
[1] NA NA NA 5 6 7
$B[[4]]
[1] NA NA NA NA 5 6
$B[[5]]
[1] NA NA NA NA NA 5
Or, if you have okay with lots of extra columns (i.e. for the groups with fewer observations), you can do this:
n = df[, .N, Group][,max(N)]
cols = paste0("lag.Age",1:n)
df[, c(cols):=shift(Age,1:n), Group]
Output:
Age Group lag.Age1 lag.Age2 lag.Age3 lag.Age4 lag.Age5 lag.Age6
1: 1 A NA NA NA NA NA NA
2: 2 A 1 NA NA NA NA NA
3: 3 A 2 1 NA NA NA NA
4: 4 A 3 2 1 NA NA NA
5: 5 B NA NA NA NA NA NA
6: 6 B 5 NA NA NA NA NA
7: 7 B 6 5 NA NA NA NA
8: 8 B 7 6 5 NA NA NA
9: 9 B 8 7 6 5 NA NA
10: 10 B 9 8 7 6 5 NA
I have an empty dataframe as such:
a <- data.frame(x = rep(NA,10))
which gives the following:
x
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
and I have another dataframe as such (the non-sequential row numbers are because this dataframe is a subset of a much larger dataframe):
x
1 NA
2 4
3 NA
5 NA
6 5
7 71
8 3
What I want to do is to merge the 2 dataframes together the values from b will replace the current values in x for an output like this:
x
1 NA
2 4
3 NA
4 NA
5 NA
6 5
7 71
8 3
9 NA
10 NA
My first instinct is to use a for loop like this:
for (i in rownames(b)){
a[i,"x"] <- b[i,"x"]
}
However, this is inefficient for large dataframes. I haven't seen an implementation of this using merge and cbind/rbind yet.
Is there a more efficient way to accomplish this?
transform(a, x = b[row.names(a),])
# x
#1 NA
#2 4
#3 NA
#4 NA
#5 NA
#6 5
#7 71
#8 3
#9 NA
#10 NA
We can merge based on rownames:
a <- data.frame(x = rep(NA,10))
b <- data.frame(x = c(NA,4,NA,NA,5,71,3))
data.frame(x=merge(a, b, by=0, suffixes = c(".a","") ,all=TRUE)[,"x"])
#> x
#> 1 NA
#> 2 NA
#> 3 4
#> 4 NA
#> 5 NA
#> 6 5
#> 7 71
#> 8 3
#> 9 NA
#> 10 NA
d.b answer is the efficient one.
Probably there's a very easy solution to this but I can't figure it out for some reason. This is what my data (in R) look like (except for value_new which is the exact description of what I need!):
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
I hope that this is self explanatory. What I need is the values of "value" for is.na(value) (i.e. the first five rows) and paste these values as the first five rows (i.e. when value<0) of a new variable I'd like to call "value_new".
What is an easy way of doing this? I'd basically need to cut out the bottom half and paste it as new variable(s) in the top section of the dataframe. Hope this makes sense.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9))
dat$value_new = NA
dat$value_new[!is.na(dat$id)] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 NA 7 NA
# 7 NA NA NA
# 8 NA 4 NA
# 9 NA 1 NA
# 10 NA 9 NA
In case you have more rows with a non-NA id compared to NA id you can use:
dat<-data.frame("id"=c(1,2,3,4,5,6,NA,NA,NA,NA,NA),
"value"=c(rep(NA,6),7,NA,4,1,9))
k = sum(is.na(dat$id))
dat$value_new = NA
dat$value_new[!is.na(dat$id)][1:k] = dat$value[is.na(dat$id)]
dat
# id value value_new
# 1 1 NA 7
# 2 2 NA NA
# 3 3 NA 4
# 4 4 NA 1
# 5 5 NA 9
# 6 6 NA NA
# 7 NA 7 NA
# 8 NA NA NA
# 9 NA 4 NA
# 10 NA 1 NA
# 11 NA 9 NA
where k is the number of values you'll replace in the top part of your new column.
dat<-data.frame("id"=c(1,2,3,4,5,NA,NA,NA,NA,NA),
"value"=c(rep(NA,5),7,NA,4,1,9),
"value_new"=c(7,NA,4,1,9,rep(NA,5)))
ind <- which(!is.na(dat$value))[1]
newcol <- `length<-`(dat$value[ind:nrow(dat)], nrow(dat))
dat$value_new2 <- newcol
# id value value_new value_new2
#1 1 NA 7 7
#2 2 NA NA NA
#3 3 NA 4 4
#4 4 NA 1 1
#5 5 NA 9 9
#6 NA 7 NA NA
#7 NA NA NA NA
#8 NA 4 NA NA
#9 NA 1 NA NA
#10 NA 9 NA NA
Short version:
dat$value_new2 <- `length<-`(dat$value[which(!is.na(dat$value))[1]:nrow(dat)], nrow(dat))
I remove the first continuing NA and add them to the end. Not considering id's here.
I have a large dataframe, 300+ columns (time series) with about 2600 observations. The columns are filled with a lot of NA's and then a short time series, and then typically NA's again. I would like to find the first non-NA value in each column and replace it with NA.
This is what I'm hoping to achieve, only with a much bigger dataframe:
Before:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 1 1 NA NA
4 2 2 1 1
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
After:
x1 x2 x3 x4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 2 2 NA NA
5 3 3 2 2
6 4 4 3 3
7 5 5 4 4
8 6 6 5 5
9 7 7 6 6
10 8 8 7 7
11 9 9 NA NA
12 10 10 NA NA
13 NA NA NA NA
14 NA NA NA NA
I've searched around and found a way to do this for each column, but my efforts to apply it to the whole dataframe has proven difficult.
I have created an example dataframe to reproduce my original dataframe:
#Dataframe with NA
x1=x2=c(NA,NA,1:10,NA,NA)
x3=x4=c(NA,NA,NA,1:7,NA,NA,NA,NA)
df=data.frame(x1,x2,x3,x4)
I have used this to replace the first value with NA in 1 column (provided by #Joshua Ulrich here), however I would like to apply it to all columns without manually changing 300+ codes:
NonNAindex <- which(!is.na(df[,1]))
firstNonNA <- min(NonNAindex)
is.na(df[,1]) <- seq(firstNonNA, length.out=1)
I have tried to set the above as a function and run it for all columns with apply/lapply, as well as a for loop, but haven't really figured out how to apply the changes to my dataframe. I'm sure there is something I've completely overlooked as I'm just taking my first small steps in R.
All suggestions would be highly appreciated!
We can use base R
df1[] <- lapply(df1, function(x) replace(x, which(!is.na(x))[1], NA))
df1
# x1 x2 x3 x4
#1 NA NA NA NA
#2 NA NA NA NA
#3 NA NA NA NA
#4 2 2 NA NA
#5 3 3 2 2
#6 4 4 3 3
#7 5 5 4 4
#8 6 6 5 5
#9 7 7 6 6
#10 8 8 7 7
#11 9 9 NA NA
#12 10 10 NA NA
#13 NA NA NA NA
#14 NA NA NA NA
Or as #thelatemail suggested
df1[] <- lapply(df1, function(x) replace(x, Position(Negate(is.na), x), NA))
Since you would like to do this for all columns, you could use the mutate_all function from dplyr. See http://dplyr.tidyverse.org/ for more information. In particular, you may want to look at some of the examples shown here.
library(dplyr)
mutate_all(df, funs(if_else(row_number() == min(which(!is.na(.))), NA_integer_, .)))
#> x1 x2 x3 x4
#> 1 NA NA NA NA
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 2 2 NA NA
#> 5 3 3 2 2
#> 6 4 4 3 3
#> 7 5 5 4 4
#> 8 6 6 5 5
#> 9 7 7 6 6
#> 10 8 8 7 7
#> 11 9 9 NA NA
#> 12 10 10 NA NA
#> 13 NA NA NA NA
#> 14 NA NA NA NA
For example,
dataX = data.frame(a=c(1:5),b=c(2:6),c=c(3:7),d=c(4:8),e=c(5:9),f=c(6:10))
How do I insert a blank column after every 2 columns?
Here is a similar method that uses a trick with matrices and integer selection of columns. The original data.frame gets an NA column with cbind. The columns of this new object are then referenced with every two columns and then the final NA column using a matrix to fill in the final column with rbind.
cbind(dataX, NewCol=NA)[c(rbind(matrix(seq_along(dataX), 2), ncol(dataX)+1))]
a b NewCol c d NewCol.1 e f NewCol.2
1 1 2 NA 3 4 NA 5 6 NA
2 2 3 NA 4 5 NA 6 7 NA
3 3 4 NA 5 6 NA 7 8 NA
4 4 5 NA 6 7 NA 8 9 NA
5 5 6 NA 7 8 NA 9 10 NA
We can use use split to split the dataset at unique positions into a list of data.frame, loop through the list, cbind with NA and cbind the elements together
res <- do.call(cbind, setNames(lapply(split.default(dataX, (seq_len(ncol(dataX))-1)%/%2),
function(x) cbind(x, NewCol = NA)), NULL))
res
# a b NewCol c d NewCol e f NewCol
#1 1 2 NA 3 4 NA 5 6 NA
#2 2 3 NA 4 5 NA 6 7 NA
#3 3 4 NA 5 6 NA 7 8 NA
#4 4 5 NA 6 7 NA 8 9 NA
#5 5 6 NA 7 8 NA 9 10 NA
names(res) <- make.unique(names(res))
Let us construct a empty data frame with the same number of rows as dataX
empty_df <- data.frame(x1=rep(NA,nrow(df)),x2=rep(NA,nrow(df)),x3=rep(NA,nrow(df)))
dataX<-cbind(dataX,empty_df)
dataX<-dataX[c("a","b","x1","c","d","x2","e","f","x3")]
resulting in:
a b x1 c d x2 e f x3
1 1 2 NA 3 4 NA 5 6 NA
2 2 3 NA 4 5 NA 6 7 NA
3 3 4 NA 5 6 NA 7 8 NA
4 4 5 NA 6 7 NA 8 9 NA
5 5 6 NA 7 8 NA 9 10 NA