Replace values of rows with missing values by values of another row - r

I’m trying to work with conditional but don’t find an easy way to do it.
I have a dataset with missing value in column As, I want to create a new column C that takes the original values in A for all the rows without missing, and for row with missing value take the value from another column (column B).
All columns are character variables.
A
B
13 A 1
15 A 2
15 A 2
15 A 2
NA
15 A 8
10 B 3
15 A 2
NA
15 A 5
What i want is:
A
B
C
13 A 1
15 A 2
13 A 1
15 A 2
15 A 2
15 A 2
NA
15 A 8
15 A 8
10 B 3
15 A 2
10 B 3
NA
15 A 5
15 A 5
I tried with a loop but the result is not satisfactory,
for(i in 1:length(df$A)) {
if(is.na(df$A[i])) {
df$C <- df$B
}
else {
df$C<- df$A
}
}
If anyone can help me,
Thanks in advance

In general, if you find yourself looping over a data frame, there is probably a more efficient solution, either to use vectorised functions like
Jonathan has in his answer, or to use dplyr as follows.
We can check if a is NA - if so, we set c equal to b, otherwise keep it as a.
library(dplyr)
dat %>% mutate(c = if_else(is.na(A), B, A))
A B c
1 13 A 1 15 A 2 13 A 1
2 15 A 2 15 A 2 15 A 2
3 <NA> 15 A 8 15 A 8
4 10 B 3 15 A 2 10 B 3
5 <NA> 15 A 5 15 A 5

df$C <- ifelse(is.na(df$A), df$B, df$A)

We could use coalesce:
library(dplyr)
df %>%
mutate(c = coalesce(A, B))
A B c
1 13 A 1 15 A 2 13 A 1
2 15 A 2 15 A 2 15 A 2
3 <NA> 15 A 8 15 A 8
4 10 B 3 15 A 2 10 B 3
5 <NA> 15 A 5 15 A 5

Related

Subtracting 1 column from multiple columns

df <- data.frame(a=1:3, b=4:6, c=7:9, d=10:12, e=13:15)
a b c d e
1 4 7 10 13
2 5 8 11 14
3 6 9 12 15
Is it possible to subtract 'column a' from all of the other columns without doing each calculation individually?
I have a dataset of 1001 columns and would like to know if it is possible to do so without doing 1000 calculations manually.
Many Thanks
Try this:
#Data
df <- data.frame(a=1:3, b=4:6, c=7:9, d=10:12, e=13:15)
#Isolate
df1 <- df[,1,drop=F]
#Substract
dfr <- cbind(df1,as.data.frame(apply(df[,-1],2,function(x) x-df1)))
names(dfr)<-names(df)
a b c d e
1 1 3 6 9 12
2 2 3 6 9 12
3 3 3 6 9 12

How do I select rows in a data frame before and after a condition is met?

I'm searching the web for a few a days now and I can't find a solution to my (probably easy to solve) problem.
I have huge data frames with 4 variables and over a million observations each. Now I want to select 100 rows before, all rows while and 1000 rows after a specific condition is met and fill the rest with NA's. I tried it with a for loop and if/ifelse but it doesn't work so far. I think it shouldn't be a big thing, but in the moment I just don't get the hang of it.
I create the data using:
foo<-data.frame(t = 1:15, a = sample(1:15), b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1), c = sample(1:15))
My Data looks like this:
ID t a b c
1 1 4 1 7
2 2 7 1 10
3 3 10 1 6
4 4 2 1 4
5 5 13 1 9
6 6 15 4 3
7 7 8 4 15
8 8 3 4 1
9 9 9 4 2
10 10 14 1 8
11 11 5 1 11
12 12 11 1 13
13 13 12 1 5
14 14 6 1 14
15 15 1 1 12
What I want is to pick the value of a (in this example) 2 rows before, all rows while and 3 rows after the value of b is >1 and fill the rest with NA's. [Because this is just an example I guess you can imagine that after these 15 rows there are more rows with the value for b changing from 1 to 4 several times (I did not post it, so I won't spam the question with unnecessary data).]
So I want to get something like:
ID t a b c d
1 1 4 1 7 NA
2 2 7 1 10 NA
3 3 10 1 6 NA
4 4 2 1 4 2
5 5 13 1 9 13
6 6 15 4 3 15
7 7 8 4 15 8
8 8 3 4 1 3
9 9 9 4 2 9
10 10 14 1 8 14
11 11 5 1 11 5
12 12 11 1 13 11
13 13 12 1 5 NA
14 14 6 1 14 NA
15 15 1 1 12 NA
I'm thankful for any help.
Thank you.
Best regards,
Chris
here is the same attempt as missuse, but with data.table:
library(data.table)
foo<-data.frame(t = 1:11, a = sample(1:11), b = c(1,1,1,4,4,4,4,1,1,1,1), c = sample(1:11))
DT <- setDT(foo)
DT[ unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ])), d := a]
t a b c d
1: 1 10 1 2 NA
2: 2 6 1 10 6
3: 3 5 1 7 5
4: 4 11 4 4 11
5: 5 4 4 9 4
6: 6 8 4 5 8
7: 7 2 4 8 2
8: 8 3 1 3 3
9: 9 7 1 6 7
10: 10 9 1 1 9
11: 11 1 1 11 NA
Here
unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ]))
gives you your desired indixes : the unique indices of the line for your condition, the same indices+3 and -2.
Here is an attempt.
Get indexes that satisfy the condition b > 1
z <- which(foo$b > 1)
get indexes for (z - 2) : (z + 3)
ind <- unique(unlist(lapply(z, function(x){
g <- pmax(x - 2, 1) #if x - 2 is negative
g : (x + 3)
})))
create d column filled with NA
foo$d <- NA
replace elements with appropriate indexes with foo$a
foo$d[ind] <- foo$a[ind]
library(dplyr)
library(purrr)
# example dataset
foo<-data.frame(t = 1:15,
a = sample(1:15),
b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1),
c = sample(1:15))
# function to get indices of interest
# for a given index x go 2 positions back and 3 forward
# keep only positive indices
GetIDsBeforeAfter = function(x) {
v = (x-2) : (x+3)
v[v > 0]
}
foo %>% # from your dataset
filter(b > 1) %>% # keep rows where b > 1
pull(t) %>% # get the positions
map(GetIDsBeforeAfter) %>% # for each position apply the function
unlist() %>% # unlist all sets indices
unique() -> ids_to_remain # keep unique ones and save them in a vector
foo$d = foo$c # copy column c as d
foo$d[-ids_to_remain] = NA # put NA to all positions not in our vector
foo
# t a b c d
# 1 1 5 1 8 NA
# 2 2 6 1 14 NA
# 3 3 4 1 10 NA
# 4 4 1 1 7 7
# 5 5 10 1 5 5
# 6 6 8 4 9 9
# 7 7 9 4 15 15
# 8 8 3 4 6 6
# 9 9 7 4 2 2
# 10 10 12 1 3 3
# 11 11 11 1 1 1
# 12 12 15 1 4 4
# 13 13 14 1 11 NA
# 14 14 13 1 13 NA
# 15 15 2 1 12 NA

How to generate new variables based on the name of the variables in the data frame

For example, I have a toy dataset as the one I created below,
a1<-1:10
a2<-11:20
v<-c(1,2,1,NA,2,1,2,1,2,1)
data<-data.frame(a1,a2,v,stringsAsFactors = F)
Then I want to create a new variable y which will be assigned the value a1 or a2 or NA based on the value of variable v. Therefore, the 'y'
should equals to 1 12 3 NA 15 6 17 8 19 10.
I want to generate it with the command similar to the ones I list below, It doesn't work, I guess it's because of the vectorization issue, then how can I fix it?
In reality, I have several as, say 10 and the actual values are characters instead of numeric ones.
data$y[!is.na(data$v)]<-data[,paste0('a',data$v)]
or
data%>%
mutate(y=ifelse(!is.na(v),get(paste0('a',v)),NA))
You could use standard indexing with cbind for that:
dat$y <- dat[cbind(1:nrow(dat), dat$v)]
The result:
> dat
a1 a2 v y
1 1 11 1 1
2 2 12 2 12
3 3 13 1 3
4 4 14 NA NA
5 5 15 2 15
6 6 16 1 6
7 7 17 2 17
8 8 18 1 8
9 9 19 2 19
10 10 20 1 10
(I used dat instead of data, because it is not wise to call a dataframe the same as a function; see ?data)
Only idea that comes to my mind:
data%>%
mutate(y=ifelse(!is.na(v),paste0('a',v),NA)) %>%
mutate(z=ifelse(!is.na(y),(ifelse(y=="a1",get("a1"),get("a2"))),NA))
a1 a2 v y z
1 1 11 1 a1 1
2 2 12 2 a2 12
3 3 13 1 a1 3
4 4 14 NA <NA> NA
5 5 15 2 a2 15
6 6 16 1 a1 6
7 7 17 2 a2 17
8 8 18 1 a1 8
9 9 19 2 a2 19
10 10 20 1 a1 10
or more directly:
data%>%
mutate(y=ifelse(!is.na(v),(ifelse(v==1, get("a1"),get("a2"))),NA))
a1 a2 v y
1 1 11 1 1
2 2 12 2 12
3 3 13 1 3
4 4 14 NA NA
5 5 15 2 15
6 6 16 1 6
7 7 17 2 17
8 8 18 1 8
9 9 19 2 19
10 10 20 1 10
still based on ifelse :(
You need to use a matrix accessor:
# Get the indices of missing values
ind <- which(!is.na(data$v))
# Transform colnames to indices
tab <- structure(match(c("a1", "a2"), names(data)), .Names = c("a1", "a2"))
# Access data with a matrix accessor
data$y[ind] <- data[cbind(ind, tab[paste0('a', data$v[ind])])]

R: fill new columns in data.frame based on row values by condition?

I want to create a new columns in my data.frame, based on values in my rows.
If 'type" is not equal to "a", my "new.area" columns should contain the data from "area" of type "a". This is for multiple "distances".
Example:
# create data frame
distance<-rep(seq(1,5, by = 1),2)
area<-c(11:20)
type<-rep(c("a","b"),each = 5)
# check data.frame
(my.df<-data.frame(distance, area, type))
distance area type
1 1 11 a
2 2 12 a
3 3 13 a
4 4 14 a
5 5 15 a
6 1 16 b
7 2 17 b
8 3 18 b
9 4 19 b
10 5 20 b
I want to create a new columns (my.df$new.area), where for every "distance" in rows, there will be values of "area" of type "a".
distance area type new.area
1 1 11 a 11
2 2 12 a 12
3 3 13 a 13
4 4 14 a 14
5 5 15 a 15
6 1 16 b 11
7 2 17 b 12
8 3 18 b 13
9 4 19 b 14
10 5 20 b 15
I know how to make this manually for a single row:
my.df$new.area[my.df$distance == 1 ] <- 11
But how to make it automatically?
Here is a base R solution using index subsetting ([) and match:
my.df$new.area <- with(my.df, area[type == "a"][match(distance, distance[type == "a"])])
which returns
my.df
distance area type new.area
1 1 11 a 11
2 2 12 a 12
3 3 13 a 13
4 4 14 a 14
5 5 15 a 15
6 1 16 b 11
7 2 17 b 12
8 3 18 b 13
9 4 19 b 14
10 5 20 b 15
area[type == "a"] supplies the vector of possibilities. match is used to return the indices from this vector through the distance variable. with is used to avoid the repeated use of my.df$.
We can use data.table
library(data.table)
setDT(my.df)[, new.area := area[type=="a"] , distance]
my.df
# distance area type new.area
# 1: 1 11 a 11
# 2: 2 12 a 12
# 3: 3 13 a 13
# 4: 4 14 a 14
# 5: 5 15 a 15
# 6: 1 16 b 11
# 7: 2 17 b 12
# 8: 3 18 b 13
# 9: 4 19 b 14
#10: 5 20 b 15
Or we can use the numeric index of distance as it is in a sequence
with(my.df, area[type=="a"][distance])
#[1] 11 12 13 14 15 11 12 13 14 15

How to delete duplicates but keep most recent data in R

I have the following two data frames:
df1 = data.frame(names=c('a','b','c','c','d'),year=c(11,12,13,14,15), Times=c(1,1,3,5,6))
df2 = data.frame(names=c('a','e','e','c','c','d'),year=c(12,12,13,15,16,16), Times=c(2,2,4,6,7,7))
I would like to know how I could merge the above df but only keeping the most recent Times depending on the year. It should look like this:
Names Year Times
a 12 2
b 12 2
c 16 7
d 16 7
e 13 4
I'm guessing that you do not mean to merge these but rather combine by stacking. Your question is ambiguous since the "duplication" could occur at the dataframe level or at the vector level. You example does not display any duplication at the dataframe level but would at the vector level. The best way to describe the problem is that you want the last (or max) Times entry within each group if names values:
> df1
names year Times
1 a 11 1
2 b 12 1
3 c 13 3
4 c 14 5
5 d 15 6
> df2
names year Times
1 a 12 2
2 e 12 2
3 e 13 4
4 c 15 6
5 c 16 7
6 d 16 7
> dfr <- rbind(df1,df2)
> dfr <-dfr[order(dfr$Times),]
> dfr[!duplicated(dfr, fromLast=TRUE) , ]
names year Times
1 a 11 1
2 b 12 1
6 a 12 2
7 e 12 2
3 c 13 3
8 e 13 4
4 c 14 5
5 d 15 6
9 c 15 6
10 c 16 7
11 d 16 7
> dfr[!duplicated(dfr$names, fromLast=TRUE) , ]
names year Times
2 b 12 1
6 a 12 2
8 e 13 4
10 c 16 7
11 d 16 7
This uses base R functions; there are also newer packages (such as plyr) that many feel make the split-apply-combine process more intuitive.
df <- rbind(df1, df2)
do.call(rbind, lapply(split(df, df$names), function(x) x[which.max(x$year), ]))
## names year Times
## a a 12 2
## b b 12 1
## c c 16 7
## d d 16 7
## e e 13 4
We could also use aggregate:
df <- rbind(df1,df2)
aggregate(cbind(df$year,df$Times)~df$names,df,max)
# df$names V1 V2
# 1 a 12 2
# 2 b 12 1
# 3 c 16 7
# 4 d 16 7
# 5 e 13 4
In case you wanted to see a data.table solution,
# load library
library(data.table)
# bind by row and convert to data.table (by reference)
df <- setDT(rbind(df1, df2))
# get the result
df[order(names, year), .SD[.N], by=.(names)]
The output is as follows:
names year Times
1: a 12 2
2: b 12 1
3: c 16 7
4: d 16 7
5: e 13 4
The final line orders the row-binded data by names and year, and then chooses the last observation (.sd[.N]) for each name.

Resources