I have two data frames, x and y. For each value of x[,2], I look if the value is equal to the value of the elements of y[,1]. If so, I add a third column in the first data frame that contains the values of y[,2].
I managed to do that with loops, but how can I do this using vectors?
x=data.frame(1:15,15:1)
y=data.frame(3:5,c(7.2,8.5,0.3))
for ( i in 1:nrow(x)) {
for (j in 1:nrow(y)) {
if (x[i,2]==y[j,1]){
x[i,3]=y[j,2]
}else{
}
}
}
Use a join instead of loops - based on the loop comparision, the second column of 'x' is compared with the first column of 'y', thus those columns are used in the on, assign (:=) the second column (col2) from the second dataset to create the new column 'col3' in first data
library(data.table)
setDT(x)[y, col3 := i.col2, on = .(col2 = col1)]
-output
> x
col1 col2 col3
1: 1 15 NA
2: 2 14 NA
3: 3 13 NA
4: 4 12 NA
5: 5 11 NA
6: 6 10 NA
7: 7 9 NA
8: 8 8 NA
9: 9 7 NA
10: 10 6 NA
11: 11 5 0.3
12: 12 4 8.5
13: 13 3 7.2
14: 14 2 NA
15: 15 1 NA
data
x <- data.frame(col1 = 1:15, col2 = 15:1)
y <- data.frame(col1 = 3:5, col2 = c(7.2,8.5,0.3))
Update: Many thanks to #TrainingPizza (who has drawn my attention to the false output of my first answer and also provided how it could work:
library(dplyr)
x %>%
rowwise() %>%
mutate(col3 = ifelse(col2 %in% y$col1, y$col2[y$col1==col2], NA))
col1 col2 col3
<int> <int> <dbl>
1 1 15 NA
2 2 14 NA
3 3 13 NA
4 4 12 NA
5 5 11 NA
6 6 10 NA
7 7 9 NA
8 8 8 NA
9 9 7 NA
10 10 6 NA
11 11 5 0.3
12 12 4 8.5
13 13 3 7.2
14 14 2 NA
15 15 1 NA
First answer (not correct)
Here is dplyr way how to avoid the for - loop:
library(dplyr)
x %>%
mutate(V3 = ifelse(V2 %in% y$V1, y$V2, NA))
V1 V2 V3
1 1 15 NA
2 2 14 NA
3 3 13 NA
4 4 12 NA
5 5 11 NA
6 6 10 NA
7 7 9 NA
8 8 8 NA
9 9 7 NA
10 10 6 NA
11 11 5 8.5
12 12 4 0.3
13 13 3 7.2
14 14 2 NA
15 15 1 NA
Related
In this type of dataframe:
df <- data.frame(
x = c(3,3,1,12,2,2,10,10,10,1,5,5,2,2,17,17)
)
how can I create a new column recording the run-length ID of only a subset of x values, say, 3-20?
My own attempt only succeeds at inserting NA where the run-length count should be interrupted; but internally it seems the count is uninterrupted:
library(data.table)
df %>%
mutate(rle = ifelse(x %in% 3:20, rleid(x), NA))
x rle
1 3 1
2 3 1
3 1 NA
4 12 3
5 2 NA
6 2 NA
7 10 5
8 10 5
9 10 5
10 1 NA
11 5 7
12 5 7
13 2 NA
14 2 NA
15 17 9
16 17 9
The expected result:
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
In base R:
df[df$x %in% 3:20, "rle"] <- data.table::rleid(df[df$x %in% 3:20, ])
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With left_join:
left_join(df, df %>%
filter(x %in% 3:20) %>%
distinct() %>%
mutate(rle = row_number()))
Joining, by = "x"
x rle
1 3 1
2 3 1
3 1 NA
4 12 2
5 2 NA
6 2 NA
7 10 3
8 10 3
9 10 3
10 1 NA
11 5 4
12 5 4
13 2 NA
14 2 NA
15 17 5
16 17 5
With data.table:
library(data.table)
setDT(df)
df[x %between% c(3,20),rle:=rleid(x)][]
x rle
<num> <int>
1: 3 1
2: 3 1
3: 1 NA
4: 12 2
5: 2 NA
6: 2 NA
7: 10 3
8: 10 3
9: 10 3
10: 1 NA
11: 5 4
12: 5 4
13: 2 NA
14: 2 NA
15: 17 5
16: 17 5
I have a dataframe with three columns Time, observed value (Obs.Value), and an interpolated value (Interp.Value). If the value of Obs.Value is NA then the value of Interp.Value should also be NA. I can make the whole row NA but I need to keep the Time value.
Here is the repex:
dat <- data.frame(matrix(ncol = 3, nrow = 10))
x <- c("Time", "Obs.Value", "Interp.Value")
colnames(dat) <- x
dat$Time <- seq(1,10,1)
dat$Obs.Value <- c(5,6,7,NA,NA,5,4,3,NA,2)
interp <- approx(dat$Time,dat$Obs.Value,dat$Time)
dat$Interp.Value <- round(interp$y,1)
Here is the code that makes the whole row NA
dat[with(dat, is.na(Obs.Value)|is.na("Interp.Value")),] <- NA
Here is what the output should look like:
Time Obs.Value Interp.Value
1 1 5 5
2 2 6 6
3 3 7 7
4 4 NA NA
5 5 NA NA
6 6 5 5
7 7 4 4
8 8 3 3
9 9 NA NA
10 10 2 2
dat$Interp.Value[is.na(dat$Obs.Value)] <- NA
dat
# Time Obs.Value Interp.Value
# 1 1 5 5
# 2 2 6 6
# 3 3 7 7
# 4 4 NA NA
# 5 5 NA NA
# 6 6 5 5
# 7 7 4 4
# 8 8 3 3
# 9 9 NA NA
# 10 10 2 2
Or if either column being NA is sufficient, then
dat[!complete.cases(dat[,-1]),-1] <- NA
If there is only one column to change #r2evans' answer is pretty straightforward and way to go. If there are more than one column that you want to change you can use across in dplyr.
library(dplyr)
dat %>%
mutate(across(-c(Time,Obs.Value), ~replace(., is.na(Obs.Value), NA)))
# Time Obs.Value Interp.Value
#1 1 5 5
#2 2 6 6
#3 3 7 7
#4 4 NA NA
#5 5 NA NA
#6 6 5 5
#7 7 4 4
#8 8 3 3
#9 9 NA NA
#10 10 2 2
This question already has answers here:
How do I use tidyr to fill in completed rows within each value of a grouping variable?
(4 answers)
Closed 2 years ago.
I have the following data.frame:
df=data.frame(x=c(1:3,8:10,15),y=rnorm(7))
x y
1 0.05976784
2 -1.01992023
3 -1.16075185
8 0.48641141
9 0.54460423
10 -0.59915799
15 -0.60785783
I simply need to fill the rows with NA by following df$x sequence from 1 to 17.
Here my expected output:
x y
1 0.05976784
2 -1.01992023
3 -1.16075185
4 NA
5 NA
6 NA
7 NA
8 0.48641141
9 0.54460423
10 -0.59915799
11 NA
12 NA
13 NA
14 NA
15 -0.60785783
16 NA
17 NA
How can I achieve this?
Any suggestion?
Using base::match:
data.frame(x=1:17, df$y[match(1:17, df$x)])
We could use complete from tidyr
tidyr::complete(df, x = 1:17)
# A tibble: 17 x 2
# x y
# <dbl> <dbl>
# 1 1 -0.560
# 2 2 -0.230
# 3 3 1.56
# 4 4 NA
# 5 5 NA
# 6 6 NA
# 7 7 NA
# 8 8 0.0705
# 9 9 0.129
#10 10 1.72
#11 11 NA
#12 12 NA
#13 13 NA
#14 14 NA
#15 15 0.461
#16 16 NA
#17 17 NA
data
set.seed(123)
df=data.frame(x=c(1:3,8:10,15),y=rnorm(7))
The values >=10 in the data frame below (values 31,89,12,69) does sometimes come in order like 89 and 12. By that I mean de order 123456789, they are adjacent to eachother. I would like to make the values which are not adjacent to each other(31,69, in 31 nr 2 is missing in between to be in order, for 69, nr 7 and8 are missing to be in order) NA. How to code this? Imagine a big dataset! :)
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,67,8,9,0,6,7,9)
B <- c(5,6,31,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,89,3,2,9,NA,12,69,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 31 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 69
10 b 9 4 6 8
It should look like:
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA NA 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 NA
10 b 9 4 6 8
Another solution defining a vector of the values to keep beforehand (only up to two-digit numbers, but could be extended):
numerals <- 1:9
vector <- 0:9
for (i in numerals) {
j <- numerals[i+1]
if (!is.na(j)) {
number <- as.numeric(paste(c(i, j), collapse = ""))
number_reverse <- as.numeric(paste(c(j, i), collapse = ""))
vector <- c(vector, number, number_reverse)
}
}
vector
[1] 0 1 2 3 4 5 6 7 8 9 12 21 23 32 34 43 45 54 56 65 67 76 78 87 89 98
Function to replace number if not in vector:
replace <- function(x) {
x <- ifelse(!x %in% vector, NA, x)
return(x)
}
Result:
df %>% mutate_at(c("A", "B", "C", "D"), replace)
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA NA 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 NA
10 b 9 4 6 8
Here is a function that tests individual numbers
MyFunction <- function(A){
NumbersToCheck <- lapply(strsplit(as.character(A),""),as.integer)
check <- lapply(2:length(unlist(NumbersToCheck)), function(X) ifelse(NumbersToCheck[[1]][X]-NumbersToCheck[[1]][X-1]==1,TRUE,FALSE))
return(ifelse(FALSE %in% check,NA,A))
}
Which can then be applied to your entire df as follows
df[,2:ncol(df)] <- lapply(2:ncol(df), function(X) unlist(lapply(df[,X],MyFunction)))
to get the following result
> df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA NA 5 89
4 a 67 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 12
9 b 7 7 4 NA
10 b 9 4 6 8
df[] <- lapply(df, function(col) {
# Split each value character by character
NAs <- sapply(strsplit(as.character(col), split = ""), function(chars) {
# Convert them back to integer to compare with `diff`
# and verify the increment is always 1 or -1
diff <- diff(as.integer(chars))
!all(diff == 1) && !all(diff == -1)
})
# If not, replace those values with NA
col[NAs] <- NA
col
})
#> Warning in diff(as.integer(chars)): NAs introduced by coercion
#> Warning in diff(as.integer(chars)): NAs introduced by coercion
#> ...
#> Warning in diff(as.integer(chars)): NAs introduced by coercion
df
#> id A B C D
#> 1 a 1 5 2 6
#> 2 a 2 6 3 5
#> 3 a NA NA 5 89
#> 4 a 67 9 NA 3
#> 5 a 8 8 NA 2
#> 6 b 9 1 2 9
#> 7 b 0 NA 7 NA
#> 8 b 6 9 6 12
#> 9 b 7 7 4 NA
#> 10 b 9 4 6 8
Created on 2020-03-31 by the reprex package (v0.3.0)
I have a data frame where each condition (in the example: hope, dream, joy) has 5 variables (in the example, coded with suffixes x, y, z, a, b - the are the same for each condition).
df <- data.frame(matrix(1:16,5,16))
names(df) <- c('ID','hopex','hopey','hopez','hopea','hopeb','dreamx','dreamy','dreamz','dreama','dreamb','joyx','joyy','joyz','joya','joyb')
df[1,2:6] <- NA
df[3:5,c(7,10,14)] <- NA
This is how the data looks like:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16
I want to create a new variable for each condition (hope, dream, joy) that codes whether all of the variables x...b for that condition are NA (0 if all are NA, 1 if any is non-NA). And I want the new variables to be stored in the data frame. Thus, the output should be this:
ID hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope joy dream
1 1 NA NA NA NA NA 15 4 9 14 3 8 13 2 7 12 0 1 1
2 2 7 12 1 6 11 16 5 10 15 4 9 14 3 8 13 1 1 1
3 3 8 13 2 7 12 NA 6 11 NA 5 10 15 NA 9 14 1 1 1
4 4 9 14 3 8 13 NA 7 12 NA 6 11 16 NA 10 15 1 1 1
5 5 10 15 4 9 14 NA 8 13 NA 7 12 1 NA 11 16 1 1 1
The code below does it, but I'm looking for a more elegant solution (e.g., for a case where I have even more conditions). I've tried with various combinations of all(), select(), mutate(), but while they all seem useful, I cannot figure out how to combine them to get what I want. I'm stuck and would be interested in learning to code more efficiently. Thanks in advance!
df$hope <- 0
df[is.na(df$hopex) == FALSE | is.na(df$hopey) == FALSE | is.na(df$hopez) == FALSE | is.na(df$hopea) == FALSE | is.na(df$hopeb) == FALSE, "hope"] <- 1
df$dream <- 0
df[is.na(df$dreamx) == FALSE | is.na(df$dreamy) == FALSE | is.na(df$dreamz) == FALSE | is.na(df$dreama) == FALSE | is.na(df$dreamb) == FALSE, "dream"] <- 1
df$joy<- 0
df[is.na(df$joyx) == FALSE | is.na(df$joyy) == FALSE | is.na(df$joyz) == FALSE | is.na(df$joya) == FALSE | is.na(df$joyb) == FALSE, "joy"] <- 1
Here is an option with tidyverse
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(hope = select(., starts_with('hope')) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer)
# hopex hopey hopez hopea hopeb dreamx dreamy dreamz dreama dreamb joyx joyy joyz joya joyb hope
#1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0
#2 1 1 4 3 2 3 5 4 5 2 5 NA 4 3 1 1
#3 2 NA 4 4 4 3 5 NA 5 5 4 NA 4 5 1 1
#4 4 3 NA 1 1 1 5 2 NA 5 1 2 1 1 1 1
#5 1 NA 4 NA NA 2 1 5 1 2 NA 3 1 2 5 1
Or with rowSums
df %>%
mutate(hope = +(rowSums(!is.na(select(., starts_with('hope'))))!= 0))
For multiple columns, we can create a function
f1 <- function(dat, colSubstr) {
dplyr::select(dat, starts_with(colSubstr)) %>%
is.na %>%
`!` %>%
rowSums %>%
is_greater_than(0) %>%
as.integer
}
df %>%
mutate(hope = f1(., 'hope'),
dream = f1(., 'dream'),
joy = f1(., 'joy'))
Or using base R
cbind(df, sapply(split.default(df, sub(".$", "", names(df))),
function(x) +(rowSums(!is.na(x)) != 0)))
If we want to subset columns
nm1 <- setdiff(names(df), "ID")
cbind(df, sapply(split.default(df[nm1], sub(".$", "", names(df[nm1]))),
function(x) +(rowSums(!is.na(x)) != 0)))
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 5 * 15, replace = TRUE),
ncol = 15, dimnames = list(NULL, paste0(rep(c("hope", "dream", "joy"),
each = 5), c('x', 'y', 'z', 'a', 'b')))))
df[1,] <- NA