Removing rows from each dataframe in list with condition in R - r

I have such a list:
df1 <- data.frame(a=c(NA, NA, 1:10), b=c(NA, 1:11))
df2 <- data.frame(a=1:10, b=c(NA,1:9))
mylist <- list(df1, df2)
> mylist
[[1]]
a b
1 NA NA
2 NA 1
3 1 2
4 2 3
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 8 9
11 9 10
12 10 11
[[2]]
a b
1 1 NA
2 2 1
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
I'd like to remove all rows with more than 1 NA in a row in each data frame. How can I do that?
I found out how to delete rows
lapply(mylist, `[`, -1,)
and how to calculate the sum of NAs
NAsums <- function(x) {rowSums(is.na(x))}
lapply(mylist, NAsums)
But I can't figure out how to combine the two steps..

We loop through the list (lapply), use rowSums to get the number of NA elements in each row, convert to a logical vector (<2), and use that to subset the rows.
lapply(mylist, function(x) x[rowSums(is.na(x))<2,])
#[[1]]
# a b
#2 NA 1
#3 1 2
#4 2 3
#5 3 4
#6 4 5
#7 5 6
#8 6 7
#9 7 8
#10 8 9
#11 9 10
#12 10 11
#[[2]]
# a b
#1 1 NA
#2 2 1
#3 3 2
#4 4 3
#5 5 4
#6 6 5
#7 7 6
#8 8 7
#9 9 8
#10 10 9

Related

How to collect outputs of vector-valued function into a dataframe?

I have a function f1 that takes a number k as input and returns 3 numbers k, k+1, k+2. I would like to ask how to concatenate these results into a dataframe for k from 1 to 10. In this way, the line k corresponds to the output f1(k).
f1 <- function(k){
return (c(k, k+1, k+2))
}
f1(1)
f1(2)
An option is to Vectorize the function 'f1', pass the values 1 to 10, returns a matrix, and then convert it to data.frame with as.data.frame
as.data.frame(Vectorize(f1)(1:10))
If it needs to be vertical, then transpose the output and apply as.data.frame
as.data.frame(t(Vectorize(f1)(1:10)))
-output
# V1 V2 V3
#1 1 2 3
#2 2 3 4
#3 3 4 5
#4 4 5 6
#5 5 6 7
#6 6 7 8
#7 7 8 9
#8 8 9 10
#9 9 10 11
#10 10 11 12
Or we can use outer
as.data.frame(outer(1:10, 0:2, `+`))
You can also use:
as.data.frame(do.call(rbind,lapply(1:10,f1)))
Output:
as.data.frame(do.call(rbind,lapply(1:10,f1)))
V1 V2 V3
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
5 5 6 7
6 6 7 8
7 7 8 9
8 8 9 10
9 9 10 11
10 10 11 12

merging duplicated colums by which row is greater than others

i have list of dataframes and the dataframes have some duplicated columns. I want to merge duplicated columns which row is greater than others(some data frames have much more duplicates).
example data:
temp <- data.frame(seq_len(15), 5, 3)
colnames(temp) <- c("A", "A", "B")
temp$A[5]=NA
temp$A[3]=NA
temp$A[2]=NA
temp[7,2]=NA
A A B
<int> <dbl> <dbl>
1 5 3
NA 5 3
NA 5 3
4 5 3
NA 5 3
6 5 3
7 NA 3
8 5 3
9 5 3
10 5 3
final output
A B
<int> <dbl>
1 3
5 3
5 3
5 3
5 3
6 3
7 3
8 3
9 3
10 3
Thanks for everyone
A base R approach would be to split the data frame based on similarity of columns and select row-wise maximum using do.call + pmax.
data.frame(sapply(split.default(temp, names(temp)), function(x)
do.call(pmax, c(x, na.rm = TRUE))))
# A B
#1 5 3
#2 5 3
#3 5 3
#4 5 3
#5 5 3
#6 6 3
#7 7 3
#8 8 3
#9 9 3
#10 10 3
#11 11 3
#12 12 3
#13 13 3
#14 14 3
#15 15 3

Is there any way to replace a missing value based on another columns' value to match the column name

I have a dataset:
a day day.1.time day.2.time day.3.time day.4.time day.5.time
1 NA 2 4 5 7 10 4
2 NA 5 4 1 1 6 NA
3 NA 3 7 9 6 7 4
4 NA 3 6 8 8 4 5
5 NA 3 5 2 4 5 6
6 NA 3 87 3 2 1 78
7 NA 1 NA 7 5 9 54
8 NA 5 6 6 3 2 3
9 NA 2 5 10 9 8 3
10 NA 3 9 4 10 3 3
I am trying to use the day column value to match with the day.x.time column to replace the missing value in column a. For instance, in the first row, the first value in the day column is 2, then we should use day.2.time value 5 to replace the first value in column a.
If the day.x.time value is missing, we should use -1 day or +1 day to replace the missing in column a. For instance, in the second row, the day column shows 5, so we should use the value in day.5.time column, but it's also a missing value. In this case, we should use the value in day.4.time column to replace the missing value in column a.
You can use dat = data.frame(a = rep(NA,10), day = c(2,5,3,3,3,3,1,5,2,3), day.1.time = c(4,4,7,6,5,87,NA,6,5,9), day.2.time = sample(10), day.3.time = sample(10), day.4.time = sample(10), day.5.time = c(4,NA,4,5,6,78,54,3,3,3)) to generate the sample data.
I have tried grep(paste0("^day."dat$day,".time$", names(dat)) to match with the column but my code isn't matching in every row, so any help would be appreciated!
Here is one way to do this.
The first part is easy to match day column with the corresponding day.x.time column. We can do this using matrix subsetting.
cols <- grep('day\\.\\d+\\.time', names(dat))
dat$a <- dat[cols][cbind(1:nrow(dat), dat$day)]
dat
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
#1 3 2 4 3 3 3 4
#2 NA 5 4 4 10 2 NA
#3 1 3 7 8 1 8 4
#4 4 3 6 6 4 5 5
#5 6 3 5 10 6 7 6
#6 8 3 87 5 8 9 78
#7 NA 1 NA 1 7 10 54
#8 3 5 6 7 9 1 3
#9 2 2 5 2 5 6 3
#10 2 3 9 9 2 4 3
To fill values where day.x.time column is NA we can select the closest non-NA value in that row.
inds <- which(is.na(dat$a))
dat$a[inds] <- mapply(function(x, y)
na.omit(unlist(dat[x, cols[order(abs(y- seq_along(cols)))]])[1:4])[1],
inds, dat$day[inds])
dat
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
#1 3 2 4 3 3 3 4
#2 2 5 4 4 10 2 NA
#3 1 3 7 8 1 8 4
#4 4 3 6 6 4 5 5
#5 6 3 5 10 6 7 6
#6 8 3 87 5 8 9 78
#7 1 1 NA 1 7 10 54
#8 3 5 6 7 9 1 3
#9 2 2 5 2 5 6 3
#10 2 3 9 9 2 4 3
Using sapply to loop over the rows and subset by day[i] + 2 column.
res <- transform(dat, a=sapply(1:nrow(dat), function(i) dat[i, dat$day[i] + 2]))
res
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
# 1 5 2 4 5 7 10 4
# 2 NA 5 4 1 1 6 NA
# 3 6 3 7 9 6 7 4
# 4 8 3 6 8 8 4 5
# 5 4 3 5 2 4 5 6
# 6 2 3 87 3 2 1 78
# 7 NA 1 NA 7 5 9 54
# 8 3 5 6 6 3 2 3
# 9 10 2 5 10 9 8 3
# 10 10 3 9 4 10 3 3
Edit
The +/-2 days would require a decision rule, what to chose, if day is NA, but none of day - 1 and day + 1 is NA and both have the same values.
Here a solution that goes from day backwards and takes the first non-NA. If it is day one, as it's the case in row 7, we get NA.
res <- transform(dat, a=sapply(1:nrow(dat), function(i) {
days <- dat[i, -(1:2)]
day.value <- days[dat$day[i]]
if (is.na(day.value)) {
day.value <- tail(na.omit(unlist(days[1:dat$day[i]])), 1)
if (length(day.value) == 0) day.value <- NA
}
return(day.value)
}))
res
# a day day.1.time day.2.time day.3.time day.4.time day.5.time
# 1 10 2 4 10 1 2 4
# 2 10 5 4 1 3 10 NA
# 3 2 3 7 7 2 7 4
# 4 6 3 6 2 6 6 5
# 5 10 3 5 9 10 5 6
# 6 8 3 87 6 8 4 78
# 7 NA 1 NA 3 7 1 54
# 8 3 5 6 4 4 9 3
# 9 8 2 5 8 5 8 3
# 10 9 3 9 5 9 3 3

Create a function to Impute values form one data frame into another

The NA values in column A should be filled by the A value from the dat data frame and so on for the other variables.
id <- factor(rep(letters[1:2], each=5))
A <- c(1,2,NA,6,8,9,0,6,7,9)
B <- c(5,6,1,9,8,1,NA,9,7,4)
C <- c(2,3,5,NA,NA,2,7,6,4,6)
D <- c(6,5,8,3,2,9,NA,2,6,8)
df <- data.frame(id, A, B,C,D)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a NA 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
dat <- data.frame(col=c("A","B","C","D"), value=c(23,45,26,89))
dat
dat
col value
1 A 23
2 B 45
3 C 26
4 D 89
It should look like:
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
I was thinking something like this but I dont know how to connect those data frames in a function...
test <- function(i){
df[,i][is.na(df[,i])] <- dat$value
}
test(2)
If you want it in your format
test <- function(i){
df[,i][is.na(df[,i])] <<- dat$value[dat$col==i]
}
test("A")
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 NA 3
5 a 8 8 NA 2
6 b 9 1 2 9
7 b 0 NA 7 NA
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
One approach is to iterate over the columns and values and use coalesce():
library(dplyr)
library(purrr)
df[-1] <- map2_df(df[-1], dat$value, coalesce)
df
id A B C D
1 a 1 5 2 6
2 a 2 6 3 5
3 a 23 1 5 8
4 a 6 9 26 3
5 a 8 8 26 2
6 b 9 1 2 9
7 b 0 45 7 89
8 b 6 9 6 2
9 b 7 7 4 6
10 b 9 4 6 8
Or same using replace():
map2_df(df[-1], dat$value, ~ replace(.x, is.na(.x), .y))

R, Using reshape to pull pre post data

I have a simple data frame as follows
x = data.frame(id = seq(1,10),val = seq(1,10))
x
id val
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I want to add 4 more columns. The first 2 are the previous two rows and the next two are the next two rows. For the first two rows and last two rows it needs to write out as NA.
How do I accomplish this using cast in the reshape package?
The final output would look like
1 1 NA NA 2 3
2 2 NA 1 3 4
3 3 1 2 4 5
4 4 2 3 5 6
... and so on...
Thanks much in advance
After your give the example , I change the solution
mat <- cbind(dat,
c(c(NA,NA),head(dat$id,-2)),
c(c(NA),head(dat$val,-1)),
c(tail(dat$id,-1),c(NA)),
c(tail(dat$val,-2),c(NA,NA)))
colnames(mat) <- c('id','val','idp','valp','idn','valn')
id val idp valp idn valn
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA
Here is a soluting with sapply. First, choose the relative change for the new columns:
lags <- c(-2, -1, 1, 2)
Create the new columns:
newcols <- sapply(lags,
function(l) {
tmp <- seq.int(nrow(x)) + l;
x[replace(tmp, tmp < 1 | tmp > nrow(x), NA), "val"]})
Bind together:
cbind(x, newcols)
The result:
id val 1 2 3 4
1 1 1 NA NA 2 3
2 2 2 NA 1 3 4
3 3 3 1 2 4 5
4 4 4 2 3 5 6
5 5 5 3 4 6 7
6 6 6 4 5 7 8
7 7 7 5 6 8 9
8 8 8 6 7 9 10
9 9 9 7 8 10 NA
10 10 10 8 9 NA NA

Resources