If there's a "6" in df$a, I'd like 1:9 from the previous September to next May, to be in a new column, shown here as df$b, with NA as the rest.
library(tidyverse)
library(lubridate)
date <- c("2/29/1940","3/31/1940","4/30/1940","5/31/1940","6/30/1940","7/31/1940","8/31/1940","9/30/1940","10/31/1940","11/30/1940","12/31/1940","1/31/1941","2/28/1941",
"3/31/1941","4/30/1941","5/31/1941","6/30/1941","7/31/1941","8/31/1941","9/30/1941","10/31/1941","11/30/1941", "12/31/1941","1/31/1942","2/28/1942","3/31/1942",
"4/30/1942","5/31/1942", "6/30/1942","7/31/1942","8/31/1942","9/30/1942","10/31/1942","11/30/1942","12/31/1942","1/31/1943","2/28/1943","3/31/1943","4/30/1943",
"5/31/1943","6/30/1943","7/31/1943", "8/31/1943","9/30/1943")
a <- c("NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA",6,"NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA",
"NA","NA","NA","NA","NA","NA","NA","NA","NA","NA","NA")
df <- data.frame(date, a)
df %<>% mutate(date = mdy(date), a)
df:
date a b
2/29/1940 NA NA
3/31/1940 NA NA
4/30/1940 NA NA
5/31/1940 NA NA
6/30/1940 NA NA
7/31/1940 NA NA
8/31/1940 NA NA
9/30/1940 NA 1
10/31/1940 NA 2
11/30/1940 NA 3
12/31/1940 NA 4
1/31/1941 NA 5
2/28/1941 6 6
3/31/1941 NA 7
4/30/1941 NA 8
5/31/1941 NA 9
6/30/1941 NA NA
7/31/1941 NA NA
8/31/1941 NA NA
9/30/1941 NA NA
10/31/1941 NA NA
11/30/1941 NA NA
12/31/1941 NA NA
1/31/1942 NA NA
2/28/1942 NA NA
3/31/1942 NA NA
4/30/1942 NA NA
5/31/1942 NA NA
6/30/1942 NA NA
7/31/1942 NA NA
8/31/1942 NA NA
9/30/1942 NA NA
10/31/1942 NA NA
11/30/1942 NA NA
12/31/1942 NA NA
1/31/1943 NA NA
2/28/1943 NA NA
3/31/1943 NA NA
4/30/1943 NA NA
5/31/1943 NA NA
6/30/1943 NA NA
7/31/1943 NA NA
8/31/1943 NA NA
9/30/1943 NA NA
For more context, I have a hundred years or so of monthly data in a data frame and I'm looking for an efficient way to produce the third column given the first two columns, to process/visualize other data not shown. Only sometimes there is a 6 for February in df$a. When so, I'd like the previous September through the next May to be populated as shown in a new column (I'm looking to produce df$b). I tried some clumsy ways, mostly by a bunch of lines with variations of mutate() , lag() , and lead() but have a good feeling there's more direct routes.
thank you,
dave
A solution using case_when, lead, and lag from dplyr. It is not the most concise solution, but it will work when 6 is close to the edge.
library(tidyverse)
df2 <- df %>%
mutate(b = case_when(
lead(a, n = 5L) == 6 ~1,
lead(a, n = 4L) == 6 ~2,
lead(a, n = 3L) == 6 ~3,
lead(a, n = 2L) == 6 ~4,
lead(a, n = 1L) == 6 ~5,
a == 6 ~6,
lag(a, n = 1L) == 6 ~7,
lag(a, n = 2L) == 6 ~8,
lag(a, n = 3L) == 6 ~9,
TRUE ~NA_real_
))
DATA
Notice that I changed the way you specified NA in the column A.
library(lubridate)
date <- c("2/29/1940","3/31/1940","4/30/1940","5/31/1940","6/30/1940","7/31/1940","8/31/1940","9/30/1940","10/31/1940","11/30/1940","12/31/1940","1/31/1941","2/28/1941",
"3/31/1941","4/30/1941","5/31/1941","6/30/1941","7/31/1941","8/31/1941","9/30/1941","10/31/1941","11/30/1941", "12/31/1941","1/31/1942","2/28/1942","3/31/1942",
"4/30/1942","5/31/1942", "6/30/1942","7/31/1942","8/31/1942","9/30/1942","10/31/1942","11/30/1942","12/31/1942","1/31/1943","2/28/1943","3/31/1943","4/30/1943",
"5/31/1943","6/30/1943","7/31/1943", "8/31/1943","9/30/1943")
a <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA , 6, NA , NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA , NA , NA , NA , NA , NA , NA,
NA, NA, NA , NA , NA , NA , NA , NA , NA , NA , NA)
df <- data.frame(date, a)
df %<>% mutate(date = mdy(date), a)
Related
I would like to know how I can remove from a dataset the records that have more than 5 null values in the columns that define them. The following code allows you to delete records with any NA in any column. However, how can I modify it to do exactly what I ask? Any ideas?
df [ complete.cases (df),]
Here is an example data frame. One of the rows has 6 NA values.
We sum the NA values by row in a new column, filter where the number of NA is less than or equal to 5, then remove the new column.
df <- data.frame(a = c(1,NA,1,1),
b = c(1, NA, NA, 1),
c = c(1, NA, NA, NA),
d = c(1, NA, NA ,NA),
e = c(1, NA, NA, NA),
f = c(1, NA, NA, NA))
a b c d e f
1 1 1 1 1 1 1
2 NA NA NA NA NA NA
3 1 NA NA NA NA NA
4 1 1 NA NA NA NA
df %>%
mutate(count = rowSums(is.na(df))) %>%
filter(count <= 5) %>%
select(-count)
a b c d e f
1 1 1 1 1 1 1
2 1 NA NA NA NA NA
3 1 1 NA NA NA NA
I'm assuming you are referring to values of NA in your data indicating a missing value. NULL is returned by expressions and functions whose value is undefined. First create some reproducible data:
set.seed(42)
vals <- sample.int(1000, 250)
idx <- sample.int(250, 100)
vals[idx] <- NA
example <- as.data.frame(matrix(vals, 25))
Now compute the number of missing values by row and exclude the rows with more than 5 missing values:
na.count <- rowSums(is.na(example))
example[na.count<=5, ]
I have to find all columns with all NA-values. If there are not all NA-values in column, I have to replace NAs with 0.
My solution is:
NA_check <- colSums(is.na(frame)) == nrow(frame) #True or False - all NA or not
frame[is.na(frame) & which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])] <- 0
These conditions work separately, but they don't work together or I get some errors combining them. How can I solve my problem?
P.S. This modification also doesn't work if NA_checkis not all FALSE:
frame[is.na(frame[which(names(frame) %in% names(NA_check)[which(NA_check == FALSE, arr.ind=T)])])] <- 0
You can find out columns which has atleast one non-NA value (not all values are NA) and replace NA in that subset to 0.
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
We can check this with an example :
frame <- data.frame(a = c(NA, NA, 3, 4), b = NA, c = c(NA, 1:3), d = NA)
frame
# a b c d
#1 NA NA NA NA
#2 NA NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
not_all_NA <- colSums(!is.na(frame)) > 0
frame[not_all_NA][is.na(frame[not_all_NA])] <- 0
frame
# a b c d
#1 0 NA 0 NA
#2 0 NA 1 NA
#3 3 NA 2 NA
#4 4 NA 3 NA
We can also do this with dplyr :
library(dplyr)
frame %>% mutate(across(where(~any(!is.na(.))), tidyr::replace_na, 0))
My data consists of columns and rows. Each column has "NA" and different numbers.
For example column1 is:
2
1
1
NA
1
NA
NA
NA
I want to assign a column id to the numbers in each column.
for(j in 1:54){
if(!(col[j] <-"NA")){
col[j] <- i
}
}
Expected result for column1:
1
1
NA
NA
NA
1
NA
NA
1
**column 2: **
2
2
NA
NA
NA
2
NA
NA
2
You can use
v <- c(2, 1, NA, NA, 4, 5, NA)
id <- ifelse(!is.na(v), 1, NA)
id
1 1 NA NA 1 1 NA
This means you don't need the for loop here. If you can apply a function to a vector you should avoid using the for loop.
Also, please provide your data so that others can actually use it (like in my code above).
EDIT
According to the comments you have multiple columns. You can use same code. See here
df <- data.frame(a= c(2, 1, NA, NA, 4, 5, NA), b= c(3, NA, NA, NA, 5, NA, 6))
id <- sapply(1:ncol(df), function(i){
ifelse(!is.na(df[ , i]), i, NA)})
id
a b
[1,] 1 2
[2,] 1 NA
[3,] NA NA
[4,] NA NA
[5,] 1 2
[6,] 1 NA
[7,] NA 2
I have a data frame that consists of 3 columns, with each column representing the group which respondents belong to. Respondents belong to one of those groups and are tasked to provide their numerical responses in the group column that they belonged to. Hence, for a given row, 2 other columns will be blank.
I need to create a column that has their score, regardless of which group they belonged to. On Stackoverflow, there is a similar question to mine, but it is for Python (see here)
The following is how the data would look like and what I have done:
library(dplyr)
df <- data.frame(grp_A = c(13, NA, NA, NA, NA, 20, NA),
grp_B = c(NA, 59, 66, NA, NA, NA, NA),
grp_C = c(NA, NA, NA, 23, 42, NA, NA))
df$value <- apply(select(df, grp_A, grp_B, grp_C), 1,
function(x) x[!is.na(x)])
As there are missing data in some rows, R incorrectly converts that new column into a list. I have tried to reconvert it back into a data frame using as.data.frame, but it did not work.
Please kindly advise how to prevent the newly created column from turning into a list.
No need to use apply since for each row you would only have one non-NA value, we could get that value using max.col without worrying about ties.
df$value <- df[cbind(1:nrow(df), max.col(!is.na(df)))]
df
# grp_A grp_B grp_C value
#1 13 NA NA 13
#2 NA 59 NA 59
#3 NA 66 NA 66
#4 NA NA 23 23
#5 NA NA 42 42
#6 20 NA NA 20
#7 NA NA NA NA
max.col gives us the index of column number which has max value for each row and since we are wrapping it in !is.na it will give us the index of TRUE.
max.col(!is.na(df))
#[1] 1 2 2 3 3 1 2
The reason your apply didn't work is because your last row had all NAs and x[!is.na(x)] fails for it. If you remove that row and run your function then it would work
apply(df[-7, ], 1,function(x) x[!is.na(x)])
# 1 2 3 4 5 6
#13 59 66 23 42 20
We could also find out max value for each row by removing NA but this will return -Inf for rows with all NAs
apply(df, 1,max, na.rm = TRUE)
#[1] 13 59 66 23 42 20 -Inf
Base R rowMeans
df$new=rowMeans(df,na.rm=T)
df
grp_A grp_B grp_C new
1 13 NA NA 13
2 NA 59 NA 59
3 NA 66 NA 66
4 NA NA 23 23
5 NA NA 42 42
6 20 NA NA 20
7 NA NA NA NaN
How about using Reduce with dplyr::coalesce?
library(dplyr)
df <- data.frame(grp_A = c(13, NA, NA, NA, NA, 20, NA),
grp_B = c(NA, 59, 66, NA, NA, NA, NA),
grp_C = c(NA, NA, NA, 23, 42, NA, NA))
mutate(df, value = Reduce(coalesce, df))
Result:
grp_A grp_B grp_C value
1 13 NA NA 13
2 NA 59 NA 59
3 NA 66 NA 66
4 NA NA 23 23
5 NA NA 42 42
6 20 NA NA 20
7 NA NA NA NA
Another option is to use rowSums:
df$value <- rowSums(df, na.rm = T)
df[df$value == 0, ] <- NA
Also, performance wise, base Reduce solution seems to be the most efficient:
microbenchmark::microbenchmark(
Reduce = Reduce(coalesce, df),
purrr = purrr::reduce(df, coalesce),
rowMeans = rowMeans(df,na.rm=T),
rowSums = rowSums(df, na.rm = T),
cbind = df[cbind(1:nrow(df), max.col(!is.na(df)))],
times = 1000
)
Unit: microseconds
expr min lq mean median uq max neval cld
Reduce 83.507 107.2095 145.4134 121.4320 137.8410 12190.845 1000 a
purrr 205.667 269.1175 357.5908 304.8540 342.4135 24316.051 1000 b
rowMeans 129.089 159.3555 196.1438 174.4890 194.9095 5481.523 1000 a
rowSums 129.454 157.1680 197.2731 173.5775 196.0035 7685.874 1000 a
cbind 267.294 331.8385 408.3179 368.4860 410.2400 4533.050 1000 b
I have 15 datasets. The 1st column is "subject" and is identical in all sets. The number of the rest of the columns is not the same in all datasets. I need to combine all of this data in a single dataframe. I found the command "Reduce" but I am just starting with R and I couldn't understand if this is what I need and if so, what is the syntax? Thanks!
I suggest including a reproducible example in the future so that others can see the format of data you're working with and what you're trying to do.
Here is some randomly generated example data, each with the "Subject" column:
list_of_dfs <- list(
df1 = data.frame(Subject = 1:4, a = rnorm(4), b = rnorm(4)),
df2 = data.frame(Subject = 5:8, c = rnorm(4), d = rnorm(4), e = rnorm(4)),
df3 = data.frame(Subject = 7:10, f = rnorm(4)),
df4 = data.frame(Subject = 2:5, g = rnorm(4), h = rnorm(4))
)
Reduce with merge is a good choice:
combined_df <- Reduce(
function(x, y) { merge(x, y, by = "Subject", all = TRUE) },
list_of_dfs
)
And the output:
> combined_dfs
Subject a b c d e f g h
1 1 1.1106594 1.2530046 NA NA NA NA NA NA
2 2 -1.0275630 0.6437101 NA NA NA NA -1.9393347 -0.4361952
3 3 0.1558639 1.2792212 NA NA NA NA -0.8861966 1.0137530
4 4 0.4283585 -0.1045530 NA NA NA NA 1.8924896 -0.3788198
5 5 NA NA 0.08261190 0.77058804 -1.165042 NA 0.7950784 -1.3467386
6 6 NA NA 2.51214598 0.62024328 1.496520 NA NA NA
7 7 NA NA 0.01581309 -0.04777196 -1.327884 1.5111734 NA NA
8 8 NA NA 0.80448136 -0.33347573 -2.290428 -0.3863564 NA NA
9 9 NA NA NA NA NA -1.2371795 NA NA
10 10 NA NA NA NA NA 1.6819063 NA NA