Removing NAs from a large dataframe - r

I have a very large dataframe, with number of rows = 10 703 009. I want to remove NAs but getting this error, 'Colloc couldnot allocate memory of 10703009 bytes.
My input dataframe is 'a' with many rows with NAs,
IDs
Codes
1
C493
1
NA
2
E348
3
NA
I need a output with rows without NAs
IDs
Codes
1
C493
2
E348
I tried both, but getting memory error,
drop_na(a,Codes)
subset(a,Codes)
Please suggest the solution to this in R.

A frame of 10,703,009 lines is no problem for R. See below. I generated a tibble with exactly the number of lines where the variable Codes contains NA with a probability of probNA = 0.3.
library(tidyverse)
n=10703009
probNA = 0.3
df = tibble(IDs = 1:n,
Codes = paste0(sample(LETTERS[1:10], n, replace = TRUE),
sample(100:999, n, replace = TRUE))) %>%
mutate(Codes = ifelse(sample(c(T,F), n, replace = TRUE,
prob = c(probNA, 1-probNA)), NA, Codes))
df
output
# A tibble: 10,703,009 x 2
IDs Codes
<int> <chr>
1 1 I586
2 2 A188
3 3 H674
4 4 D641
5 5 A793
6 6 B455
7 7 B837
8 8 A805
9 9 NA
10 10 E380
# ... with 10,702,999 more rows
The size of such a tibble is object.size (df) return 12 894 1096 bytes.
We will try to get rid of the lines with NA values.
df %>% filter(!is.na(Codes))
output
# A tibble: 7,490,809 x 2
IDs Codes
<int> <chr>
1 1 I586
2 2 A188
3 3 H674
4 4 D641
5 5 A793
6 6 B455
7 7 B837
8 8 A805
9 10 E380
10 11 C231
# ... with 7,490,799 more rows
Now let's replace all NA values with an empty string.
df %>% mutate(Codes = ifelse(is.na(Codes), "", Codes))
output
# A tibble: 10,703,009 x 2
IDs Codes
<int> <chr>
1 1 "I586"
2 2 "A188"
3 3 "H674"
4 4 "D641"
5 5 "A793"
6 6 "B455"
7 7 "B837"
8 8 "A805"
9 9 ""
10 10 "E380"
# ... with 10,702,999 more rows
As you can see, everything works smoothly and without any problems.

Related

Select Random Consecutive Rows Per Group

I have data which is grouped by 'student_id':
my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
result = rnorm(15,60,10))
my_data
student_id exam_no result
1 1 1 56.60374
2 1 2 55.76655
3 1 3 53.81728
4 1 4 74.82202
5 1 5 34.91834
6 2 1 58.32422
7 2 2 60.38213
8 2 3 49.40390
9 2 4 63.85426
10 2 5 40.32912
11 3 1 69.54969
12 3 2 43.36639
13 3 3 37.97265
14 3 4 52.36436
15 3 5 61.62080
My Question:
For each student, I want to select a set of consecutive rows, with random start and end rows.
For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.
I thought of the following way to do this:
Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)
library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())
# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)
From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:
Warning message:
In 1:counts$counts :
numerical expression has 3 elements: only the first used
Warning messages:
1: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
# A tibble: 3 x 4
student_id counts min max
<dbl> <int> <int> <int>
1 1 5 4 5
2 2 5 4 5
3 3 5 4 5
I am not sure how to continue this - can someone please show me how to continue?
Thanks!
A base R option using cumsum to label the in-between consecutive rows
subset(
my_data,
ave(
exam_no,
student_id,
FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
) == 1
)
which gives, for example
student_id exam_no result
2 1 2 61.83643
3 1 3 51.64371
4 1 4 75.95281
6 2 1 51.79532
7 2 2 64.87429
8 2 3 67.38325
11 3 1 75.11781
12 3 2 63.89843
13 3 3 53.78759
A more compact version by data.table with a similar idea as above is
library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.
library(data.table)
setDT(my_data)
set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]
# student_id exam_no result
# <num> <num> <num>
# 1: 1 5 74.05672
# 2: 1 4 49.37525
# 3: 1 3 67.41662
# 4: 1 2 67.64935
# 5: 2 4 55.15337
# 6: 2 3 58.95694
# 7: 3 4 50.79859
# 8: 3 3 53.66886
# 9: 3 2 47.01089

R Fill backwards with flexible window based on number of rows in a separate column

I am trying to carry a value in one column backwards by a number of rows given in a second column and fill everything in between.
So column y mainly has 1s in it but might have individual numbers up to about 20 (in my real data, up to 3 in my example below). If the number in y is 20, I need the 19 rows before that row and that row itself to equal the value of x for the row where y is 20. If the value in y is 1 the output will just equal x.
y also has many NAs, these NAs are either legitimate NAs where I want an NA output or are placeholders where the filling should occur if a y value afterwards is > 1.
I thought I could use dplyr::lead but I cannot have a variable n value to look forwards a different number of steps, and it wouldn't fill inbetween, and I wondered about making a new, always increasing column and using RcppRoll::roll_max but have similar problems with the flexible window size.
Typically y-values in the lead up to a y > 1 will be 0 or NA, but if there were conflicts I would want to adopt the later value still eg in row 8 of my data frame y is 1 followed by y = 2 in row 9 so I want the value associated with row 9 in both cases. If y in NA and there is not covered by filling backwards, I want it to remain NA (or 0 would be fine)
Thanks for any thoughts
set.seed(1)
test <- data.frame(x = sample(1:15,replace = F), y = c(NA,NA,NA,1,NA,NA,3,1,2,1,1,NA,NA,NA,2))
desired_out <- test
desired_out$out <- c(NA,NA,NA,1,11,11,11,8,8,12,5,NA,NA,14,14)
desired_out
#> x y out
#> 1 9 NA NA
#> 2 4 NA NA
#> 3 7 NA NA
#> 4 1 1 1
#> 5 2 NA 11
#> 6 13 NA 11
#> 7 11 3 11
#> 8 3 1 8
#> 9 8 2 8
#> 10 12 1 12
#> 11 5 1 5
#> 12 6 NA NA
#> 13 15 NA NA
#> 14 10 NA 14
#> 15 14 2 14
#try adopting #sirius answer before I specified about the extra NAs
test$y <- ifelse(is.na(test$y),0,test$y)
test$out <- with( test, rep( x, y ) )
#> Error in `$<-.data.frame`(`*tmp*`, out, value = c(1L, 11L, 11L, 11L, 3L, : replacement has 11 rows, data has 15
Created on 2021-04-08 by the reprex package (v0.3.0)
Things got a bit complex, but essentially calculate all the repeated x's for each y > 0, and then let subsequent x'es overwrite earlier ones
set.seed(1)
test <- data.frame(x = sample(1:15,replace = F), y = c(NA,NA,NA,1,NA,NA,3,1,2,1,1,NA,NA,NA,2))
desired_out <- test
desired_out$out <- c(NA,NA,NA,1,11,11,11,8,8,12,5,NA,NA,14,14)
desired_out
test %<>% mutate( id = seq(n()) ) %>%
filter( !is.na(y) & y != 0 ) %>%
group_by(id) %>%
slice( rep(1,y) ) %>%
mutate( id = rev( max(id)+1-1:n() ) ) %>%
group_by(id) %>%
summarize( out = as.numeric(last(x)) ) %>%
right_join( test %>% mutate( id=seq(n()) ) ) %>%
arrange( id ) %>% select( -id ) %>% relocate( x, y, out )
identical( as.data.frame(test), desired_out ) ## TRUE
test
Output:
> test
# A tibble: 15 x 3
x y out
<int> <dbl> <dbl>
1 9 NA NA
2 4 NA NA
3 7 NA NA
4 1 1 1
5 2 NA 11
6 13 NA 11
7 11 3 11
8 3 1 8
9 8 2 8
10 12 1 12
11 5 1 5
12 6 NA NA
13 15 NA NA
14 10 NA 14
15 14 2 14
What the algorithm does, which after a few piped lines is no longer very clear, is the following:
temporarily add id as original row number
take away 0 and NA rows for y
repeat each row y times
within each such repeated row, create a new id that counts backwards (these will be the new row numbers for the x-values to
go)
group by id again this time to let later values overwrite earlier values (so keep only the highest row number for any collision)
join these data back on the original data, using the newly calculated row numbers, repeated x's will now be inserted
sort and clean up
Sequencing and indexing to the rescue:
test$rn <- seq_len(nrow(test))
src <- with(test[!is.na(test$y),],
list(val = rep(x,y), idx = rep(rn,y) - sequence(y) + 1) )
test$out[src$idx] <- src$val
test$rn <- NULL
# x y out
#1 9 NA NA
#2 4 NA NA
#3 7 NA NA
#4 1 1 1
#5 2 NA 11
#6 13 NA 11
#7 11 3 11
#8 3 1 8
#9 8 2 8
#10 12 1 12
#11 5 1 5
#12 6 NA NA
#13 15 NA NA
#14 10 NA 14
#15 14 2 14
I'm generating a row number, getting the row numbers prior to the key rows, and then overwriting those rows with repeats of the selected rows. Sometimes they specify the same location, but the later value will be taken as you can see in the output.
Should be pretty efficient as everything is vectorised and there's only one major assignment operation back to the original dataset for updating all the rows at once. Here's 4.5M rows processed in a fraction of a second:
test <- test[rep(1:15, 3e5),]
system.time({
test$rn <- seq_len(nrow(test))
src <- with(test[!is.na(test$y),],
list(val = rep(x,y), idx = rep(rn,y) - sequence(y) + 1) )
test$out[src$idx] <- src$val
test$rn <- NULL
})
# user system elapsed
# 0.28 0.00 0.28

R - dataframe - every x rows new number in other column

my question is:
I have a matrix of 200.000 rows and 3 different columns (productID, week, order).
I want to put the productID (starting with 1) in the product column and create 26 rows for each ID. Than I want to put 1-26 in the week column for every ID.
I know it's not that hard, but I keep making mistakes.
Thank you so much for your help!
Do you look for something like this:
tibble(productID = 1:4, week = 5:8, order = "Test") %>%
tidyr::complete(week = 1:26, productID = 1:4, fill = list(order = NA_character_))
# A tibble: 104 x 3
week productID order
<int> <int> <chr>
1 1 1 NA
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 2 1 NA
6 2 2 NA
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 NA
# ... with 94 more rows

if condition is true find max in 3 consecutive rows and report it in a new column - r

Reproducible example:
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
The output I am after is a new column "test" that gets the maximum of the column "Value" for each value of the column "Label" when there are 3 consecutives values that are the same and otherwise just report the values of the column "Value".
I do not mind about the missing values at the beggining and at the end, they can stay.
Expected result of the column test: NA, NA, 3,3,3,1,2,3,3,3,NC,3,3,3,NA,NA
in excel it was very easy and I coded successfully as follow:
=IF(AND(BN6=BN5,BN6=BN4),X4,Y6)
but in R I cannot.
I tried several methods, the closest to a result is the following:
test <-c(NA,NA)
test_tot <-NULL
for(i in 3:length(dat1$Label)){
test_tot<-c(test_tot, test)
if( dat1$Label[i]==dat1$Label[i+1]&& dat1$Label[i]==dat1$Label[i+2] ){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i+1],dat1$Value[i+2])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i+1]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i+1])))
}
if(dat1$Label[i]==dat1$Label[i-1]&& dat1$Label[i]==dat1$Label[i-2]){
test<-max(as.numeric(c(dat1$Value[i],dat1$Value[i-1],dat1$Value[i-2])))
}
else {test<-dat1$Value[i]}
}
test_tot<-c(test_tot,NA,NA)
dat1$test<-test_tot
EDIT:
The difficulty apparently is that the column "Value" has character based values. Any solution able to deal with it is greatly appreciated.
Edit: The OP has pointed out that column Value may contain character-based values which are important to identify a specific behaviour happened at a specific time.
Consequently, the whole vector or column is of type character in R (or factor). The code below has been amended to handle this by extracting numeric values to a separate column, computing the maximum values per group, coercing the result back to character and to copy the character-based values into the result.
The data.table solution below
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,"NC",1,3,2,1,NA)
Expected <- c(NA, NA, 3,3,3,1,2,3,3,3,"NC",3,3,3,NA,NA)
dat1<-data.frame(Label, Value, Expected)
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table
setDT(dat1)[
# create temporary column with only numeric values
, Value_num := as.numeric(as.character(Value))][
# create temp cols for group id and group size
, `:=`(grp = .GRP, N = .N), by = rleid(Label)][
# for sufficiently large groups compute max values and coerce to char
N >= 3, new := as.character(max(Value_num)), by = grp][
# copy missing values
is.na(new), new := as.character(Value)][
# clean up
, c("grp", "N", "Value_num") := NULL][]
returns the expected result
Label Value Expected new
1: 0 NA NA NA
2: 0 NA NA NA
3: 1 1 3 3
4: 1 2 3 3
5: 1 3 3 3
6: 2 1 1 1
7: 2 2 2 2
8: 3 3 3 3
9: 3 2 3 3
10: 3 1 3 3
11: 4 NC NC NC
12: 5 1 3 3
13: 5 3 3 3
14: 5 2 3 3
15: 6 1 NA 1
16: 6 NA NA NA
except for row 15 where I believe the expected result should be 1 if we follow the words of the OP otherwise just report the values of the column "Value"
The warning message:
In eval(jsub, SDenv, parent.frame()) : NAs introduced by coercion
can be ignored as it's intended to convert non-numbers to NA, here.
Here is a dplyr solution. . NOTE: NC was changed to NA
Label<-c(0,0,1,1,1,2,2,3,3,3,4,5,5,5,6,6)
Value<-c(NA,NA,1,2,3,1,2,3,2,1,NA,1,3,2,1,NA)
dat1<-as.data.frame(cbind(Label, Value))
library(dplyr)
dat1 %>%
filter(!is.na(Value)) %>%
group_by(Label) %>%
summarize(n = n(), max_Value = max(Value)) %>%
mutate(test = if_else(n>=3, max_Value, as.numeric(NA))) %>%
right_join(dat1, by = "Label") %>%
mutate(test = if_else(is.na(test), Value, test)) %>%
select(Label, Value, test)
# # A tibble: 16 × 3
# Label Value test
# <dbl> <dbl> <dbl>
# 1 0 NA NA
# 2 0 NA NA
# 3 1 1 3
# 4 1 2 3
# 5 1 3 3
# 6 2 1 1
# 7 2 2 2
# 8 3 3 3
# 9 3 2 3
# 10 3 1 3
# 11 4 NA NA
# 12 5 1 3
# 13 5 3 3
# 14 5 2 3
# 15 6 1 1
# 16 6 NA NA

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

Resources