Sum() in dplyr and aggregate: NA values - r

I have a dataset with about 3,000 rows. The data can be accessed via https://pastebin.com/i4dYCUQX
Problem: NA results in the output, though there appear to be no NA in the data. Here is what happens when I try to sum the total value in each category of a column via dplyr or aggregate:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
example
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
Out:
# A tibble: 4 x 2
size volume
<fctr> <int>
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
# aggregate
aggregate(volume ~ size, data=example, FUN=sum)
Out:
size volume
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
When trying to access the value via colSums, it seems to work:
# Colsums
small <- example %>% filter(size == "Small")
colSums(small["volume"], na.rm = FALSE, dims = 1)
Out:
volume
3869267348
Can anyone imagine what the issue could be?

The first thing to note is that, running your example, I get:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <int>
#> 1 Extra Large NA
#> 2 Large NA
#> 3 Medium 937581572
#> 4 Small NA
which clearly states that you're sums are overflowing the integer type. If we do as the warning message suggests, we can convert the integers to numerics and then sum:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum(as.numeric(.))))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <dbl>
#> 1 Extra Large 3609485056
#> 2 Large 11435467097
#> 3 Medium 937581572
#> 4 Small 3869267348
here the funs(sum) has been replaced by funs(sum(as.numeric(.)) which is the same, executing sum on each group but converting to numeric first.

its because value is an integer and not numeric
example$volume <- as.numeric(example$volume)
aggregate(volume ~ size, data=example, FUN=sum)
size volume
1 Extra Large 3609485056
2 Large 11435467097
3 Medium 937581572
4 Small 3869267348
For more check here:
What is integer overflow in R and how can it happen?

Related

How can I do Stratified sampling with proportionate size

I have a dataset named by "Tree_all_exclusive" of 7607 rows and 39 column, which contains different information of tress such as age, height, name etc. I am able to create a sample of 1200 size with the below code, which looks picking trees randomly:
sam1<-sample_n(Tree_all_exclusive, size = 1200)
But I like to generate a proportionate stratified sample of 1200 trees which will pick the number of trees according to the proportion of the number of that specific type of tree.
To do this I am using below code:
sam3<-Tree_all_exclusive %>%
group_by(TaxonNameFull)%>%
summarise(total_numbers=n())%>%
arrange(-total_numbers)%>%
mutate(pro = total_numbers/7607)%>% #7607 total number of trees
mutate(sz= pro*1200)%>% #1200 is number of sample
mutate(siz=as.integer(sz)+1) #since some size is 0.01 so making it 1
sam3
s<-stratified(sam3, group="TaxonNameFull", sam3$siz)
But it is giving me the below error:
Error in s_n(indt, group, size) : 'size' should be entered as a named vector.
Would you please point me any direction to solve this issue?
Also if there is any other way to do the stratified sampling with proportionate number please guide me.
Thanks a lot.
How about using sample_frac():
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 11
#> 2 6 7
#> 3 8 14
mtcars %>%
group_by(cyl) %>%
sample_frac(.5) %>%
tally()
#> # A tibble: 3 × 2
#> cyl n
#> <dbl> <int>
#> 1 4 6
#> 2 6 4
#> 3 8 7
Created on 2023-01-24 by the reprex package (v2.0.1)

Expanding mean over time per subgroup in dataframe

Still quite new to R, so trying to figure out what I am doing wrong in the following explanation.
I am trying to calculate the expanding mean over time per subgroup for a dataframe. My code works when there is only a single subgroup in the dataframe, but starts to break when multiple subgroups are available within the dataframe.
Apologies if I have overlooked something, but I cant figure out where exactly my code is incorrect. My hunch is that I am not filling in the width correctly, but I have not been able to figure out how to change width to a dynamically expanding window over time per subgroup.
See my data below;
sample file
See my code below;
library(ggplot2)
library(zoo)
library(RcppRoll)
library(dplyr)
x <- read.csv("stackoverflow.csv")
x$datatime <- as.POSIXlt(x$datatime,format="%m/%d/%Y %H:%M",tz=Sys.timezone())
x$Event <- as.factor(x$Event)
x2 <- arrange(x,x$Event,x$datatime) %>%
group_by(x$Event) %>%
mutate(ma=rollapply(data = x$Actual, width=seq_along(x$Actual), FUN=mean,
partial=TRUE, fill=NA,
align = "right"))
Any help is very much appreciated!
Thanks
EDIT:
A fix has been found! Thanks to all the useful feedback.
The working code is;
x <-
arrange(x,x$Event,x$datatime) %>%
group_by(Event) %>%
mutate(ma=rollapply(data = Actual,
width=seq_along(Actual),
FUN=mean,
partial=TRUE,
fill=NA,
align = "right"))
I think the problem here is that you’re using x$ to extract columns from
the original data in mutate(), rather than using the column name directly
to refer to the column in the grouped slice.
In dplyr verbs you can (and in case of grouped operations, must) refer to the columns directly.
The solution is to just remove
all x$ references from your code in dplyr functions.
Here’s a small example that illustrates what’s going on:
library(dplyr, warn.conflicts = FALSE)
tbl <- tibble(g = c(1, 1, 2, 2, 2), x = 1:5)
tbl
#> # A tibble: 5 x 2
#> g x
#> <dbl> <int>
#> 1 1 1
#> 2 1 2
#> 3 2 3
#> 4 2 4
#> 5 2 5
tbl %>%
group_by(g) %>%
mutate(y = cumsum(tbl$x))
#> Error in `mutate_cols()`:
#> ! Problem with `mutate()` column `y`.
#> i `y = cumsum(tbl$x)`.
#> i `y` must be size 2 or 1, not 5.
#> i The error occurred in group 1: g = 1.
And how to fix it:
tbl %>%
group_by(g) %>%
mutate(y = cumsum(x))
#> # A tibble: 5 x 3
#> # Groups: g [2]
#> g x y
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 1 2 3
#> 3 2 3 3
#> 4 2 4 7
#> 5 2 5 12

Find cohorts in dataset in r dataframe

I'm trying to find the biggest cohort in a dataset of about 1000 candidates and 100 test questions. Every candidate is asked 15 questions out of a pool of 100 test questions. People in different cohorts make the same set of randomly sampled questions. I'm trying to find the largest group of candidates who all make the same test.
I'm working in R. The data.frame has about a 1000 rows, and 100 columns. Each column indicates which test question we're working with. For each row (candidate) all column entries are NA apart from the ones where a candidate filled in a particular question he or she was shown. The input in these question instances are either 0 or 1. (see picture)
Is there an elegant way to solve this? The only thing I could think of was using dplyer and filter per 15 question subset, and check how many rows still remain. However, with 100 columns this means it has to check (i think) 15 choose 100 different possibilities. Many thanks!
data.frame structure
We can infer the cohort based on the NA pattern:
library(tidyverse)
answers <- tribble(
~candidate, ~q1, ~q2, ~q3,
1,0,NA,NA,
2,1,NA,NA,
3,0,0,00
)
answers
#> # A tibble: 3 x 4
#> candidate q1 q2 q3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 NA NA
#> 2 2 1 NA NA
#> 3 3 0 0 0
# infer cohort by NA pattern
cohorts <-
answers %>%
group_by(candidate) %>%
mutate_at(vars(-group_cols()), ~ ifelse(is.na(.x), NA, TRUE)) %>%
unite(-candidate, col = "cohort")
cohorts
#> # A tibble: 3 x 2
#> # Groups: candidate [3]
#> candidate cohort
#> <dbl> <chr>
#> 1 1 TRUE_NA_NA
#> 2 2 TRUE_NA_NA
#> 3 3 TRUE_TRUE_TRUE
answers %>%
pivot_longer(-candidate) %>%
left_join(cohorts) %>%
# count filled answers per candidate and cohort
group_by(cohort, candidate) %>%
filter(! is.na(value)) %>%
count() %>%
# get the largest cohort
arrange(-n) %>%
pull(cohort) %>%
first()
#> Joining, by = "candidate"
#> [1] "TRUE_TRUE_TRUE"
Created on 2021-09-21 by the reprex package (v2.0.1)

Unable to import whole data from an excel using R

I have an excel file (.xlsx) with 15000 records which I loaded to R, and there is a column 'X' which has data after 10000 rows.
Data <- read_excel("Business_Data.xlsx", sheet = 3, skip = 2)
When I checked the dataframe after importing file, I could see only NA in that 'X' column. Rather, column X has factors like "Cost +, Resale-, Purchase" which are not getting captured. Is it because the data for this column contains after 10000 records? Or am I missing something?
read_excel tries to infer the type of the data using the first 1000 rows by default.
If it can't get the right type and can't coerce the data to this type, you'll get NA.
You probably had a warning : "There were 50 or more warnings (use warnings() to see the first 50)"
And checking the warnings tells you something like :
> warnings()
Messages d'avis :
1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, ... :
Expecting logical in B15002 / R15002C2: got 'A'
...
Solution : add the argument guess_max = 20000
library(tidyverse)
library(writexl)
library(readxl)
# create a dataframe with a character column "empty" at the beginning
df1 <- tibble(x = 1:20000,
y = c(rep(NA_character_, 15000), rep("A", 5000)))
# bottom rows are OK
tail(df1)
#> # A tibble: 6 x 2
#> x y
#> <int> <chr>
#> 1 19995 A
#> 2 19996 A
#> 3 19997 A
#> 4 19998 A
#> 5 19999 A
#> 6 20000 A
write_xlsx(df1, "d:/temp/test.xlsx")
# we read back ; bottom rows are missing !
df2 <- read_xlsx("d:/temp/test.xlsx")
tail(df2)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <lgl>
#> 1 19995 NA
#> 2 19996 NA
#> 3 19997 NA
#> 4 19998 NA
#> 5 19999 NA
#> 6 20000 NA
# everything is fine with guess_max = 20000
df3 <- read_xlsx("d:/temp/test.xlsx", guess_max = 20000)
tail(df3)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <chr>
#> 1 19995 A
#> 2 19996 A
#> 3 19997 A
#> 4 19998 A
#> 5 19999 A
#> 6 20000 A
So, check warnings !
To be sure you can also coerce type :
df4 <- read_xlsx("d:/temp/test.xlsx",
col_types = c("numeric", "text"))
In any case, note that integers are not recognized from the xlsx format, so you may need to transform your numbers to integers to get the exact original dataframe :
df4 %>%
mutate(x = as.integer(x)) %>%
identical(df1)
#> [1] TRUE

Select column that has the fewest NA values

I am working with a data frame that produces two output columns. One column always has more NA values than the other column, but not in any predictable fashion. here is my question, how can I use dplyr to select the column with the fewest number of NA values. I was thinking of utilizing which.min to decide, but not sure how to put it all together. Note that both columns contain na values, and I want to select the one with the fewest of those values.
You can do this with dplyr and purrr.
inside which.min you first calculate the number of NA's in the columns with map (can be as many columns as you have in your data.frame. The keep part returns only those columns which actually have NA's. The which.min returns the named vector of which we take the name and supply it to the select function of dplyr.
I have outlined the code a bit so you can easily see which parts belong where.
library(purrr)
library(dplyr)
df %>% select(names(which.min(df %>%
map(function(x) sum(is.na(x))) %>%
keep(~ .x > 0)
)
)
)
library(dplyr)
df <- tibble(a = c(rep(c(NA, 1:5), 4)), # df with different NA counts/col
b = c(rep(c(NA, NA, 2:5), 4)))
df %>%
summarise_all(funs(sum(is.na(.)))) # NA counts
#> # A tibble: 1 x 2
#> a b
#> <int> <int>
#> 1 4 8
df %>% # answer
select_if(funs(which.min(sum(is.na(.)))))
#> # A tibble: 24 x 1
#> a
#> <int>
#> 1 NA
#> 2 1
#> 3 2
#> 4 3
#> 5 4
#> 6 5
#> 7 NA
#> 8 1
#> 9 2
#> 10 3
#> # ... with 14 more rows
Created on 2018-05-25 by the reprex package (v0.2.0).

Resources