Unable to import whole data from an excel using R - r

I have an excel file (.xlsx) with 15000 records which I loaded to R, and there is a column 'X' which has data after 10000 rows.
Data <- read_excel("Business_Data.xlsx", sheet = 3, skip = 2)
When I checked the dataframe after importing file, I could see only NA in that 'X' column. Rather, column X has factors like "Cost +, Resale-, Purchase" which are not getting captured. Is it because the data for this column contains after 10000 records? Or am I missing something?

read_excel tries to infer the type of the data using the first 1000 rows by default.
If it can't get the right type and can't coerce the data to this type, you'll get NA.
You probably had a warning : "There were 50 or more warnings (use warnings() to see the first 50)"
And checking the warnings tells you something like :
> warnings()
Messages d'avis :
1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, ... :
Expecting logical in B15002 / R15002C2: got 'A'
...
Solution : add the argument guess_max = 20000
library(tidyverse)
library(writexl)
library(readxl)
# create a dataframe with a character column "empty" at the beginning
df1 <- tibble(x = 1:20000,
y = c(rep(NA_character_, 15000), rep("A", 5000)))
# bottom rows are OK
tail(df1)
#> # A tibble: 6 x 2
#> x y
#> <int> <chr>
#> 1 19995 A
#> 2 19996 A
#> 3 19997 A
#> 4 19998 A
#> 5 19999 A
#> 6 20000 A
write_xlsx(df1, "d:/temp/test.xlsx")
# we read back ; bottom rows are missing !
df2 <- read_xlsx("d:/temp/test.xlsx")
tail(df2)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <lgl>
#> 1 19995 NA
#> 2 19996 NA
#> 3 19997 NA
#> 4 19998 NA
#> 5 19999 NA
#> 6 20000 NA
# everything is fine with guess_max = 20000
df3 <- read_xlsx("d:/temp/test.xlsx", guess_max = 20000)
tail(df3)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <chr>
#> 1 19995 A
#> 2 19996 A
#> 3 19997 A
#> 4 19998 A
#> 5 19999 A
#> 6 20000 A
So, check warnings !
To be sure you can also coerce type :
df4 <- read_xlsx("d:/temp/test.xlsx",
col_types = c("numeric", "text"))
In any case, note that integers are not recognized from the xlsx format, so you may need to transform your numbers to integers to get the exact original dataframe :
df4 %>%
mutate(x = as.integer(x)) %>%
identical(df1)
#> [1] TRUE

Related

How do I create multiple flag columns based on multiple columns with NA Values using ifelse?

I tried using this first:
for (i in names(data)){
data[paste0('FLAG_NA_',i)]<- ifelse(is.na(data$i),1,0)
}
But this code only creates new columns with only NA values
I found a similar solution to what I want here: How to apply ifelse function across multiple columns and create new columns in R.
The answer is:
data %>%
mutate(across(starts_with('C'), ~ifelse( .x == "Off", 1, 0), .names = 'scr_{sub("C", "", .col)}'))
But when I try to use the is.na() condition on the code, It doesn't work:
data %>%
mutate(across(names(data), ~ifelse( .x %>% is.na, 1, 0), .names = paste0('FLAG_NA_',names(data))))
Error message:
Error: Problem with `mutate()` input `..1`.
i `..1 = across(...)`.
x All unnamed arguments must be length 1
The .names in across should not be a vector. It should be a single character value that serves as a "glue specification" for the names using "{.col} to stand for the selected column name, and {.fn} to stand for the name of the function being applied". So in this case, you could use 'FLAG_NA_{.col}', producing the output below.
## Example data
set.seed(2022)
library(magrittr)
data <-
letters[1:3] %>%
setNames(., .) %>%
purrr::map_dfc(~ sample(c(1, NA, 3), 5, T))
data
#> # A tibble: 5 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 3 3 3
#> 2 NA NA 1
#> 3 3 3 NA
#> 4 3 1 3
#> 5 NA NA NA
## Create new variables
library(dplyr, warn.conflicts = FALSE)
data %>%
mutate(across(everything(), ~ as.numeric(is.na(.x)),
.names = 'FLAG_NA_{.col}'))
#> # A tibble: 5 × 6
#> a b c FLAG_NA_a FLAG_NA_b FLAG_NA_c
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 3 3 0 0 0
#> 2 NA NA 1 1 1 0
#> 3 3 3 NA 0 0 1
#> 4 3 1 3 0 0 0
#> 5 NA NA NA 1 1 1
Created on 2022-02-17 by the reprex package (v2.0.1)

Finding duplicate observations of selected variables in a tibble

I have a rather large tibble (called df.tbl with ~ 26k rows and 22 columns) and I want to find the "twins" of each object, i.e. each row that has the same values in column 2:7 (date:Pos).
If I use:
inner_join(df.tbl, ~ df.tbl[i,], by = c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos"))
with i being the row I want to check for "twins", everything is working as expected, spitting out a 2 x 22 tibble, and I can expand this using:
x <- NULL
for (i in 1:nrow(df.tbl)) {
x[[i]] <- as_vector(inner_join(df.tbl[,],
df.tbl[i,],
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x)
}
to create a list containing the row numbers for each twin for each object (row).
I cannot, however I try, use map to produce a similar result:
twins <- map(df.tbl, ~ inner_join(df.tbl,
.,
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x) )
All I get is the following error:
Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
How would I go about to convert the for loop into an equivalent using map?
My original data look like this:
>head(df.tbl, 3)
# A tibble: 3 x 22
rowNum date forge serNum PinMain PinMainNumber Pos FrontBack flow Sharped SV OP max min mean
<dbl> <date> <chr> <fct> <fct> <fct> <fct> <fct> <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl>
1 1 2017-10-18 NA 179 Pin 1 W F NA 3 36237 235 77.7 55.3 64.7
2 2 2017-10-18 NA 179 Pin 2 W F NA 3 36237 235 77.5 52.1 67.4
3 3 2017-10-18 NA 179 Pin 3 W F NA 3 36237 235 79.5 58.6 69.0
# ... with 7 more variables: median <dbl>, sd <dbl>, Round2 <dbl>, Round4 <dbl>, OrigData <list>, dataSize <int>,
# fileName <chr>
and I would like a list with a length the same as nrow(df.tbl) looking like this:
> twins
[[1]]
[1] 1 7
[[2]]
[1] 2 8
[[3]]
[1] 3 9
Almost all objects have one twin / duplicate (as above) but a few have two or even three duplicates (as defined above, i.e. column 2:7 are the same)
A bit late to the party, but you can do it much more neatly with nest().
tbl.df1 <- tbl.df %>% group_by(date, forge, serNum, PinMain, PinMainNumber, Pos) %>% nest(rowNum)
The twins will be in the list of tibbles created by nest.
tbl.df1$data
# [[1]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 1
# 2 7
#[[2]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 2
# 2 8
# etc
do you really need to solve it with map?
I would solve through combining duplicated and semi_join from the package dplyr like this
defining_columns <- c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos")
dplyr::semi_join(
df.tbl,
df.tbl[duplicated(df.tbl[defining_columns]),],
by = defining_columns
) %>%
group_by_at(defining_columns) %>%
arrange(.by_group = TRUE) %>%
summarise(twins = paste0(rowNum,collapse = ",")) %>%
pull(twins) %>%
strsplit(",")
the duplicated gives us which rows are duplicated and the semi_join only keeps rows in x that are present in y
Hope this helps!!

Select column that has the fewest NA values

I am working with a data frame that produces two output columns. One column always has more NA values than the other column, but not in any predictable fashion. here is my question, how can I use dplyr to select the column with the fewest number of NA values. I was thinking of utilizing which.min to decide, but not sure how to put it all together. Note that both columns contain na values, and I want to select the one with the fewest of those values.
You can do this with dplyr and purrr.
inside which.min you first calculate the number of NA's in the columns with map (can be as many columns as you have in your data.frame. The keep part returns only those columns which actually have NA's. The which.min returns the named vector of which we take the name and supply it to the select function of dplyr.
I have outlined the code a bit so you can easily see which parts belong where.
library(purrr)
library(dplyr)
df %>% select(names(which.min(df %>%
map(function(x) sum(is.na(x))) %>%
keep(~ .x > 0)
)
)
)
library(dplyr)
df <- tibble(a = c(rep(c(NA, 1:5), 4)), # df with different NA counts/col
b = c(rep(c(NA, NA, 2:5), 4)))
df %>%
summarise_all(funs(sum(is.na(.)))) # NA counts
#> # A tibble: 1 x 2
#> a b
#> <int> <int>
#> 1 4 8
df %>% # answer
select_if(funs(which.min(sum(is.na(.)))))
#> # A tibble: 24 x 1
#> a
#> <int>
#> 1 NA
#> 2 1
#> 3 2
#> 4 3
#> 5 4
#> 6 5
#> 7 NA
#> 8 1
#> 9 2
#> 10 3
#> # ... with 14 more rows
Created on 2018-05-25 by the reprex package (v0.2.0).

Sum() in dplyr and aggregate: NA values

I have a dataset with about 3,000 rows. The data can be accessed via https://pastebin.com/i4dYCUQX
Problem: NA results in the output, though there appear to be no NA in the data. Here is what happens when I try to sum the total value in each category of a column via dplyr or aggregate:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
example
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
Out:
# A tibble: 4 x 2
size volume
<fctr> <int>
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
# aggregate
aggregate(volume ~ size, data=example, FUN=sum)
Out:
size volume
1 Extra Large NA
2 Large NA
3 Medium 937581572
4 Small NA
When trying to access the value via colSums, it seems to work:
# Colsums
small <- example %>% filter(size == "Small")
colSums(small["volume"], na.rm = FALSE, dims = 1)
Out:
volume
3869267348
Can anyone imagine what the issue could be?
The first thing to note is that, running your example, I get:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <int>
#> 1 Extra Large NA
#> 2 Large NA
#> 3 Medium 937581572
#> 4 Small NA
which clearly states that you're sums are overflowing the integer type. If we do as the warning message suggests, we can convert the integers to numerics and then sum:
example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum(as.numeric(.))))
#> # A tibble: 4 × 2
#> size volume
#> <fctr> <dbl>
#> 1 Extra Large 3609485056
#> 2 Large 11435467097
#> 3 Medium 937581572
#> 4 Small 3869267348
here the funs(sum) has been replaced by funs(sum(as.numeric(.)) which is the same, executing sum on each group but converting to numeric first.
its because value is an integer and not numeric
example$volume <- as.numeric(example$volume)
aggregate(volume ~ size, data=example, FUN=sum)
size volume
1 Extra Large 3609485056
2 Large 11435467097
3 Medium 937581572
4 Small 3869267348
For more check here:
What is integer overflow in R and how can it happen?

Mass changing columns of a data set to numeric

I've imported an excel data set and want to set nearly all columns (greater than 90) to numeric when they are initially characters. What is the best way to achieve this because importing and changing each to numeric one by one isn't the most efficient approach?
This should do as you wish:
# Random data frame for illustration (100 columns wide)
df <- data.frame(replicate(100,sample(0:1,1000,rep=TRUE)))
# Check column names / return column number (just encase you wanted to check)
colnames(df)
# Specify columns
cols <- c(1:length(df)) # length(df) is useful as if you ever add more columns at later date
# Or if only want to specify specific column numbers:
# cols <- c(1:100)
#With help of magrittr pipe function change all to numeric
library(magrittr)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
# Check our columns are numeric
str(df)
Assuming your data is already imported with all character columns, you can convert the relevant columns to numeric using mutate_at by position or name:
suppressPackageStartupMessages(library(tidyverse))
# Assume the imported excel file has 5 columns a to e
df <- tibble(a = as.character(1:3),
b = as.character(5:7),
c = as.character(8:10),
d = as.character(2:4),
e = as.character(2:4))
# select the columns by position (convert all except 'b')
df %>% mutate_at(c(1, 3:5), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# or drop the columns that shouldn't be used ('b' and 'd' should stay as chr)
df %>% mutate_at(-c(2, 4), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4
# select the columns by name
df %>% mutate_at(c("a", "c", "d", "e"), as.numeric)
#> # A tibble: 3 x 5
#> a b c d e
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 5 8 2 2
#> 2 2 6 9 3 3
#> 3 3 7 10 4 4

Resources