New df in R pulling from large existing df - r

In R, am trying to take this df I have called "gorilla" and create four new dfs by column identifiers. The "gorilla" spreadsheet has a column called "order", and this column has values that are either 1, 2, 3, and 4. I want to create a new df with "1" values only, and another one with "2" values only, etc. What is the best way to do this?

If you do:
list2env(setNames(split(gorilla, gorilla$order), paste0("gorilla", 1:4)),
envir = globalenv())
Then you will have the 4 data frames in your workspace, called gorilla1, gorilla2, gorilla3 and gorilla4
For example, if we have this dataset:
set.seed(100)
gorilla <- data.frame(data = rnorm(10), order = sample(4, 10, TRUE))
gorilla
#> data order
#> 1 -0.50219235 3
#> 2 0.13153117 4
#> 3 -0.07891709 2
#> 4 0.88678481 1
#> 5 0.11697127 4
#> 6 0.31863009 3
#> 7 -0.58179068 3
#> 8 0.71453271 4
#> 9 -0.82525943 2
#> 10 -0.35986213 1
We can do:
list2env(setNames(split(gorilla, gorilla$order), paste0("gorilla", 1:4)),
envir = globalenv())
#> <environment: R_GlobalEnv>
And now we can see we have these objects available:
gorilla1
#> data order
#> 4 0.8867848 1
#> 10 -0.3598621 1
gorilla2
#> data order
#> 3 -0.07891709 2
#> 9 -0.82525943 2
gorilla3
#> data order
#> 1 -0.5021924 3
#> 6 0.3186301 3
#> 7 -0.5817907 3
gorilla4
#> data order
#> 2 0.1315312 4
#> 5 0.1169713 4
#> 8 0.7145327 4
Note though that it is probably best in most circumstances to keep the data frames in a list:
gorillas <- split(gorilla, gorilla$order)
That way, you can just access gorillas[[1]] , gorillas[[2]] etc

An optioin with group_split
library(dplyr)
gorillas <- gorilla %>%
group_split(order)

Related

is.na or complete.cases in R using column number

In this example I need to drop all rows with NA values, I tried
drop <- is.na(df[,c(3,4,5)])
Error in df[, c(3, 4, 5)] : incorrect number of dimensions
My dataframe have 5 columns
I am not trying to select columns with column name
Also tried
df[complete.cases(df[ , 3:5]),]
Same error, incorrect number of dimensions
Dropping missing values from vectors
The errors indicate that your data are likely a vector, not a data.frame. Accordingly, there are no rows or columns (it has no dim) and so using [,] is throwing errors. To support this, below I create a vector, reproduce the errors, and demonstrate how to drop missing values from it.
# Create vector, show it's a vector
vec <- c(NA,1:4)
vec
#> [1] NA 1 2 3 4
is.vector(vec)
#> [1] TRUE
# Reproduces your errors for both methods
is.na(vec[ ,2:3])
#> Error in vec[, 2:3]: incorrect number of dimensions
vec[complete.cases(vec[ , 2:3]), ]
#> Error in vec[, 2:3]: incorrect number of dimensions
# Remove missing values from the vector
vec[!is.na(vec)]
#> [1] 1 2 3 4
vec[complete.cases(vec)]
#> [1] 1 2 3 4
I'll additionally show you below how to check if your data object is a data.frame and how to omit rows with missing values in case it is.
Create data and check it's a data.frame
# Create an example data.frame
set.seed(123)
N <- 10
df <- data.frame(
x1 = sample(c(NA_real_, 1, 2, 3), N, replace = T),
x2 = sample(c(NA_real_, 1, 2, 3), N, replace = T),
x3 = sample(c(NA_real_, 1, 2, 3), N, replace = T)
)
print(df)
#> x1 x2 x3
#> 1 2 3 NA
#> 2 2 1 3
#> 3 2 1 NA
#> 4 1 NA NA
#> 5 2 1 NA
#> 6 1 2 2
#> 7 1 3 3
#> 8 1 NA 1
#> 9 2 2 2
#> 10 NA 2 1
# My hunch is that you are not using a data.frame. You can check as follows:
class(df)
#> [1] "data.frame"
Approaches to removing rows with missing values from data.frames
Your first approach returns logical values for whether a value is missing for the specified columns. You could then rowSum and drop them per below.
# Example: shows whether values are missing for second and third columns
miss <- is.na(df[ ,2:3])
print(miss)
#> x2 x3
#> [1,] FALSE TRUE
#> [2,] FALSE FALSE
#> [3,] FALSE TRUE
#> [4,] TRUE TRUE
#> [5,] FALSE TRUE
#> [6,] FALSE FALSE
#> [7,] FALSE FALSE
#> [8,] TRUE FALSE
#> [9,] FALSE FALSE
#> [10,] FALSE FALSE
# We can sum all of these values by row (`TRUE` = 1, `FALSE` = 0 in R) and keep only
# those rows that sum to 0 to remove missing values. Notice that the row names
# retain the original numbering.
df[rowSums(miss) == 0, ]
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
#> 10 NA 2 1
Your second approach is to use complete.cases. This also works and produces the same result as the first approach.
miss_cases <- df[complete.cases(df[ ,2:3]), ]
miss_cases
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
#> 10 NA 2 1
A third approach is to use na.omit() however, it doesn't let you specify columns and you should just use complete.cases instead if you need to filter on specific columns.
na.omit(df)
#> x1 x2 x3
#> 2 2 1 3
#> 6 1 2 2
#> 7 1 3 3
#> 9 2 2 2
A fourth approach is to use the tidyr package where the appeal is you can use column indices as well as unquoted column names. This also updates row names.
library(tidyr)
drop_na(df, 2:3)
#> x1 x2 x3
#> 1 2 1 3
#> 2 1 2 2
#> 3 1 3 3
#> 4 2 2 2
#> 5 NA 2 1

merge xts objects with suffixes dynamically

The task is to dynamically merge multiple xts objects into one big blob with suffixes, where suffixes are essentially the column names for the xts objects.
Sample Data:
library(xts)
a <- data.frame(alpha=1:10, beta=2:11)
xts1 <- xts(x=a, order.by=Sys.Date() - 1:10)
b <- data.frame(alpha=3:12, beta=4:13)
xts2 <- xts(x=b, order.by=Sys.Date() - 1:10)
c <- data.frame(alpha=5:14, beta=6:15)
xts3 <- xts(x=c, order.by=Sys.Date() - 1:10)
Static way of merging:
$> merge.zoo(xts1, xts2, xts3, suffixes=c("A", "B", "C"))
# output
alpha.A beta.A alpha.B beta.B alpha.C beta.C
2022-03-11 10 11 12 13 14 15
2022-03-12 9 10 11 12 13 14
2022-03-13 8 9 10 11 12 13
2022-03-14 7 8 9 10 11 12
2022-03-15 6 7 8 9 10 11
2022-03-16 5 6 7 8 9 10
2022-03-17 4 5 6 7 8 9
2022-03-18 3 4 5 6 7 8
2022-03-19 2 3 4 5 6 7
2022-03-20 1 2 3 4 5 6
I might have more than 3 xts objets and 3 is just an arbitrary demonstration.
I've tried do.call but my do.call attempt failed with/out suffixes because I can't wrap up the 3 xts objects into a data structure that's accepted by do.call as a list (in R's language it should be a vector of 3 items).
do.call Demo:
# do.call with xts objects as separate args and suffixes, works
do.call(merge.zoo, list(xts1, xts2, xts3, suffixes=c("A", "B", "C")))
# do.call with xts objects wrapped up as list and suffixes, failed
# because R takes each element of list as vector and essentially a list of xts objects is a list of 3 lists, each of which has a xts object.
xts.list <- list(xts1, xts2, xts3)
# check data type
class(xts.list[[1]]) # output: xts,zoo
class(xts.list[1]) # output: list
# do.call failed attempt
do.call(merge.zoo, list(xts.list, suffixes=c("A", "B", "C")))
# Error Message
Error in zoo(structure(x, dim = dim(x)), index(x), ...) :
“x” : attempt to define invalid zoo object
In other words, if I can unpack the list into dynamic number of arguments, I'd be able to get this idea to work; however I can't seem to find a way to either unpack arguments in R or some other solutions.
Disclaimer: The ultimate problem I am trying to solve is to be able to plot the time series data in the multi-panel view eventually; ggplot does not work with most of the packages I am using on a daily basis.
Disclaimer 2: merge.xts ignores suffixes (a bug), merge.zoo is the working alternative. For more information take a look here
We can pass everything in a list i.e.
library(zoo)
c(xts.list, list(suffixes=c("A", "B", "C")))
Now, use the merge in do.call
do.call(merge.zoo, c(xts.list, list(suffixes=c("A", "B", "C"))))
-output
alpha.A beta.A alpha.B beta.B alpha.C beta.C
2022-03-11 10 11 12 13 14 15
2022-03-12 9 10 11 12 13 14
2022-03-13 8 9 10 11 12 13
2022-03-14 7 8 9 10 11 12
2022-03-15 6 7 8 9 10 11
2022-03-16 5 6 7 8 9 10
2022-03-17 4 5 6 7 8 9
2022-03-18 3 4 5 6 7 8
2022-03-19 2 3 4 5 6 7
2022-03-20 1 2 3 4 5 6
Note that the first argument to merge is variadic component (... - which can take one or more xts data, where as all the other components are named and that is the reason we are creating the list with names only for those components i.e. suffixes. According to ?merge
merge(...,
all = TRUE,
fill = NA,
suffixes = NULL,
join = "outer",
retside = TRUE,
retclass = "xts",
tzone = NULL,
drop=NULL,
check.names=NULL)
Thus, when we want to append a list i.e. xts.list with another list element, wrap the second named vector in a list and then just concatenate. It is similar to
> c(list(1), list(a = 1, b = 2))
[[1]]
[1] 1
$a
[1] 1
$b
[1] 2
and not the following as this create a nested list
> list(list(1), list(a = 1, b = 2))
[[1]]
[[1]][[1]]
[1] 1
[[2]]
[[2]]$a
[1] 1
[[2]]$b
[1] 2
This is now (finally) fixed in this commit. Thanks for the nudge to get this fixed!
library(xts)
idx <- Sys.Date() - 1:10
x1 <- xts(cbind(alpha = 1:10, beta = 2:11), idx)
x2 <- xts(cbind(alpha = 3:12, beta = 4:13), idx)
x3 <- xts(cbind(alpha = 5:14, beta = 6:15), idx)
suffixes <- LETTERS[1:3]
merge(x1, x2, x3, suffixes = suffixes)
## alpha.A beta.A alpha.B beta.B alpha.C beta.C
## 2022-05-13 10 11 12 13 14 15
## 2022-05-14 9 10 11 12 13 14
## 2022-05-15 8 9 10 11 12 13
## 2022-05-16 7 8 9 10 11 12
## 2022-05-17 6 7 8 9 10 11
## 2022-05-18 5 6 7 8 9 10
## 2022-05-19 4 5 6 7 8 9
## 2022-05-20 3 4 5 6 7 8
## 2022-05-21 2 3 4 5 6 7
## 2022-05-22 1 2 3 4 5 6

R: Categorizing column dependent on value in another column (same characters exist)

I'm sure there's a painfully easy solution to this, but given I'm new to R I'm a bit stumped.
I have a large dataset with the data structured accordingly.
v1
1 US2
2 L1_US24
3 US2_0
4 US24
5 US245
6 US245
7 US24 L
8 US3
What I'd like to do is create a categorisation column dependent upon the values in v1 like so:
v1 Cat
1 US2 1
2 L1_US24 2
3 US2_0 1
4 US24 2
5 US245 3
6 US245 3
7 US24 L 2
8 US3 4
Now if it was a binary choice it would be quite easy for I could use 'grepl' with 'ifelse' to assign the values accordingly. However I'm unsure whether that is an efficient way of doing it in a large dataset where the same values are contained in the columns.
Can anyone provide some advice on how to achieve the desired result?
Please find a more general solution that should answer the different cases you encounter.
Reprex
Solution with Base R only
Code
# Extract codes 'USXXX'
code <- regmatches(df$V1, regexpr("US\\d+", df$V1))
# Convert codes into numeric categories and add the in the 'Cat' column
df$Cat <- as.numeric(factor(code, levels = unique(code)))
Output
df
#> V1 Cat
#> 1 US2 1
#> 2 L1_US24 2
#> 3 US2_0 1
#> 4 US24 2
#> 5 US245 3
#> 6 US245 3
#> 7 US24 L 2
#> 8 US3 4
Solution using stringr
Code
# Extract codes 'USXXX'
code <- stringr::str_extract(df$V1, "US\\d+")
# Convert codes into numeric categories and add them in the 'Cat' column
df$Cat <- as.numeric(factor(code, levels = unique(code)))
Output
df
#> V1 Cat
#> 1 US2 1
#> 2 L1_US24 2
#> 3 US2_0 1
#> 4 US24 2
#> 5 US245 3
#> 6 US245 3
#> 7 US24 L 2
#> 8 US3 4
Data
df <- data.frame(V1 = c("US2", "L1_US24", "US2_0", "US24", "US245", "US245", "US24 L", "US3"))
Created on 2022-02-04 by the reprex package (v2.0.1)
You can convert to factor and then to numeric:
df$Cat <- as.numeric(factor(df$v1, levels = unique(df$v1)))
df
v1 Cat
1 US2 1
2 US24 2
3 US2 1
4 US24 2
5 US245 3
6 US245 3
7 US243 4
8 US3 5

map() into an argument that is not the first argument

I have a function that takes multiple arguments (simple reproducible example below):
return_numbers <- function(first = 1, last = 10){
seq(first, last)
}
If I then have a vector that I want to map(), for example:
x <- c(5, 6, 7)
It's quite easy to map() the vector x into the first argument of the function:
map(x, return_numbers)
[[1]]
[1] 5 6 7 8 9 10
[[2]]
[1] 6 7 8 9 10
[[3]]
[1] 7 8 9 10
But I can't work out how to map x into the second argument (last = ).
I referred to Hadley Wickham's Advanced R:
https://adv-r.hadley.nz/functionals.html#change-argument
and tried this, but I must be doing something wrong:
map(x, ~ return_numbers(x, last = .x))
My desired output would be:
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1 2 3 4 5 6
[[3]]
[1] 1 2 3 4 5 6 7
This should work:
map(x, ~return_numbers(last = .))
You can also mention the first argument explicitly :
return_numbers <- function(first = 1, last = 10){
seq(first, last)
}
x <- c(5, 6, 7)
purrr::map(x, return_numbers, first=1)
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 1 2 3 4 5 6
#>
#> [[3]]
#> [1] 1 2 3 4 5 6 7
Created on 2019-11-10 by the reprex package (v0.3.0)

How to add a column to lists within a list without losing their names?

I did several attempts to add a specific column to data frames resp. lists within a list, but all *apply() attempts failed to preserve the names of the data frames.
For example for list l,
l <- list(alpha=data.frame(1:3), bravo=data.frame(4:6), charly=data.frame(7:9))
> l
$`alpha`
X1.3
1 1
2 2
3 3
$bravo
X4.6
1 4
2 5
3 6
$charly
X7.9
1 7
2 8
3 9
I want the initial letters of the lists' names as a second id column. I tried these attempts who give me basically what I want:
lapply(seq_along(l), function(x) cbind(l[[x]], id=substr(names(l)[x], 1, 1)))
# or
lapply(seq_along(l), function(x) data.frame(l[[x]], id=substr(names(l)[x], 1, 1)))
# [[1]]
# X1.3 id
# 1 1 a
# 2 2 a
# 3 3 a
#
# [[2]]
# X4.6 id
# 1 4 b
# 2 5 b
# 3 6 b
#
# [[3]]
# X7.9 id
# 1 7 c
# 2 8 c
# 3 9 c
but the inner lists have lost their names. Option USE.NAMES=TRUE from lapply() documentation didn't work.
I also tried these two attempts, but they failed even worse.
lapply(seq_along(l), function(x) mapply(cbind, l[[x]], id=substr(names(l)[x], 1, 1),
SIMPLIFY=FALSE))
rapply(l, function(x) cbind(x, id=substr(names(l)[x], 1, 1)), how="list")
I know I could do this like so:
l1 <- lapply(seq_along(l), function(x) cbind(l[[x]], id=substr(names(l)[x], 1, 1)))
names(l1) <- names(l)
or do a for loop:
for(i in seq_along(l)) {
l[[i]] <- data.frame(l[[i]], id=substr(names(l)[i], 1, 1))
}
but I'd like to know whether an *apply() solution could be improved to bring the expected output, which would be:
$`alpha`
X1.3 id
1 1 a
2 2 a
3 3 a
$bravo
X4.6 id
1 4 b
2 5 b
3 6 b
$charly
X7.9 id
1 7 c
2 8 c
3 9 c
Try Map
Map(`[<-`, l, "id", value = substr(names(l), 1, 1))
#$alpha
# X1.3 id
#1 1 a
#2 2 a
#3 3 a
#$bravo
# X4.6 id
#1 4 b
#2 5 b
#3 6 b
#$charly
# X7.9 id
#1 7 c
#2 8 c
#3 9 c
The first argument is a function. Map then applies the function "to the first elements of each ... argument, the second elements, the third elements, and so on.", see ?mapply.
sapply over the names with simplify = FALSE.
addId <- function(x) cbind(l[[x]], id = substring(x, 1, 1))
sapply(names(l), addId, simplify = FALSE)
giving:
$`alpha`
X1.3 id
1 1 a
2 2 a
3 3 a
$bravo
X4.6 id
1 4 b
2 5 b
3 6 b
$charly
X7.9 id
1 7 c
2 8 c
3 9 c
Alternately:
replace(l, TRUE, lapply(names(l), addId))
If you don't mind switching over to the purrr::map family from the apply family, purrr::imap takes 2 arguments: the item being mapped over, and the names of the item being mapped over. Then you can use your same cbind call, but you now have easy access to the names of the data frames.
l <- list(alpha=data.frame(1:3), bravo=data.frame(4:6), charly=data.frame(7:9))
purrr::imap(l, function(df, name) cbind(df, id = substr(name, 1, 1)))
#> $alpha
#> X1.3 id
#> 1 1 a
#> 2 2 a
#> 3 3 a
#>
#> $bravo
#> X4.6 id
#> 1 4 b
#> 2 5 b
#> 3 6 b
#>
#> $charly
#> X7.9 id
#> 1 7 c
#> 2 8 c
#> 3 9 c
Or if you want to go full tidyverse, you can add a column with dplyr::mutate inside your imap.
library(tidyverse)
imap(l, function(df, name) df %>% mutate(id = str_sub(name, 1, 1)))
#> $alpha
#> X1.3 id
#> 1 1 a
#> 2 2 a
#> 3 3 a
#>
#> $bravo
#> X4.6 id
#> 1 4 b
#> 2 5 b
#> 3 6 b
#>
#> $charly
#> X7.9 id
#> 1 7 c
#> 2 8 c
#> 3 9 c
As noted by #markus, you can also use the ~. formula notation shorthand instead of spelling out your functions. In that case, purrr::imap's two arguments become .x (the data frames) and .y (the names). This looks like:
purrr::imap(l, ~cbind(.x, id = substr(.y, 1, 1)))

Resources