R: Split large character lines into slices - r

For a large dataframe containing 99150000 rows, the following code splits the data my_df into chunks of 1000 rows and writes to the disk.
lapply(seq(1, nrow(my_df), by = 1000),
function(i) write.table(my_df[i:i+1000-1,]
, file = paste0('path_to_logal_dir/data'
, i, '-', i+1000-1, '.csv')
,row.names = F,col.names = F,quote = F)
)
Now, I have the same data (99150000 elements) in the character format, sample data below:
[1] "1979_1,532,40,7.7,12.9,116.9,12.9,85,2,2.001,4,25,55,5.3,55,85,7.7,85,145,7.5,145,265,5.0"
[2] "1979_2,532,40,7.7,12.9,116.9,12.9,85,2,2.001,4,25,55,5.3,55,85,7.7,85,145,7.5"
[3] "1979_3,532,40,7.7,12.9,116.9,12.9,85,2,2.001,4,25,55,5.3,55,85,7.7,85"
...
[99150000] ...
How could I achieve the same task above, that is, splitting the character format data into chunks (files containing 1000 lines)?

This is a solution made using only base R. You can easily generalize it using apply family or purrr package. First I create some fake data
fake_data <- c("A", "B", "C", "D", "E", "F", "G", "H")
fake_data
#> [1] "A" "B" "C" "D" "E" "F" "G" "H"
You want to divide your character vector into groups of 1000 lines. For simplicity I divide this vector into groups of 2 lines
group_length <- 2
This means that the first 2 elements of the character vector belong to the first group, the second 2 elements belong to the second group and so on
groups <- rep(1 : (length(fake_data) / group_length), each = group_length)
groups
#> [1] 1 1 2 2 3 3 4 4
Now I divide the character vector into subgroups based
splitted_groups <- split(fake_data, groups)
splitted_groups
#> $`1`
#> [1] "A" "B"
#>
#> $`2`
#> [1] "C" "D"
#>
#> $`3`
#> [1] "E" "F"
#>
#> $`4`
#> [1] "G" "H"
and create a for loop to save each subgroup to a file
for (i in seq_len(length(fake_data) / group_length)) {
table_data <- data.frame(x = splitted_groups[[i]])
write.csv(table_data, file = paste0("data", i, ".csv"), row.names = FALSE)
}
Created on 2019-07-30 by the reprex package (v0.3.0)
You could also replace the last for loop using the map family defined in the purrr package.

Related

Turn a datatable into a two-dimensional list in R

I have a data.table (see dt). I want to turn it into a 2-dimensional list for future use (e.g. a, b and c are column names of another dt. I want to select the value of a non-missing column among a, b and c then impute into x, and so on). So the 2-dimensional list will act like a reference object for fcoalesce function.
# example
dt <- data.table(col1 = c("a", "b", "c", "d", "e", "f"),
col2 = c("x", "x", "x", "y", "y", "z"))
# desirable result
list.1 <- list(c("a", "b", "c"), c("d", "e"), c("f"))
list.2 <- list("x", "y", "z")
list(list.1, list.2)
Since the actual dt is much larger than the example dt, is there a more efficient way to do it?
You can use split():
lst1 <- split(dt$col1, dt$col2)
lst2 <- as.list(names(lst1))
result <- list(unname(lst1), lst2)
result
# [[1]]
# [[1]][[1]]
# [1] "a" "b" "c"
#
# [[1]][[2]]
# [1] "d" "e"
#
# [[1]][[3]]
# [1] "f"
#
#
# [[2]]
# [[2]][[1]]
# [1] "x"
#
# [[2]][[2]]
# [1] "y"
#
# [[2]][[3]]
# [1] "z"

Save unique values of variable for each combination of two variables in a dataset

I have a (large) dataset with three variables. For each combination of sub1 and sub2, I would like to save a all unique IVs in a separate vector or dataset, ignoring id, and name it using the variables "sub1.and.sub2.IV". As my dataset is quite large, I would like to avoid using which and automatically extract all combinations.
id sub1 sub2 IV
<chr> <chr> <chr> <chr>
1 3 a a p
2 3 a a f
3 6 a b z
4 6 a b e
5 7 a c b
6 7 a c b
In the end, I would have three vector or datasets:
> a.and.a.IV
[1] "p" "f"
> a.and.b.IV
[1] "z" "e"
> a.and.c.IV
[1] "b"
MRE example:
structure(list(id = c("3", "3", "6", "6", "7", "7"), sub1 = c("a",
"a", "a", "a", "a", "a"), sub2 = c("a", "a", "b", "b", "c", "c"
), IV = c("p", "f", "z", "e", "b", "b")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
Maybe split
> split(df$IV, df[c("sub1","sub2")])
$a.a
[1] "p" "f"
$a.b
[1] "z" "e"
$a.c
[1] "b" "b"
One possibility could be::
a.and.a.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="a"),]$IV)
a.and.b.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="b"),]$IV)
a.and.c.IV<-unique(df[which(df$sub1 == "a" & df$sub2=="c"),]$IV)
> a.and.a.IV
[1] "p" "f"
> a.and.b.IV
[1] "z" "e"
> a.and.c.IV
[1] "b"
I used #ThomasIsCoding's comment to search for more solutions. I have found 3 solutions to split the dataframe into a list of tibbles and 1 solution using a loop to split a list into dataframes. The for loop stays the same for every solution:
Solution 1:
Using a custom made function by #romainfrancois to split and name the data.frames with the corresponding combinations of sub1 and sub2.
library(dplyr, warn.conflicts = FALSE)
named_group_split <- function(.tbl, ...) {
grouped <- group_by(.tbl, ...)
names <- rlang::eval_bare(rlang::expr(paste(!!!group_keys(grouped), sep = " / ")))
grouped %>%
group_split() %>%
rlang::set_names(names)
}
df_split1 <- df %>%
named_group_split(sub1, sub2) %>%
unique()
for(i in 1:length(df_split1)) {
assign(paste0(names(df_split1[i])), as.data.frame(df_split1[[i]]))
}
Solution 2:
Using dplyr::group_split to split the dataset into a list with all the original variables and their respective names. Unfortunately, this solution is not able to name the data.frames. Solution found here.
df_split2 <- df %>%
group_split(sub1, sub2)
for(i in 1:length(df_split2)) {
assign(paste0(names(df_split2[i])), as.data.frame(df_split2[[i]]))
}
Solution 3:
Using base::split allows to split the dataset into a list with just IVs as variable and the for loop.
df_split3 <- split(df$IV, df[c("sub1","sub2")])
for(i in 1:length(df_split3)) {
assign(paste0(names(df_split3[i])), as.data.frame(df_split3[[i]]))
}

How to obtain the list of elements from a Venn diagram

I have a Venn diagram made from 3 lists, I would like to obtain all the different sub-lists, common elements between two lists, between the tree of them, and the unique elements for each list. Is there a way to make this as straight forward as possible?
AW.DL <- c("a","b","c","d")
AW.FL <- c("a","b", "e", "f")
AW.UL <- c("a","c", "e", "g")
venn.diagram(
x = list(AW.DL, AW.FL, AW.UL),
category.names = c("AW.DL" , "AW.FL","AW.UL" ),
filename = '#14_venn_diagramm.png',
output=TRUE,
na = "remove"
)
I found that the package VennDiagram has a function calculate.overlap() but I wasn't able to find a way to name the sections from this function. However, if you use package gplots , there is the function venn() which will return the intersections attribute.
AW.DL <- c("a","b","c","d")
AW.FL <- c("a","b", "e", "f")
AW.UL <- c("a","c", "e", "g")
library(gplots)
lst <- list(AW.DL,AW.FL,AW.UL)
ItemsList <- venn(lst, show.plot = FALSE)
lengths(attributes(ItemsList)$intersections)
Output:
> lengths(attributes(ItemsList)$intersections)
A B C A:B A:C B:C A:B:C
1 1 1 1 1 1 1
To get elements, just print attributes(ItemsList)$intersections:
> attributes(ItemsList)$intersections
$A
[1] "d"
$B
[1] "f"
$C
[1] "g"
$`A:B`
[1] "b"
$`A:C`
[1] "c"
$`B:C`
[1] "e"
$`A:B:C`
[1] "a"

Finding specific elements in lists

I am stuck at one of the challenges proposed in a tutorial I am reading.
# Using the following code:
challenge_list <- list(words = c("alpha", "beta", "gamma"),
numbers = 1:10
letter = letters
# challenge_list
# Extract the following things:
#
# - The word "gamma"
# - The letters "a", "e", "i", "o", and "u"
# - The numbers less than or equal to 3
I have tried using the followings:
## 1
challenge_list$"gamma"
## 2
challenge_list [[1]["gamma"]]
But nothing works.
> challenge_list$words[challenge_list$words == "gamma"]
[1] "gamma"
> challenge_list$letter[challenge_list$letter %in% c("a","e","i","o","u")]
[1] "a" "e" "i" "o" "u"
> challenge_list$numbers[challenge_list$numbers<=3]
[1] 1 2 3
We can use a function and then do the subset if it is numeric or not and then use Map to pass the list to vector that correspond to the original list element and apply the f1. This would return the new list with the filtered values
f1 <- function(x, y) if(is.numeric(x)) x[ x <= y] else x [x %in% y]
out <- Map(f1, challenge_list, list('gamma', 3, c("a","e","i","o","u")))
out
-output
#$words
#[1] "gamma"
#$numbers
#[1] 1 2 3
#$letter
#[1] "a" "e" "i" "o" "u"
Try this. Most of R objects can be filtered using brackets. In the case of lists you have to use a pair of them like [[]][] because the first one points to the object inside the list and the second one makes reference to the elements inside them. For vectors the task is easy as you only can use a pair of brackets and set conditions to extract elements. Here the code:
#Data
challenge_list <- list(words = c("alpha", "beta", "gamma"),
numbers = 1:10
letter = letters
#Code
challenge_list[[1]][1]
letter[letter %in% c("a", "e", "i", "o","u")]
numbers[numbers<=3]
As I have noticed your data is in a list, you can also play with the position of the elements like this:
#Data 2
challenge_list <- list(words = c("alpha", "beta", "gamma"),numbers = 1:10,letter = letters)
#Code 2
challenge_list[[1]][1]
challenge_list[[3]][challenge_list[[3]] %in% c("a", "e", "i", "o","u")]
challenge_list[[2]][challenge_list[[2]]<=3]
Output:
challenge_list[[1]][1]
[1] "alpha"
challenge_list[[3]][challenge_list[[3]] %in% c("a", "e", "i", "o","u")]
[1] "a" "e" "i" "o" "u"
challenge_list[[2]][challenge_list[[2]]<=3]
[1] 1 2 3

R - How do I check if an element is in a list of vectors?

Ok, my question might be a bit weirder than what the title suggests.
I have this list:
x <- list(
c("a", "d"),
c("a", "c"),
c("d", "e"),
c("e", "f"),
c("b", "c"),
c("f", "c"), # row 6
c("c", "e"),
c("f", "b"),
c("b", "a")
)
And I need to copy this stuff in another list called T. The only condition is that both letters of the pair must not be in T already. If one of them is already in T and the other isn't it's fine.
Basically in this example I would take the first 5 positions and copy them in T one after another because either one or both letters are new to T.
Then I would skip the 6th position because the letter "f" was already in the 4th position of T and the letter "c" is already in the 2nd and 5th positions of T.
Then I would skip the remaining 3 positions for the same reason (the letters "c", "e", "f", "b", "a" are already in T at this point)
I tried doing this
for(i in 1:length(T){
if (!( *first letter* %in% T && *second letter* %in% T)) {
T[[i]] <- c(*first letter*, *second letter*)
}
}
But it's like the "if" isn't even there, and I'm pretty sure I'm using %in% in the wrong way.
Any suggestions? I hope what I wrote makes sense, I'm new to R and to this site in general.
Thanks for your time
Effectively, for each element of the list, you want to lose it if both of its elements exist in earlier elements. A logical index is helpful here.
# Make a logical vector the length of x.
lose <- logical(length(x))
Now you can run a loop over the length of lose and compare it against all previous elements of x. Using seq_len saves us the headache of having to guard against the special case of i = 1 (seq_len(0) returns a zero-length integer instead of 0).
for (i in seq_along(lose)){
lose[i] <- all(x[[i]] %in% unique(unlist(x[seq_len(i - 1)])))
}
Now let's use the logical vector to subset x to T
T <- x[!lose]
T
#> [[1]]
#> [1] "a" "d"
#>
#> [[2]]
#> [1] "a" "c"
#>
#> [[3]]
#> [1] "d" "e"
#>
#> [[4]]
#> [1] "e" "f"
#>
#> [[5]]
#> [1] "b" "c"
# Created on 2018-07-19 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).
You can put the set of all previous elements in a list cum.sets, then use Map to check if all elements of the current vector are in the lagged cumulative set.
cum.sets <- lapply(seq_along(x), function(y) unlist(x[1:y]))
keep <- unlist(
Map(function(x, y) !all(x %in% y)
, x
, c(NA, cum.sets[-length(cum.sets)])))
x[keep]
# [[1]]
# [1] "a" "d"
#
# [[2]]
# [1] "a" "c"
#
# [[3]]
# [1] "d" "e"
#
# [[4]]
# [1] "e" "f"
#
# [[5]]
# [1] "b" "c"
tidyverse version (same output)
library(tidyverse)
cum.sets <- imap(x, ~ unlist(x[1:.y]))
keep <- map2_lgl(x, lag(cum.sets), ~!all(.x %in% .y))
x[keep]
You can use Reduce. In this case. IF all the new values are not in the list already, then concatenate it to the list, else drop it. the initial is the first element of the list:
Reduce(function(i, y) c(i, if(!all(y %in% unlist(i))) list(y)), x[-1],init = x[1])
[[1]]
[1] "a" "d"
[[2]]
[1] "a" "c"
[[3]]
[1] "d" "e"
[[4]]
[1] "e" "f"
[[5]]
[1] "b" "c"
The most straightforward option is that you could store unique entries in another vector as you're looping through your input data.
Here's a solution without considering the positions (1 or 2) of the alphabets in your output list or the order of your input list.
dat <- list(c('a','d'),c('a','c'),c('d','e'),c('e','f'),c('b','c'),
c('f','c'),c('c','e'),c('f','b'),c('b','a'))
Dat <- list()
idx <- list()
for(i in dat){
if(!all(i %in% idx)){
Dat <- append(Dat, list(i))
## append to idx if not previously observed
if(! i[1] %in% idx) idx <- append(idx, i[1])
if(! i[2] %in% idx) idx <- append(idx, i[2])
}
}
print(Dat)
#> [[1]]
#> [1] "a" "d"
#>
#> [[2]]
#> [1] "a" "c"
#>
#> [[3]]
#> [1] "d" "e"
#>
#> [[4]]
#> [1] "e" "f"
#>
#> [[5]]
#> [1] "b" "c"
On another note, I'd advise against using T as your vector name as it's used as TRUE in R.
We can unlist, check duplicated values with duplicated, reformat as a matrix and filter out pairs of TRUE values:
x[colSums(matrix(duplicated(unlist(x)), nrow = 2)) != 2]
# [[1]]
# [1] "a" "d"
#
# [[2]]
# [1] "a" "c"
#
# [[3]]
# [1] "d" "e"
#
# [[4]]
# [1] "e" "f"
#
# [[5]]
# [1] "b" "c"
#
And I recommend you don't use T as a variable name, it means TRUE by default (thought it's discouraged to use it as such), this could lead to unpleasant debugging.

Resources