Subset dataframe into equal subgroup chunks

Subset dataframe into equal subgroup chunks - r

I have df dataframe that needs subsetting into chunks of 2 names. From example below, there are 4 unique names: a,b,c,d. I need to subset into 2 one column matrices a,b and c,d.
Output format:
name1
item_value
item_value
...
END
name2
item_value
item_value
...
END
Example:
#dummy data
df <- data.frame(name=sort(c(rep(letters[1:4],2),"a","a","c")),
item=round(runif(11,1,10)),
stringsAsFactors=FALSE)
#tried approach - split per name. I need to split per 2 names.
lapply(split(df,f=df$name),
function(x)
{name <- unique(x$name)
as.matrix(c(name,x[,2],"END"))
})
#expected output
[,1]
[1,] "a"
[2,] "8"
[3,] "9"
[4,] "6"
[5,] "4"
[6,] "END"
[1,] "b"
[2,] "2"
[3,] "10"
[4,] "END"
[,2]
[1,] "c"
[2,] "6"
[3,] "6"
[4,] "2"
[5,] "END"
[1,] "d"
[2,] "4"
[3,] "1"
[4,] "END"
Note: Actual df has ~300000 rows with ~35000 unique names.

You may try this.
# for each 'name', "pad" 'item' with 'name' and 'END'
l1 <- lapply(split(df, f = df$name), function(x){
name <- unique(x$name)
as.matrix(c(name, x$item, "END"))
})
# create a sequence of numbers, to select two by two elements from the list
steps <- seq(from = 0, to = length(unique(df$name))/2, by = 2)
# loop over 'steps' to bind together list elements, two by two.
l2 <- lapply(steps, function(x){
do.call(rbind, l1[1:2 + x])
})
l2
# [[1]]
# [,1]
# [1,] "a"
# [2,] "6"
# [3,] "4"
# [4,] "10"
# [5,] "3"
# [6,] "END"
# [7,] "b"
# [8,] "6"
# [9,] "7"
# [10,] "END"
#
# [[2]]
# [,1]
# [1,] "c"
# [2,] "2"
# [3,] "6"
# [4,] "10"
# [5,] "END"
# [6,] "d"
# [7,] "5"
# [8,] "4"
# [9,] "END"

Instead of making the lists from individual names make it from the column of subsets of the data.frame
res <- list("a_b" = c(df[df$name == "a",2],"END",df[df$name == "b", 2],"END"),
"c_d" = c(df[df$name == "c",2],"END", df[df$name == "d", 2],"END"))
res2 <- vector(mode="list",length=2)
res2 <- sapply(1:(length(unique(df$name))/2),function(x) {
sapply(seq(1,length(unique(df$name))-1,by=2), function(y) {
name <- unique(df$name)
res2[x] <- as.matrix(c(name[y],df[df$name == name[y],2],"END",name[y+1],df[df$name == name[y+1],2],"END"))
})
})
answer <- res2[,1]
This is giving me a matrix of lists since there are two sapplys happening, I think everything you want is in res2[,1]

Related

How to use tidyverse map to iterate filtering and writing to csv in R

I've got a data frame full of study metadata, with two key columns: citation information and questions of mine that they pertain to:
library(tidyverse)
citation <- c(letters)
study_question <- rep(1:3, len = length(citation))
df <- as.data.frame(cbind(citation, study_question))
#so that df looks like:
citation study_question
[1,] "a" "1"
[2,] "b" "2"
[3,] "c" "3"
[4,] "d" "1"
[5,] "e" "2"
[6,] "f" "3"
[7,] "g" "1"
[8,] "h" "2"
[9,] "i" "3"
[10,] "j" "1"
[11,] "k" "2"
[12,] "l" "3"
[13,] "m" "1"
[14,] "n" "2"
[15,] "o" "3"
[16,] "p" "1"
[17,] "q" "2"
[18,] "r" "3"
[19,] "s" "1"
[20,] "t" "2"
[21,] "u" "3"
[22,] "v" "1"
[23,] "w" "2"
[24,] "x" "3"
[25,] "y" "1"
[26,] "z" "2"
>
What I'd like to do is use an iterative function to filter for study question = 1, to get:
> df %>% filter(study_question == 1)
citation study_question
1 a 1
2 d 1
3 g 1
4 j 1
5 m 1
6 p 1
7 s 1
8 v 1
9 y 1
then write that list of citations to a csv named "sq1_papers.csv", then do the same for study question = 2, with the output "sq2_papers.csv", and then the same for question 3.
I have tried this with a for loop, which has not worked, and would prefer to try it with a map function, which I have gotten to work in the past. Here is the code I tried:
for(i in study_question) {
file <- df %>%
filter(study_question == study_question[[i]])
write_csv(file, "data/sq[i]_papers.csv")
}

With tidyverse, we can group split by 'study_question, loop over the list with iwalk and write to 'csv' with write_csv from readr
library(dplyr)
library(purrr)
library(readr)
library(stringr)
df %>%
group_split(study_question) %>%
iwalk(~ write_csv(.x, str_c('data/sq', .y, '_papers.csv'))

The below code should work for you:
## iterate through unique "study_questions"
for(i in unique(df$study_question)) {
## filter data frame by current "study question"
file <- df %>%
filter(study_question == i)
## create file name for export
fileName <- paste0("data/sq",i,"_papers.csv")
## export filtered data to path of fileName
write_csv(file, fileName)
}
I hope this helps!

Comparing rows of matrix and replacing matching elements

I want to compare two matrices. If row elements in the first matrix matches row elements in the second matrix, then I want the rows in the second matrix to be kept. If the rows do not match, then I want those rows to be to empty. I apologise that I had a quite similar question recently, but I still haven't been able to solve this one.
INPUT:
> mat1<-cbind(letters[3:8])
> mat1
[,1]
[1,] "c"
[2,] "d"
[3,] "e"
[4,] "f"
[5,] "g"
[6,] "h"
> mat2<-cbind(letters[1:5],1:5)
> mat2
[,1] [,2]
[1,] "a" "1"
[2,] "b" "2"
[3,] "c" "3"
[4,] "d" "4"
[5,] "e" "5"
Expected OUTPUT:
> mat3
[,1] [,2]
[1,] "NA" "NA"
[2,] "NA" "NA"
[3,] "c" "3"
[4,] "d" "4"
[5,] "e" "5"
I have unsuccessfully attempted this:
> mat3<-mat2[ifelse(mat2[,1] %in% mat1[,1],mat2,""),]
Error in mat2[ifelse(mat2[, 1] %in% mat1[, 1], mat2, ""), ] :
no 'dimnames' attribute for array
I have been struggling for hours, so any suggestions are welcomed.

You were on the right track, but the answer is a little simpler than what you were trying. mat2[, 1] %in% mat1[, 1] returns the matches as a logical vector, and we can just set the non-matches to NA using that vector as an index.
mat1<-cbind(letters[3:8])
mat2<-cbind(letters[1:5],1:5)
match <- mat2[,1] %in% mat1 # gives a T/F vector of matches
mat3 <- mat2
mat3[!match,] <- NA

Replace values in one matrix with values from another

I am a programming newbie attempting to compare two matrices. In case an element from first column in mat1 matches any element from first column in mat2, then I want that matching element in mat1 to be replaced with the neighboor (same row different column) to the match in mat2.
INPUT:
mat1<-matrix(letters[1:5])
mat2<-cbind(letters[4:8],1:5)
> mat1
[,1]
[1,] "a"
[2,] "b"
[3,] "c"
[4,] "d"
[5,] "e"
> mat2
[,1] [,2]
[1,] "d" "1"
[2,] "e" "2"
[3,] "f" "3"
[4,] "g" "4"
[5,] "h" "5"
wished OUTPUT:
> mat3
[,1]
[1,] "a"
[2,] "b"
[3,] "c"
[4,] "1"
[5,] "2"
I have attempted the following without succeeding:
> for(x in mat1){mat3<-ifelse(x==mat2,mat2[which(x==mat2),2],mat1)}
> mat3
[,1] [,2]
[1,] "a" "a"
[2,] "2" "b"
[3,] "c" "c"
[4,] "d" "d"
[5,] "e" "e"
Any advice will be very appreciated. Have spent a whole day without making it work. It doesn't matter to me if the elements are in a matrix or a data frame.
Thanks.

ifelse is vectorized so, we can use it on the whole column. Create the test logical condition in ifelse by checking whether the first column values of 'mat1' is %in% the first column of 'mat2', then , get the index of the corresponding values with match, extract the values of the second column with that index, or else return the first column of 'mat1'
mat3 <- matrix(ifelse(mat1[,1] %in% mat2[,1],
mat2[,2][match(mat1[,1], mat2[,1])], mat1[,1]))
mat3
# [,1]
#[1,] "a"
#[2,] "b"
#[3,] "c"
#[4,] "1"
#[5,] "2"

Here is another base R solution
v <- `names<-`(mat2[,2],mat2[,1])
mat3 <- matrix(unname(ifelse(is.na(v[mat1]),mat1,v[mat1])))
which gives
> mat3
[,1]
[1,] "a"
[2,] "b"
[3,] "c"
[4,] "1"
[5,] "2"

An option just using logical operation rather than a function
mat3 <- mat1
mat3[mat1[,1] %in% mat2[,1], 1] <- mat2[mat2[,1] %in% mat1[,1], 2]
Subsetting the values to find those that occur in both and replacing them where they do

use dplyr to summarize first element of a nested list (2-d array?)

i'm trying to understand the appropriate usage of dplyr on summarizing a nested list within a tibble.
The structure is as follows:
> glimpse(mydata)
Rows: 1,000
Columns: 3
$ meta <df[,6]> <data.frame[40 x 6]>
$ independent_variable <list> [<"A", "B", "B", "B", "A", "A", "B", "A…
$ dependent_variables <df[,4]> <data.frame[40 x 4]>
> head(mydata$independent_variable)
[[1]]
[,1] [,2] [,3] [,4]
[1,] "A" "FALSE" "5" NA
[2,] "B" "FALSE" "5" "NA"
[3,] "B" "FALSE" "5" "NA"
[4,] "B" "FALSE" "5" "NA"
[5,] "A" "FALSE" "13" "NA"
[6,] "A" "FALSE" "5" "NA"
[7,] "B" "FALSE" "12" "NA"
[8,] "A" "FALSE" "133 "NA"
[9,] "A" "FALSE" "131 "NA"
[10,] "A" "TRUE" "0" "NA"
[[2]]
[,1] [,2] [,3] [,4]
[1,] "A" "FALSE" "77" NA
[2,] "B" "FALSE" NA "NA"
[3,] "B" "FALSE" NA "NA"
[4,] "B" "FALSE" NA "NA"
[5,] "B" "FALSE" NA "NA"
[6,] "A" "TRUE" "1" "NA"
the independent_variable is a 1000 entries of N x 4 listings (that is, all 1000 entries have 4 columns, and varying number of rows. the first column is the only column im currently interested in reviewing, and each element can only be either "A" or "B"). I want to count the number of "A"'s within each of the 1000 and get that value back for each of the 1000 entries.
it seems like i should use purrr, but i'm not sure how to structure this in dplyr

Here is an approach using purrr:
library(purrr)
library(dplyr)
# my example data
tmp = list(cbind(c("A","A","B"),1),cbind(c("B","A","B"),2))
# define a summary function
count_A = function(x){
x %>%
as.data.frame() %>% # needed as the input data is of type 'matrix'
select(V1) %>% # the default column name for column 1
filter(V1 == "A") %>%
ungroup() %>% # unnecessary, but clear you are summarising the whole df
summarise(num_A = n())
}
# test summary function
count_A(tmp[[1]])
# apply function to every element of list
map(tmp, count_A)
In this pattern, your summary function can be any function that takes a single argument and returns the desired result. If the function works correctly when applied to the first element of the list (see in the code, I test my summary function) then you can expect that map will apply the function to every element of the list.

Replace multiple values in a matrix

a is a matrix:
a <- matrix(1:9,3)
> a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
I want to replace all the 1 to good, all the 4 to medium, and all the 9 to bad.
I use the following code:
a[a==1] <- "good"
a[a==4] <- "medium"
a[a==9] <- "bad"
> a
[,1] [,2] [,3]
[1,] "good" "medium" "7"
[2,] "2" "5" "8"
[3,] "3" "6" "bad"
It works, but is this the simplest way to work it out? Can I combine these codes into one command?

Using cut():
matrix(cut(a, breaks = c(0:9),
labels = c("good", 2:3, "medium", 5:8, "bad")), 3)
But not really happy with manual labels bit.
Maybe using match(), more flexible:
res <- matrix(c("good", "medium", "bad")[match(a, c(1, 4, 9))], 3)
res <- ifelse(is.na(res), a, res)

car::recode() does nicely here, returning the same matrix structure as was given as input.
car::recode(a, "1='good';4='medium';9='bad'")
# [,1] [,2] [,3]
# [1,] "good" "medium" "7"
# [2,] "2" "5" "8"
# [3,] "3" "6" "bad"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset dataframe into equal subgroup chunks - r

Related

How to use tidyverse map to iterate filtering and writing to csv in R

Comparing rows of matrix and replacing matching elements

Replace values in one matrix with values from another

use dplyr to summarize first element of a nested list (2-d array?)

Replace multiple values in a matrix

Categories

Resources